In the last decade, specialised hardware has become a viable solution to address the end of Moors’s Law [MooreLaw] and the breakdown of Dennard scaling [DennardScaling]
and deal with the constant growth in performance and efficiency requirements of computer systems. Whitin this context, short time to market, higher design productivity and reusability of existing modules are only a subset of the challenges that have driven the Electronic Design Automation (EDA) industry and research. In particular, the recent wide-spread of solutions based on artificial intelligence and deep learning techniques[lecun2015deep, schmidhuber2015deep] has lead to an increasing demand of hardware accelerators able to target a large variety of computational workloads.
Automating the design process by exploiting the predictive power of modern machine learning (ML) models is an appealing approach that, while accelerating the development of computer architectures, would also allow the ML community to benefit from the improved computing platforms. In fact, progress in one field ripples through the other one, thus creating a positive feedback loop and a virtuous cycle[fahim2021hls4ml]. In this setting, graphs are natural candidates to capture the functionally-dependant structure of software and hardware systems. It is not surprising, then, that the advent of graph neural networks (GNNs) [scarselli2008graph, bronstein2017geometric, battaglia2018relational] led to impressive developments in computer-aided hardware and software design: from chip placement [mirhoseini2020chip] to compilers for ML computational graphs [zhou2020transferable]. In this context, we advocate for the adoption of analogous automation strategies to fill the gap between the increasing request for efficient and effective hardware acceleration of existing applications and the hardware design productivity.
Traditional Integrated Circuits (ICs) design methodologies rely on Hardware Description Languages (HDLs) to describe logical components and their interaction at a Register Transfer Level (RTL). This approach requires designers to manually define the concurrent description of millions of transistors working in parallel to carry out the desired computations. To reduce the burden of this task, High-Level Synthesis (HLS) comes into play. HLS tools enable to start the design process from high-level behavioural specifications in C/C++/SystemC, hence avoiding designers to deal with the tedious and error prone task of implementing the functionality at RTL level. Besides specifying the desired behavior, as shown in Figure 1, designers can guide the synthesis process by applying directives able to tune the resulting RTL implementation according to target performance and cost requirements. Synthesis directives allow to specify how to implement in hardware specific software constructs such as loops, arrays and functions. For example, the designer can use directives to tune the degree of hardware parallelization of a loop by specifying a loop unrolling factor. While HLS allows to explore a vast design space of micro-architectural variations by using different directives, resulting performance and resource utilization of each implementation cannot be determined a priori. In fact, exhaustive exploration involves time-consuming syntheses, whose number grows exponentially w.r.t. the number of applied optimizations. Moreover, among all the possible configurations (i.e., combinations of directives), only a few are Pareto-optimal from a performance and costs perspective. As designers are interested in effective methodologies to automate the Design-Space Exploration (DSE) process, the HLS-driven DSE problem consists in identifying as accurately as possible the set of Pareto implementations while, at the same time, minimizing the number of synthesis runs.
Recent works demonstrated the possibility to guide DSEs by exploiting the notion of function similarity and the knowledge acquired from past explorations performed on different functions [ferretti2020leveraging, jhye2020transfer, WangJun20] and have shown strong empirical results besides the small number of source domains. In light of this, we claim that data-driven approaches are the way forward for HLS-driven DSE. In this work, we introduce the methodological framework where we firmly believe that progress in the field is bound to happen. We propose both a data representation able to capture the critical elements of the HLS process and the data-processing tools to profit from such representation. At first, we introduce a novel graph representation of computer programs, based on an augmented hybrid control and data flow graph, capturing invariant properties and relevant information from the HLS perspective. Then, we exploit this representation to train a novel graph neural network model in a supervised fashion, by fitting the model on a dataset of previously synthesized configurations (behavioral specifications plus optimization directives) to predict the latency and resource utilization corresponding to each point. To the best of our knowledge, this is the first attempt to use graph representation learning from software specifications to perform HLS-driven design space exploration. We show that the learned model can be used for effective DSE after fine-tuning on a small set of samples and that our method compares favorably against state-of-the-art baselines on relevant benchmarks. We refer to our framework as gnn4hls
. We believe that the results achieved here constitute a strong signal for the community that calls for a general effort in collecting large datasets of syntheses to unlock the full potential of graph deep learning solutions to the HLS-driven DSE problem. In order to accellerate progress in this direction, methods and datasets presented here together with a platform for data collection will be open sourced at the end of the blind review process.
The rest of the paper is organized as follows. In Section 2 we define the problem by introducing the main concepts and proper terminology for HLS-driven DSE and graph neural networks. Then, in Section 3 we lay out the details of our approach. In Section 4 we evaluate the proposed method on relevant benchmarks. Finally, we discuss the related works in Section 5, draw our conclusions and discuss future works in Section 6.
2.1 High-Level Synthesis driven Design-Space Exploration
Given a software functionality, e.g., the sum_scan function from the Radix Sort algorithm in Machsuite [reagen2014machsuite] (used as a running example in the rest of the document), we define as HLS design, or simply design, the functionality to be realized in hardware, and as specifications the behavioural description of the design in a high-level programming language such as C/C++. The specification () is given in input to the HLS tool together with the target technology library () and a target frequency (). The result of the synthesis process is named implementation, and it is an automatically generated RTL code, usually in VHDL or Verilog. The resulting RTL is coupled to a performance metric (e.g., latency or throughput), and a cost metric (e.g., area or energy costs).
An implementation is generated by applying a set of directives – specified using compiler pragmas – to the , affecting the resulting performance and costs. The set of directives affecting an implementation is named configuration. Each directive is associated to a target in the , which can be either a label or a code construct (e.g., labeled statements, function names, variables, etc.). In addition, a directive is characterized by its type and an associated value. Examples of directive types are loop unrolling – affecting number of resources required to implement the loop body in hardware and enabling its parallel execution – and array partitioning – splitting the input array in multiple memory banks and enabling parallel access to the data. The directive value forces a given directive type to a specific value. As an example, an unrolling factor of doubles the logic required to implement in hardware the loop body, enabling parallel hardware execution of two iterations.
The HLS design flow and the different elements characterizing it are shown in Figure 1. Figure 1(left) shows the inputs of the HLS design flow: a behavioural description of an HLS-design – the sum_scan function in the Radix Sort benchmark from MachSuite [reagen2014machsuite] – an example of pragmas applied to the array and loop constructs of the specification, the technology library adopted, and the target frequency. After being processed by the HLS tool, the resulting implementation is generated as an RTL desing, with the synthesis scripts and the performance and cost reports–Figure 1(right).
Given a design , a designer limits the set configurations to explore during a DSE by defining a configuration space. The configuration space is defined as the Cartesian product among the set of directive values associated to each -th directive, i.e., , where is the number of considered directives. The size of the configuration space is given by its cardinality . Given a configuration space , a design space can be defined as the set of implementations resulting from the synthesis of the configuration in .
The DSE problem is a Multi-Objective Optimization Problem (MOOP) having costs and merit as objective functions. In the context of hardware design common performance measure and cost are latency, and area or power respectively. In this work we use as performance the effective latency (LAT), i.e., the number of clock cycles required by the hardware implementation to execute its functionality multiplied by the target clock of the system. For cost we consider the percentage of resources (silicon) utilization required to implement the IC in hardware. In this work, since our target architecture is a Field Programmable Gate Array (FPGA), costs are expressed in terms of number of Flip-Flops (FF), Look-Up Tables (LUT), Digital Signal Processor (DSP), and Block RAM (BRAM)111For the HLS designs considered in this work, the DSEs have not affected the number of BRAM; therefore, we have not include BRAM estimation in our experiments.. In particular, the objective is to identify a subset of the configuration space such that: and is Pareto. A Pareto configuration () of a design is defined as: , . With , , and , being the cost () and merit () associated to the implementation of and respectively.
2.2 Graph neural networks
Input behavioural specifications (programs) significantly differ in terms of size, structure, constructs, and optimization directives. This variability is hardly captured by vector representations which have been, we argue, one of the main limiting factors of previous works in attempting to learn predictive models of the HLS process[LiuJun13, WangJun20]. Differently, when modeling the problem in the space of graphs we can use methods that naturally exploit existing functional dependencies, account for the variability in the structure of different specifications, and seamlessly transfer learned representations across different configuration spaces. Furthermore, graph-processing techniques permit to exploit the properties of graph representations (e.g., permutation invariance) inducing positive inductive biases that restrict the hypothesis space explored by the learning system to plausible models.
We consider attributed directed graphs, with attributes associated to both nodes and edges. In particular, a graph is a tuple , where is a set of nodes, is a set of edges and is a global attribute vector. We denote by the raw features associated with node and with the attribute vector associated with edge connecting nodes and .
Different works propose general frameworks to design GNNs: inspired by gilmer2017neural and battaglia2018relational, we consider a very general class of message-passing neural networks (MPNNs) with global attributes where the -th propagation layer (or step) can be written as
where the update (,) and message (,
) functions can be implemented by any differentiable function, e.g., Multi-Layer Perceptrons (MLPs). Aggregation functionsand can be any permutation invariant operation. Multi-layer GNNs are built by stacking several propagation layers, which allow to aggregate at each node messages from different neighborhoods.
3.1 Data representation
Control Flow Graphs (CFGs) are graphs representing the possible execution paths in a program. The CFG is a directed graph defined as the tuple , where is the the set of nodes corresponding to the basic blocks of the program, and is the set of edges representing the possible control flows among basic blocks. An example of CFG for the sum_scan function is shown in Figure 2-(top). Traditionally, a basic block is defined as a consecutive sequence of instructions without incoming and outgoing branches except for the first and last instructions of the block respectively. In this work we adopt a different definition of basic block. In particular, in our proposed formulation, we differentiate basic blocks according to the type of instructions they perform. We discriminate between the following types of blocks: loop blocks identifying basic blocks including loop instructions, read block including a single load instruction from main memory, writes block including a single store instruction to main memory, function block including a function invocation instruction, and lastly standard block being basic block including instructions performing computations that do not belong to any of the above mentioned categories. The effect of this representation and taxonomy affects the granularity of the CFG representation, increasing the number of blocks w.r.t.the traditional one. Figure 2-(bottom), shows the differences of the proposed CFG representation – with a higher block granularity – w.r.t. the traditional one. This choice aims at avoiding the limitation of approaches relying on a vector-based representation of the program and directives [LiuJun13], while, at the same time, focusing only on the information that is more relevant from the HLS and DSE perspective. Notably, recent works highlight the effectiveness of adopting a similar taxonomy of basic blocks to capture similarities among points in different configuration spaces [ferretti2020leveraging]. Compared to previous works in program analysis (e.g., li2019graph), we include in our CFG representation only corse-grained information on the types of operations performed in each block. In addition to the node type, attribute vectors associated to each node include block-type related information, such as: number of instruction in a block, number of iterations of a loop block, presence of loop carried dependencies, and other type of information extracted through static and dynamic code analysis performed using custom LLVM [lattner2004llvm] compiler passes (more details in Section 4). While CFGs contain information about the execution flow of a program, they do not model the flow of data and information. Data Flow Graphs (DFGs) address this aspect. DFGs are used to represent the different dependencies between the instructions of a program. In particular they represent the use-def chains among variables in the program execution. The DFG is a directed graph defined as a tuple , where is the a set of nodes corresponding to a instruction in the input source code, and is the set of edges representing the possible data flow among instruction. The DFG representation for the running example function is shown in Figure 2-(top).
Hybrid Control Data Flow Graph
HLS tools use instrumented Control Flow Graphs and Data Flow Graphs as program representations to decide how to implement in hardware the design functionality. In our approach we aim at using a similar representation directly as input of a learning method. We propose a graph representation of the software description including both CFG and the DFG information. In particular, we augment the CFG representation by adding data flow edges and nodes representing the input and output parameters of the function. Data flow edges are added among the nodes involving operations affecting the input and output parameters. These edges are identified by tracking the def-use chains among parameter variables in the DFG and embedding them in the CFG. In addition, edges among the parameter nodes and the CFG read and write blocks are added to the set of edges (param flow). We indicate this Hybrid Control Data Flow Graph as and we use it as input representation in our methodology. To sum up, is defined as a tuple , where is the a set of nodes corresponding to basic blocks and function parameters, and is the set of attributed edges representing the control, data, and parameter flows. Each edge attribute vector is a
dimensional feature vector with a one hot encoding representation of the edge type. Lastly,is the feature vector associated to each node with a one-hot-encoding representation of the nodes type and their attributes, plus the value of each optimization directive. Figure 2-(bottom) shows the Hybrid Control Data Flow Graph representations for the running example function.
3.2 Graph neural networks for high-level synthesis design-space exploration
Given a graph representing a program annotated with optimization directives, we process it with message passing neural networks by parametrizing the computation of messages exchanged between neighboring nodes of the graph with MLPs. A schematic of the model is shown in Figure 3: we describe at first each computational block in detail then discuss the training procedure.
The Encoder block maps node, edge, and global graph features into a first hidden representation without performing any message-passing operation. The Encoder is implemented by using standard MLPs as update functions
as using feed-forward layers before message passing has shown to be beneficial to final performance in GNNs [you2020design]. The Encoder block is followed by a stack of propagation layers.
Propagation layers are instantiated in the message-passing framework shown in Equation 1. In particular, the node update and message functions are implemented as:
where is the vector concatenation operator and indicates that we aggregate incoming messages by averaging them out. To update the global representation we use a MLP as update function that takes in input the concatenation of at the previous propagation step and node attributes aggregated by exploiting the attention mechanism [vaswani2017attention, velickovic2018graph]. In particular, we use a MLP () to compute a raw attention score for each node attribute vector given the global features
; raw scores (i.e., logits) are then normalized over the nodes of the graph with asoftmax function. Normalized node scores are used to aggregate node features processed by a third MLP. Putting all together, each propagation layer updates the global representation as follows:
where is the element-wise multiplication operator and indicates aggregation by graph-wise summation of node features. In practice multiple attention heads can be used in parallel for increased model capacity.
After message-passing blocks, node representations are pooled in a single vector using a permutation invariant aggregation function (e.g., by taking the sum or the average of node representations, optionally weighted by learned attention scores). The pooled representation is then concatenated to the global attributes leading to a vector representation of the input graph. This feature vector is fed trough a last MLP which maps it to a prediction of latency and resources as shown in Figure 3.
Training procedure and transfer learning
We train the GNN by supervised learning to predict the outcome of the HLS procedure. We use data from several synthesized designs, with program specifications relevant to different domains (we refer to Section4 for more details). While learning to predict the outcome of the synthesis process for configurations spaces already partially explored is interesting from a research perspective, we are interested in assessing the possibility of exploiting the model for DSE. In particular, we aim at assessing if the learned representation can support transfer to different domains (designs) when only a few samples, or none, from the target configuration space are available. We comment that our method differs from previous approaches, which usually are domain specific and tied to the characteristics of the target design space. Instead, our methodology is general and can easily incorporate knowledge from different design spaces by simply including synthesized points in the training dataset.
4 Experimental evaluation
The graph representation of HLS designs and the associated pragma values are generated combining LLVM [lattner2004llvm] compiler passes, Clang Abstract Syntax Tree (AST) analysis, Frama-C [Couq2021framac] internal representation of program dependencies, and HLS synthesis information from a recently published database of HLS-drived DSEs [ferretti2021db4hls]. The custom LLVM pass generates the CFG representation from the compiler Intermediate Representation (IR) and performs static program analysis to identify the block proprieties. In order to account for the information lost in the LLVM IR representation, an AST visitor extracts and maps the SW information to the CFG blocks. Data flow information is extracted from the Frama-C program dependency analysis, and used to generate the data flow edges of . We generated graph representations from different functions in MachSuite [reagen2014machsuite]. The considered functions include a wide range of computational intensive applications such as: matrix-matrix multiplication, sparse matrix-vector multiplication, sorting algorithms, stencil computations, molecular dynamics, and forward passes of fully-connected neural networks. For each design, we used configurations and synthesis results available from db4hls [ferretti2021db4hls], an open source data base of HLS DSEs. The total number of configurations considered in this work is
. Configuration spaces contain from several hundreds, up to several thousand design points. For the graph neural network, we use as global graph attributes the number of LLVM instructions, the number of input parameters, and the average value of each directive set within the configuration minus the mean value of the directive sets computed over the entire configuration space. To increase robustness to outliers, we also concatenate to the representation the same values minus the median.
In the following at first we perform an experiment to asses the accuracy of our model in inferring the performance and costs of unseen directive combinations, then we switch to the DSE settings. Hyper parameters of the models and full experimental setup are provided in the supplementary materials.
4.1 Performance and cost estimation
We split the synthesized configuration space of each function in three folds, keeping of the available points for training, for validation, and for testing. Selecting a proper baseline for comparison is not easy since none of the approaches existing in the literature can easily be extended to our settings. The most similar approach is the one introduced by jhye2020transfer, where a MLP is trained in a multi-domain setting. However their approach, based on multi-task learning [caruana1997multitask], relies on training different input and output layers for each domain, hence limiting flexibility. Furthermore, they use only optimization directives as an input and predict normalized scores for performance and costs instead of the actual latency and resources. For these reasons we consider a DeepSets [zaheer2017deep] model to be a more appropriate baseline: in practice we use a node-level MLP followed by a permutation invariant aggregation and a second MLP to process the aggregated features. The network architectures were chosen to have a similar model complexity. The boxplot in Fig. 6 shows, for each figure of merit, the median, and quartiles, and interquartile range of the mean absolute percentage error (MAPE) of the two models over the 23 functions in the dataset. Results, averaged over independent runs, show that our model drastically outperforms the baseline by achieving an average MAPE, averaged over performance and costs estimates, of against : an improvement of over . Furthermore, the performance obtained by our model is qualitatively similar to SoA simulator ones, which exploit an analytical model of the HLS process, HLScope+ [choi2017hlscope+] for the latency, and MPSeeker [zhong2017design] for resources222A direct comparison with these methodologies was not possible since these were not available open source. Thus, we compare w.r.t. the performance of the original papers.. In particular, latency estimation performance are comparable to the best from the SoA (1.1% MAPE of HLScope+ vs. 2.1% of gnn4hls), our area performance estimation outperforms the ones from existing models. gnn4hls achieves 4.8%, 2.6% and 1.3% estimation for FFs, LUTs, and DSPs compared to the 14.7%, 13.2%, 12.7% respectively from MPSeeker. In addition, inference time is greatly reduced from seconds to tens of milliseconds w.r.t. SoA alternatives. In particular, the network used here requires to process a single point from the get_delta_matrix_weights1 function on an Intel(R) Xeon(R) Silver CPU.
4.2 Ablation study
To evaluate the effectiveness of the proposed graph representation, we have performed an ablation study on the graph edges. Fig. 6 shows results obtained given 3 different ablations on the graph structure: no data flow edges, no param edges, and both data flow and param edges removed. For all the representation the same type of nodes and attributes have been considered. Results show that the proposed Hybrid CDFG representation leads to the best results. Note that the DFG can largely be inferred from Param edges, thus the similar performance of this setting.
4.3 Design-space exploration
The second set of experiments aims at addressing DSE. Herein, we focus on approximating the Pareto-frontier of a target HLS design, given the synthesis outcomes of all the considered functions but the target one (i.e., we perform a leave-one-out evaluation w.r.t. the available functions). We consider the setting where the designer performs an initial naïve random sampling of the configuration space, uses the synthesized points to fine tune the model and then uses the model’s estimates over the configuration space to approximate the Pareto curve.
In particular, we select only the configurations expected to be Pareto-optimal by considering as cost the weighted sum of the utilized resources. We assessed the quality of the DSEs measuring the Average Distance from Reference Set (ADRS) [schafer2020survey, ferretti2020leveraging, jhye2020transfer] metric among the real Pareto-solutions and the one estimated to be Pareto-optimal. A low value of ADRS implies a close approximation of the real Pareto-frontier. To increase robustness to prediction errors we iteratively select candidate Pareto-optimal points by removing from the configuration space the already selected configurations and recomputing the Pareto curve up to times. Results are shown in Figure 6. In particular, we compare the performance of the fine-tuning approach against the current state-of-the-art one on the considered dataset, namely the prior-knowl. approach [ferretti2020leveraging] (see Section 5). For the fine-tuning procedure we use a maximum of points from the target domain, capped at of the configuration space dimension; we fix the number of SGD updates to , with a batch size of . We also compare the performance in the zero-shot setting (no fine-tuning).
Figure 6 shows the performance obtained by the fine-tuned approach (averaged over independent runs) w.r.t. prior-knowl. and the zero-shot model. As previously mentioned, we consider up to the Pareto-front and compare against the reported results for prior knowl. which iterates up to the frontier. We considered only the portion of the configuration space that is actually synthesizable (i.e., we considered only configurations present in the database). Results show the distributions of the ADRS across the considered functions. In particular, the fine-tuned model obtains Pareto-frontier approximations comparable to the state of the art, while reducing the number of outliers compared to prior knowl. and obtaining an average ADRS of vs
: a remarkable improvement. This result is particularly appealing when observing that our approach neither uses any heuristic to perform the initial sampling, nor does it rely on domain knowledge provided by the designer. Furthermore, our method provides the user with performance and cost estimates of the candidate configurations, which could be instrumental to further reduce the number of syntheses required to obtain the desired performance and satisfy hardware constraints. Figure8 shows how the ADRS scores change w.r.t. the number of iteratively estimated Pareto frontiers during the DSE. While the ADRS decreases exponentially (plot in logarithmic scale along the axis), the number of required syntheses grows linearly. Compared to prior knowl. our approach requires a higher number of syntheses. However, we argue that we might expect this gap to reduce significantly when considering a larger dataset: the performance of the zero-shot approach, in fact, should be considered in light of the fact that our dataset contains only different designs. Finally, Figure 8 compares the results of gnn4hls and prior knowl.
across the different class of applications considered in this work (see appendix for details on the taxonomy and the complete set of results). Our method shows lower variance in the ML designs, but a higher mean, and lower ADRS in linear algebra designs which are relevant for deep learning.
5 Related works
Design Space Exploration approaches in High-Level Synthesis
In past years, the hardware design community has proposed different works to address the HLS-driven DSE problem. A recent survey from schafer2020survey summarize them. Among these works we can identify two main categories: model-based approaches, and refinement-based ones. Model-based approaches [ZhongDec14, PhamMar15, ZhongJun16, ZhaoOct17, choi2017hlscope+]
rely on estimates of performance and resource requirements of a given optimization. These approaches require very few synthesis runs to approximate the Pareto frontier, but, often, have difficulties while dealing with multiple, interdependent optimizations. Conversely, refinement-based methodologies are agnostic to the number and types of directives considered, and rely on the outcome of few heuristically sampled synthesis runs as a starting point for DSE. After the initial synthesis, the models aim to improve the initial solutions using different strategies such as genetic algorithms[SchaferMay12], simulated annealing [mahapatra2014machine], clustering [FerrettiJan18] or local search techniques [ferretti2018lattice]. Refinement-based approaches are not limited by the number and type of synthesis directives pre-characterized, but usually converge more slowly to the Pareto-frontier with respect to model-based approaches. Taking a different stance, more recently ferretti2020leveraging have proposed a DSE strategy able to map the result of past DSE targeting different design to unseen ones. The proposed approach searches for similarity in the source code of the already explored designs, and, based on the result of past DSEs, decides how to optimize the target one. This approach has been used as a baseline to compare the DSEs performance of the experiment in Section 4.3. Similarly, a recent work from [jhye2020transfer] proposes a neural network model for mixed-sharing multi-domain transfer learning to transfer the knowledge obtained from previously explored design spaces in exploring a new target one.
Graph neural networks for hardware/software design.
First GNNs date back to models developed by gori2005new and scarselli2008graph
and earlier ideas on how to process graph-structured data with recurrent neural networks[sperduti1997supervised, frasconi1998general]. In recent years, the field of graph deep learning has surged in popularity and several architectural improvements and variants of GNNs have found wide spread and adoption by the community [li2016gated, kipf2017semi, monti2017geometric, hamilton2017inductive, velickovic2018graph, bianchi2021graph]. Among their many applications to structured data processing, GNNs have been widely used as the learning system of choice in software engineering to automate program analysis and code optimization [si2018learning, li2019graph, bieber2020learning, zhou2020transferable]. Furthermore, graphs have also been used to capture the structure of hardware architectures. Notably, mirhoseini2020chip used a GNN to learn a transferable representation to tackle the critical hardware design problem of chip placement. Finally, due to the flexibility of graph representations, GNNs have be used to automate design processes in many other areas of science and engineering: prime examples are in the discovery of new molecules [li2018learning, you2018graph] and in automatic robot design [wang2019neural, zhao2020robogrammar].
6 Conclusion and future works
In this work we presented gnn4hls, a graph-based learning framework for HLS-driven DSE. Compared against the state of the art, our method offers tools that are general and that can easily be applied to any configuration space. A key aspect of gnn4hls
is its simplicity w.r.t.the its effectiveness. We show that our method compares favorably against the state of the art w.r.t. both quantitative and qualitative metrics. In the future, we plan to investigate more advanced solutions from the few-shot learning literature to improve transfer to unseen domains and to consider the a finer-grained representation of basic blocks. Then, we argue for the possibility of replacing existing heuristics to perform DSE with an exploration policy learned by (model-based) reinforcement learning. To support breakthroughs in this direction, we renew our invitation to the community in participating in a common effort for the collection of datasets fully enabling the application of deep learning in the context of HLS-driven DSE. We believe that this work represents a milestone for the application of graph deep learning in EDA and a significant step towards fully automated design of hardware accelerators with no human in the loop.
Appendix A Dataset and experimental set-up
In this section we provide details on the dataset and the experimental setting used for the experiments presented in the paper. All experiments were carried out on a server with Intel Xeon Silver CPUs and Nvidia Titan Xp/V GPUs.
For developing the models and the infrastructure to run experiments we relied on the following open-source tools and libraries:
PyTorch Geometric [fey2019fast];
PyTorch Lightning [falcon2019pytorch];
The code used to run the experiments is attached as supplementary material and it will be released on GitHub upon publication.
a.1 Dataset description
The synthesis data used to train and test the model come from the db4hls database [ferretti2021db4hls]. The database includes a collection of DSEs performed for HLS designs of the MachSuite benchmark suite [reagen2014machsuite]. In this work we use the data collected in the database according to the configuration space definition described by ferretti2021db4hls. The performed DSEs include up to different type of pragmas: resource type, array partitioning type, array partitioning factor, loop unrolling, and function inlining. The pragma values specified for array partitioning factor and loop unrolling are limited to power of twos or integer divisors of the input/output array sizes and loop trip-counts. We consider only a subset of the functions available in db4hls. In particular, we select only HLS designs which do not require allocation of arrays or structs. This limitation is due to the LLVM compiler analysis performed to extract the CFG representation of the software specifications: this limitation will be addressed in future works. The original designs from MachSuite are compiled with the lowest optimisation level (-O0). Then, we process the intermediate representation (IR) produced by LLVM through a custom LLVM pass in order to generate the CFG and distinguish among the different types of block. Due to some LLVM optimisation performed on the IR representation, memory allocation are in some cases transformed into function calls generating an undesired code transformation which affect the the resulting CFG – i.e., additional call blocks are generated without an explicit function invocation in the original specifications. We aim at addressing this limitation in the next future.
In addition to the block type, the LLVM pass extracts block specific information. This information is used to populate the block attribute vectors used to represent each node of the hybrid control data flow graph representation. In order to extract such information, we perform different analyses of the LLVM IR, the Clang abstract syntax tree (AST) and the source code. An LLVM pass generates the intermediate representation, which is used to first identify basic blocks in the code and the traditional CFG representation. Then, for each instruction in the basic blocks we check the type of operation performed and we discriminate among the different types of block. Basic blocks including store instructions are split separating the store instruction from the predecessor and successor instructions. The newly created store block is then is connected to the new basic blocks resulting from the split. Read blocks are generated in the same way. Loop blocks are identified from the IR. Loop information is extracted in order to model loop carried dependencies, loop stride, and loop trip count. Similarly, call blocks are identified from the IR. In this case information about the number of parameters, number of function invocation, and number of LLVM instructions of the invoked function are extracted and included in the attribute vector. From the remaining uncategorized standard blocks, information about the number of LLVM instructions executed in it are extracted and added to the attribute vector.
Parameter blocks instead are generated combining information from the LLVM IR and the AST analysis. In particular we extract information associated to the parameter type (pointer or value), parameter data type – the size of the data type –, and number of elements of pointer parameters. This last information is required by the HLS tools in order to know the memory required by the hardware accelerator. However, since this information is not preserved in the LLVM IR, we have extracted it from a joint AST and source code analysis. To avoid large numerical differences among attributes, we use logarithmic scale fot the number of LLVM instructions, loop trip counts, array partitioning factors, data types, and their related directive values.
Table 1 reports the designs considered in this work. For each function we show: the application domain (used to group the applications in Figure 8), the original benchmark name in MachSuite, the HLS design name, the size of the designs in term of lines of code, the type of pragmas considered during the exploration and the configuration space size.
|Domain||Benchmark||HLS design||Lines of code||Type of pragmas||| CS ||
a.2 Global attributes
In addition to the attribute vectors associated to each node, we introduce a global representation of the configuration with respect to its configuration space.
Given a design and its associated configuration space, we define as the vector having for each component the average among the directive values set associated to each pragma type. As an example, given a configuration space having directive value sets with 2 resource types (one-hot encoded), 2 partitioning types (one-hot encoded), partitioning factors of , , , , unrolling factor of , , , and 2 options for function inlining (one-hot encoded), the resulting configuration space vector will be = . Then, given a specific configuration, we generate the vector as the average vector among directive values of the same type in that particular configuration. Similarly, we generate vectors and using the median among the directive set values instead of the mean. Finally, given the number of instructions in the function and the number of input parameters we define the global attribute vector of the configuration as:
where is the vector concatenation operation. Intuitively, this vector representation captures how much a configuration leans toward a region of the design space. In practice, the vectors and are also normalized so that each element has unitary variance across the configuration space.
a.2.1 Design space exploration performance metric
The results of the DSEs have been evaluated in term of Average Distance from Reference Set (ADRS). The ADRS metric is used to quantify the distance between a reference curve , and an approximated one . In our case, the reference curve is the Pareto-frontier ground truth available computed over the synthesized design avaiable in the dataset, while the approximated one is the Pareto-frontier resulting from the DSE performed by using the trained model. The ADRS for two objective functions is defined as:
where and are the area and latency of an element of the reference Pareto-frontier, while and are the area and latency of the approximated one. Intuitively, lower ADRS values implies proximity among the approximated curve and the reference one. We consider as altency () the number of clock cycles required by the hardware implementation to execute the functionality, and, as measure of the area (), the number aggregated values of , , in form of a linear combination of their utilisation.
This formulation allows to obtain a unique metric for the area costs evaluating the overall utilisation of the resources required by an implementation (, , and ) with respect to the ones available on a specific FPGA (, , and ).
Appendix B Hyperparameters and additional results
In this section first we provide details on the hyperparameters and architecture used for the different models and how training was performed, then we show additional (more complete) experimental results.
b.1 Performance and cost estimation experiment
We use a GNN with propagation blocks. All the MLPs required to implement encoding, update and message functions are implemented as networks with a single hidden layer and ELUactivation function [clvert2016elu]. The MLPs processing the node representations all have hidden units, while we use a width of to process global graph features, both in the propagation blocks and in the regression head. The MLPs computing node messages () have and additional linear layer after the nonlinear one. Finally, the MLP used to compute attention scores has an hidden layer of units and a number of output units equal to the number of attention heads with LeakyReLU ( negative slope) activation as in [velickovic2018graph]. We use attention heads in parallel and we concatenate their outputs before processing with the global update function . For the baseline we use a node-level MLP with hidden layers with units each and ReLU activation function, followed by
aggregation and a second global MLP with a single hidden layer with the same activation and number of neurons.
For training, we use as target values the natural logarithm of the true values, and the mean absolute error as loss function. Models are trained forepochs with a batch-size of without early stopping and by taking as final model the one that achieves the lowest validation error across the training epochs. For optimization we use the Adam optimizer [kingma2015adam] with an initial learning rate of and cosine annealing (with no restarts) as a schedule with a minimum learning rate of . We also clip the gradient norm to a maximum value of to avoid learning instabilities. We did not use any form of regularization since we did not observe any sign of overfitting. We do not perform any additional scaling to the input features w.r.t. the preprocessing steps described in A.1.
Table 2 and Table 3 show the estimation accuracy for gnn4hls and the DeepSets baseline respectively. In particular, we show performance in terms of Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) for all the prediction targets for each design. Note that the MAPE of the DSP estimate is not defined for some function where DSP is not used, tables report a error for both models in such cases. Results are averaged over
independent runs, we do not report standard deviations for the sake of presentation clarity.
|ADRS||# of syntehsis||ADRS||# of synthesis|
|ncubed||0.043 0.042||146 27||0.012||35|
|bbgemm||0.016 0.018||148 29||0.007||46|
|ellpack||0.032 0.02||115 21||0.034||65|
|hist||0.107 0.199||119 20||0.007||46|
|init||1.3 2.36||61 30||0.078||68|
|last_step_scan||0.211 0.166||104 19||0.004||90|
|sum_scan||0.030 0.02||143 17||0.136||25|
|local_scan||0.167 0.24||88 10||0.005||71|
|update||0.002 0.002||97 17||0.009||28|
|ss_sort||0.042 0.02||42 10||0.0005||21|
|stencil2d||0.193 0.24||90 20||0.015||46|
|stencil3d||0.115 0.19||97 28||1.88||16|
|knn||0.006 0.004||284 42||0.006||25|
|get_delta_matrix_weights1||0.087 0.036||427 73||0.002||139|
|get_delta_matrix_weights2||0.054 0.017||525 71||0.010||77|
|get_delta_matrix_weights3||0.087 0.037||568 88||0.030||222|
|product_with_bias_input_layer||0.005 0.005||215 36||3.560||3|
|product_with_bias_second_layer||0.0001 0.001||49 18||0||30|
|product_with_bias_output_layer||0.003 0.013||71 32||2.5E-5||24|
|take_difference||0.002 0.005||224 56||0.0002||8|
|get_oracle_activations1||0.048 0.022||220 29||2.907||67|
|get_oracle_activations2||0.013 0.009||244 41||0.051||19|
|update_weights||0.0002 0.0002||63 24||1.1E-5||3|
b.2 Design space exploration experiment
For the DSE experiments we pretrained a model for each one of the available functions using a leave-one-out approach. Then we finetuned each pretrained model on the target domain as described in Section 4.3. The model architecture here is the same used for the previous experiment. In the fine-tuning stage we used Adam with a constant learning rate of and clipping the gradient norm to . To evaluate the impact of the initial random sampling the finetuning runs were repeated times with different random seeds.
Table 4 reports the DSEs result obtained by the gnn4hls framework compared against the prior-knowl. approach from ferretti2020leveraging. The table lists, for all the considered function, the ADRS values obtained and the number of synthesis required by the methodologies. For gnn4hls results are averaged over different runs and we report the standard deviations.