Domain-specific accelerators are one of the key solutions to continue increasing performance and efficiency beyond the end of Moore’s law scaling (Esmaeilzadeh et al., 2012; Dally et al., 2020). These accelerators use only the minimal required resources, consume less power, and compute faster than general purpose hardware (Horowitz, 2014). However, the design of such components is complex (Dally et al., 2020).
Modern big data and machine learning applications need to process huge and potentially distributed data sets with stringent requirements. Managing these data sets requires a combination of different solutions to hide the communication latency and exploit the inherent data parallelism (Pilato et al., 2021). Researchers proposed accelerators with local caches and private local memories for storing data on chip, while multiple channels help combine classic DRAM with non-volatile memories (NVM) for off-chip data. Memory architectures with intelligent data transfers can greatly optimize the systems but require specialization based on the application (Mutlu, 2020).
On one side, domain-specific languages like Spatial (Koeplinger et al., 2018) can abstract memory operations while still being hardware-oriented, but they miss a complete tool-flow to port software-oriented algorithms to hardware. High-level synthesis (HLS) is a technology to automatically generate hardware modules starting from high-level descriptions (Nane et al., 2016; Cong et al., 2011) but memory optimization is still an open problem (Pilato et al., 2017). This line of research proposes a compiler-based approach for optimizing the accelerator memories on top of traditional HLS. The main idea is to use domain-specific annotations to pass useful information to the compiler, transform the intermediate representations, and interface directly with modern HLS tools.
2. High-Level Synthesis: The Present
High-level synthesis helps raise the abstraction level and use high-level, software-like methods for hardware design. Modern high-level synthesis tools are based on state-of-the-art compilers to extract a language-agnostic intermediate representation from common software languages (Nane et al., 2016). Using compiler frontends also allows designers to apply common compiler transformations like constant propagation, dead-code elimination, and loop transformations. For example, most HLS tools use the GCC or LLVM compilers to apply state-of-the-art compiler transformations and extract the resulting intermediate representation. In the following phase, the HLS engine determines how to distribute the operations over time (scheduling) and over the hardware resources (allocation and binding). These steps determine the hardware architecture of the controller, which determines the evolution of the circuit in each clock cycle, and the datapath, which contains the hardware resources and their interconnections.
Current HLS tools have strong focus on the computational aspects, while the surrounding memory architecture is adapted to merely sustain the required data accesses. In case of data-intensive applications, the optimizations should focus more on coordinating memory transfers and accesses, rather than on the actual computation. To do so, compilers need to integrate, propagate, and expose more data-related information. If passed to the HLS engine, this information can help specialize the memory architecture together with the accelerators.
3. Domain-Specific Memory Template
Specialized architectures are designed specifically for an accelerator, but the process is time consuming and must be done for each new design. Domain-specific architectures are more general since the structure can be reused across multiple applications, sacrificing performance. For the memory aspects of a hardware accelerator, we propose an approach in between, using a domain-specific template that allows the specialization of particular components.
The lower part of Figure 1 shows the proposed template. It is composed of existing memory primitives, like caches, DMA engines, prefetchers, and multi-port memories. Based on given area constraints, only part of the data can stay on chip, while the rest is stored in DRAM or non-volatile memories (either on the same device or remotely). On-chip data are stored in different memories based on the application data structures but also the type of accesses that are expected. Irregular accesses can be implemented with custom latency-insensitive memory architectures (Minutoli et al., 2016). Data with regular accesses can be stored in fixed-latency private local memories (PLMs) and customized with multi-bank configurations to expose a large number of ports to the accelerator logic. Data reuse buffers can remove unnecessary data transfers. Data accesses with a certain degree of locality can benefit from architectures featuring caches that are local or shared with the processor by means of a coherent protocol (Shao et al., 2016; Mantovani et al., 2020). We also feature a direct-memory access (DMA) engine to make the data transfers more efficient and a prefetcher to anticipate known data transfers to hide the communication latency. These IP blocks can be augmented with special functions, like data protection (e.g., encryption) or application-specific transformations (e.g., matrix transpose).
This template is general enough to be reused across multiple applications but it can also be specialized based on the accelerator characteristics. For instance, we can vary the number of ports on a multi-bank memory based on the specific access patterns of the application. Also, components can be removed if they are unnecessary for the application. For example, if the data resides entirely on-chip, the prefetcher can be removed or if there is only a single memory, the multi-channel controller can be simplified. We propose to use a compiler-based approach to progressively refine such template.
4. Specialization of the Memory Template
To achieve better performance and reduce costs, designers can specialize the memory template based on the given accelerator. For this, our approach is based on the idea of platform-based design (Sangiovanni-Vincentelli and Martin, 2001), where the memory template is refined in different stages, starting from the general organization of the data in memory to the actual interaction with the actual accelerator. The upper part of Figure 1 shows our compiler-based customization flow.
Intermediate Representation. The compiler infrastructure will need to include more hardware-related information. We target novel multi-level representations, like MLIR (Lattner et al., 2020), to include more hardware-related information early in the compilation flow to make progressive refinements of the architecture at proper levels of abstraction. A novel flow is required because existing approaches are not fully compatible with HLS. CIRCT (Wilson et al., 2020) proposes MLIR extensions for low-level hardware synthesis (below the HLS level). Calyx (Nigam et al., 2021) follows, instead, a different approach with a novel IR and associated compiler. SODA (Minutoli et al., 2020) proposes a MLIR-based synthesis framework for machine learning accelerators with more focus on the computational aspects.
Compilation Flow. We extend the LLVM-MLIR compilation flow with additional passes to include memory-related information and transform the IR accordingly. Our passes include solutions to define the data layout, size the physical memories (both caches and PLMs), optimize the access patterns, and create multi-port PLMs for fast access. Currently, we use custom generators like Mnemosyne111http://github.com/chrpilat/mnemosyne to derive the HDL descriptions from such information. We will also investigate the possibility to interface directly with MLIR formats for hardware, like CIRCT.
The customization flow shown at the top of Figure 1 would proceed as follows: At the highest abstraction level, the data organization phase analyzes the data representations to determine the coarse memory structure, i.e. deciding which data are stored off-chip or on-chip. The next step, the layout phase, reorganizes the computation to better exploit local memories (either caches or PLMs). Then, in the communication phase, the prefetcher is configured to hide transfer latency based on the data access patterns. After this, the local partitioning phase determines the multi-bank PLM architecture, also sharing physical memories for data with disjoint lifetimes (Pilato et al., 2017). Finally, the HLS phase generates the computation part of the component with traditional HLS, producing the complete syntesizable description of the accelerator.
Accelerator Logic HLS. With our approach, the accelerator is designed only at the end of the flow according to the resulting memory organization. The accelerator features state-of-the-art solutions for memory management (e.g., dynamic address resolution (Pilato et al., 2011; Pilato and Ferrandi, 2013)). The accelerator is mostly unaware of the data organization and layout since the IR has been already updated based on the memory transformations. It is only optimized to efficiently access the data with fixed or unbounded latency. This part can leverage existing HLS tools that start from low-level intermediate representations. For example, the final LLVM IR representation can be directly interfaced with the Xilinx Vitis HLS front-end222https://github.com/Xilinx/HLS.
We described a novel approach for specializing domain-specific memory templates during the compilation flow and before high-level synthesis of the accelerator logic. Starting from a high-level memory template, we apply a multi-level compilation flow based on MLIR that progressively refines the memory architecture and then interfaces with commercial HLS tools. Our approach borrows idea from platform-based design, trading off flexibility and specialization based on specific needs of the designers.
This project is partially funded by the EU Horizon 2020 Programme under grant agreement No 957269 (EVEREST).
- High-level synthesis for fpgas: from prototyping to deployment. IEEE Transactions on CAD of Integrated Circuits and Systems 30 (4), pp. 473–491. Cited by: §1.
- Domain-specific hardware accelerators. Comm. of the ACM 63 (7), pp. 48–57. External Links: Cited by: §1.
- Dark silicon and the end of multicore scaling. IEEE Micro 32, pp. 122–134. Cited by: §1.
- 1.1 computing’s energy problem (and what we can do about it). pp. 10–14. Cited by: §1.
- Spatial: a language and compiler for application accelerators. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), External Links: Cited by: §1.
- MLIR: a compiler infrastructure for the end of moore’s law. External Links: Cited by: §4.
- Agile soc development with open esp. In Proceedings of the ACM/IEEE International Conference on Computer-Aided Design (ICCAD), Cited by: §3.
- SODA: a new synthesis infrastructure for agile hardware design of machine learning accelerators. In Proceedings of the IEEE/ACM International Conference On Computer-Aided Design (ICCAD), Cited by: §4.
- Enabling the high level synthesis of data analytics accelerators. In Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Cited by: §3.
- Intelligent architectures for intelligent machines. Cited by: §1.
- A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35 (10), pp. 1591–1604. External Links: Cited by: §1, §2.
- A compiler infrastructure for accelerator generators. In Proceedings of ACM SIGPLAN Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Cited by: §4.
- Bambu: a modular framework for the high level synthesis of memory-intensive applications. Cited by: §4.
- System-level optimization of accelerator local memory for heterogeneous systems-on-chip. IEEE Transactions on CAD of Integrated Circuits and Systems 36 (3), pp. 435–448. Cited by: §1, §4.
- EVEREST: a design environment for extreme-scale big data analytics on heterogeneous platforms. In Proceedings of the ACM/IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE), Cited by: §1.
- A design methodology to implement memory accesses in high-level synthesis. Cited by: §4.
- Platform-based design and software design methodology for embedded systems. IEEE Design & Test 18 (6), pp. 23–33. Cited by: §4.
- Co-designing accelerators and soc interfaces using gem5-aladdin. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Cited by: §3.
- CIRCT: circuit ir compilers and tools. Note: https://github.com/llvm/circt Cited by: §4.