Deep neural networks [deeplearning]
(DNNs) have gained major interest in recent years due to their extraordinary and robust ability to make predictions on large amounts of data. These prediction abilities have been applied to computer vision[imagenet], machine translation [sutskever2014sequence], gaming [brockman2016openai], robotics [mahler2017dex, brockman2016openai], and many other fields. Hardware accelerators are a natural solution to the large computational requirements imposed by DNNs.
A large portion of DNN accelerators produced by major vendors such as Google [tpu], Samsung [samsung] and Tesla [bannon2019accelerated] have used systolic array architectures for matrix multiplication and convolution operations. Systolic arrays were originally proposed in the 1980s [why-systolic, kung1979systolic]
, but have recently regained interest from their effectiveness in accelerating general matrix multiplications (GEMM) and convolutions in modern machine-learning (ML) workloads.
Accelerators can be used in various stages of the machine learning process: whether in training or inference, on edge devices or on the cloud. Each of these use cases applies different constraints on the accelerator, including latency, power, throughput, energy, area, programmability, and system integration. Nevertheless, the intrinsic computational kernels used in these scenarios remain the same. Critically, the differences between edge inference and cloud training accelerators can be cast as different accelerator parameters rather than changes to the basic computational kernels.
For these reasons, hardware generators [bora_generators, bora_generators2] are an attractive approach to building DNN accelerators. Although computational kernels may stay the same across workloads, characteristics such as layer dimensions or model size impact how workloads are optimally scheduled and mapped to any particular hardware accelerator[squeeze]. Thus, full-stack generators must target software frontends as well as hardware backends, so that workloads and accelerators can be tuned together.
Systolic array hardware generators should target pertinent architectural parameters such as dataflow, pipeline depth, banking strategy, precision, and on-chip memory capacity. Such generators also need to consider parameters for system-level integration such as bus widths, off-chip memory bandwidth, and host CPU architecture. Accurately evaluating the generated system requires a high-fidelity simulator which can faithfully model system-level interactions with operating systems, DRAM controllers, networks, etc.
Many of these architectural parameters impact the physical realizability, as well as the power, area, and maximum clock frequency of the generated hardware. Therefore, any generator needs to be evaluated not only on its architectural or RTL characteristics, but also on the full physical design flow which it enables.
In this paper, we address these needs and present Gemmini, an agile systolic array generator, which is integrated with the Rocket Chip system-on-chip generator [rocket] and the BOOM out-of-order processor [celio2015berkeley]. Gemmini is composed of a hardware/software generator which produces both RTL and optimized C libraries for common neural network (NN) workloads. We utilize Firesim [karandikar2018firesim], a cycle-exact FPGA-accelerated simulation platform, to extract accurate performance figures from a design-space exploration (DSE) over architectural and system-level parameters. Additionally, we evaluate Gemmini across the full physical design flow to produce tapeout-ready designs with varying timing constraints and floorplans.
Our DSE revealed that evaluation of any one neural network layer in isolation is not representative of performance on an entire network, partly because the performance and energy-efficiency of different layers can vary widely based on their dimensions and tiling factors. Furthermore, we show that CPU performance and system-level constraints such as bus protocols can severely limit the maximum performance an accelerator can provide. We also demonstrate that Gemmini can produce tapeout-ready designs which meet timing and power constraints with a variety of different floorplans and configurations. Gemmini designs have even been taped-out in TSMC 16nm and Intel 22FFL process technologies.
2 Gemmini Generator
Gemmini is an open-source modular and flexible generator of systolic array accelerators, supporting multiple dataflows, targeting ASIC and FPGA implementations. Gemmini is open source, written in the Chisel hardware description language [chisel], enabling parameterization and configurability through high-level meta-programming and functional programming abstractions. Gemmini produces instances of systolic architectures that can be integrated with the Rocket Chip SoC generator. Its parameterization and system-level integration enable efficient hardware and software co-design, and help perform agile design space exploration. This section describes the architecture of a systolic array generated by Gemmini (Section 2.1), the major generator parameters (Section 2.2), and the accelerator programming model (Section 2.3).
A system-level view of the Gemmini generated accelerator is illustrated in Figure 1. The core unit is a 2-D systolic array that performs matrix multiplications, represented by the equation:
where and are the multiplied matrices, C is the result and
is a bias matrix. The array is fed by a banked scratchpad memory made of SRAMs, with access to main system handled by a direct memory access (DMA) engine in the controller. There are dedicated components for non-linear activation functions, such as ReLU and ReLU6, as well as components necessary for retaining network accuracy after quantization[jacob2017], such as rounding and saturating bitshifts. The accelerator also includes an accumulator with a wider bitwidth than the systolic array itself to accumulate partial results.
The Arch of the systolic array is illustrated in Figure 2. The basic element of the systolic array is a fully combinational processing element (PE), which performs MACs, and optionally rounding bitshifts. The PEs can support a variety of dataflows, which may either be fixed at design time or configurable at runtime. The PEs can support different bitwidths for their inputs, outputs, and internal buffer (Section 2.2), as determined at elaboration time. To enable full utilization of the MAC units, each PE is double-buffered such that weights/biases can be loaded for a future computation while the current compute cycle is running. PEs are arranged in a combinational grid to form a tile, and tiles are arranged in a pipelined grid to form the systolic array itself.
To perform a GEMM operation, , , and matrices must be explicitly moved into the scratchpad from main memory ( may also be moved directly into the accumulator). The systolic array is then configured with the desired dataflow and activation functions. Afterwards, the , , and matrices are fed directly into the systolic array, which writes the result, , either back into the scratchpad or into the accumulator. Finally, the result may be written back into main memory.
For workloads that are sensitive to precision and rounding, the result of a matrix multiplication must often be of a higher bitwidth than the input matrices. To support this pattern, the Gemmini architecture also includes a higher-bitwidth accumulator external to the systolic array, which is implemented as a dual-port SRAM with adders at its inputs.
The template architecture also includes peripheral circuitry, which performs activation functions and scales high-bitwidth values down to lower-bitwidth values if necessary. For example, Gemmini supports rounding bitshifts, which can be applied within PEs (for the output-stationary dataflow) or at the output of the accumulator (for the weight-stationary dataflow). In a quantized neural network, output activations are usually accumulated to higher precision, e.g., 32 bits. However, before being fed into other layers, these activations must be scaled back down to a lower precision, such as 8 bits. Gemmini saturates and rounds such scaling operations to the nearest bit in order to maximize accuracy [jacob2017].
Some of this peripheral circuitry also preprocesses the data. For example, our architecture includes a dedicated transposer, which is itself implemented as a smaller systolic array. For the output-stationary dataflow, a PE must consume the rows of while consuming the columns of . However, typically all matrices are stored in main memory as row-major. The transposer allows both and to be stored as row-major and makes matrix transformations transparent to the programmer.
The accelerator is integrated with the Rocket Chip System-on-Chip generator, which can be configured to use either the Rocket [rocket] in-order core or the BOOM [celio2015berkeley] out-of-order core. The accelerator communicates with the host processor through the Rocket Co-Processor (RoCC) interface, which enables the host RISC-V core to send the accelerator a stream of custom instructions. The RoCC interface enables the accelerator to be integrated into the processor’s cache-coherent TileLink [tilelink] memory system, and provides execution ordering semantics with respect to the host processor.
Although all the accelerator instances produced by Gemmini have the same general architecture, a designer can explore different trade-offs in performance, energy and area, based on a range of tunable generator parameters. Choosing an appropriate design point for a specific application is extremely important in the case of a general kernel such as matrix multipication. As such, an energy-conscious accelerator for a mobile device with limited parallelism would likely choose smaller array sizes and a single dataflow (at the cost of performance and flexibility), while larger cloud-based accelerators with batch-level parallelism can choose larger array sizes and multiple dataflows (for optimal performance). Some of the current parameters enabled by Gemmini are described below.
Dataflow: a dataflow describes the data movement into and out of the systolic array and the communication patterns between PEs. In the classic three-level nested for-loop for matrix multiplications, the dataflow determines which loops are unrolled spatially and which are unrolled temporally. Currently our generator supports both the output-stationary and the weight-stationary dataflows. The dataflow can either be fixed at elaboration time (improving energy efficiency and physical design), or configured at runtime (improving flexibility and possible performance). Previous work has demonstrated that runtime configurable dataflows can improve DNN inference performance and energy efficiency [squeeze].
Dimensions: systolic arrays can be generated with any number of PEs the user chooses. As arrays get larger, more operations will be executed per cycle, and data reuse will improve. However, large arrays increase the latency of small matrix multiplications, as operands must traverse the entire length and height of an array before the result can be read out. Large arrays also suffer from low utilization when operating on small matrices, wasting energy and area. Furthermore, large arrays can have a significant impact on physical design and cycle time, since the scratchpad memories need to be placed appropriately to reduce wire-delay between the memory and the array edges.
Bitwidth: the generator can be configured at elaboration time to operate on arbitrary bitwidth matrices. The final accumulated result of a matrix multiplication can also have a different bitwidth than the input matrices. Previous work has demonstrated that DNN compression and quantization enable significant energy savings [han2015deep, lin2017deep] at the potential cost of accuracy. Our bitwidth parameterization enables a designer to explore the accuracy-efficiency trade-off and choose an appropriate design point for the respective application.
Pipeline Depth: traditionally, systolic arrays place registers between each PE. However, our generator allows the density of these registers to be reduced, even to the point of the array being made of fully combinational logic. Fewer pipeline registers reduce the area requirement for our accelerator, but may reduce the maximum achievable clock frequency. The optimal pipeline depth is impacted by physical design and the choice of fabrication process technology.
Memory Capacity: both the scratchpad and accumulator memories (implemented using SRAMs) can be configured to have arbitrary capacities. Previous work has found that data movement and coherency management between main memory and accelerators’ private memory can consume up to 40% of an accelerated workload’s total runtime [shao2016co]. Since data-transfer between main memory and the private scratchpad/accumulator memory is expensive, it is beneficial to have large scratchpads to allow for maximal data re-use. However, over-provisioned private memory can lead to energy and area inefficiency. Therefore memory capacity should be balanced with the system bus and DMA bandwidths which drive the memory as well as the data re-use potential of the accelerated workload.
Memory Banks: the private memory scratchpad is divided into banks in order to maximize read/write throughput. A larger number of banks allows for higher throughput, but results in additional wiring and physical design constraints.
System Parameters: since Gemmini is integrated with the Rocket Chip SoC ecosystem, it can use SoC-level parameters which have been shown to have an impact on accelerator performance [shao2016co]. One such parameter is the host processor, which can be an in-order Rocket core or an out-of-order BOOM core. Another example is the SoC system-bus width, which impacts the bandwidth with which the accelerator can communicate and move data between main memory and the private scratchpads.
Datatype Parameters: through the use of Chisel hardware description language and Scala typeclass features, our generator is type-generic over the concrete datatype being processed by the systolic array. Gemmini can create accelerator instances which operate on signed integers, unsigned integers, floating point values, or any user-defined datatype, such as a posit [posits] or dynamic fixed-point number, through the implementation of the relevant Scala typeclass. This level of parameterization can enable the generator to produce instances specialized for low-precision DNN integer inference operations, as well as for high-precision floating point DNN training and scientific computing.
2.3 Programming Model
Gemmini is programmed via a stream of custom RISC-V instructions transmitted directly from a host processor to our accelerator. Gemmini connects directly to the datapath of a RISC-V core, through the Rocket Custom Coprocessor Interface [rocket]. The accelerator has its own instruction queues, allowing it to run in parallel with the host processor.
Data and memory management between the accelerator and the host processor is explicit, i.e., data must be explicitly moved between the processor’s main address space and the accelerator’s private address space using a sequence of movement instructions. The ISA defines two data movement instructions mvin and mvout shown in Figure 3. These instructions use Gemmini’s DMA unit to move multiple systolic-dimension (DIM) matrices between main memory and the accelerator’s private memory space consisting of the scratchpad and accumulator’s SRAM banks.
Once matrices have been brought in from main memory, the Gemmini ISA provides a compute instruction that can be configured with a dataflow, a scaling factor, and an activation function. The compute instruction takes the local addresses of the and matrices which can be stored in any scratchpad or accumulator bank.
The output stationary (OS) variant of compute (illustrated in Figure 4) executes by loading the matrix into the PEs’ internal accumulators, pushing and through the systolic array, and leaves the result resident in each PEs accumulator. Providing addresses for the and matrices are optional in the OS case. This is useful, for example, when a programmer wants to repeatedly accumulate submatrix multiplications on top of each other without reading the results out of the systolic array until the final result has been calculated.
The weight stationary (WS) variant of compute (illustrated in Figure 5) takes local addresses for , , and . First, is preloaded into the PEs’ weight buffer, then is pushed through the systolic array, and the result is written to the accumulator. A bias matrix can be used in the WS dataflow by first executing a mvin into the accumulator. Specifying is optional, so the programmer can reuse the already loaded weights in the systolic array.
The Gemmini architecture uses a decoupled-access-execute [smith1982] architecture, where all instructions are issued to one of three independent, parallel command queues: the LOAD queue (mvin), the STORE queue (mvout), and the EXECUTE queue (compute). Any data hazards within a command queue are handled transparently by hardware. However, dependencies between queues must be encoded into the instructions themselves by the compiler or programmer. Each instruction has four reserved bits which specify whether the instruction depends upon an instruction in another queue, or whether an instruction in another queue will depend upon it. This scheme is inexpensive to implement in hardware, but it increases software complexity. Similar software-based dependency management schemes have been implemented in other NN accelerator works [moreau2018].
To make it easier for programmers to use Gemmini accelerators, we provide a software library that implements hand-tuned, tiled GEMM functions (supporting both dataflows): matrix multiplication of any size, multi-level perceptrons (MLP), convolutional neural networks (CNN), non-linear activations, and quantization. Tiling is performed along the parameterized size of the systolic array and the accelerator scratchpad. The tiling parameters are generated by the Chisel generator, and included as a header file in the software libraries. This approach facilitates rapid software-hardware co-design.
3 Design Space Exploration
|64 KiB||5||128 bits||rocket|
|64 KiB||5||128 bits||rocket|
|OS + WS||
|64 KiB||5||128 bits||rocket|
|64 KiB||5||128 bits||rocket|
|64 KiB||5||128 bits||rocket|
|64 KiB||5||128 bits||rocket|
|256 KiB||5||128 bits||rocket|
|64 KiB||33||128 bits||rocket|
|64 KiB||5||64 bits||rocket|
|64 KiB||5||128 bits||BOOM|
A major advantage of a generator-based methodology is the ability to perform elaborate design space exploration across a multi-dimensional design space. In this section we explore the performance of multiple design points on different DNN workloads.
3.1 Evaluation Method
We chose to run the DNN applications under Linux to evaluate their performance in a full-system context. Performing this type of RTL evaluation using a logic simulator would take multiple compute-years. Therefore, for full-system performance evaluation we used FireSim, an FPGA-accelerated cycle-exact simulation platform [karandikar2018firesim]. Unlike FPGA prototyping, FireSim is designed to simulate ASIC RTL designs with timing-accurate system components. FireSim facilitates full-system simulation by enabling integration of the simulated SoC with accurate peripheral and system-level interface models such as DDR3 memory and a last-level-cache (LLC) [biancolin2019fased]. By using FireSim’s timing-accurate models, we faithfully simulate our target ASIC designs.
Power and area are evaluated using a Cadence VLSI flow with TSMC 16 nm FinFET technology libraries. Logic synthesis was performed using Genus, physical design was performed using Innovus, and power estimation was performed using Voltus. The accelerators were synthesized to meet frequencies of 1 GHz and 500 MHz.
For performance evaluation, the memory system includes a 256 KiB L2 Cache and a 4 MiB last level cache (LLC). The simulated backing memory preserves the same latency for the 500 MHz and 1 GHz design points, while proportionally scaling the memory bandwidth for the 1 GHz design. At 500 MHz, we used the DDR3 1066 8-8-8 model and at 1 GHz we used the DDR3 2133 14-14-14 model.
Each design point for the DSE was selected by varying a single design parameter relative to a baseline which matches design point in Table 1. This method attempts to identify and isolate the impact of the different parameters on area, performance, and power consumption. The baseline design point was selected based on common parameters published in the literature.
3.2 Area and Power
The area and power consumption of our designs, normalized and illustrated in Figure 6, were heavily correlated. According to synthesis our baseline 500 MHz design took up 0.467 mm2 and consumed 611 mW, when including the area and power of the RISC-V core connected to our systolic array. At 1 GHz, synthesis reported that the baseline design took up 0.541 mm2 and consumed 1.4 W. However, we found that synthesis results were generally pessimistic. Designs which we place-and-routed sometimes consumed less than half of what synthesis predicted they would consume, as seen in Section 4. The trends between different design points, however, remained the same.
The weight-stationary dataflow () consumed less power than the output-stationary baseline, as it did not require 32-bit accumulators in the PEs of the systolic mesh. Configurations which increased the size of the systolic mesh, on the other hand, such as by scaling up its dimensions or bitwidth, increased power consumption by up to 3.4 and area by up to 2.3. Design , which replaced the default in-order Rocket processor with a four-wide out-of-order BOOM processor also significantly increased both area and power consumption, whereas in the other design points, the CPU had only a minor impact upon the overall power and area.
We evaluate the selected design points by running DNNs such as MobileNet, ResNet50, and Resnet152, as well as an additional collection of MLPs, which we refer to in Figure 7 as MLP 1 [claudiu2010deep], MLP 2 [meier2011better], MLP 3 [lu2013speech], MLP 4 [ngiam2011multimodal]. The evaluated DNNs represent a wide range of modern state-of-the-art neural network architectures. They include MLPs (which make up more than 61% of Google’s inference workloads [tpu]
), autoencoders, non-linear activations, convolutions, quantization, and depthwise convolutions.
We observed that many of the design points that were expected to boost performance did not have a large impact due to system-level and Arch effects, while also noting significant variability in the performance boosts achieved by different workloads.
As seen in Figure 6(a), for DNN workloads, using a beefier processor in boosted performance substantially, while increasing scratchpad memory had little impact, contrary to typical intuition. Since the DNN workloads used the CPU core to perform tasks that map poorly to GEMMs, the CPU often became the bottleneck that limited the maximum speedup achievable. For example, our DNNs performed im2col reshaping[caffecontroll, cudnn] to convert 2D convolutions to GEMMs. With MobileNet in particular, accelerated computation time was dominated by depthwise convolutions on the CPU. Some layers of the evaluated DNNs include convolutional kernels that could be mapped directly to matrix multiplication without requiring any reshaping. Resnet-152 included the highest portion of such kernels, and thus it performed better in general in all the design points.
While the larger scratchpad () added more data locality, improving performance by a marginal on our DNNs, its benefit was limited by the CPU bottleneck. On the other hand, increasing bitwidths to 32 bits () reduced performance significantly in all cases, as it caused memory requirement to increase, limiting re-use and locality within the scratchpad.
For MLP workloads, the CPU is only used for book-keeping, so increasing the memory and compute capacity of the accelerator had a larger impact on performance as seen in Figure 6(b). Increasing the host CPU’s performance did help, but not as substantially as increasing the dimensions of the systolic array or boosting its scratchpad size.
One would expect that Gemmini is memory-bandwidth limited, and thus cutting the memory bus width would degrade performance. However, we observe no signficant performance hit in design owing to a system-level limitation on the number of memory requests in-flight. This limitation turns a bandwidth constraint into a memory latency constraint. Since the round-trip latency of a memory request and the number of maximum requests in flight are independent of the bus width, decreasing it does not impact the effective bandwidth. This reveals the critical importance of system-level evaluation, since using an ideal memory model at Gemmini’s memory port would not reveal system-level bottlenecks.
We observe up to performance improvement on MLP inference when increasing the size of the systolic array to (). The Gemmini Arch requests multiple systolic-dimension matrix rows at a time when executing the mvin instruction. Increasing the array dimension results in larger blocks of memory requested per mvin over TileLink. Doubling the systolic array dimensions doubles the effective memory bandwidth and quadruples the compute throughput. Depending on how much reuse there is within a layer and according to the tiling factors, the expected performance boost can be anywhere from 2-4.
For all the models, before feeding the data into the systolic array, the operands are zeropadded so that their dimensions are multiples of the size of the systolic array. In most of our benchmarks, this resulted in negligible added overhead of multiplying zeros. This overhead was highest in MobileNet, where it consumed 10% of the workload, but it significantly dropped with larger DNNs like Resnet.
We also found that due to their low arithmetic intensity and large memory footprint, depthwise convolutions would require feeding inputs sequentially into the systolic array, which would limit their performance. Therefore, we perform depthwise convolutions in Mobilenet on the host processor itself. Prior work has demonstrated that the low arithmetic intensity of depthwise convolution can be an impediment to the efficient acceleration of MobileNet. This is also demonstrated in the results of our DSE - while depthwise convolution layers take up 18% of the runtime on our CPU implementation, they take up nearly 100% of the execution time in the accelerated workload.
We observe from Figure 6(b) that the performance improvement between MLP topologies varies wildly, as a consequence of the shapes of their layers. The shapes of the layers affected not only the input/weight reuse, but also the amount by which GEMMs could be tiled in the scratchpad. The maximum tiling factors were a function of the scratchpad size and array dimensions, but narrow or non-divisible layers often reduced tiling factors, which also reduced performance. As an example, MLP 4 outperformed MLP 3, because its dimensions, which were powers-of-2 mapped better onto our maximum tiling factors.
3.4 Design Space Analysis
We integrate our results for power, performance, and area in Figure 8. We plot the performance per joule of each workload against the performance per unit area. In workloads which were not CPU-limited, such as MLPs, increasing memory capacity improved Gemmini’s area and energy efficiency by minimizing main memory accesses and improving locality within the scratchpad. In all workloads, design points which demanded extra memory bandwidth, such as which increased bitwidth to 32 bits, reduced energy and area efficiency significantly.
Although the weight-stationary dataflow () did not noticeably improve performance, it did increase energy and area efficiency by removing the 32-bit accumulators within each PE, which greatly reduced the power consumption of the systolic mesh. The 3232 design (), on the other hand, suffered from very low efficiency despite its high performance, because of its greatly increased power consumption and area overhead. The BOOM processor () improved energy- and area-efficiency significantly on MobileNet, which was severely CPU-limited, while it suffered on other workloads where the CPU was not the bottleneck.
Additionally, 500 MHz designs were generally more energy- and area-efficient than 1 GHz designs. The 500 MHz designs used more high-voltage-threshold (HVT) logic gates, reducing leakage power consumption more than enough to compensate for their slower compute performance.
4 Physical Design
We perform an evaluation of the physical design properties of the generated accelerator design, to explore the area and power requirements of different design points, as well as to evaluate the place-and-route feasibility of Gemmini accelerators.
Since the end of Dennard scaling, power density has proven to be a significant factor in the design of digital systems-on-chip. In particular, custom accelerators with a dense collection of compute units (MACs, in the case of a matrix multiplication unit) are known to be sensitive to such thermal and energy constraints. As an example, the Google TPUv3 uses liquid cooling to assist the power dissipation from its dense array of compute units [tpu2_3]. As such, it is important that the design space of such a matrix multiplication unit be evaluated through full physical design VLSI flows (placement and routing), allowing for the evaluation of the feasibility of the RTL design (timing closure), as well as power density and energy under different floorplans. Furthermore, physical design of the selected design points was necessary for the integration of the accelerator into test systems-on-chip for fabrication.
We evaluated four design points based on the results of the earlier DSE, by choosing two Gemmini configurations and place-and-routing each of them at both 500 MHz and 1 GHz. The first configuration is a systolic array, with dual dataflows, a 4-banked 256 KiB scratchpad memory and a 64 KiB accumulator. The second configuration is a systolic array, with dual dataflows, 4 banked 512 KiB scratchpad memory and 128 KiB accumulator.
Each of our selected design points was also evaluated using two different floorplans in TSMC 16 nm FinFET process technology. The first floorplan (Figure 9) organizes the accelerator’s SRAMs in a major block, leaving space on the side for the systolic mesh, as well as a routing channel across the block. The second floorplan, in Figure 10, organizes the accelerator’s SRAMs in a semi-ring around the computational mesh.
The placed designs are shown in Figures 9 and 10. In both floorplans, the controller was placed next to the Rocket host processor since the processor interacts with the controller to send instructions and data. Each floorplan has intuitive benefits: while the block floorplan provides the systolic mesh more vertical access to the SRAM address lines, the semi-ring floorplan allows for more surface area contact between the systolic mesh and the SRAMs.
The comparison results are presented in Table 2. We can observe that the 1616 arrays performed similarly with both floorplans, even when increasing the frequency from 500 MHz to 1 GHz. For example, all 1616 design points achieved nearly the same Worst Negative Slack (WNS) for any given frequency, although the Total Negative Slack (TNS) was 79% higher with the semi-ring floorplan at 1 GHz. Additionally, the worst WNS for 1616 designs was -14 ps, which can easily be adjusted to meet timing requirements. The power consumption of 1616 designs also showed little variation across different floorplans, differing by only 1-3%. We can conclude from this that the generator is flexible enough to allow for a variety of floorplans, with reasonable a Quality-of-Result (QoR) for each.
However, we observe that for larger design points such as a 32 configuration, the difference between the floorplans begins to have more noticeable impacts upon our timing results. For example, the semi-ring floorplan achieves a 41% better setup WNS at 1 GHz than the block floorplan, as well as an 81% better TNS, because wires between the scratchpad’s SRAMs and the systolic mesh can be routed to travel a shorter average distance. However, neither of the 3232 floorplans were able to meet timing at 1 GHz, and the setup violations, which were several hundred picoseconds long, were significant enough to require several iterations of physical design attempts to have a chance of closing timing.
Overall, our physical design evaluation demonstrates that semi-ring floorplans can reduce power consumption and achieve faster clock frequencies. They do this by reducing wire lengths and by placing SRAMs where it is easier for physical design tools to route them to the systolic mesh. Furthermore, with 1616 designs, semi-ring floorplans also reduce area requirements. However, as the systolic array grows to 3232, the area of the semi-ring floorplan grows faster than the block design. Thus, as an accelerator design grows, it may be necessary to change to a floorplan which clusters SRAMs closer together to meet area requirements.
|Floorplan||Area (mm)||Power (mW)||
Embedding and integrating the Gemmini generator within a composable, silicon-proven, open-source platform such as Rocket Chip allows for seamless integration with additional system components such as complex out-of-order cores and vector accelerators. We demonstrate integration with a Rocket RISC-V core, representative of low-energy processors found in embedded devices, as well as integration with BOOM, a high-performance OoO core, which revealed the power/performance tradeoffs when using a beefier core.
Gemmini was designed as a flexible generator for systolic GEMM accelerators to identify power, performance, and area trends as various parameters are varied, rather than to achieve state-of-the-art ML inference performance. Gemmini’s DSE revealed the benefit of specializing the hardware for a weight-stationary-only dataflow as is the case in the TPU [tpu], and the system-level evaluation demonstrated that a larger scratchpad is only valuable if a workload is not CPU-limited.
Gemmini targets the most common kernel across many network architectures: matrix multiplications. A large portion of ML inference time is spent on fully connected layers [park2018], which are implemented as matrix multiplications. Furthermore, compute intensive convolutions used in CNNs can be efficiently mapped to matrix multiplications. By targeting GEMMs, Gemmini can to adapt to different network architectures and layer types, in contrast to specializing for convolutional layers. Additionally, GEMMs are a useful linear algebra primitive, which can expose the Gemmini generator to other application domains such as scientific computing.
Due to the speed limitations of RTL software simulation, some prior works choose to evaluate single-layer performance, and then extrapolate to report the performance of a full DNN. However, extrapolation of layer-by-layer performance neglects to consider shared-cache state between the host processor and the accelerator, as well as host-processor time between layers.
Furthermore, FPGA prototypes used to evaluate ASIC DNN accelerators often connect directly to an on-FPGA DRAM controller and thus see a higher memory throughput than an ASIC implementation would. This can distort performance and energy numbers. Gemmini has been evaluated using FireSim, which accurately models a last-level cache and DRAM to preserve simulation fidelity while executing on an FPGA.
Full-system simulation of RTL implementations is important not only for performance evaluation, but for functional validation as well. While an original version of the Gemmini design passed many individual benchmarks and micro-benchmarks, some design decisions such as methods for handling exceptions, memory management race conditions and TLB flushes were exposed only in a full-system multi-process environment such as Linux.
6 related work
Systolic architectures first came into prominence in the early 1980s [why-systolic, kung1979systolic], and since then, many systolic accelerators have been developed. There has also been much work on algorithms which can design new systolic arrays methodologically, rather than through ad-hoc intuition [algoinfo].
Early systolic arrays were used to compute convolutions [kungconv], solutions to triangular linear systems [why-systolic], matrix multiplications, and more. Systolic architectures enable modular and extensible designs, use local neighbor-to-neighbor message passing, and contain easy-to-floorplan regular structures.
Systolic architectures have recently regained popularity, since the convolution and matrix multiplication kernels common in machine learning and deep learning application are highly susceptible to multi-dimensional acceleration using systolic arrays.
Commercially deployed ASIC implementations of NN accelerators include the Google TPU [tpu] for cloud workloads, as well as edge inference implementations by Samsung [samsung], Nvidia [nvidia], Apple [apple], and Tesla [bannon2019accelerated, bannon2019systems]. In particular, a detailed description of the original TPU implementation includes a matrix multiplication unit implemented using a reduced-precision systolic MAC array with a weight stationary dataflow for NN inference in the cloud. Successor versions included floating-point representation, additional memory, and improved utilization for both training and inference [tpu2_3].
Prior work has demonstrated the integration of an open-source commercial DNN accelerator (NVDLA) with the Rocket Chip ecosystem and the FireSim platform [farzad2019rocketnvdla]. The accelerator in this work was integrated using the memory bus, as opposed to Gemmini which is integrated using the RoCC interface. Prior work [seldridge] has also demonstrated the integration of academic NN accelerators with the Rocket Chip ecosystem using the RoCC interface, but did not use systolic architectures for that purpose. Gemmini puts an emphasis on enabling design space exploration rather than single design-point integration.
Academic researchers have proposed numerous systolic accelerators, especially for neural-network inference. For example, NeuFlow [neuflow] was a systolic-inspired architecture which allowed individual processing elements (PEs) to be re-configured at runtime to perform tasks such as multiply-accumulates, divisions, and non-linear activations. ShiDianNao [shidiannao]
, similarly, allowed PEs to be reconfigured at runtime to perform multiply-accumulates, additions, and max poolings. Eyeriss[eyeriss] implemented a weight-stationary dataflow using a spatial array. Eyeriss v2 [eyeriss2] improved on the original Eyeriss by demonstrating a new PE architecture that can operate on sparse CSC-encoded matrices, and a hierarchical mesh NoC capable of unicast, multicast, and broadcast data transfers to maximize reuse. These and other systolic-inspired architectures typically permit both global and local connections between PEs and global memory, which is not strictly systolic, but often improves performance.
Several previous proposals [squeeze, lu2017flexflow, fu2017] have presented performance and energy benefits resulting from flexible data-flow options in NN accelerators. However, the benefits and impact of the dataflow structure of NN accelerators is still an active area of research, and some works [overrrated] have shown that optimal memory-hierarchies and loop-blocking strategies can have a more significant impact on energy efficiency than the choice of dataflows.
Various energy efficient neural network accelerator proposals have also been presented in the integrated circuits community [ueyoshi2018quest, lee2018unpu, bankman2018always, karnik2018cm, shin201714, yin20171, ando2017brein, kim20192, sayal201914, lee20197, yue20197]. Many of these proposals focus on exploiting sparsity and quantization features of DNNs. Furthermore, while some of these proposals address runtime-configurability, they still address only a single fabrication-capable design point, and most do not present design and elaboration time parameterization. Further, most of these accelerators are tested in isolation, often without a fully integrated software environment, hence potentially neglecting system-level effects.
A host of DNN accelerators targeted for FPGA implementation have also been proposed[zeng2018, wang2016, shen2018, dicecco2016, guan2017, guo2018, zhang2018, zhang2015, venieris2016, sharma2016, zhang2017, wang2017], taking advantage of FPGA reconfigurability to implement exotic layers, specialize the hardware for a specific network, and evaluate multiple design points. However, FPGA acceleration frameworks do not necessarily translate well to ASIC implementations, and are not ideal for scenarios where energy efficiency is critical.
Some prior works [yazdani2016, song2017, srivastava2018, sharify2018, squeeze, angizi2018, min2019, nowatzki2017] use analytical or high-level model-based simulations to evaluate different parameterizations of a proposed accelerator architecture. In contrast, Gemmini performs design space exploration on the RTL directly and uses feedback from FPGA-accelerated simulation and physical design to find optimal design points for ASIC implementation.
Since the energy consumed during DNN inference and matrix multiplication is often dominated by external memory accesses, academic researchers have proposed processing in memory[liu2018, chi2016, yan2018, yan2018_2, zha2019, ji2019, chang2019]. These works include the development of new SRAM circuits and the use of novel devices such as ReRAMs. Gemmini is designed and validated for CMOS implementation, and uses design space exploration to discover the ideal memory access patterns and memory hierarchy to conserve energy.
Researchers have also proposed methodological systems and algorithms to automatically generate systolic architectures directly from the algorithms they are meant to accelerate. For example, PolySA [polysa] analyzes polyhedral models to attempt to find the optimal mapping between a sequential algorithm and a set of parallel PEs. Yang et al. [overrrated] extended the Halide programming language to automatically generate C++ high-level-synthesis (HLS) implementations of systolic arrays.
Prior work has also introduced TVM [chen2018] and VTA [moreau2018] as an integrated research platform for SW/HW evaluation of NN accelerators. While Gemmini and VTA hold many architectural similarities, including the use of a GEMM core, explicit memory management, and explicit instruction dependency handling, VTA has primarily targeted FPGA accelerators implementations, as opposed to Gemmini which currently targets primarily ASIC designs and has been used in the fabrication on multiple test-chips. Furthermore, Gemmini’s integration with the RISC-V eco-system enables an additional level of customization in SW/HW co-design.
7 Future Work
The Gemmini generator has been used in the fabrication of two test system-on-chips. The chips were taped-out within approximately a month of each other in different process technologies, demonstrating the flexibility and utility of the Gemmini generator. Further evaluation of the integration of the Gemmini accelerators within the context of these larger embedded vision processors will be performed when the chips complete the fabrication process.
As demonstrated previously, CPU operations can significantly slow down inference on our workloads. Operations which map convolutions to GEMMs, such as im2col, can make up a significant portion of this overhead. To address this we intend to map convolutions to GEMMs transparently in hardware.
Some additional overheads come from zero-padding matrices so that their dimensions can tile onto the systolic array, which reduces utilization at the boundaries of our arrays. By breaking up a single, large, systolic array into numerous smaller ones operating in parallel, we can possibly reduce zero-padding requirements while still preserving the same compute throughput[kungSmallArrays].
Finally, a generator-based methodology can be useful for the hardware/software co-design process through the integration of hardware generator and compiler parameters. Future integration with optimizing DSLs and compilers such as Halide or TVM will allow for better code generation which considers the generator parameters, hence allowing for better cross-layer data re-use and optimization.
This work presented Gemmini, an open source and agile systolic array generator that enables systematic evaluations of deep-learning architectures. This systematic evaluation is demonstrated through a DSE case study, identifying bottlenecks in common DNN inference workloads, and capturing the variation of performance improvements of different workloads running on different hardware configurations. With a baseline design equipped with a 1616 systolic array, Gemmini demonstrated 90 and inference speedups on ResNet-152 and ResNet-50, respectively, when compared to a cache-optimized CPU implementation, and two to three orders of magnitude speedup on MLPs. We demonstrate the critical importance of full-system evaluation by showing that even though an accelerator can effectively accelerate individual layers of DNNs, it often fails to achieve impressive performance improvements on the entire DNN if any part of it is not efficiently mapped onto the accelerator. For example, although a Gemmini baseline design was able to accelerate the first layer of MobileNet by 330, it failed to accelerate the entire network beyond 6 using a Rocket host processor and 18 using a BOOM host processor, due to the presence of depthwise convolutions. We also show that even with DNNs that have similar network architectures, performance may vary based upon the shape and size of different layers. Looking forward, we believe Gemmini will enable a new range of systematic evaluations and HW/SW co-design of deep learning workloads.
This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) through the Circuit Realization at Faster Timescales (CRAFT) Program under Grant HR0011-16-C0052. This research was partially funded by ADEPT Lab industrial sponsors and affiliates. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.