Deep neural networks (DNNs) are growing as a prominent solution for a wide range of machine learning applications, such as smart environment, automotive, and health care(LeCun et al., 2015). To achieve a high accuracy, larger and deeper neural networks are required. Hence, to expedite the inference process, a hardware accelerator is employed. Many DNN accelerators have been proposed over the past few years (Chen et al., 2014; Han et al., 2016; Chen et al., 2017; et al, 2017; Parashar et al., 2017; Lu et al., 2017; Kwon et al., 2018; Hanif et al., 2018). These accelerators offer a significantly high performance efficiency as compared to any general-purpose CPU-based solution. However, the energy consumption of the off-chip memory (i.e., DRAM) accesses hinder the designs from achieving further efficiency gains as the energy cost incurred by the off-chip accesses is significantly higher compared to other operations (Sze et al., 2017), also highlighted through Fig. 1. Therefore, minimizing the number of DRAM accesses is the key to reduce the overall energy consumption of the accelerators.
1.1. State-of-the-art and Their Limitations
Data reuse technique is extensively employed by state-of-the-art DNN accelerators for minimizing the DRAM accesses. The idea is to reuse the same data multiple times once it is fetched from the DRAM. In this manner, we can maximally use the same data multiple times to avoid redundant fetches from the off-chip memory. Towards this, previous works have tried exploiting such technique. Their works can be loosely classified into two categories based on how they define the data reuse factor, namelyfixed data type reuse and dynamic data type reuse.
Fixed data type reuse gives priority of reuse to only one specific data type, either input activation/feature map (ifmap), output activation/feature map (ofmap), or weights. This concept has been widely used in previous works, such as (Zhang et al., 2015; Alwani et al., 2016). However, it suffers from inefficiency if the fixed dataflow forces the data type with highest number of reusability to be fetched multiple times from DRAM. For example, in Fig. 2b, the 1 layer of VGG-16 has the highest number of reuse for the weights data type, hence most likely that the weights will stay longer inside the on-chip memory than the others. However, in the 8 layer of VGG-16, weights data type has the lowest number of reuse, hence the efficiency of employing such fixed dataflow goes down. To address this limitation, dynamic data type reuse is proposed in (Li et al., 2018). It defines a dynamic dataflow based on the statistics observed in each layer, as shown in Fig. 2a-b. Although this dynamic data type reuse shows efficiency improvements in terms of reduction in the number of DRAM accesses, it only considers two reuse schemes, i.e. weights and ofmap reuse based dataflows, which limits the efficiency gains of the other possible dataflow patterns, as we will show with the help of the motivational case study.
1.2. Motivational Case Study
To achieve a high energy efficiency, the state-of-the-art DNN accelerators exploit the reuse of three data types, namely ifmap, ofmap, and weights. Fig. 2a-b show that the reuse factor for each data type actually varies across layers and networks. If we sort them based on the reuse factor, there will be six possible orders for data reuse scheme. Meanwhile, the state-of-the-art only provides two reuse schemes. Considering that we have limited hardware resources, keeping the data type with higher reuse factor in the on-chip for longer time than other data types, is desired to minimize fetching the redundant data. Hence, providing only partial possible reuse scheme is not sufficient to obtain efficient dataflow patterns.
Furthermore, considering the fact that the DNN processing tightly couples the relation between ifmap, ofmap, and weights, the order of data reuse factor is indeed important. Thereby, we argue that sticking to one type of data reuse (i.e. fixed data type reuse) and/or supporting only partial possible reuse schemes in dynamic data type reuse, limit the efficiency gains for DNN accelerators which later also limit the energy efficiency improvements. Therefore, there is a significant need for dataflow patterns that can exploit data reuse efficiently and dynamically adapt to the different configuration of the layers in a network. However, determining such dataflow patterns bears several challenges as discussed below.
1.3. Associated Scientific Challenges
This strategy should account how much data of the ifmap, ofmap, and weights which should be fetched and computed in a single phase of computation, while fully utilizing the available DRAM bandwidth, size of the on-chip memories, and size of computing array. The single phase of computation means a phase where a tile of the ifmap and weights are fetched and computed to produce a tile of the ofmap. Furthermore, such dataflow patterns should be supported with a data mapping strategy inside the DRAM to minimize the row buffer misses and increase data parallelism. In this manner, the number of DRAM accesses can be reduced significantly.
1.4. Novel Contributions
In this paper, ROMANet Methodology (Section 3): Fine-Grained Reuse-Driven Data Organization and Off-Chip Memory Access Management for Deep Neural Network Accelerators is proposed, and it makes the following novel contributions to overcome the aforementioned scientific challenges.
Data Reuse Strategy (Section 3.1): We propose a strategy to define an efficient dataflow pattern for each layer of the given network. It defines how much data of the ifmap, ofmap, and weights to be fetched and computed in the a single computation phase, using data tiling approach. Hence, each data type will have its own data tiling parameters and these parameters are defined by considering the DRAM bandwidth, size of the on-chip memories, and size of computing array.
DRAM Data Mapping (Section 3.2): We propose an efficient data mapping inside the DRAM to minimize the row buffer misses and increase the data throughput. Hence, this mapping strategy considers the data tiling and DRAM configuration which are employed in the accelerator.
On-Chip Data Mapping (Section 3.3): We also propose the data mapping inside the on-chip memory to efficiently support the dataflow patterns which are employed in the accelerator. This mapping strategy also considers the available DRAM bandwidth and size of computing array.
2.1. Deep Neural Networks
In DNN, the convolutional layers are computationally intensive as they have a high number of MAC operations (Fig. 2c) and a high reuse factor (Fig. 2a-b). The convolution processing and its pseudocode are presented in Fig. 3. Here, the terms P and Q represent the number of rows and columns in the weights; M and N represent the number of rows and column in the ofmap; I and J represent the number of ifmap and ofmap; and the terms H and W represent the number of rows and columns in the ifmap.
As illustrated in Fig. 4a, DRAM is hierarchically organized into rank, chip, bank, row, and column as it is seen from top to bottom perspective. If there is an access request to DRAM, one specific DRAM rank will respond. The DRAM chips within a rank are accessed in parallel and contribute forming the DRAM word. Inside a DRAM chip, the request is directed to a specific DRAM bank. Even though we can only access one bank at a time, we can still use multi-bank design to obtain parallelism. The benefit of multi-bank design is illustrated in Fig. 4b. In the specific bank, an access request is decoded into array (i.e., row and column) addresses. If the accessed row is already activated, then the data in this entire row are already in the row buffer. Then, the column address will define the data to be accessed from row buffer. This condition is called a row buffer hit since the desired data is already in row buffer, thereby it can reduce the DRAM access latency and energy consumption. Otherwise, row buffer miss occurs and new referenced row needs to be activated in the row buffer. Hence, it increases the DRAM access latency and energy consumption. From the aforementioned information, it is clear that DRAM operation enables the two types of parallelism which we can exploit, namely bank-level and chip-level parallelism.
3. The ROMANet Methodology
In this work, we propose a ROMANet methodology to reduce the number of DRAM accesses and increase the energy efficiency of DNN accelerators. The overview of the ROMANet methodology flow is presented in Fig. 5 and consists of the following key steps.
Observe the reuse factor of all data types (i.e. ifmap, ofmap, and weights) for each layer of the given network.
Define the reuse scheme for each layer of a network based on the rank/order of the reuse factor. This order defines the data reuse scheme. From this step, each layer will have a specific scheme out of the six possible schemes as presented in Table 1.
Define the data tiling configuration for all data types (i.e. ifmap, ofmap, and weights) based on the previously selected reuse scheme. This tiling configuration is constrained by the DNN accelerator configuration, such as DRAM bandwidth, size of on-chip memories, and size of the computing array. It will be discussed in Section 3.1.
Define the memory mapping based on the data tiling configuration from the previous step. Here, DRAM and on-chip mapping are defined. The DRAM mapping strategy considers the DRAM and tiling configurations. Meanwhile, on-chip mapping strategy considers the DRAM bandwidth and size of the computing array. These will be discussed in Section 3.2 and 3.3.
Evaluate the number of DRAM accesses, DRAM access volume, and DRAM access energy. The evaluation uses a systolic array based DNN accelerator design and conducts a comparison with state-of-the-art methodology. These will be discussed in Section 4.
3.1. Data Reuse Strategy
The proposed data reuse strategy exploits the information of the reuse factor and considers the corresponding order. A data type with the highest reuse factor, takes the highest priority of reuse; data type with the lowest reuse factor, takes the lowest priority of reuse; another data type takes place in between (i.e. medium priority of reuse). As presented in Table 1, there are six possible schemes and each scheme has its own dataflow pattern which we will explain further with the help of an example and illustration in Fig. 6.
|Scheme||Reuse factor||Description||Main tiling flow (Fig. 6)|
|1||ifmap||weights||ofmap||maximally reuse , (esp. ) to efficiently reuse weights||ifmap: 1⃝ 2⃝ 3⃝|
|2||ifmap||ofmap||weights||maximally reuse , (esp. ) to efficiently reuse ofmap||ifmap: 3⃝ 1⃝ 2⃝|
|3||weights||ifmap||ofmap||maximally reuse , (esp. ) to efficiently reuse ifmap||weights: 1⃝ 2⃝|
|4||weights||ofmap||ifmap||maximally reuse , (esp. ) to efficiently reuse ofmap||weights: 2⃝ 1⃝|
|5||ofmap||ifmap||weights||maximally reuse , (esp. ) to efficiently reuse ifmap||ofmap: 1⃝ 2⃝ 3⃝|
|6||ofmap||weights||ifmap||maximally reuse , (esp. ) to efficiently reuse weights||ofmap: 3⃝ 1⃝ 2⃝|
For example, if a layer of the given network has reuse factor sorted as ifmap weights ofmap, then it means that ifmap has the highest priority of reuse than the others. Which portion of the ifmap should be fetched from DRAM, is determined by the data type which has the second highest priority (i.e., weights in this case). This idea is to maximally reuse the data with highest priority and also consider the second highest one for developing the dataflow pattern. Hence, the movement of dataflow patterns for each data type is determined as follows.
For ifmap: The priority is to maximally fetching ifmap data type within tiling parameters , , and to fit into the allocated on-chip memory. To enable high reusability of weights, the priority is focused more on parameters and , because the data reuse in computation between ifmap and weights happens in 2-dimensional space. Therefore, we can increase the size of and by expanding them in the direction of ifmap-1⃝ and ifmap-2⃝.
For weights: The priority is to maximally use the tiling parameter since it is in-line with the fact that the data reuse in computation between weights and ifmap happens in 2-dimensional space. Therefore, we can increase the size of by expanding it in the direction of Weights-1⃝. How much data that a tile of weights filter should be fetched is limited by the size of allocated on-chip memory. Here, we define and since typically the size of row and column in the weights filter are small.
For ofmap: The priority is to maximally use the generated partial sums from the computation part. Therefore, how much data that a tile of ofmap should be managed on-chip is constrained by the tile size of the ifmap and weights that generate the partial sums, as well as the allocated on-chip memory.
The order of dataflow direction presented in the Table 1 is the main tiling flow which should be followed by all data types. For example, if the main tiling flow is ifmap-1⃝ 2⃝ 3⃝, it means that the highest priority of dataflow will be in the direction of ifmap-1⃝ ifmap-2⃝ ifmap-3⃝. As implication, the weights will follow with the direction of weights-1⃝ weights-2⃝.
To conduct the aforementioned steps, some DNN accelerator configuration should be considered (Eq. 1).
The equations state that, for each layer of a network, the ifmap tile has to fit inside the allocated ifmap buffer (iBuff), the weights tile has to fit inside the allocated weights buffer (wBuff), and the ofmap tile has to fit inside the allocated ofmap buffer (oBuff). Otherwise, the tiling configuration for all data types should be adjusted according to the available on-chip buffer.
3.2. DRAM Data Mapping
To efficiently use the DRAM operations, ROMANet methodology also proposes a DRAM data mapping strategy. It exploits two basic concepts: reducing the number of row buffer misses and increasing the data throughput. To achieve these, the row-buffer locality, bank-level parallelism, and chip-level parallelism are exploited. The row-buffer locality means that if we want to fetch several data at a subsequent time, they should be placed in the same row of a DRAM bank. In this manner, we can access all the data in the same row subsequently. Bank-level parallelism means that if we want to access several data in parallel, we can place these data across DRAM banks and similarly, chip-level parallelism means that if we want to access several data in parallel, we can place these data across DRAM chips. To exploit these concepts efficiently, information of the tiling configuration for all data types are needed. Then, the DRAM data mapping can be done accordingly as illustrated in Fig. 7a, with the following strategy.
For ifmap: Each ifmap is tiled using parameters , , and
. Using these tiling parameters, we can estimate the number of tiling accesses which is required for theifmap accesses. Based on that, we can map the ifmap inside the DRAM efficiently. Firstly, to exploit the row-buffer locality, we map a portion of ifmap tile in multiple subsequent rows within a DRAM bank. In this manner, if we get a row hit, we can efficiently access the entire data in that row subsequently and will only activate another row if the entire data have been accessed. Therefore, row-buffer misses will be reduced significantly. Secondly, to exploit the bank-level and chip-level parallelism, we map other portion of ifmap tile across multiple banks and multiple chips. In this manner, we can exploit the data parallelism in both aspects to increase the throughput.
For weights: The weights filters are tiled using parameters , , , and . Using these tiling parameters, we can estimate the number of tiling accesses which is required for the weights accesses. Based on that information, we can map the weights inside the DRAM efficiently. To exploit the row-buffer locality, we should map a portion of weights tile in multiple subsequent rows within a DRAM bank. Meanwhile, to exploit the bank-level and chip-level parallelism, we also map other portion of ifmap tile across multiple banks and multiple chips. In this manner, we can exploit the data parallelism in both aspects to increase the throughput.
For ofmap: The ofmap data type can follow the strategy of ifmap, since the ofmap data type will be the ifmap for next convolutional layer processing.
3.3. On-Chip Data Mapping
To efficiently shuttle the data from the DRAM to the computing engine, the on-chip data mapping strategy is also needed. Here, we used a scratch-pad memory (SPM) as on-chip buffer. The efficient data mapping in SPM can be done as illustrated in Fig.7b, with the following strategy.
For ifmap: Each tile of ifmap is placed across multi banks of SPM to increase the data throughput. For systolic array-based accelerator case, it is recommended that the number of SPM banks for ifmap is the same with the number of rows of the systolic array. Thereby, each bank can supply data to specific row of the systolic array engine. In this manner, the data parallelism of ifmap can be exploited efficiently.
For weights: The tile of weights filters are placed across multi banks of SPM to increase the data throughput. For systolic array-based accelerator case, it is recommended that the number of SPM banks for weights is the same with the number of columns of the systolic array. Thereby, each bank can supply data to specific column of the systolic array engine. Furthermore, each different filter is placed in different banks to ensure that each bank can supply different filter to specific column of the systolic array. In this manner, the data parallelism of weights can be exploited efficiently.
For ofmap: The ofmap data type can also follow the strategy of ifmap, since the ofmap data type will be the ifmap for next convolutional layer processing.
4. Evaluation Methodology
To evaluate ROMANet methodology, we developed a simulator to model its behavior and strategy to find the efficient dataflow patterns. It is integrated with DNN accelerator and memory configuration as constraints. We designed a state-of-the-art Tensor Processing Unit (TPU)(et al, 2017)-like DNN accelerator with a reduced size of the on-chip memory and the computing array (Table 2), then we used it as a reference for experiments. We synthesized that accelerator in Synopsys Design Compiler on CMOS technology to obtain timing, power, and area estimation. We also extracted power and energy estimation for the off-chip and the on-chip memories using CACTI (et al, 2018). The experimental setup and the tool flow is illustrated in Fig. 8. In the experiments, we evaluated the ROMANet methodology and compared it with the state-of-the-art using AlexNet and VGG-16 for investigating the number of DRAM accesses, DRAM access volume, and dynamic energy consumption of the DRAM accesses.
|Systolic Array||12 14 MAC processing elements|
|Data Buffer||Size of the total buffer = 108KB|
|Accumulator||Size of the register = 256B|
|Activation-Pooling||Size of the register = 256B|
|DRAM||Size = 2Gb, Bandwidth = 12.8GB/s (Malladi et al., 2012; Micron, 2010)|
5. Results and Discussion
The experimental results for AlexNet are presented in Fig. 9a-c. For observation of the number of DRAM accesses, the ROMANet methodology achieves up to overall improvements as compared to the state-of-the-art. If the memory mapping is considered in the state-of-the-art, then the ROMANet methodology can still achieve up to improvements. In such scenario, if the observation is conducted in each layer of the given network, the improvements achieved by the ROMANet methodology are within range . For observation of the DRAM access volume, the ROMANet methodology also achieves up to overall improvements as compared to the state-of-the-art and if the memory mapping is considered in the state-of-the-art. For layer-wise observation, the improvements achieved are within range . The similar percentages of improvements are also observed in the dynamic energy consumption for DRAM accesses for overall and layer-wise observations.
The experimental results for VGG-16 are presented in Fig. 9d-f. For observation of the number of DRAM accesses, the ROMANet methodology achieves up to overall improvements as compared to the state-of-the-art. If the memory mapping is employed in the state-of-the-art, then the ROMANet methodology can still achieve improvements up to . In such case, if the observation is conducted in each layer of the given network, the improvements achieved by the ROMANet methodology are within range . For observation of the DRAM access volume, the ROMANet methodology also achieves up to overall improvements as compared to the state-of-the-art and if memory mapping is considered in the state-of-the-art. The improvements are within range for layer-wise observation. The similar percentages of improvements are also observed in the dynamic energy consumption for DRAM accesses for overall and layer-wise observation.
These results show that the memory mapping gives us a significant benefit to reduce the overall data accesses, because it minimizes the number of accesses for the redundant data. The significant improvements of the ROMANet methodology over the state-of-the-art, come from the dataflow pattern that exploits both, the data reuse and memory mapping efficiently for each layer of the given network. Furthermore, the experimental results show that the improvements are not evenly distributed across the layers. This condition gives us insight that in some layers, the ROMANet methodology is more efficient than the state-of-the-art, thanks to our efficient dataflow patterns and memory mapping. Meanwhile, in some other layers, the ROMANet methodology has a comparable efficiency with the state-of-the-art.
In this work, we demonstrate that the efficient dataflow patterns for DNN accelerators can be obtained by the proposed ROMANet methodology. It defines an efficient data reuse strategy and memory mapping for each layer of a network. The experimental results show that the proposed methodology can reduce the number of the DRAM accesses, DRAM access volume, and DRAM dynamic energy consumption significantly, thereby increasing the overall dynamic energy efficiency up to for AlexNet and for VGG-16, in the state-of-the-art DNN accelerators. Our novel concepts would enable further research on more comprehensive studies for energy-efficient DNN accelerators.
- Alwani et al. (2016) M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.
- Chen et al. (2014) T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, USA, 269–284.
et al. (2017)
Y. H. Chen, T. Krishna,
J. S. Emer, and V. Sze.
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.IEEE Journal of Solid-State Circuits 52, 1 (Jan 2017), 127–138.
- et al (2018) N. Muralimanohar et al. 2018. CACTI 7.0. https://github.com/HewlettPackard/cacti
- et al (2017) N. P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 1–12.
- Han et al. (2016) S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 243–254.
- Hanif et al. (2018) M. A. Hanif, R. V. W. Putra, M. Tanvir, R. Hafiz, S. Rehman, and M. Shafique. 2018. MPNA: A Massively-Parallel Neural Array Accelerator with Dataflow Optimization for Convolutional Neural Networks. arXiv preprint arXiv:1810.12910 (2018).
- Kwon et al. (2018) H. Kwon, A. Samajdar, and T. Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In 23th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 461–475.
- LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
- Li et al. (2018) J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, and X. Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE). 343–348.
- Lu et al. (2017) W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. 2017. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 553–564.
- Malladi et al. (2012) K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz. 2012. Towards energy-proportional datacenter memory with mobile DRAM. In 2012 39th Annual International Symposium on Computer Architecture (ISCA). 37–48.
- Micron (2010) Micron. 2010. Micron 2Gb: x4, x8, x16 DDR3 SDRAM. Data Sheet MT41J128M16HA-12.
- Parashar et al. (2017) A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 27–40.
- Sze et al. (2017) V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (Dec 2017), 2295–2329.
- Zhang et al. (2015) C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, USA, 161–170.
- Zhang et al. (2016) S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.