ROMANet: Fine-Grained Reuse-Driven Data Organization and Off-Chip Memory Access Management for Deep Neural Network Accelerators

02/04/2019 ∙ by Rachmad Vidya Wicaksana Putra, et al. ∙ 0

Many hardware accelerators have been proposed to improve the computational efficiency of the inference process in deep neural networks (DNNs). However, off-chip memory accesses, being the most energy consuming operation in such architectures, limit the designs from achieving efficiency gains at the full potential. Towards this, we propose ROMANet, a methodology to investigate efficient dataflow patterns for reducing the number of the off-chip accesses. ROMANet adaptively determine the data reuse patterns for each convolutional layer of a network by considering the reuse factor of weights, input activations, and output activations. It also considers the data mapping inside off-chip memory to reduce row buffer misses and increase parallelism. Our experimental results show that ROMANet methodology is able to achieve up to 50 dynamic energy savings in state-of-the-art DNN accelerators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep neural networks (DNNs) are growing as a prominent solution for a wide range of machine learning applications, such as smart environment, automotive, and health care

(LeCun et al., 2015). To achieve a high accuracy, larger and deeper neural networks are required. Hence, to expedite the inference process, a hardware accelerator is employed. Many DNN accelerators have been proposed over the past few years (Chen et al., 2014; Han et al., 2016; Chen et al., 2017; et al, 2017; Parashar et al., 2017; Lu et al., 2017; Kwon et al., 2018; Hanif et al., 2018). These accelerators offer a significantly high performance efficiency as compared to any general-purpose CPU-based solution. However, the energy consumption of the off-chip memory (i.e., DRAM) accesses hinder the designs from achieving further efficiency gains as the energy cost incurred by the off-chip accesses is significantly higher compared to other operations (Sze et al., 2017), also highlighted through Fig. 1. Therefore, minimizing the number of DRAM accesses is the key to reduce the overall energy consumption of the accelerators.

Figure 1. Breakdown of (a) the overall energy consumption of DNN accelerators: Cambricon-X (Zhang et al., 2016), DianNao (Chen et al., 2014), systolic array (Hanif et al., 2018) and (b) the layer-wise energy consumption of AlexNet convolution in systolic array based accelerator.

1.1. State-of-the-art and Their Limitations

Data reuse technique is extensively employed by state-of-the-art DNN accelerators for minimizing the DRAM accesses. The idea is to reuse the same data multiple times once it is fetched from the DRAM. In this manner, we can maximally use the same data multiple times to avoid redundant fetches from the off-chip memory. Towards this, previous works have tried exploiting such technique. Their works can be loosely classified into two categories based on how they define the data reuse factor, namely

fixed data type reuse and dynamic data type reuse.

Fixed data type reuse gives priority of reuse to only one specific data type, either input activation/feature map (ifmap), output activation/feature map (ofmap), or weights. This concept has been widely used in previous works, such as (Zhang et al., 2015; Alwani et al., 2016). However, it suffers from inefficiency if the fixed dataflow forces the data type with highest number of reusability to be fetched multiple times from DRAM. For example, in Fig. 2b, the 1 layer of VGG-16 has the highest number of reuse for the weights data type, hence most likely that the weights will stay longer inside the on-chip memory than the others. However, in the 8 layer of VGG-16, weights data type has the lowest number of reuse, hence the efficiency of employing such fixed dataflow goes down. To address this limitation, dynamic data type reuse is proposed in (Li et al., 2018). It defines a dynamic dataflow based on the statistics observed in each layer, as shown in Fig. 2a-b. Although this dynamic data type reuse shows efficiency improvements in terms of reduction in the number of DRAM accesses, it only considers two reuse schemes, i.e. weights and ofmap reuse based dataflows, which limits the efficiency gains of the other possible dataflow patterns, as we will show with the help of the motivational case study.

Figure 2. Number of data reuse for ifmap, ofmap, and weights in (a) AlexNet and (b) VGG-16; (c) Percentage of the MACs and weights number in AlexNet and VGG-16

1.2. Motivational Case Study

To achieve a high energy efficiency, the state-of-the-art DNN accelerators exploit the reuse of three data types, namely ifmap, ofmap, and weights. Fig. 2a-b show that the reuse factor for each data type actually varies across layers and networks. If we sort them based on the reuse factor, there will be six possible orders for data reuse scheme. Meanwhile, the state-of-the-art only provides two reuse schemes. Considering that we have limited hardware resources, keeping the data type with higher reuse factor in the on-chip for longer time than other data types, is desired to minimize fetching the redundant data. Hence, providing only partial possible reuse scheme is not sufficient to obtain efficient dataflow patterns.

Furthermore, considering the fact that the DNN processing tightly couples the relation between ifmap, ofmap, and weights, the order of data reuse factor is indeed important. Thereby, we argue that sticking to one type of data reuse (i.e. fixed data type reuse) and/or supporting only partial possible reuse schemes in dynamic data type reuse, limit the efficiency gains for DNN accelerators which later also limit the energy efficiency improvements. Therefore, there is a significant need for dataflow patterns that can exploit data reuse efficiently and dynamically adapt to the different configuration of the layers in a network. However, determining such dataflow patterns bears several challenges as discussed below.

1.3. Associated Scientific Challenges

This strategy should account how much data of the ifmap, ofmap, and weights which should be fetched and computed in a single phase of computation, while fully utilizing the available DRAM bandwidth, size of the on-chip memories, and size of computing array. The single phase of computation means a phase where a tile of the ifmap and weights are fetched and computed to produce a tile of the ofmap. Furthermore, such dataflow patterns should be supported with a data mapping strategy inside the DRAM to minimize the row buffer misses and increase data parallelism. In this manner, the number of DRAM accesses can be reduced significantly.

1.4. Novel Contributions

In this paper, ROMANet Methodology (Section 3): Fine-Grained Reuse-Driven Data Organization and Off-Chip Memory Access Management for Deep Neural Network Accelerators is proposed, and it makes the following novel contributions to overcome the aforementioned scientific challenges.

  • Data Reuse Strategy (Section 3.1): We propose a strategy to define an efficient dataflow pattern for each layer of the given network. It defines how much data of the ifmap, ofmap, and weights to be fetched and computed in the a single computation phase, using data tiling approach. Hence, each data type will have its own data tiling parameters and these parameters are defined by considering the DRAM bandwidth, size of the on-chip memories, and size of computing array.

  • DRAM Data Mapping (Section 3.2): We propose an efficient data mapping inside the DRAM to minimize the row buffer misses and increase the data throughput. Hence, this mapping strategy considers the data tiling and DRAM configuration which are employed in the accelerator.

  • On-Chip Data Mapping (Section 3.3): We also propose the data mapping inside the on-chip memory to efficiently support the dataflow patterns which are employed in the accelerator. This mapping strategy also considers the available DRAM bandwidth and size of computing array.

2. Preliminaries

2.1. Deep Neural Networks

In DNN, the convolutional layers are computationally intensive as they have a high number of MAC operations (Fig. 2c) and a high reuse factor (Fig. 2a-b). The convolution processing and its pseudocode are presented in Fig. 3. Here, the terms P and Q represent the number of rows and columns in the weights; M and N represent the number of rows and column in the ofmap; I and J represent the number of ifmap and ofmap; and the terms H and W represent the number of rows and columns in the ifmap.

Figure 3. Overview of the convolutional processing.

2.2. Dram

As illustrated in Fig. 4a, DRAM is hierarchically organized into rank, chip, bank, row, and column as it is seen from top to bottom perspective. If there is an access request to DRAM, one specific DRAM rank will respond. The DRAM chips within a rank are accessed in parallel and contribute forming the DRAM word. Inside a DRAM chip, the request is directed to a specific DRAM bank. Even though we can only access one bank at a time, we can still use multi-bank design to obtain parallelism. The benefit of multi-bank design is illustrated in Fig. 4b. In the specific bank, an access request is decoded into array (i.e., row and column) addresses. If the accessed row is already activated, then the data in this entire row are already in the row buffer. Then, the column address will define the data to be accessed from row buffer. This condition is called a row buffer hit since the desired data is already in row buffer, thereby it can reduce the DRAM access latency and energy consumption. Otherwise, row buffer miss occurs and new referenced row needs to be activated in the row buffer. Hence, it increases the DRAM access latency and energy consumption. From the aforementioned information, it is clear that DRAM operation enables the two types of parallelism which we can exploit, namely bank-level and chip-level parallelism.

Figure 4. (a) Overview of the DRAM organization and (b) the multi-bank operations to increase the data parallelism

3. The ROMANet Methodology

In this work, we propose a ROMANet methodology to reduce the number of DRAM accesses and increase the energy efficiency of DNN accelerators. The overview of the ROMANet methodology flow is presented in Fig. 5 and consists of the following key steps.

  1. Observe the reuse factor of all data types (i.e. ifmap, ofmap, and weights) for each layer of the given network.

  2. Define the reuse scheme for each layer of a network based on the rank/order of the reuse factor. This order defines the data reuse scheme. From this step, each layer will have a specific scheme out of the six possible schemes as presented in Table 1.

  3. Define the data tiling configuration for all data types (i.e. ifmap, ofmap, and weights) based on the previously selected reuse scheme. This tiling configuration is constrained by the DNN accelerator configuration, such as DRAM bandwidth, size of on-chip memories, and size of the computing array. It will be discussed in Section 3.1.

  4. Define the memory mapping based on the data tiling configuration from the previous step. Here, DRAM and on-chip mapping are defined. The DRAM mapping strategy considers the DRAM and tiling configurations. Meanwhile, on-chip mapping strategy considers the DRAM bandwidth and size of the computing array. These will be discussed in Section 3.2 and 3.3.

  5. Evaluate the number of DRAM accesses, DRAM access volume, and DRAM access energy. The evaluation uses a systolic array based DNN accelerator design and conducts a comparison with state-of-the-art methodology. These will be discussed in Section 4.

Figure 5. Overview of the ROMANet methodology flow.

3.1. Data Reuse Strategy

The proposed data reuse strategy exploits the information of the reuse factor and considers the corresponding order. A data type with the highest reuse factor, takes the highest priority of reuse; data type with the lowest reuse factor, takes the lowest priority of reuse; another data type takes place in between (i.e. medium priority of reuse). As presented in Table 1, there are six possible schemes and each scheme has its own dataflow pattern which we will explain further with the help of an example and illustration in Fig. 6.

Scheme Reuse factor Description Main tiling flow (Fig. 6)
Highest Medium Lowest
1 ifmap weights ofmap maximally reuse , (esp. ) to efficiently reuse weights ifmap: 1⃝ 2⃝ 3⃝
2 ifmap ofmap weights maximally reuse , (esp. ) to efficiently reuse ofmap ifmap: 3⃝ 1⃝ 2⃝
3 weights ifmap ofmap maximally reuse , (esp. ) to efficiently reuse ifmap weights: 1⃝ 2⃝
4 weights ofmap ifmap maximally reuse , (esp. ) to efficiently reuse ofmap weights: 2⃝ 1⃝
5 ofmap ifmap weights maximally reuse , (esp. ) to efficiently reuse ifmap ofmap: 1⃝ 2⃝ 3⃝
6 ofmap weights ifmap maximally reuse , (esp. ) to efficiently reuse weights ofmap: 3⃝ 1⃝ 2⃝
Table 1. Proposed strategy for tiling configuration based on the possible data reuse scheme.

For example, if a layer of the given network has reuse factor sorted as ifmap weights ofmap, then it means that ifmap has the highest priority of reuse than the others. Which portion of the ifmap should be fetched from DRAM, is determined by the data type which has the second highest priority (i.e., weights in this case). This idea is to maximally reuse the data with highest priority and also consider the second highest one for developing the dataflow pattern. Hence, the movement of dataflow patterns for each data type is determined as follows.

  • For ifmap: The priority is to maximally fetching ifmap data type within tiling parameters , , and to fit into the allocated on-chip memory. To enable high reusability of weights, the priority is focused more on parameters and , because the data reuse in computation between ifmap and weights happens in 2-dimensional space. Therefore, we can increase the size of and by expanding them in the direction of ifmap-1⃝ and ifmap-2⃝.

  • For weights: The priority is to maximally use the tiling parameter since it is in-line with the fact that the data reuse in computation between weights and ifmap happens in 2-dimensional space. Therefore, we can increase the size of by expanding it in the direction of Weights-1⃝. How much data that a tile of weights filter should be fetched is limited by the size of allocated on-chip memory. Here, we define and since typically the size of row and column in the weights filter are small.

  • For ofmap: The priority is to maximally use the generated partial sums from the computation part. Therefore, how much data that a tile of ofmap should be managed on-chip is constrained by the tile size of the ifmap and weights that generate the partial sums, as well as the allocated on-chip memory.

  • If a single tile of ifmap is already computed with a single tile of weights filter, we can continue to fetch other tiles for next computation. The next data tiling configuration to be fetched for ifmap and weights will follow the tiling flow direction as presented in Fig. 6 and Table 1.

  • The order of dataflow direction presented in the Table 1 is the main tiling flow which should be followed by all data types. For example, if the main tiling flow is ifmap-1⃝ 2⃝ 3⃝, it means that the highest priority of dataflow will be in the direction of ifmap-1⃝ ifmap-2⃝ ifmap-3⃝. As implication, the weights will follow with the direction of weights-1⃝ weights-2⃝.

Figure 6. Illustration of data tiling configuration and terms used for ifmap, ofmap, and weights in data reuse strategy.

To conduct the aforementioned steps, some DNN accelerator configuration should be considered (Eq. 1).

(1)

The equations state that, for each layer of a network, the ifmap tile has to fit inside the allocated ifmap buffer (iBuff), the weights tile has to fit inside the allocated weights buffer (wBuff), and the ofmap tile has to fit inside the allocated ofmap buffer (oBuff). Otherwise, the tiling configuration for all data types should be adjusted according to the available on-chip buffer.

3.2. DRAM Data Mapping

To efficiently use the DRAM operations, ROMANet methodology also proposes a DRAM data mapping strategy. It exploits two basic concepts: reducing the number of row buffer misses and increasing the data throughput. To achieve these, the row-buffer locality, bank-level parallelism, and chip-level parallelism are exploited. The row-buffer locality means that if we want to fetch several data at a subsequent time, they should be placed in the same row of a DRAM bank. In this manner, we can access all the data in the same row subsequently. Bank-level parallelism means that if we want to access several data in parallel, we can place these data across DRAM banks and similarly, chip-level parallelism means that if we want to access several data in parallel, we can place these data across DRAM chips. To exploit these concepts efficiently, information of the tiling configuration for all data types are needed. Then, the DRAM data mapping can be done accordingly as illustrated in Fig. 7a, with the following strategy.

Figure 7. (a) DRAM data mapping and (b) on-chip memory mapping considering the dataflow patterns and tiling configuration.
  • For ifmap: Each ifmap is tiled using parameters , , and

    . Using these tiling parameters, we can estimate the number of tiling accesses which is required for the

    ifmap accesses. Based on that, we can map the ifmap inside the DRAM efficiently. Firstly, to exploit the row-buffer locality, we map a portion of ifmap tile in multiple subsequent rows within a DRAM bank. In this manner, if we get a row hit, we can efficiently access the entire data in that row subsequently and will only activate another row if the entire data have been accessed. Therefore, row-buffer misses will be reduced significantly. Secondly, to exploit the bank-level and chip-level parallelism, we map other portion of ifmap tile across multiple banks and multiple chips. In this manner, we can exploit the data parallelism in both aspects to increase the throughput.

  • For weights: The weights filters are tiled using parameters , , , and . Using these tiling parameters, we can estimate the number of tiling accesses which is required for the weights accesses. Based on that information, we can map the weights inside the DRAM efficiently. To exploit the row-buffer locality, we should map a portion of weights tile in multiple subsequent rows within a DRAM bank. Meanwhile, to exploit the bank-level and chip-level parallelism, we also map other portion of ifmap tile across multiple banks and multiple chips. In this manner, we can exploit the data parallelism in both aspects to increase the throughput.

  • For ofmap: The ofmap data type can follow the strategy of ifmap, since the ofmap data type will be the ifmap for next convolutional layer processing.

3.3. On-Chip Data Mapping

To efficiently shuttle the data from the DRAM to the computing engine, the on-chip data mapping strategy is also needed. Here, we used a scratch-pad memory (SPM) as on-chip buffer. The efficient data mapping in SPM can be done as illustrated in Fig.

7b, with the following strategy.

  • For ifmap: Each tile of ifmap is placed across multi banks of SPM to increase the data throughput. For systolic array-based accelerator case, it is recommended that the number of SPM banks for ifmap is the same with the number of rows of the systolic array. Thereby, each bank can supply data to specific row of the systolic array engine. In this manner, the data parallelism of ifmap can be exploited efficiently.

  • For weights: The tile of weights filters are placed across multi banks of SPM to increase the data throughput. For systolic array-based accelerator case, it is recommended that the number of SPM banks for weights is the same with the number of columns of the systolic array. Thereby, each bank can supply data to specific column of the systolic array engine. Furthermore, each different filter is placed in different banks to ensure that each bank can supply different filter to specific column of the systolic array. In this manner, the data parallelism of weights can be exploited efficiently.

  • For ofmap: The ofmap data type can also follow the strategy of ifmap, since the ofmap data type will be the ifmap for next convolutional layer processing.

4. Evaluation Methodology

To evaluate ROMANet methodology, we developed a simulator to model its behavior and strategy to find the efficient dataflow patterns. It is integrated with DNN accelerator and memory configuration as constraints. We designed a state-of-the-art Tensor Processing Unit (TPU)

(et al, 2017)-like DNN accelerator with a reduced size of the on-chip memory and the computing array (Table 2), then we used it as a reference for experiments. We synthesized that accelerator in Synopsys Design Compiler on CMOS technology to obtain timing, power, and area estimation. We also extracted power and energy estimation for the off-chip and the on-chip memories using CACTI (et al, 2018). The experimental setup and the tool flow is illustrated in Fig. 8. In the experiments, we evaluated the ROMANet methodology and compared it with the state-of-the-art using AlexNet and VGG-16 for investigating the number of DRAM accesses, DRAM access volume, and dynamic energy consumption of the DRAM accesses.

Module Description
Systolic Array 12 14 MAC processing elements
Data Buffer Size of the total buffer = 108KB
Accumulator Size of the register = 256B
Activation-Pooling Size of the register = 256B
DRAM Size = 2Gb, Bandwidth = 12.8GB/s (Malladi et al., 2012; Micron, 2010)
Table 2. Configuration of the systolic array accelerator.
Figure 8. Overview of the experimental setup and tool flow.

5. Results and Discussion

Figure 9. Evaluation and comparison of ROMANet with state-of-the-art on the number of DRAM accesses for (a) AlexNet and (d) VGG-16; DRAM access volume for (b) AlexNet and (e) VGG-16; and DRAM access energy for (c) AlexNet and (f) VGG-16

.

The experimental results for AlexNet are presented in Fig. 9a-c. For observation of the number of DRAM accesses, the ROMANet methodology achieves up to overall improvements as compared to the state-of-the-art. If the memory mapping is considered in the state-of-the-art, then the ROMANet methodology can still achieve up to improvements. In such scenario, if the observation is conducted in each layer of the given network, the improvements achieved by the ROMANet methodology are within range . For observation of the DRAM access volume, the ROMANet methodology also achieves up to overall improvements as compared to the state-of-the-art and if the memory mapping is considered in the state-of-the-art. For layer-wise observation, the improvements achieved are within range . The similar percentages of improvements are also observed in the dynamic energy consumption for DRAM accesses for overall and layer-wise observations.

The experimental results for VGG-16 are presented in Fig. 9d-f. For observation of the number of DRAM accesses, the ROMANet methodology achieves up to overall improvements as compared to the state-of-the-art. If the memory mapping is employed in the state-of-the-art, then the ROMANet methodology can still achieve improvements up to . In such case, if the observation is conducted in each layer of the given network, the improvements achieved by the ROMANet methodology are within range . For observation of the DRAM access volume, the ROMANet methodology also achieves up to overall improvements as compared to the state-of-the-art and if memory mapping is considered in the state-of-the-art. The improvements are within range for layer-wise observation. The similar percentages of improvements are also observed in the dynamic energy consumption for DRAM accesses for overall and layer-wise observation.

These results show that the memory mapping gives us a significant benefit to reduce the overall data accesses, because it minimizes the number of accesses for the redundant data. The significant improvements of the ROMANet methodology over the state-of-the-art, come from the dataflow pattern that exploits both, the data reuse and memory mapping efficiently for each layer of the given network. Furthermore, the experimental results show that the improvements are not evenly distributed across the layers. This condition gives us insight that in some layers, the ROMANet methodology is more efficient than the state-of-the-art, thanks to our efficient dataflow patterns and memory mapping. Meanwhile, in some other layers, the ROMANet methodology has a comparable efficiency with the state-of-the-art.

6. Conclusion

In this work, we demonstrate that the efficient dataflow patterns for DNN accelerators can be obtained by the proposed ROMANet methodology. It defines an efficient data reuse strategy and memory mapping for each layer of a network. The experimental results show that the proposed methodology can reduce the number of the DRAM accesses, DRAM access volume, and DRAM dynamic energy consumption significantly, thereby increasing the overall dynamic energy efficiency up to for AlexNet and for VGG-16, in the state-of-the-art DNN accelerators. Our novel concepts would enable further research on more comprehensive studies for energy-efficient DNN accelerators.

References

  • (1)
  • Alwani et al. (2016) M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fused-layer CNN accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.
  • Chen et al. (2014) T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, USA, 269–284.
  • Chen et al. (2017) Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017.

    Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.

    IEEE Journal of Solid-State Circuits 52, 1 (Jan 2017), 127–138.
  • et al (2018) N. Muralimanohar et al. 2018. CACTI 7.0. https://github.com/HewlettPackard/cacti
  • et al (2017) N. P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 1–12.
  • Han et al. (2016) S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 243–254.
  • Hanif et al. (2018) M. A. Hanif, R. V. W. Putra, M. Tanvir, R. Hafiz, S. Rehman, and M. Shafique. 2018. MPNA: A Massively-Parallel Neural Array Accelerator with Dataflow Optimization for Convolutional Neural Networks. arXiv preprint arXiv:1810.12910 (2018).
  • Kwon et al. (2018) H. Kwon, A. Samajdar, and T. Krishna. 2018. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In 23th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 461–475.
  • LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
  • Li et al. (2018) J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, and X. Li. 2018. SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE). 343–348.
  • Lu et al. (2017) W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. 2017. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). 553–564.
  • Malladi et al. (2012) K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz. 2012. Towards energy-proportional datacenter memory with mobile DRAM. In 2012 39th Annual International Symposium on Computer Architecture (ISCA). 37–48.
  • Micron (2010) Micron. 2010. Micron 2Gb: x4, x8, x16 DDR3 SDRAM. Data Sheet MT41J128M16HA-12.
  • Parashar et al. (2017) A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). 27–40.
  • Sze et al. (2017) V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (Dec 2017), 2295–2329.
  • Zhang et al. (2015) C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). ACM, New York, NY, USA, 161–170.
  • Zhang et al. (2016) S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.