DRMap: A Generic DRAM Data Mapping Policy for Energy-Efficient Processing of Convolutional Neural Networks

04/21/2020 ∙ by Rachmad Vidya Wicaksana Putra, et al. ∙ TU Wien 0

Many convolutional neural network (CNN) accelerators face performance- and energy-efficiency challenges which are crucial for embedded implementations, due to high DRAM access latency and energy. Recently, some DRAM architectures have been proposed to exploit subarray-level parallelism for decreasing the access latency. Towards this, we present a design space exploration methodology to study the latency and energy of different mapping policies on different DRAM architectures, and identify the pareto-optimal design choices. The results show that the energy-efficient DRAM accesses can be achieved by a mapping policy that orderly prioritizes to maximize the row buffer hits, bank- and subarray-level parallelism.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The widespread use of machine learning (ML) algorithms for organizing, analyzing, and inferring information from digital data is growing fast. Among many ML algorithms, convolutional neural network (CNN) algorithms have demonstrated state-of-the-art performance in data analytic tasks, such as image classification, object recognition, smart environment, health care, and automotive

[13]. Since the CNN algorithms require data-intensive processing, CNN hardware accelerators are typically required to expedite the inference process. Over the past few years, several CNN accelerators have been proposed [2, 22, 3, 1, 17, 7, 18, 16, 12, 6, 19]. These accelerators offer higher performance- and energy-efficiency as compared to general-purpose CPUs. However, many CNN accelerators still face performance- and energy-efficiency challenges due to the high off-chip memory (i.e., DRAM) access latency and energy, which are higher than the latency and energy for the other compute operations [21]. Therefore, reducing the DRAM access latency and energy are required for improving the performance- and energy-efficiency of CNN accelerators.

I-a The State-of-the-Art and Limitations

Previous works have proposed different techniques to reduce the DRAM access energy, by minimizing the number of DRAM accesses [22][14][20]. Their main ideas are similar, i.e., (i) defining layer partitioning111Layer partitioning determines portions of data in the form of block/tile to be accessed from DRAM to on-chip memory at one time. A detailed explanation is provided in Section II-A. and then transferring each partition from DRAM to on-chip memory/buffer in a defined schedule, and (ii) maximally reusing the data that are already in the on-chip buffer. The state-of-the-art [14] considers adaptive layer partitioning and scheduling, to minimize the number of DRAM accesses, by adaptively switching the reuse priority between different data types: input activations/feature maps (ifms), output activations/feature maps (ofms), and weights (wghs), across the layers of a network. Although all these works result in a reduced number of DRAM accesses (which also means reduced DRAM access energy), they do not consider improving (i) the DRAM latency-per-access, and (ii) DRAM energy-per-access. Therefore, performance- and energy-efficiency improvements achieved by the state-of-the-art are sub-optimal, thereby limiting the CNN accelerators to achieve further performance- and energy-efficiency improvements. We will illustrate this with the help of the following motivational case study.

I-B Motivational Case Study and Associated Research Challenges

Motivational Case Study: Although there are various types of commodity DRAM (e.g., DDR3, DDR4, etc.), they have similar internal organization and operations [5] (detailed DRAM organization and operations are provided in Section II-B). Therefore, different types of commodity DRAM have similar behavior regarding latency-per-access and energy-per-access. The DRAM latency-per-access and energy-per-access vary depending upon whether a single DRAM access faces a row buffer hit, a row buffer miss, or a row buffer conflict. A row buffer hit means that the requested row is already available in the row buffer, hence the data access can be performed directly without additional operations. In case of a row buffer miss or conflict, the requested row has to be opened first before a data access can be performed. In this manner, a row buffer miss and conflict require higher latency-per-access and energy-per-access than a row buffer hit. To illustrate this, we performed an experimental analysis to observe the DRAM latency-per-access and energy-per-access for different conditions (i.e., a row buffer hit, miss, and conflict), and the experimental results are presented in Fig. 1. Furthermore, in commodity DRAM, each request that goes to a DRAM bank can only access a single DRAM subarray at a time, although each bank is composed of multiple subarrays. This limits the DRAM capability to offer lower DRAM access latency and energy. Recently, several DRAM architectures that offer subarray-level parallelism (SALP) in a DRAM bank, have been proposed in the literature. In [9], three variants of SALP architectures are presented, i.e., SALP-1, SALP-2, and SALP-MASA (detailed SALP architectures are provided in Section II-C). Our observation results in Fig. 1 show that SALP architectures have the potential to further reduce the DRAM latency-per-access and energy-per-access as compared to commodity DRAM.

Fig. 1: DRAM latency-per-access and DRAM energy-per-access for different conditions (a row buffer hit, a row buffer miss, a row buffer conflict, subarray- and bank-level paralellism) in different DRAM architectures (DDR3, SALP-1, SALP-2, and SALP-MASA). Data are obtained from our experiments using a state-of-the-art cycle-accurate DRAM simulators [10][4] for DDR3-1600 2Gb x8 and SALP 2Gb x8 with 8 subarrays-per-bank.

Associated Research Challenges: From above observations, the energy efficiency of DRAM accesses for CNN accelerators can be improved by minimizing the DRAM latency-per-access and energy-per-access. Therefore, there is a need of a generic DRAM mapping policy that can achieve maximum row buffer hits while exploiting subarray- and bank-level parallelism. Furthermore, to justify that the proposed DRAM mapping policy is applicable to different design choices, a design space exploration (DSE) is required. This DSE explores different DRAM mapping policies in different DRAM architectures with different layer partitioning and scheduling schemes, to find the minimum energy-delay-product (EDP) of DRAM accesses. This EDP is used as a measure of the energy-efficiency of a CNN accelerator. Therefore,

an analytical model for estimating the EDP of different DRAM mapping policies in the DSE, is also required

.

I-C Our Novel Contributions

In this paper, we make the following novel contributions (the overview is illustrated in Fig. 2) to overcome the associated challenges.

Fig. 2: The overview of our novel contributions, highlighted in the green boxes. We use separate on-chip buffers for different data types: input buffer (iB) for ifms, weight buffer (wB) for wghs, and output buffer (oB) for ofms.
  1. [leftmargin=*]

  2. We propose DRMap: a generic DRAM data Mapping policy that offers minimum energy-delay-product (EDP) of DRAM accesses, for a given DRAM architecture, layer partitioning and scheduling scheme. DRMap orderly prioritizes to maximize row buffer hits, bank- and subarray-level parallelism.

  3. We propose a design space exploration (DSE) algorithm to find a DRAM mapping that offers minimum EDP, while considering different DRAM architectures, different layer partitioning and scheduling schemes.

  4. We propose an analytical model for estimating EDPs of different DRAM mapping policies, which will be used in the DSE. The EDP for each DRAM mapping is estimated by multiplying the number of DRAM accesses with the respective number of cycles and energy values.

Key results: DRMap orderly prioritizes to maximize the row buffer hits, bank- and subarray-level parallelism. It improves the EDP compared to the other mapping policies, up to 96% for DDR3, 94% for SALP-1, 91% for SALP-2, and 80% for SALP-MASA on AlexNet [11].

Ii Preliminaries

Ii-a Layer Partitioning and Scheduling in CNNs

The full CNN processing usually cannot be mapped at once on the accelerator fabric due to the limited on-chip buffer capacity (i.e., KB-KB [21]), hence layer partitioning and scheduling are required. To illustrate this, a pseudo-code of a convolutional layer processing in a CNN accelerator is shown in Fig. 3. It has two parts, i.e., inner loops and outer loops. The inner loops represent the on-chip processing. The outer loops represent the scheduling of processing different portions of data (from all data types: ifms, wghs, and ofms), whose sizes have to be less than or equal to the sizes of respective buffers (iB, wB, and oB). These data are partitioned in the form of blocks/tiles which are represented with the step sizes. Furthermore, the sequence of the outer loops represents the order in which the tiles are accessed from DRAM to the on-chip buffer. It thereby reflects the number of DRAM accesses required to process a layer of a network.

Fig. 3: Pseudo-code of the tiled convolutional neural network processing.

Ii-B DRAM Overview

DRAM Organization: From top to bottom perspective, the organization of a commodity DRAM is composed of channel, rank, chip, bank, row, and column [5][15], as shown in Fig. 4(a). In commodity DRAM, banks are the lowest hierarchy in DRAM, which can be accessed in parallel, and referred to as bank-level parallelism [8]. Actually, a DRAM bank is not implemented in a monolithic design (a large array of cells with a single row buffer). Instead, it is implemented in multiple subarrays, each of which has its local row buffer, as shown in Fig 4(b). Multiple subarrays in a bank share (i) global bitlines, which connect local row buffers to a global row buffer, and (ii) a global row address decoder [9].

Fig. 4: (a) DRAM organization overview, and (b) physical implementation of a DRAM bank, showing multiple subarrays in a bank.

DRAM Operations: If there is a single DRAM request, a rank will respond and thereby multiple chips within this rank can be accessed in parallel, contributing to a DRAM word. For each chip, the request is routed to a specific bank and decoded into addresses of a row and a column. The Activation (ACT) command triggers a row activation, and data from the requested row are copied to the row buffer. Afterward, a read (RD) or write (WR) command can be issued to a specific column in the activated row buffer. If the requested row is already activated, then the data in this row is already in the row buffer (a row buffer hit). Therefore, it does not need a new row activation. If the requested row is not activated yet, then it is either a row buffer miss or a row buffer conflict. In a row buffer miss, there is no activated row yet in the row buffer and it requires to activate the requested row. Meanwhile, in a row buffer conflict, there is an activated row in the row buffer, but it is not the one that the request is expecting. Therefore, this condition requires to close the activated row first using the precharging (PRE) command, and then activate the requested row using the activation (ACT) command.

DRAM Data Mapping: The default data mapping prioritizes to map the subsequently accessed data in the different columns of the same row of a bank (for increasing row buffer hits) and the different banks of the same rank (for exploiting bank-level parallelism). However, it does not exploit subarray-level parallelism and does not consider different possible layer partitioning and scheduling. Therefore, the default data mapping solution is suboptimal.

Ii-C DRAM Architectures that Exploit Subarray-level Parallelism

In a commodity DRAM, each request that goes to a DRAM bank, can only access a single subarray at a time. This limits the potential to reduce the DRAM access latency and energy. To address this limitation, [9] has proposed three DRAM architectures and mechanisms that exploit subarray-level paralellism (SALP) in the same bank, called SALP-1, SALP-2, and SALP-MASA. Following are the key ideas of these SALP architectures.

  • [leftmargin=*]

  • SALP-1 reduces the DRAM service time by overlapping the precharging of one subarray with the activation of another subarray, since mostly the precharging and activation are local to a subarray. To enable this mechanism, re-interpretation of the existing timing constraint for precharging, is required.

  • SALP-2 reduces the DRAM service time even more than SALP-1, by overlapping the write-recovery latency of an active subarray, with the activation of another subarray. To enable this, additional circuitry to activate two subarrays at the same time is required.

  • Multitude of Activated Subarrays (MASA) reduces the DRAM service time even more than SALP-2, by activating multiple subarrays at the same time (the activations of different subarrays are overlapped). To enable this, additional circuitry (more than SALP-2) to activate multiple subarrays at the same time is required.

Iii Our Design Methodology for DRAM Mapping in CNN Accelerators

Iii-a DRMap: A Generic DRAM Data Mapping Policy

Our observations from the results in Fig. 1 show that different DRAM architectures have similar behavior in terms of latency-per-access and energy-per-access. Therefore, we propose DRMap, a generic DRAM mapping policy for energy-efficient DRAM accesses in CNN accelerators. Its main idea is to orderly prioritize the data mapping that maximizes DRAM row buffer hit, bank- and subarray-level parallelism. The flowchart of DRMap mechanism in a DRAM chip is presented in Fig. 5, while its pseudo-code and physical representation of mapping policy are illustrated in Fig. 6. DRMap considers tile-based partitioning in its mechanism, thereby DRMap can be performed for each data tile using the following steps:

  1. [leftmargin=*]

  2. If we consider accessing a DRAM bank, then DRMap prioritizes to map a data partition to different columns in the same row to achieve maximum row buffer hits. If multiple chips are available within a rank, then this step can be performed in different chips for exploiting the chip-level parallelism.

  3. If all columns in the same row of a bank are fully filled, then the remaining data are mapped to different banks in the same chip, to exploit bank-level parallelism. If multiple chips are available, then this step can be performed in different chips.

  4. If all columns in the same row of all banks are fully filled, then the remaining data are mapped to a different subarray in the same bank, to exploit subarray-level parallelism. If multiple chips are available, then this step can be performed in different chips.

  5. If there are remaining data left, then step 1) to 3) can be performed again for different subarray, until all data are mapped within the same rank. In this manner, DRMap can achieve maximum row buffer hits, while maximally exploiting bank- and subarray-level parallelism within a DRAM rank.

  6. If there are remaining data left, they can be mapped in different rank (channel) if available, using the same steps as 1) to 4). In this manner, our DRMap can achieve maximum row buffer hits, while maximally exploiting bank- and subarray-level parallelism in another DRAM rank (channel) as well.

Fig. 5: Flowchart of the DRMap that illustrates how the mapping policy is performed in a DRAM chip.
Fig. 6: Pseudo-code of DRMap and its conceptual implementation in DRAM.

To illustrate that our DRMap always achieves the minimum energy-delay-product (EDP) of DRAM accesses in different possible conditions, we perform an extensive design space exploration (DSE). The DSE investigates different DRAM mapping policies, different DRAM architectures, as well as different layer partitioning and scheduling schemes on CNN, and estimates EDP for these different combinations. This DSE is important to corroborate that the best solution that provides the minimum EDP in each given combination is always the same as provided by our DRMap technique.

Iii-B Design Space Exploration for Evaluating Different DRAM Mapping Policies

To evaluate the impact of different DRAM mapping policies and see the performance of DRMap as compared to others, we performed an extensive design space exploration (DSE). An overview of the DSE is shown in Fig. 7 and its algorithm is presented in Algorithm 1.

Fig. 7: The operational flow of the our DSE methodology. Our contributions are highlighted in the green boxes.
0:  (1) CNN configuration: number of layers ();(2) Buffer size: ifms (iB), wghs (wB), ofms (oB); (3) Analytical models of EDP (); (4) Layer partitioning for ifms, wghs, and ofms (); (5) DRAM access scheduling (); (6) DRAM mapping policies ();
0:  (1) Efficient DRAM mapping (); (2) Minimum EDP (); BEGIN Initialization:
1:  ;
2:  ;
3:  ;
4:  ; Process:
5:  for ( to do
6:     for (each do
7:        for (each do
8:           for (each do
9:              if (ifms tile size ) and (wghs tile size ) and (ofms tile size then
10:                 Calculate ;
11:                 if (first loop) then
12:                    ;
13:                 else if (then
14:                    ;
15:                    Save , ;
16:                 end if
17:              end if
18:           end for
19:        end for
20:     end for
21:  end for
22:  return  (1) ; (2) ; END
Algorithm 1 Pseudo-code of the proposed DSE algorithm

For each layer of a network, the DSE performs three key steps: (1) defining different sizes of data tiles and scheduling schemes, (2) defining different DRAM mapping policies, (3) performing the DSE to find a DRAM mapping policy that offers minimum EDP. The operational flow of the DSE is explained in the following points:

Step-

1
.
Define

1a
different sizes of data tiles for all data types (ifms, wghs, and ofms), and

1b
different scheduling schemes. The tile sizes are determined by the step sizes in the outer loops of the Fig. 3. The tile sizes of ifms, wghs, and ofms have to fit in the corresponding buffers ( iB, wB, and oB). Each combination of the tile sizes for all data types defines one possible partitioning, which will be considered in the DSE. The scheduling schemes are determined by the sequence of the outer loops of the Fig. 3. In this work, we consider four scheduling schemes, based on the reuse priority of the data type: ifms-reuse, wghs-reuse, ofms-reuse, and adaptive-reuse scheduling schemes. The ifms-reuse scheduling means that ifms data type will be maximally reused when the data are available in the on-chip buffer. Similar definition is also applied for wghs-reuse and ofms-reuse. Meanwhile, the adaptive-reuse scheduling means that the reuse priority changes across different layers of a network, according to which one among ifms-/wghs-/ofms-reuse scheduling that offers minimum number of DRAM accesses.

Step-

2
.
Define different DRAM mapping policies, by determining the different orders of mapping loops to different columns, rows, subarrays, and banks in the same DRAM chip. For DDR3, orders of mapping loops are permutation of banks, rows, and columns, in the same DRAM chip; meanwhile for SALP, orders of mapping loops are permutation of banks, subarrays, rows, and columns, in the same DRAM chip. Here, we narrow down the design space by selecting the DRAM mapping policies that have the least frequent subsequent accesses to different rows, since it is the most expensive access in the same DRAM chip, for both latency and energy (as validated by Fig. 1). Therefore, there are six mapping policies to be explored in the DSE, as presented in Table I.

Step-

3
.
Perform the DSE to find a DRAM mapping policy that offers minimum EDP, across different DRAM architectures, different layer partitioning and scheduling schemes. The minimum EDP and the corresponding DRAM mapping are the outputs of the DSE, for a given DRAM architecture, layer partitioning and scheduling.

Note that the time and energy are already included in the DSE, for determining the EDP in the final results. For each layer of a network, EDP is obtained by multiplying the DRAM access energy and latency consumed by each combination of different DRAM mapping policies, different DRAM architectures, as well as different sizes of layer partitioning and scheduling schemes. Therefore, DSE will be able to find the combination that offers minimum EDP for each layer of a network and minimum total EDP for a whole network.

Mapping Inner-most- to outer-most-loops
1 column, subarray, bank, row
2 subarray, column, bank, row
3 column, bank, subarray, row
4 bank, column, subarray, row
5 subarray, bank, column, row
6 bank, subarray, column, row
TABLE I: Different DRAM mapping policies for the DSE.

Iii-C Analytical Model of Energy-Delay-Product (EDP) Estimation for Different DRAM Mapping Policies

Based on the proposed DSE, the optimization problem is formulated to minimize the EDP of DRAM accesses for each layer of a network and can be stated as

(1)

The EDP-per-layer () is obtained by multiplying the energy-per-layer and latency-per-layer. The energy-per-layer is obtained by accumulating all access energy values incurred from the DRAM accesses for all data tiles. The latency-per-layer is obtained by accumulating all access latency values incurred from the DRAM accesses for all data tiles. The access latency and energy are calculated on the basis of DRAM accesses for each data tile since we consider layer partitioning approach. Therefore, for each tile, the number of cycles required for DRAM accesses can be formulated as Eq. 2 and the DRAM access energy can be formulated as Eq. 3.

(2)
(3)

Term denotes the number of accesses to different DRAM-. denotes the number of cycles incurred when accessing different DRAM-. denotes the access energy incurred when accessing different DRAM-. For all terms, {columns, rows, subarrays, banks}.

Iv Evaluation Methodology

To evaluate our proposed methodology, we built the experimental setup, as presented in Fig. 8.

Fig. 8: Experimental setup and tool flow.
Module Description
CNN Processing Array Size = 8 8 MACs
On-chip Buffers iB: 64KB, wB: 64KB, oB: 64KB
Memory Controller Policy = open row, scheduler = FCFS
DDR3-1600 Configuration: 2Gb x8
1 channel, 1 rank-per-channel,
1 chip-per-rank, 8 banks-per-chip
SALP Configuration: 2Gb x8
1 channel, 1 rank-per-channel,
1 chip-per-rank, 8 banks-per-chip,
8 subarrays-per-bank
TABLE II: Configuration of the CNN accelerator.

Tool flow: We used a cycle-accurate DRAM simulator, Ramulator [10], to obtain the statistics (i.e, number of cycles) of different DRAM access conditions: a row buffer hit, row buffer miss, row buffer conflict, subarray- and bank-level paralellism. To profile the energy, we used a real experiments-based DRAM energy simulator, VAMPIRE [4]

. Information of energy and number of cycles are used for the DSE, which considers different DRAM mapping policies, different DRAM architectures, different layer partitioning and scheduling schemes to find the DRAM mapping policy that offers minimum EDP. For DSE, we considered a state-of-the-art Tensor Processing Unit (TPU)

[7]-like CNN accelerator with a reduced size of on-chip buffers and MAC array engine, as specified in Table II. To represent different DRAM architectures, we used DDR3 and SALP architectures (SALP-1, SALP-2, and SALP-MASA). For scheduling, we considered ifms-reuse, wghs-reuse, ofms-reuse, and adaptive-reuse scheduling schemes. For mapping, we considered the six mapping policies presented in Table I. For the input, we used AlexNet [11]

with ImageNet dataset.

V Results and Discussions

Fig. 9: The EDP in AlexNet for different DRAM mapping policies across different DRAM architectures (DDR3, SALP-1, SALP-2, and SALP-MASA), while considering different scheduling schemes: (a) ifms-reuse scheduling, (b) wghs-reuse scheduling, (c) ofms-reuse scheduling, and (d) adaptive reuse scheduling.

V-a Comparisons of Different DRAM Mapping Policies

We evaluated the impact of different DRAM mapping policies and the results are presented in Fig. 9.

Key Observation-

1
:
Our DRMap (Mapping-3) achieves the lowest EDP across different layers of the network, across different DRAM architectures, and across different scheduling schemes. It indicates that the DRMap is the most effective DRAM mapping policy for different possible conditions. According to Table I, DRMap (Mapping-3) orderly prioritizes to map the data to different columns in the same row (leading to row buffer hits in SALP and DDR3), to different banks in the same chip (exploiting bank-level parallelism in SALP and DDR3), to different subarrays in the same bank (exploiting subarray-level parallelism in SALP, but leading to row buffer conflicts in DDR3), and to different rows in the same subarray (leading to row buffer conflicts in SALP and DDR3). Therefore, the DRMap is proven as a generic DRAM mapping policy that offers the lowest EDP. Moreover, different DRAM access scheduling schemes can make use of the DRMap, so that the CNN accelerators with different scheduling schemes can optimize their DRAM access latency and energy. DRMap improves the EDP up to 96% in DDR3, 94% in SALP-1, 91% in SALP-2, and 80% in SALP-MASA, as compared to other mapping policies.

Key Observation-

2
:
Mapping-2 and Mapping-5 obtain worse EDPs (across different layers of the network, across different DRAM architectures, and across different scheduling schemes) than rest of the mapping policies. The reason is that, Mapping-2 and Mapping-5 prioritize to map data across different subarrays in the same bank (exploiting subarray-level parallelism in SALP, but leading to row buffer conflicts in DDR3), that incurs higher latency and energy, as compared to row buffer hits and exploiting bank-level parallelism.

Key Observation-

3
:
Mapping-1 and Mapping-3 obtain comparable EDPs. The reason is that, Mapping-1 and Mapping-3 prioritize to map data across different columns in the same row (leading to row buffer hits in SALP and DDR3). The difference comes when Mapping-1 proritizes to exploit subarray-level parallelism over bank-level parallelism, while Mapping-3 is the opposite. From Fig. 1, it is apparent that exploiting subarray-level parallelism incurs higher latency and energy than exploiting bank-level parallelism.

V-B Comparisons of Employing Different DRAM architectures

In general, employing SALP architectures provides EDP improvements as compared to employing DDR3. It is mainly due to latency and energy saving that are offered when exploiting subarray-level parallelism. Key Observation-

4
:
For instance, if we consider adaptive-reuse scheduling, EDP improvements achieved by employing SALP architectures as compared to DDR3 are:

  • [leftmargin=*]

  • For Mapping-1: 0.59% (SALP-1), 3.89% (SALP-2), and 1.05% (SALP-MASA).

  • For Mapping-2: 29.18% (SALP-1), 19.91% (SALP-2), and 81.04% (SALP-MASA).

  • For Mapping-3 (DRMap): 0.6% (SALP-1), 3.87% (SALP-2), and 1.01% (SALP-MASA).

  • For Mapping-4: 0.71% (SALP-1), 0.54% (SALP-2), and 1.41% (SALP-MASA).

  • For Mapping-5: 29.67% (SALP-1), 19.79% (SALP-2), and 81.76% (SALP-MASA).

  • For Mapping-6: 3.15% (SALP-1), 3.39% (SALP-2), and 7.62% (SALP-MASA).

The results show that employing SALP architectures is beneficial for improving energy-efficiency of DRAM accesses, as along as an effective mapping policy like DRMap is employed. The EDP of employing different DRAM architecture would be different, due to the different DRAM access energy and latency. However, since the internal organization of all DRAM architectures is similar (i.e., it is composed of channel, rank, chip, bank, subarray, row, and column as seen from top to bottom perspective), our DRMap can also be employed for all DRAM architectures to achieve the energy-efficient processing of convolutional neural networks in CNN accelerators.

Vi Conclusion

In this paper, we present DRMap, a generic DRAM mapping policy that offers the lowest EDP of DRAM accesses for CNN accelerators, as compared to other mapping policies. It is proven through an extensive design space exploration that study the latency and energy of different mapping policies, in different DRAM architectures as well as different layer partitioning and scheduling schemes. We expect that this work could enable further studies on energy-efficient CNN accelerators and help the existing CNN accelerators to optimize their DRAM access latency and energy.

Vii Acknowledgment

Authors acknowledge the scholarship granted by Indonesia Endowment Fund for Education (IEFE/LPDP), Ministry of Finance, Republic of Indonesia.

References

  • [1] J. Albericio et al. (2016)

    Cnvlutin: ineffectual-neuron-free deep neural network computing

    .
    In ISCA, Vol. . External Links: Document, ISSN Cited by: §I.
  • [2] T. Chen et al. (2014) DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, External Links: Document Cited by: §I.
  • [3] Y. Chen et al. (2017-01) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE JSSCC 52 (1), pp. 127–138. External Links: Document, ISSN Cited by: §I.
  • [4] S. Ghose et al. (2018-12) What your dram power models are not telling you: lessons from a detailed experimental study. Proc. ACM MACS 2 (3), pp. 38:1–38:41. External Links: ISSN 2476-1249, Document Cited by: Fig. 1, §IV.
  • [5] S. Ghose et al. (2019) Demystifying complex workload-dram interactions: an experimental study. In SIGMETRICS, Cited by: §I-B, §II-B.
  • [6] M. A. Hanif et al. (2018) MPNA: A massively-parallel neural array accelerator with dataflow optimization for convolutional neural networks. CoRR abs/1810.12910. External Links: 1810.12910 Cited by: §I.
  • [7] N. P. Jouppi et al. (2017) In-datacenter performance analysis of a tensor processing unit. In ISCA, Vol. . External Links: Document, ISSN Cited by: §I, §IV.
  • [8] Y. Kim et al. (2011) Thread cluster memory scheduling. IEEE Micro 31 (1), pp. 78–89. External Links: Document, ISSN 1937-4143 Cited by: §II-B.
  • [9] Y. Kim et al. (2012) A case for exploiting subarray-level parallelism (salp) in dram. In ISCA, Vol. . External Links: Document Cited by: §I-B, §II-B, §II-C.
  • [10] Y. Kim et al. (2016-01) Ramulator: a fast and extensible dram simulator. IEEE CAL 15 (1), pp. 45–49. External Links: Document, ISSN Cited by: Fig. 1, §IV.
  • [11] A. Krizhevsky et al. (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §I-C, §IV.
  • [12] H. Kwon et al. (2018) MAERI: enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In ASPLOS, Vol. . External Links: Document, ISSN Cited by: §I.
  • [13] Y. LeCun et al. (2015) Deep learning. Nature 521 (7553). Cited by: §I.
  • [14] J. Li et al. (2018) SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In DATE, Vol. . External Links: Document, ISSN Cited by: §I-A.
  • [15] J. Liu et al. (2012) RAIDR: retention-aware intelligent dram refresh. In ISCA, Vol. . External Links: ISSN 1063-6897 Cited by: §II-B.
  • [16] W. Lu et al. (2017) FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In HPCA, Vol. . External Links: Document, ISSN Cited by: §I.
  • [17] T. Luo et al. (2017-01) DaDianNao: a neural network supercomputer. IEEE TC 66 (1), pp. 73–88. External Links: Document, ISSN Cited by: §I.
  • [18] A. Parashar et al. (2017) SCNN: an accelerator for compressed-sparse convolutional neural networks. In ISCA, Vol. . External Links: Document, ISSN Cited by: §I.
  • [19] H. Sharma et al. (2018) Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network. In ISCA, Vol. . External Links: Document, ISSN Cited by: §I.
  • [20] A. Stoutchinin et al. (2019) Optimally scheduling CNN convolutions for efficient memory access. CoRR abs/1902.01492. External Links: 1902.01492 Cited by: §I-A.
  • [21] V. Sze et al. (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc. of the IEEE 105 (12), pp. 2295–2329. External Links: Document, ISSN Cited by: §I, §II-A.
  • [22] C. Zhang et al. (2015) Optimizing fpga-based accelerator design for deep convolutional neural networks. In FPGA, External Links: Document Cited by: §I-A, §I.