Impact of On-Chip Interconnect on In-Memory Acceleration of Deep Neural Networks

07/06/2021 ∙ by Gokul Krishnan, et al. ∙ Arizona State University University of Wisconsin-Madison 53

With the widespread use of Deep Neural Networks (DNNs), machine learning algorithms have evolved in two diverse directions – one with ever-increasing connection density for better accuracy and the other with more compact sizing for energy efficiency. The increase in connection density increases on-chip data movement, which makes efficient on-chip communication a critical function of the DNN accelerator. The contribution of this work is threefold. First, we illustrate that the point-to-point (P2P)-based interconnect is incapable of handling a high volume of on-chip data movement for DNNs. Second, we evaluate P2P and network-on-chip (NoC) interconnect (with a regular topology such as a mesh) for SRAM- and ReRAM-based in-memory computing (IMC) architectures for a range of DNNs. This analysis shows the necessity for the optimal interconnect choice for an IMC DNN accelerator. Finally, we perform an experimental evaluation for different DNNs to empirically obtain the performance of the IMC architecture with both NoC-tree and NoC-mesh. We conclude that, at the tile level, NoC-tree is appropriate for compact DNNs employed at the edge, and NoC-mesh is necessary to accelerate DNNs with high connection density. Furthermore, we propose a technique to determine the optimal choice of interconnect for any given DNN. In this technique, we use analytical models of NoC to evaluate end-to-end communication latency of any given DNN. We demonstrate that the interconnect optimization in the IMC architecture results in up to 6× improvement in energy-delay-area product for VGG-19 inference compared to the state-of-the-art ReRAM-based IMC architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

DNNs have achieved high accuracy that exceeds human-level perception for a variety of applications such as computer vision, natural language processing, and medical imaging 

(Krizhevsky et al., 2012; Deng et al., 2013; Litjens et al., 2017). The DNNs that achieve higher accuracy tend to consist of deeper and denser network structures. On the other hand, DNNs for edge devices tend to use smaller and shallower networks.

Figure 1 shows the trend in connection density for various DNNs in the literature, where connection density

is defined as the average number of connections per neuron in DNNs. In the context of DNNs, a neuron is defined as an output feature of a convolution layer and every neural unit of the fully-connected (FC) layer. Three representative DNN structures and connection patterns are illustrated in Figure 

2. Linear structures such as LeNet-5 (LeCun et al., 1998) and VGG-19 (Simonyan and Zisserman, 2014) have a connection density of one owing to one connection per neuron. Since residual networks such as ResNet (He et al., 2016) have residual skips, it has more connections than the number of neurons resulting in a connection density higher than one. Dense structures like DenseNet (Huang et al., 2017) have multiple connections from each neuron, resulting in a higher connection density.

We observe two main trends by analyzing the connection density for different DNNs in Figure 1. First, increasing connection density provides higher accuracy, which is essential for cloud-based computing platforms. Second, lower connection density is observed for compact models, which is necessary for edge computing hardware. Both hardware platforms require the processing of large amounts of data with corresponding power and performance constraints. Hence, there is a need to design optimal hardware architectures with low power and high performance for DNNs with different connection densities.

Figure 1. Connection density of different DNNs for three different datasets. Each output feature map (convolution layer) and neural unit (FC layer) represent a neuron. Larger markers represent higher accuracy.
Figure 2. Different types of DNN structures and their representative connection density.

With limited on-chip memory, conventional DNN architectures inevitably involve a significant amount of communication with off-chip memory resulting in increased energy consumption (Chen et al., 2019). However, it has been reported that the energy consumption of off-chip communication is 1,000 higher than the energy required to perform the computations (Horowitz, 2014). Dense structures like DenseNet perform approximately off-chip memory accesses to process a frame of an image (Huang et al., 2017). As a result, off-chip memory access becomes the energy bottleneck for hardware architectures of dense structures. Employing dense embedded non-volatile memory (NVM) such as ReRAM for in-memory computing (IMC) substantially reduces off-chip memory accesses (Shafiee et al., 2016; Song et al., 2017).

Figure 3. Contribution of routing latency to total latency for different DNNs for a P2P-based IMC architecture (Chen et al., 2018). With increase in connection density, routing latency becomes the bottleneck for performance.

On-chip interconnect is an integral part of hardware architectures that incorporate in-memory acceleration. Both point-to-point (P2P) interconnect (Kwon et al., 2018; Venkataramani et al., 2017) and NoC-based interconnect (Shafiee et al., 2016; Chen et al., 2019; Krishnan et al., 2020) are used for on-chip communication in state-of-the-art DNN accelerators. Shafiee et al. (Shafiee et al., 2016) utilizes a concentrated mesh for the interconnect, while Chen et al. (Chen et al., 2019) employs three different NoCs that are used for on-chip data movement in the architecture. In contrast, Krishnan et al. (Krishnan et al., 2020) utilizes a custom mesh-NoC for on-chip communication. The custom NoC derives the structure based on the on-chip traffic between different IMC processing elements (PEs), where each PE denotes the SRAM- or ReRAM-based IMC crossbar. A technique to construct custom NoC which provides minimum communication latency for a given DNN is proposed in (Mandal et al., 2020b). Since custom NoC requires alteration in hardware for different DNNs, our studies focus on regular NoC topologies. A more detailed survey on work which design efficient interconnect for DNN accelerators can be found in (Nabavinejad et al., 2020).

To better understand the need for an NoC-based on-chip interconnect, we analyze the scalability of P2P interconnect in in-memory computing (IMC) architectures by evaluating the contribution of routing latency to end-to-end latency for different DNNs, as shown in Figure 3. The contribution of routing latency increases up to 94% with increasing connection density. The high routing latency is attributed to the increased connection density, which correlates to more on-chip data movement. VGG-19 shows a reduced contribution compared to lower connection density DNNs due to the high utilization of the IMC PEs or crossbars resulting in reduced on-chip data movement. Hence, P2P networks do not provide a scalable solution for high connection density DNNs. At the same time, NoC-based interconnects require higher area and energy for operation and can result in a significant overhead for low connection density DNNs. Furthermore, different NoC topologies, mesh, or tree, are appropriate for DNNs with varying connection densities. Therefore, a connection density-aware interconnect solution is critical to DNN acceleration.

In this work, we first perform an in-depth performance analysis of P2P interconnect-based in-memory computing (IMC) architectures (Song et al., 2017). Through this analysis, we establish that P2P-based interconnects are incapable of handling data communication for dense DNNs and that NoC-based interconnect is needed for IMC architectures. Next, we evaluate P2P-based and NoC-based SRAM and ReRAM IMC architectures for a range of DNNs. Further, we evaluate NoC-tree, NoC-mesh, and c-mesh topologies for the IMC architectures. A c-mesh NoC is used in (Shafiee et al., 2016) at the tile-level to connect different tiles. C-mesh uses more number of links and routers, providing better performance in terms of communication latency. However, interconnect area and energy becomes exorbitantly high for c-mesh NoC. Therefore, the energy-delay-area product (EDAP) of c-mesh is higher than NoC-mesh. Hence, we restrict the detailed evaluations to NoC-mesh and NoC-tree. In these evaluations, we perform cycle-accurate NoC simulations through Booksim (Jiang et al., 2013). However, cycle-accurate NoC simulations are very time consuming and consequently slow down the overall performance analysis of IMC architectures. Our experiment with different DNNs (the simulation framework is described in more detail in Section 3) shows that cycle-accurate NoC simulation takes up to 80% of the total simulation time for high connection density DNNs.

Symbol Definition Symbol Definition
Number of
Layers
,
Input image size in
layer
Number of
tiles in layer
Input channels of
layer
Number of
activations in layer
Injection rate from
tile of
layer to tile
of layer
Input activations in
layer
Total communication latency
Table 1. Summary of notations

To accelerate the overall performance analysis of the IMC architecture, we propose analytical models to estimate the NoC performance of a given DNN. Specifically, we incorporate the analytical router modeling technique presented in 

(Ogras et al., 2010) to obtain the performance model for an NoC router. Then we extend the existing analytical model to get an estimation of end-to-end communication latency for NoC-tree and NoC-mesh for any given DNN as a function of the number of neurons and connection density. Through the analytical latency model, the variable communication patterns of different DNNs are incorporated using connection density and number of neurons. Leveraging this analysis and the analytical model, we conclude the importance of the optimal choice of interconnect at different hierarchies of the IMC architecture. Utilizing the same analysis, we provide guidance for the optimal choice of interconnect for IMC architectures. At the tile-level, NoC-mesh for high connection density DNNs and an NoC-tree for low connection density DNNs provide low power and high performance for IMC-based architectures. Leveraging this observation, we propose an NoC-based heterogeneous interconnect IMC architecture for DNN acceleration. We demonstrate that the NoC-based heterogeneous interconnect IMC architecture (ReRAM) achieves up to 6 improvement in the energy-delay-area product (EDAP) for inference of VGG-19 when compared to state-of-the-art implementations. The following are key contributions of this work:

  • An in-depth analysis of the shortcomings of P2P-based interconnect and the need for NoC in IMC architectures.

  • Analytical and empirical analysis to guide the choice of optimal NoC topology for an NoC-based heterogeneous interconnect.

  • The proposed heterogeneous interconnect IMC architecture achieves 6 improvement in EDAP with respect to state-of-the-art ReRAM-based IMC accelerators.

The rest of the paper is organized as follows. Section 2 introduces the background and motivation, Section 3 discusses the simulation framework used in this work, and Section 4 presents the analytical performance modeling-based technique to obtain the optimal choice of NoC for any given DNN. Section 5 presents the in-memory architecture with heterogeneous interconnect. Section 6 discusses the experimental results, and Section 7 concludes the paper.

2. Motivation and Related Work

2.1. Deep Neural Networks

We categorize DNNs into three main classes, linear (Simonyan and Zisserman, 2014), residual (He et al., 2016), and dense (Xie et al., 2019), as shown in Figure 2

. DNN structures include convolution layers stacked on top of each other for feature extraction and a set of classifier layers at the end to classify based on the features. A data point

in layer + can be expressed using the notation summarized in Table 1 as follows:

(1)

where is the kernel, is the number of rows in the kernel, and is the number of columns in the kernel of the convolution layer i. To implement (1) on hardware, number of multiplications and number of additions need

to be performed. In addition to convolution and FC layers, pooling and non-linear activation layers such as rectified linear unit (ReLU) are present in

the DNN algorithms.

Figure 4. Multi-tiled IMC architecture with routing architectures based on (a) P2P network, (b) NoC-tree, (c) NoC-mesh. NoC-tree is a P2P network with routers at junctions.

2.2. In-Memory Computing with Crossbars

DNNs with a large number of weights requires a considerable amount of computations. Conventional architectures separate access of data from the memory and computation in the computing unit. This results in increased computation and data movement, reducing both the throughput and energy-efficiency for DNN inference. In contrast, in-memory computing (IMC) seamlessly integrates computation and memory access in a single unit such as the crossbar (Shafiee et al., 2016; Song et al., 2017; Krishnan et al., 2020). Through this, IMC achieves higher energy efficiency and throughput as compared to conventional von-Neumann architectures.

The IMC technique localizes computation and data memory in a more compact design and enhances parallelism with multiple-row access, resulting in improved performance (Khwa et al., 2018; Jiang et al., 2020). The data accumulation is achieved through either current or charge accumulation. The size of the IMC subarray usually varies from 6464 to 512512. Along with the computing unit, peripheral circuits such as sample and hold circuit, analog-to-digital converter (ADC), and shift-and-add circuits are used to obtain each DNN layer’s result. In this work, we focus on IMC designs based on both SRAM (Khwa et al., 2018; Jiang et al., 2020; Yin et al., 2019) and ReRAM (Mao et al., 2019; Song et al., 2017; Qiao et al., 2018; Krishnan et al., 2020) crossbars.

2.3. Interconnect Network

Figure 5. Comparison of average latency among P2P, NoC-tree, and NoC-mesh interconnect for different injection bandwidth (Jiang et al., 2013). NoC topologies show better scalability than P2P interconnect.

As discussed in Section 1, the on-chip interconnect is critical to the accelerator performance for DNN acceleration. There are multiple topologies for Network-on-Chip (NoC). The well-known topologies are mesh, tree, torus, hypercube, and concentrated mesh (c-mesh). NoC with torus topology shows better performance than mesh due to long links between the nodes located at the edges. However, the power consumption by torus is significantly higher than mesh, as shown in (Mirza-Aghatabar et al., ). Hypercube and c-mesh have a similar disadvantage as a torus. Therefore, only NoC-tree and NoC-mesh are considered in this work. Also, they are the industrial standard for SoCs used in heavy workloads (Jeffers et al., 2016).

Figure 4 illustrates representative interconnect schemes of P2P, NoC-tree, and NoC-mesh for multi-tiled IMC architectures. Each tile consists of several crossbar sub-arrays which perform the IMC operation. Existing implementations of DNN accelerators use both P2P-based (Venkataramani et al., 2017; Kwon et al., 2018) and NoC-based (Shafiee et al., 2016; Krishnan et al., 2020; Zhu et al., 2020) interconnect for on-chip communication. To better understand the performance of different interconnect architectures, we plot the average interconnect latency for a P2P network with 64 nodes, NoC-tree with 64 nodes, and an 88 NoC-mesh with X–Y routing as shown in Figure 5. The NoC utilizes one virtual channel, a buffer size (all input and output buffers) of eight, and three router pipeline stages. We observe that for lower injection rates, the performance is comparable for all topologies, while for higher injection rates, NoC performs better in terms of latency. Hence, NoC provides better scalability and performance compared to P2P interconnects. Moreover, with increasing connection density, injection bandwidth between layers increase due to increased on-chip data movement. Therefore, P2P interconnect performs poorly for DNNs with high connection density. Hence, there is a need for systematic guidance for choosing the optimal interconnect for in-memory acceleration of DNNs. Other works such as (Chen et al., 2019) utilizes three separate NoC for weights, activations and partial sums. Such a design choice results in increased area and energy cost for the interconnect fabric. Furthermore, the three NoCs are under-utilized, resulting in a sub-optimal design choice for acceleration of DNNs.

2.4. Analytical Modeling of NoCs

Till date, multiple NoC performance analysis techniques have been proposed for SoCs (Ogras et al., 2010; Mandal et al., 2019, 2020a; Kiasari et al., 2012; Qian et al., 2015).

The analytical performance model for an NoC router assumes that the probability distribution of input traffic is in a continuous time domain 

(Ogras et al., 2010). However, all transactions in an IMC architecture happen in a discrete clock cycle. An analytical performance modeling technique for NoCs in the discrete-time domain is proposed in (Mandal et al., 2019). In this work, we estimate end-to-end communication latency for different DNNs as a function of connection density and the number of neurons of the DNN. Specifically, we utilize the analytical model for NoC router presented in (Ogras et al., 2010) with the modifications for discrete time input (Mandal et al., 2019) and extend the model to obtain end-to-end communication latency for NoC-tree and NoC-mesh.

3. Simulation Framework

There exist multiple simulators that evaluate the performance of DNNs on different hardware platforms (Dong et al., 2012; Chen et al., 2018). These simulators consider different technologies, platforms, and peripheral circuit modeling while providing less consideration to interconnect. With the advent of dense DNN structures (Xie et al., 2019), the importance of interconnect cost is higher, as discussed in Section 1. In this work, we develop an in-house simulator, where a circuit-level performance estimator of the computing fabric is combined with a cycle-accurate simulator for the interconnect. The simulator also aims at being versatile by supporting multiple DNN algorithms across different datasets, and various interconnect schemes.

Figure 6 shows a block-level representation of the simulator. The inputs of the simulator primarily include the DNN structure, technology node, and frequency of operation. In the proposed simulation framework, any circuit-level performance estimator (Dong et al., 2012; Chen et al., 2018) and any interconnect simulator (Jiang et al., 2013; Agarwal et al., 2009) can be plugged in to extract performance metrics such as area, energy, and latency, proving a common platform for system-level evaluation. In this work, we use customized versions of NeuroSim (Chen et al., 2018) for circuit simulation and BookSim (Jiang et al., 2013) for cycle-accurate NoC simulation.

Figure 6. Block-level representation of the proposed architecture simulator.

3.1. Circuit-level Simulator: Customized NeuroSim

The inputs to NeuroSim include the DNN structure indicating the layer size and layer count along with technology node, the number of bits per in-memory compute cell, frequency of operation, read-out mode, etc. The simulator performs the mapping of the entire DNN to a multi-tiled cross-bar architecture by estimating the number of cross-bar arrays and the number of tiles per layer. Based on the size of the cross-bar and , the number of cross-bar arrays is determined by (2).

(2)

where is the precision of the weights. The total number of tiles is calculated as the ratio of the total number of crossbar arrays to the number of crossbar arrays per tile. Furthermore, the peripheral circuits are laid out, and the complete tile architecture is determined. The peripheral circuits include an ADC, sample and hold circuit, shift and add circuit, and a multiplexer circuit. However, NeuroSim lacks an accurate estimation of the interconnect cost in latency, energy, and area. Therefore, we replace the interconnect part of NeuroSim with customized BookSim. We also extract the performance metrics for tile-to-tile interconnect in NeuroSim and replace it with the BookSim tile-to-tile interconnect. With this customization, our circuit simulator only reports performance metrics, such as area, energy, and latency of the computing logic. It provides the number of tiles per layer, activations, and the number of layers to the interconnect simulator.

3.2. Interconnect Simulator: Customized BookSim

1 Input: Number of layers (), Number of tiles in each layer (), FPS (), Number of activation in each layer (), interconnected topology
2 Output: End-to-end interconnect latency ()
3 for each layer  do
4       for each tile in layer  do
5             for each tile in layer  do
6                   if  then
7                         Compute following Equation 3.
8                   end if
9                  
10             end for
11            
12       end for
13      Simulate with interconnect topology and
14       Obtain from the simulator.
15       Calculate following Equation 4.
16 end for
Calculate : .
Algorithm 1 Evaluation of interconnect latency through simulation
Figure 7. Tile numbering and placement while mapping the DNN to the IMC architecture. The red arrows show the flow of the data across the tiles.

DNNs have varying structures resulting in different traffic loads and data-patterns between the IMC PEs. To accurately capture the NoC traffic of a given DNN configuration, we customize BookSim to evaluate the area, energy, and latency for interconnect, as shown in Figure 6. In the customized version of the BookSim, we enable simulation with non-uniform injection rate. We compute the injection rates for each source-destination pair in the multi-tiled architecture. The placement of tiles and routers in the IMC architecture has a direct impact on the interconnect performance. In this work, we incorporate the impact of mapping into the injection matrix calculation. The mapping of the DNN is performed such that each tile can have at least one layer while no layer is divided between two tiles. Figure 7 shows a sixteen tile IMC architecture with the tiles numbered. The red arrows show the data flow in the IMC architecture. Next, while evaluating the interconnect latency, we create an injection matrix that incorporates the position of the tile into the calculation by calculating the number of hops for each source-destination pair. Hence, the injection matrix incorporates the tile placement into the NoC latency calculation. Overall, the proposed approach can be generalized to any tile placement. Algorithm 1 describes the steps performed to compute injection rates and obtain the interconnect latency. Without loss of generality, we assume that the number of nodes required in the interconnect is equal to the total number of tiles across all layers.

The injection rate calculation is shown in lines 5–11 of Algorithm 1. The injection rate is expressed in (3) from each source to each destination in each layer.

(3)

where , , and represent data precision and bus width, and frames-per-second throughput, respectively. In the numerator of (3), we multiply the number of input activations () for layer by to obtain the total number of bits to be transferred from layer to layer for one frame of an image. We further multiply this term with FPS to obtain the total number of bits transferred between layers per second. Then, we divide this term by the operating frequency () to obtain the total number of bits transferred between layers per cycle. We assume an equal injection rate between all tiles in two consecutive layers. Therefore, to get the number of bits transferred from one tile to another in two consecutive layers, the denominator in (3) includes a multiplication between and . Thus, we divide the expression obtained so far by to obtain the injection rate (). The injection rate from every source to every destination is the input to the interconnect simulator. The interconnect simulator then provides average latency to complete all transactions from layer to layer ( cycles). Next, we multiply this latency with the number of bits from one tile to the next tile to get the total number of cycles required to transfer all data between two consecutive layers. Then, the latency from one layer to the next layer () is given by:

(4)

Finally, we accumulate the latency of all layers to compute the end-to-end interconnect latency as

(5)

4. Analytical Performance Models for NoCs in IMC Architecture

In this section, we discuss an analytical approach to estimate NoC performance for IMC architecture. The analytical performance model of NoCs is primarily useful to overcome longer simulation time incurred by cycle-accurate NoC simulators. Specifically, we utilize analytical performance models for NoCs to compare the performance of NoC-tree and NoC-mesh for a given DNN. The analytical model of an NoC router is adopted from the work proposed in (Ogras et al., 2010). We extend this router model for NoC-tree and NoC-mesh to obtain end-to-end communication latency for different DNNs. Algorithm 2

describes the technique to evaluate the communication latency through analytical models. There are two major steps involved in analyzing the performance of an NoC: 1) Computing injection rate and 2) Computing contention probability matrix.

Computing injection rate matrix (): First, the injection rate from each source to each destination () for each layer of the DNN is computed through (3). We note that the injection rate calculation incorporates the tile placement as detailed in Section 3.2. Each NoC router has five ports: North (), South (), East (), West (), and Self (). The injection rate at each port of every router () is computed as:

(6)

where denotes the number of tiles in the layer. is a function of the number of activations through each port of router (). From , the injection rate matrix for router () is computed (as shown in line 5–7 of Algorithm 2), where .

1 Input: Input activation, Number of routers in each layer (), Number of layers ()
2 Output: End-to-end communication latency ()
3 for l = 1: -1 do
4       for r = 1:  do
             /* Computing injection rate matrix */
5             Compute
6             Compute using (6)
7             Construct
             /* Computing contention matrix */
8             Compute forwarding probability matrix ()
9             Compute contention matrix ()
             /* Computing average waiting time */
10             Compute average queue length () using (8)
11             Compute average waiting time () using (9)
12       end for
13      Compute average latency for the layer () using (10)
14 end for
Algorithm 2 End-to-end latency computation through analytical models

Computing contention matrix (): Each element of the contention matrix () denotes the contention between port and port . To compute the contention matrix of router (), we first compute forwarding probability matrix . denotes the probability of a packet that arrived at the port of the router to be forwarded to the port , and is computed as shown in (7(Ogras et al., 2010).

(7)

The contention probability between port and port of the router is computed as . Line 10-11 of Algorithm 2 shows the computation of the contention matrix.

Next, the average queue length of each port of the router () is computed through the technique described in (Ogras et al., 2010).

(8)

where is the service time of the router, and we assume for our evaluation. is the average residual time and is calculated assuming that the packets arrive in discrete clock cycles (Mandal et al., 2019). Waiting time of the packets at each port of the router is computed as . End-to-end average latency for each layer () is obtained by averaging the waiting time through all 5 ports () of router and then adding across all routers, as shown in (9) and (10).

(9)
(10)

Finally, total communication latency () is obtained by adding end-to-end average latency for each layer as:

(11)

5. Connection-centric Architecture

Figure 8. Throughput comparison for three interconnect topologies (P2P, NoC-tree, and NoC-mesh) for SRAM-based IMC architecture, normalized to P2P, for different DNNs. NoC shows superior performance and scalability than P2P-based network.

In this section, we first discuss a multi-tiled SRAM-based IMC architecture with three different interconnect topologies, namely, P2P, NoC-tree, and NoC-mesh at the tile level. We perform a comprehensive analysis of these three interconnect-based SRAM IMC architectures for different DNNs using the simulation framework described in Section 3. Based on the analysis, we show the need for an NoC-based heterogeneous interconnect IMC architecture for efficient DNN acceleration. We assume all weights are stored on-chip to avoid any DRAM access. The weights are loaded pre-execution and stored on-chip. The inputs are then loaded, and the computation is performed. There is no re-loading of intermediate results or weights from the off-chip memory during the execution of the DNN. The SRAM buffer is designed large enough to hold the intermediate results on-chip rather than moving them off-chip. Multiple inferences of the images can be performed using one pre-execution loading of the weights. Hence, we do not consider the initial loading of the weights into the energy calculation, consistent with prior work (Shafiee et al., 2016; Song et al., 2017) compared in the manuscript. In addition, we adhere to layer-by-layer design instead of a layer-pipelined design, since a pipelined design introduces pipeline bubbles in the execution flow and complicates the control logic (Qiao et al., 2018).

5.1. Design Space Exploration

We evaluate different performance metrics for a wide range of DNNs with P2P, NoC-tree, and NoC-mesh-based interconnect for SRAM-based IMC architectures. We consider routers with five ports, one virtual channel for NoCs and X–Y routing for NoC-mesh for this evaluation. To facilitate fair comparison, we normalize the throughput of the hardware architectures with three interconnect topologies to that of P2P interconnect.

Figure 8 shows the throughput comparison for different DNNs. For low connection density DNNs such as MLP and LeNet-5 (LeCun et al., 1998), the choice of interconnect does not make a significant difference to the throughput, due to low data movement between different tiles of the IMC architecture. However, P2P interconnect results in 1.25 and 2 higher area cost than NoC-tree for MLP and LeNet-5, respectively. Hence, NoC-tree provides better overall performance than P2P for both MLP and LeNet-5. We further analyze dense DNNs such as NiN (Lin et al., 2013), VGG-19, ResNet-50 (He et al., 2016) and DenseNet-100 (Huang et al., 2017). The performance comparison shows that the NoC-tree and NoC-mesh-based IMC architectures perform better than the P2P-based architectures (up to 15 for DenseNet-100). Since higher connection density of the DNNs results in increased on-chip data movement, the routing latency dominates the end-to-end latency. We see a similar trend with ReRAM-based IMC architectures with similar throughput for MLP and 15 improvement in throughput for DenseNet-100. Through this, we establish that the performance of the P2P-based IMC architecture (SRAM- or ReRAM-based) diminishes with increasing connection density. In contrast, the performance of the NoC-based (tree, mesh) IMC architecture scales better (Figure 8).

Figure 9. Comparison of energy-delay-area product (EDAP) of NoC-tree, NoC-mesh, and c-mesh for different DNNs.

Exploration of other NoC topologies: Apart from tree and mesh, the other commonly known NoC topologies include c-mesh, hypercube, and torus. These topologies utilize more resources in terms of routers and links to reduce communication latency. However, the usage of more resources increases power consumption and the area of the NoC. For example, we performed experiments with c-mesh topology for different DNNs. Figure 9 compares energy-delay-area product (EDAP) of mesh-, tree- and c-mesh-based NoC for different NoC. We observe that while mesh- and tree-NoC provides comparable EDAP, the same for c-mesh is a minimum of five orders of magnitude higher than mesh- and tree-NoC.

5.2. Hardware Architecture

Figure 10. NoC-based heterogeneous interconnect IMC architecture. A three-level interconnect scheme consisting of NoC (tree or mesh) between tiles, P2P network between CEs, and bus between PEs.

Based on the conclusions from Section 5.1, we derive an NoC-based heterogeneous interconnect IMC architecture for DNN acceleration. Figure 10 shows the hardware architecture which employs the heterogeneous interconnect system.

The proposed architecture is divided into a number of tiles, with each tile having a set of computing elements (CE). The tile architecture includes non-linear activation units, I/O buffer, and accumulators to manage data transfer efficiently. Each CE further consists of multiple processing elements (PE) or crossbar arrays, multiplexers, buffers, a sense amplifier, and flash ADCs. The ADC precision is set to four bits such that there is minimum or no accuracy degradation for DNNs. In addition, the architecture does not utilize a digital-to-analog (DAC) converter; instead, it uses sequential signaling to represent multi-bit inputs (Peng et al., 2019). The proposed heterogeneous tile architecture can be used for both SRAM and ReRAM (1T1R) technologies. However, the peripheral circuits change based on the technology. In this work, we choose a homogeneous tile design consisting of four CEs and a CE structure consisting of four PEs. We evaluate both SRAM- and ReRAM-based IMC architectures for PE sizes varying from 6464 to 512512. We sample 8 DNNs (LeNet, NiN, SqueezeNet, ResNet-152, ResNet-50, VGG-16, VGG-19, and DenseNet-100) and a crossbar size of 256256 provides the lowest EDAP for 75% of the DNNs. Hence, in this work, we choose 256256 as the crossbar size for both SRAM- and ReRAM-based IMC architectures. To maximize performance, the architecture uses heterogeneous interconnects. It employs the NoC-based interconnect on the global tile-level with a P2P interconnect (H-Tree) at the CE-level and bus at the PE-level due to significantly lower data volume. For low data volume, the NoC-based interconnect provides marginal performance gain while increasing energy consumption.

6. Experiments and Results

6.1. Experimental Setup

We consider an IMC architecture (Figure 10) with a homogeneous tile structure (SRAM, ReRAM) and one NoC router per tile. Table 2 summarizes the design parameters considered. We report the end-to-end latency, chip area, and total energy obtained for a PE size of 256256 for each of the DNNs using the simulation framework discussed in Section 3. We incorporate conventional mapping (Shafiee et al., 2016), IMC SRAM bitcell/array design from (Khwa et al., 2018) and 1T1R ReRAM bitcell/array properties from (Chen et al., 2018). The IMC compute fabric utilizes a parallel read-out method. We utilize the same crossbar array size of 256256 for both SRAM and ReRAM-based IMC architectures. All rows of the IMC crossbar are asserted together, analog MAC computation is performed along the bitline, and the analog voltage/current is digitized with a 4-bit flash ADC at the column periphery. We perform an extensive evaluation of the IMC architecture with both SRAM-based and ReRAM-based PE arrays for both NoC-tree and NoC-mesh. Unless specified, the NoC utilizes one virtual channel, a buffer size (all input and output buffers) of eight, and three router pipeline stages.

PE array size
256256 Read-out Method
Parallel
Technology node
32nm Flash ADC resolution 4 bits
Cell levels 1 bit/cell
Operating frequency
1 GHz
Data precision
8 bits NoC bus width 32
Table 2. Summary of design parameters

6.2. Evaluation of NoC Analytical Model

Figure 11. Accuracy of NoC analytical model for NoC-mesh and NoC-tree with respect to cycle-accurate simulator (Jiang et al., 2013).

Figure 11 shows the accuracy of the analytical model (presented in Algorithm 2 in Section 4) to estimate the end-to-end communication latency with both NoC-tree and NoC-mesh. We observe that the accuracy is always more than 85% for different DNNs. On an average, the NoC analytical model achieves 93% accuracy with respect to cycle-accurate NoC simulation (Jiang et al., 2013). Moreover, we achieve 100-2000 speed-up with the NoC analytical model with respect to cycle-accurate NoC simulation. Figure 12 shows the speed-up for different DNNs with mesh-NoC. This speed-up is useful to perform design space exploration by considering various sizes of PE arrays and other NoC topologies. Due to the high speed-up in NoC performance analysis , we achieve 8 speed-up in overall performance analysis with respect to the framework which uses cycle-accurate NoC simulation.

Figure 12. Speed-up (in NoC simulation) with NoC analytical models with respect to cycle-accurate NoC simulation for different DNNs with mesh-NoC.

6.3. Analysis on Traffic Congestion in NoC

In this section, we present an analysis on traffic congestion in NoC for various DNNs. To this end we discuss about average queue length of different buffers in the NoC and worst case communication latency.

Analysis of the average queue length: Furthermore, we investigated the average queue length at different ports of different routers in the NoC through a cycle-accurate NoC simulator. We performed this experiment with mesh-NoC considering the configuration parameters shown in Table 2. Figure 13 shows that 64%-100% of the queues contain no flit when a new flit arrives for different DNNs. The percentage of queues with zero occupancy for LeNet-5 and NiN is 91% and 65%, respectively. These two DNNs utilize fewer number of routers, which results in less parallelism in data communication. However, we note that determining the optimal number of routers for a given DNN is not a scope of this work.

Figure 14 shows the average queue length for NiN and VGG-19 for the queues with non-zero length when a new flit arrives to the queues. We observe that the average queue length varies from 0.004-0.5 for these DNNs. Average queue length is very low in these cases since the injection rate to the queues are less, and NoC introduces a high degree of parallelism in data transmission between routers.

Figure 13. Percentage of queues with zero occupancy when a new flit arrives.
Figure 14. Average Occupancy of Queues with non-zero length for (a) NiN, (b) VGG-19.

Analysis of the worst case latency: Furthermore, we extracted the worst-case latency () for different source to destination pairs of different DNNs with mesh-NoC. We compared of each source to destination pair with corresponding average latency (). Then we compute mean absolute percentage deviation (MAPD) of from as the equation below.

(12)

Where is the total number of source to destination pairs with non-zero average latency. and are the worst-case latency and the average latency respectively of source to destination pair. Table 3 shows the mean absolute percentage deviation for different DNNs. We observe that the deviation is insignificant, except for LeNet-5 and NiN. The deviations for these two networks are 9.13% and 20.76%, respectively.

Furthermore, in Figure 15 we show the absolute difference between the worst-case latency and the average latency for LeNet-5 and NiN for different source to destination pairs with non-zero latency. The maximum difference is 6 cycles both for LeNet-5 and NiN. This analysis shows that the worst-case latency has very less deviation from the average latency. Therefore, the studies of average queue length and worst-case latency confirm that there is no congestion in the NoC.

DNNs MLP LeNet-5 NiN ResNet-50 VGG-19 DenseNet-100
MAPD(%) 0 9.13 20.76 0 0.14 0
Table 3. Mean absolute percentage deviation () of worst-case NoC latency from average NoC for different DNNs.
Figure 15. Comparison between average latency and worst-case latency for source to destination pairs with non-zero latency for (a) LeNet-5 and (b) NiN.

6.4. Guidance on Optimal Choice of Interconnect

6.4.1. Empirical Analysis

We compare the performance of the IMC architecture using both NoC-tree and NoC-mesh for both SRAM and ReRAM-based technologies. We perform the experiments for representative DNNs. MLP, LeNet-5, and NiN depict low connection density DNNs; ResNet-50, VGG-19, and DenseNet-100 depict high connection density DNNs. We report throughput and the product of energy consumption, end-to-end latency, and area (EDAP) of the IMC architectures. EDAP is used as the metric to guide the optimal choice for the interconnect for IMC architectures.

Figure 16(a) shows the ratio of the throughput of the SRAM-based IMC architecture using NoC-tree and NoC-mesh interconnect. We normalize the throughput values with respect to that of NoC-tree. NoC-tree performs better than the NoC-mesh for DNNs with low connection density. This is because of the reduced injection bandwidth into the interconnect. In addition, while NoC-mesh provides lower interconnect latency than NoC-tree, it comes at an increased area and energy cost. However, NoC-mesh performs better for DNNs with high connection density. The improved performance stems from the reduced interconnect latency for high injection rates of data into the interconnect. The reduction in latency is much higher than the additional overhead due to both area and energy of NoC-mesh.

Figure 16. (a) Normalized throughput and (b) normalized EDAP of NoC-tree and NoC-mesh-based on-chip interconnect for SRAM-based IMC architecture for different DNNs. Dense DNNs favor NoC-mesh while NoC-tree is sufficient for shallow DNNs.
Figure 17. (a) Normalized throughput and (b) normalized EDAP of NoC-tree and NoC-mesh-based on-chip interconnect for ReRAM-based IMC architecture for different DNNs. Dense DNNs favor NoC-mesh while NoC-tree is sufficient for shallow DNNs.

To better understand the performance, we report the EDAP for the SRAM-based IMC architecture. Figure 16(b) shows normalized EDAP of the NoC-tree and NoC-mesh for both low and high connection density DNNs. DNNs with low connection density have significantly lower EDAP for NoC-tree than that with NoC-mesh. Such an improved EDAP performance for NoC-tree complements the observation for throughput. At the same time, for DNNs with high connection density, the EDAP of NoC-mesh is lower than that of the NoC-tree for IMC architectures. A similar observation is seen for ReRAM-based IMC architectures as shown in Figure 17(a) and Figure 17(b). In contrast to the SRAM-based IMC architecture, NiN provides better performance in throughput for the NoC-mesh interconnect. At the same time, NoC-tree provides better EDAP compared to NoC-mesh, similar to that of the SRAM-based IMC architecture.

Furthermore, we performed another two sets of experiments with NoC-mesh and NoC-tree by varying the number of virtual channels and bus-width. In this case, we consider ReRAM-based IMC architectures. Figure 19 shows the comparison with different numbers of virtual channels, and Figure 19 shows the comparison with different bus width of the NoC. We observe similar trends for different DNNs with a different NoC configurations.

Since the injection rate to the input buffer of the NoC is always low (less than one packet in 100 cycles), increasing the number of virtual channels does not alter the inference latency significantly. Therefore, throughput remains similar (for all DNNs) both for NoC-tree and NoC-mesh with an increasing number of virtual channels. However, the area and power of both NoC-mesh and NoC-tree increase proportionally with an increasing number of virtual channels. Therefore, the normalized EDAP (EDAP of mesh-NoC divided by the EDAP of tree-NoC) is similar for all DNNs with different numbers of virtual channels.

While we change the bus width of the NoC, the latency increases (decreases) with decreasing (increasing) bus width proportionally, i.e., the latency with a bus width of 32 is twice than the latency with bus width of 64. Moreover, the area and power of the NoC increases (decreases) with increasing (decreasing) bus width proportionally. Therefore, the normalized EDAP is similar for all DNNs with different NoC bus widths. Consequently, for all configurations, we obtain exactly the same guidance on the choice of NoC for different DNNs. Therefore, the guidance is consistent across different parameters of NoCs.

Figure 18. Assessment of (a) throughput and (b) EDAP between NoC-tree and NoC-mesh with different numbers of virtual channels for different DNNs. The throughput is normalized to that of NoC-tree. The preferred NoC topology for optimal performance is shown for the regions above and below the red line.
Figure 19. Assessment of (a) throughput and (b) EDAP between NoC-tree and NoC-mesh with different bus width for different DNNs. The throughput is normalized to that of NoC-tree. The preferred NoC topology for optimal performance is shown for the regions above and below the red line.
Figure 18. Assessment of (a) throughput and (b) EDAP between NoC-tree and NoC-mesh with different numbers of virtual channels for different DNNs. The throughput is normalized to that of NoC-tree. The preferred NoC topology for optimal performance is shown for the regions above and below the red line.

6.4.2. Theoretical Analysis

We utilize the analytical model in Section 4 and the experimental results described in Figure 16 and Figure 17 to provide guidance on the optimal choice of interconnect for IMC architectures. The injection rate at each port of an NoC router for each layer of the DNN is expressed in (6). The numerator of (6) denotes the total data volume between layer and layer for each port of the router per cycle. This is divided by to obtain the injection rate from for each port of every router as detailed in Section 4. For a fixed NoC-based IMC architecture, target throughput (), frequency of operation (), and bus width () remain constant. Hence, from (6) we obtain,

(13)

Let the connection density for layer be and the number of neurons be . Data volume between and layer is proportional to the product of and , as shown in (14).

(14)

Additionally, the number of tiles in layer is directly proportional to . Hence, from (13) and (14) we get,

(15)

Generalising (15), we obtain,

(16)
Figure 20. Optimal NoC topology for IMC architectures for different DNNs.

Therefore, the injection rate is directly proportional to the connection density and inversely proportional to the number of neurons of the DNN. Figure 20 presents the preferred regions for NoC-tree and NoC-mesh for best throughput for different DNNs with IMC architectures. If the connection density of a DNN is more than 2, then NoC-mesh is suitable. If the connection density is less than 1, then NoC-tree is appropriate. Both NoC-tree and NoC-mesh are suitable for the DNNs with connection density in the range of 1-2 (the region where red and blue ovals overlap in Figure 20).

6.5. Comparison with state-of-the-art architectures

Table 4 compares the proposed architecture with state-of-the-art DNN accelerators. Prior works show the efficacy of their ReRAM-based IMC architectures for VGG-19 DNN (Qiao et al., 2018; Shafiee et al., 2016; Song et al., 2017). Hence for comparison, we choose VGG-19 network as the representative DNN. Moreover, we compare the dynamic power consumption of the DNN hardware since prior work utilizes dynamic power in their results, hence making the comparison consistent. The inference latency of the proposed architecture with SRAM arrays is 2.2 lower than the architecture with ReRAM arrays. The proposed ReRAM-based architecture achieves 4.7 improvement in FPS and 6 improvement in EDAP than AtomLayer (Qiao et al., 2018). The improvement in performance is attributed to the optimal choice of interconnect coupled with the absence of off-chip accesses. The proposed ReRAM-based architecture consumes 400 lower power per frame along with 1.74 improvement in FPS than PipeLayer (Song et al., 2017). Moreover, there is a 5.4 improvement in inference latency compared to ISAAC (Shafiee et al., 2016), which is achieved by the heterogeneous interconnect structure.

Latency
Power/frame
(W/frame)
FPS
EDAP
(J.ms.mm)
Proposed-SRAM 0.68 1.96 1458 0.46
Proposed-ReRAM 1.49 0.43 670 0.28
AtomLayer (Qiao et al., 2018) 6.92 4.8 145 1.58
PipeLayer (Song et al., 2017) 2.6* 168.6 385 94.17
ISAAC (Shafiee et al., 2016) 8.0* 65.8 125 359.64
Table 4. Inference Performance Results for VGG-19. *Reported in (Qiao et al., 2018)

6.6. Connection Density and Hardware Performance

Figure 1 showed a trend of DNNs moving toward a high connection density structure for performance and low connection density structure for compact models. Figure 21 shows the performance for both P2P and NoC-based interconnect at the tile-level for IMC architecture for DNNs with different connection density. We observe a steep increase in total latency with a P2P interconnect. However, the IMC architecture with NoC interconnect shows a stable curve as we move towards high connection density DNNs. With the advent of neural architecture search (NAS) techniques (Xie et al., 2019; Zoph et al., 2018), DNNs are moving towards a highly branched structure with very high connection densities. Hence, the NoC-based heterogeneous interconnect architecture provides a scalable and suitable platform for IMC acceleration of DNNs.

Figure 21. Appropriate selection of NoC topology significantly improves performance for both SRAM- and ReRAM-based IMC architectures.

7. Conclusion

The trend of connection density in modern DNNs requires a re-evaluation of the underlying interconnect architecture. Through a comprehensive evaluation, we demonstrate that the P2P-based interconnect is incapable of handling the high volume of on-chip data movement of DNNs. Further, we provide guidance backed by empirical and analytical results to select the appropriate NoC topology as a function of the connection density and the number of neurons. We conclude that NoC-mesh is preferred for DNNs with high connection density, while NoC-tree is suitable for DNNs with low connection density. Finally, we show that the NoC-based heterogeneous interconnect IMC architecture achieves 6 lower EDAP than state-of-the-art ReRAM-based IMC accelerators.

8. Acknowledgement

This work was supported by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, and SRC task 3012.001.

References

  • N. Agarwal, T. Krishna, L. Peh, and N. K. Jha (2009) GARNET: A Detailed On-Chip Network Model Inside a Full-System Simulator. In 2009 IEEE ISPASS, pp. 33–42. Cited by: §3.
  • P. Chen, X. Peng, and S. Yu (2018) NeuroSim: A Circuit-level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37 (12), pp. 3067–3080. Cited by: Figure 3, §3, §3, §6.1.
  • Y. Chen, T. Yang, J. Emer, and V. Sze (2019) Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE JETCAS 9 (2), pp. 292–308. Cited by: §1, §1, §2.3.
  • L. Deng, G. Hinton, and B. Kingsbury (2013) New Types of Deep Neural Network Learning for Speech Recognition and Related Applications: An Overview. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8599–8603. Cited by: §1.
  • X. Dong, C. Xu, Y. Xie, and N. P. Jouppi (2012) Nvsim: A Circuit-level Performance, Energy, and Area Model for Emerging Nonvolatile Memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31 (7), pp. 994–1007. Cited by: §3, §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 770–778. Cited by: §1, §2.1, §5.1.
  • M. Horowitz (2014) Computing’s Energy Problem (and What We Can Do About It). In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely Connected Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §1, §1, §5.1.
  • J. Jeffers, J. Reinders, and A. Sodani (2016) Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann. Cited by: §2.3.
  • N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally (2013) A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 86–96. Cited by: §1, Figure 5, §3, Figure 11, §6.2.
  • Z. Jiang, S. Yin, J. Seo, and M. Seok (2020) C3SRAM: An In-Memory-Computing SRAM Macro Based on Robust Capacitive Coupling Computing Mechanism. IEEE Journal of Solid-State Circuits (), pp. 1–1. Cited by: §2.2.
  • W. Khwa, J. Chen, J. Li, X. Si, E. Yang, X. Sun, R. Liu, P. Chen, Q. Li, S. Yu, et al. (2018) A 65nm 4kb algorithm-dependent computing-in-memory sram unit-macro with 2.3 ns and 55.8 tops/w fully parallel product-sum operation for binary dnn edge processors. In 2018 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 496–498. Cited by: §2.2, §6.1.
  • A. E. Kiasari, Z. Lu, and A. Jantsch (2012) An Analytical Latency Model for Networks-on-Chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21 (1), pp. 113–123. Cited by: §2.4.
  • G. Krishnan, S. K. Mandal, C. Chakrabarti, J. Seo, U. Y. Ogras, and Y. Cao (2020) Interconnect-aware area and energy optimization for in-memory acceleration of dnns. IEEE Design & Test 37 (6), pp. 79–87. Cited by: §1, §2.2, §2.2, §2.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
  • H. Kwon, A. Samajdar, and T. Krishna (2018) Maeri: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In ACM SIGPLAN Notices, Vol. 53, pp. 461–475. Cited by: §1, §2.3.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proc. of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §5.1.
  • M. Lin, Q. Chen, and S. Yan (2013) Network in Network. arXiv preprint arXiv:1312.4400. Cited by: §5.1.
  • G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez (2017)

    A Survey on Deep Learning in Medical Image Analysis

    .
    Medical image analysis 42, pp. 60–88. Cited by: §1.
  • S. K. Mandal, R. Ayoub, M. Kishinevsky, M. M. Islam, and U. Y. Ogras (2020a) Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic. IEEE Embedded Systems Letters. Cited by: §2.4.
  • S. K. Mandal, R. Ayoub, M. Kishinevsky, and U. Y. Ogras (2019) Analytical Performance Models for NoCs with Multiple Priority Traffic Classes. ACM Transactions on Embedded Computing Systems (TECS) 18 (5s), pp. 1–21. Cited by: §2.4, §4.
  • S. K. Mandal, G. Krishnan, C. Chakrabarti, J. Seo, Y. Cao, and U. Y. Ogras (2020b) A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10 (3), pp. 362–375. Cited by: §1.
  • M. Mao, X. Peng, R. Liu, J. Li, S. Yu, and C. Chakrabarti (2019) MAX2: an reram-based neural network accelerator that maximizes data reuse and area utilization. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. Cited by: §2.2.
  • [24] M. Mirza-Aghatabar, S. Koohi, S. Hessabi, and M. Pedram An Empirical Investigation of Mesh and Torus NoC Topologies under Different Routing algorithms and Traffic Models. In 10th Euromicro conference on digital system design architectures, methods and tools (DSD 2007), pp. 19–26. Cited by: §2.3.
  • S. M. Nabavinejad, M. Baharloo, K. Chen, M. Palesi, T. Kogel, and M. Ebrahimi (2020) An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10 (3), pp. 268–282. Cited by: §1.
  • U. Y. Ogras, P. Bogdan, and R. Marculescu (2010) An Analytical Approach for Network-on-Chip Performance Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29 (12), pp. 2001–2013. Cited by: §1, §2.4, §4, §4, §4.
  • X. Peng, M. Kim, X. Sun, S. Yin, T. Rakshit, R. M. Hatcher, J. A. Kittl, J. Seo, and S. Yu (2019) Inference Engine Benchmarking Across Technological Platforms from CMOS to RRAM. In Proceedings of the International Symposium on Memory Systems, pp. 471–479. Cited by: §5.2.
  • Z. Qian, D. Juan, P. Bogdan, C. Tsui, D. Marculescu, and R. Marculescu (2015)

    A Support Vector Regression (SVR)-based Latency Model for Network-on-Chip (NoC) Architectures

    .
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35 (3), pp. 471–484. Cited by: §2.4.
  • X. Qiao, X. Cao, H. Yang, L. Song, and H. Li (2018) Atomlayer: A Universal Reram-based CNN Accelerator with Atomic Layer Computation. In IEEE/ACM DAC, Cited by: §2.2, §5, §6.5, Table 4.
  • A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar (2016) ISAAC: A Convolutional Neural Network Accelerator with in-situ Analog Arithmetic in Crossbars. Proceedings of the 43rd International Symposium on Computer Architecture. Cited by: §1, §1, §1, §2.2, §2.3, §5, §6.1, §6.5, Table 4.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.1.
  • L. Song, X. Qian, H. Li, and Y. Chen (2017) Pipelayer: A Pipelined Reram-based Accelerator for Deep Learning. In IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 541–552. Cited by: §1, §1, §2.2, §2.2, §5, §6.5, Table 4.
  • S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, et al. (2017) Scaledeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. ACM SIGARCH Computer Architecture News 45 (2). Cited by: §1, §2.3.
  • S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring Randomly Wired Neural Networks for Image Recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293. Cited by: §2.1, §3, §6.6.
  • S. Yin, Z. Jiang, M. Kim, T. Gupta, M. Seok, and J. Seo (2019) Vesti: Energy-Efficient In-Memory Computing Accelerator for Deep Neural Networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28 (1), pp. 48–61. Cited by: §2.2.
  • Z. Zhu, H. Sun, K. Qiu, L. Xia, G. Krishnan, G. Dai, D. Niu, X. Chen, X. S. Hu, Y. Cao, et al. (2020) MNSIM 2.0: a behavior-level modeling tool for memristor-based neuromorphic computing systems. In Proceedings of the 2020 on Great Lakes Symposium on VLSI, pp. 83–88. Cited by: §2.3.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §6.6.