Figure 1 shows the trend in connection density for various DNNs in the literature, where connection density
is defined as the average number of connections per neuron in DNNs. In the context of DNNs, a neuron is defined as an output feature of a convolution layer and every neural unit of the fully-connected (FC) layer. Three representative DNN structures and connection patterns are illustrated in Figure2. Linear structures such as LeNet-5 (LeCun et al., 1998) and VGG-19 (Simonyan and Zisserman, 2014) have a connection density of one owing to one connection per neuron. Since residual networks such as ResNet (He et al., 2016) have residual skips, it has more connections than the number of neurons resulting in a connection density higher than one. Dense structures like DenseNet (Huang et al., 2017) have multiple connections from each neuron, resulting in a higher connection density.
We observe two main trends by analyzing the connection density for different DNNs in Figure 1. First, increasing connection density provides higher accuracy, which is essential for cloud-based computing platforms. Second, lower connection density is observed for compact models, which is necessary for edge computing hardware. Both hardware platforms require the processing of large amounts of data with corresponding power and performance constraints. Hence, there is a need to design optimal hardware architectures with low power and high performance for DNNs with different connection densities.
With limited on-chip memory, conventional DNN architectures inevitably involve a significant amount of communication with off-chip memory resulting in increased energy consumption (Chen et al., 2019). However, it has been reported that the energy consumption of off-chip communication is 1,000 higher than the energy required to perform the computations (Horowitz, 2014). Dense structures like DenseNet perform approximately off-chip memory accesses to process a frame of an image (Huang et al., 2017). As a result, off-chip memory access becomes the energy bottleneck for hardware architectures of dense structures. Employing dense embedded non-volatile memory (NVM) such as ReRAM for in-memory computing (IMC) substantially reduces off-chip memory accesses (Shafiee et al., 2016; Song et al., 2017).
On-chip interconnect is an integral part of hardware architectures that incorporate in-memory acceleration. Both point-to-point (P2P) interconnect (Kwon et al., 2018; Venkataramani et al., 2017) and NoC-based interconnect (Shafiee et al., 2016; Chen et al., 2019; Krishnan et al., 2020) are used for on-chip communication in state-of-the-art DNN accelerators. Shafiee et al. (Shafiee et al., 2016) utilizes a concentrated mesh for the interconnect, while Chen et al. (Chen et al., 2019) employs three different NoCs that are used for on-chip data movement in the architecture. In contrast, Krishnan et al. (Krishnan et al., 2020) utilizes a custom mesh-NoC for on-chip communication. The custom NoC derives the structure based on the on-chip traffic between different IMC processing elements (PEs), where each PE denotes the SRAM- or ReRAM-based IMC crossbar. A technique to construct custom NoC which provides minimum communication latency for a given DNN is proposed in (Mandal et al., 2020b). Since custom NoC requires alteration in hardware for different DNNs, our studies focus on regular NoC topologies. A more detailed survey on work which design efficient interconnect for DNN accelerators can be found in (Nabavinejad et al., 2020).
To better understand the need for an NoC-based on-chip interconnect, we analyze the scalability of P2P interconnect in in-memory computing (IMC) architectures by evaluating the contribution of routing latency to end-to-end latency for different DNNs, as shown in Figure 3. The contribution of routing latency increases up to 94% with increasing connection density. The high routing latency is attributed to the increased connection density, which correlates to more on-chip data movement. VGG-19 shows a reduced contribution compared to lower connection density DNNs due to the high utilization of the IMC PEs or crossbars resulting in reduced on-chip data movement. Hence, P2P networks do not provide a scalable solution for high connection density DNNs. At the same time, NoC-based interconnects require higher area and energy for operation and can result in a significant overhead for low connection density DNNs. Furthermore, different NoC topologies, mesh, or tree, are appropriate for DNNs with varying connection densities. Therefore, a connection density-aware interconnect solution is critical to DNN acceleration.
In this work, we first perform an in-depth performance analysis of P2P interconnect-based in-memory computing (IMC) architectures (Song et al., 2017). Through this analysis, we establish that P2P-based interconnects are incapable of handling data communication for dense DNNs and that NoC-based interconnect is needed for IMC architectures. Next, we evaluate P2P-based and NoC-based SRAM and ReRAM IMC architectures for a range of DNNs. Further, we evaluate NoC-tree, NoC-mesh, and c-mesh topologies for the IMC architectures. A c-mesh NoC is used in (Shafiee et al., 2016) at the tile-level to connect different tiles. C-mesh uses more number of links and routers, providing better performance in terms of communication latency. However, interconnect area and energy becomes exorbitantly high for c-mesh NoC. Therefore, the energy-delay-area product (EDAP) of c-mesh is higher than NoC-mesh. Hence, we restrict the detailed evaluations to NoC-mesh and NoC-tree. In these evaluations, we perform cycle-accurate NoC simulations through Booksim (Jiang et al., 2013). However, cycle-accurate NoC simulations are very time consuming and consequently slow down the overall performance analysis of IMC architectures. Our experiment with different DNNs (the simulation framework is described in more detail in Section 3) shows that cycle-accurate NoC simulation takes up to 80% of the total simulation time for high connection density DNNs.
|Total communication latency|
To accelerate the overall performance analysis of the IMC architecture, we propose analytical models to estimate the NoC performance of a given DNN. Specifically, we incorporate the analytical router modeling technique presented in(Ogras et al., 2010) to obtain the performance model for an NoC router. Then we extend the existing analytical model to get an estimation of end-to-end communication latency for NoC-tree and NoC-mesh for any given DNN as a function of the number of neurons and connection density. Through the analytical latency model, the variable communication patterns of different DNNs are incorporated using connection density and number of neurons. Leveraging this analysis and the analytical model, we conclude the importance of the optimal choice of interconnect at different hierarchies of the IMC architecture. Utilizing the same analysis, we provide guidance for the optimal choice of interconnect for IMC architectures. At the tile-level, NoC-mesh for high connection density DNNs and an NoC-tree for low connection density DNNs provide low power and high performance for IMC-based architectures. Leveraging this observation, we propose an NoC-based heterogeneous interconnect IMC architecture for DNN acceleration. We demonstrate that the NoC-based heterogeneous interconnect IMC architecture (ReRAM) achieves up to 6 improvement in the energy-delay-area product (EDAP) for inference of VGG-19 when compared to state-of-the-art implementations. The following are key contributions of this work:
An in-depth analysis of the shortcomings of P2P-based interconnect and the need for NoC in IMC architectures.
Analytical and empirical analysis to guide the choice of optimal NoC topology for an NoC-based heterogeneous interconnect.
The proposed heterogeneous interconnect IMC architecture achieves 6 improvement in EDAP with respect to state-of-the-art ReRAM-based IMC accelerators.
The rest of the paper is organized as follows. Section 2 introduces the background and motivation, Section 3 discusses the simulation framework used in this work, and Section 4 presents the analytical performance modeling-based technique to obtain the optimal choice of NoC for any given DNN. Section 5 presents the in-memory architecture with heterogeneous interconnect. Section 6 discusses the experimental results, and Section 7 concludes the paper.
2. Motivation and Related Work
2.1. Deep Neural Networks
where is the kernel, is the number of rows in the kernel, and is the number of columns in the kernel of the convolution layer i. To implement (1) on hardware, number of multiplications and number of additions needthe DNN algorithms.
2.2. In-Memory Computing with Crossbars
DNNs with a large number of weights requires a considerable amount of computations. Conventional architectures separate access of data from the memory and computation in the computing unit. This results in increased computation and data movement, reducing both the throughput and energy-efficiency for DNN inference. In contrast, in-memory computing (IMC) seamlessly integrates computation and memory access in a single unit such as the crossbar (Shafiee et al., 2016; Song et al., 2017; Krishnan et al., 2020). Through this, IMC achieves higher energy efficiency and throughput as compared to conventional von-Neumann architectures.
The IMC technique localizes computation and data memory in a more compact design and enhances parallelism with multiple-row access, resulting in improved performance (Khwa et al., 2018; Jiang et al., 2020). The data accumulation is achieved through either current or charge accumulation. The size of the IMC subarray usually varies from 6464 to 512512. Along with the computing unit, peripheral circuits such as sample and hold circuit, analog-to-digital converter (ADC), and shift-and-add circuits are used to obtain each DNN layer’s result. In this work, we focus on IMC designs based on both SRAM (Khwa et al., 2018; Jiang et al., 2020; Yin et al., 2019) and ReRAM (Mao et al., 2019; Song et al., 2017; Qiao et al., 2018; Krishnan et al., 2020) crossbars.
2.3. Interconnect Network
As discussed in Section 1, the on-chip interconnect is critical to the accelerator performance for DNN acceleration. There are multiple topologies for Network-on-Chip (NoC). The well-known topologies are mesh, tree, torus, hypercube, and concentrated mesh (c-mesh). NoC with torus topology shows better performance than mesh due to long links between the nodes located at the edges. However, the power consumption by torus is significantly higher than mesh, as shown in (Mirza-Aghatabar et al., ). Hypercube and c-mesh have a similar disadvantage as a torus. Therefore, only NoC-tree and NoC-mesh are considered in this work. Also, they are the industrial standard for SoCs used in heavy workloads (Jeffers et al., 2016).
Figure 4 illustrates representative interconnect schemes of P2P, NoC-tree, and NoC-mesh for multi-tiled IMC architectures. Each tile consists of several crossbar sub-arrays which perform the IMC operation. Existing implementations of DNN accelerators use both P2P-based (Venkataramani et al., 2017; Kwon et al., 2018) and NoC-based (Shafiee et al., 2016; Krishnan et al., 2020; Zhu et al., 2020) interconnect for on-chip communication. To better understand the performance of different interconnect architectures, we plot the average interconnect latency for a P2P network with 64 nodes, NoC-tree with 64 nodes, and an 88 NoC-mesh with X–Y routing as shown in Figure 5. The NoC utilizes one virtual channel, a buffer size (all input and output buffers) of eight, and three router pipeline stages. We observe that for lower injection rates, the performance is comparable for all topologies, while for higher injection rates, NoC performs better in terms of latency. Hence, NoC provides better scalability and performance compared to P2P interconnects. Moreover, with increasing connection density, injection bandwidth between layers increase due to increased on-chip data movement. Therefore, P2P interconnect performs poorly for DNNs with high connection density. Hence, there is a need for systematic guidance for choosing the optimal interconnect for in-memory acceleration of DNNs. Other works such as (Chen et al., 2019) utilizes three separate NoC for weights, activations and partial sums. Such a design choice results in increased area and energy cost for the interconnect fabric. Furthermore, the three NoCs are under-utilized, resulting in a sub-optimal design choice for acceleration of DNNs.
2.4. Analytical Modeling of NoCs
The analytical performance model for an NoC router assumes that the probability distribution of input traffic is in a continuous time domain(Ogras et al., 2010). However, all transactions in an IMC architecture happen in a discrete clock cycle. An analytical performance modeling technique for NoCs in the discrete-time domain is proposed in (Mandal et al., 2019). In this work, we estimate end-to-end communication latency for different DNNs as a function of connection density and the number of neurons of the DNN. Specifically, we utilize the analytical model for NoC router presented in (Ogras et al., 2010) with the modifications for discrete time input (Mandal et al., 2019) and extend the model to obtain end-to-end communication latency for NoC-tree and NoC-mesh.
3. Simulation Framework
There exist multiple simulators that evaluate the performance of DNNs on different hardware platforms (Dong et al., 2012; Chen et al., 2018). These simulators consider different technologies, platforms, and peripheral circuit modeling while providing less consideration to interconnect. With the advent of dense DNN structures (Xie et al., 2019), the importance of interconnect cost is higher, as discussed in Section 1. In this work, we develop an in-house simulator, where a circuit-level performance estimator of the computing fabric is combined with a cycle-accurate simulator for the interconnect. The simulator also aims at being versatile by supporting multiple DNN algorithms across different datasets, and various interconnect schemes.
Figure 6 shows a block-level representation of the simulator. The inputs of the simulator primarily include the DNN structure, technology node, and frequency of operation. In the proposed simulation framework, any circuit-level performance estimator (Dong et al., 2012; Chen et al., 2018) and any interconnect simulator (Jiang et al., 2013; Agarwal et al., 2009) can be plugged in to extract performance metrics such as area, energy, and latency, proving a common platform for system-level evaluation. In this work, we use customized versions of NeuroSim (Chen et al., 2018) for circuit simulation and BookSim (Jiang et al., 2013) for cycle-accurate NoC simulation.
3.1. Circuit-level Simulator: Customized NeuroSim
The inputs to NeuroSim include the DNN structure indicating the layer size and layer count along with technology node, the number of bits per in-memory compute cell, frequency of operation, read-out mode, etc. The simulator performs the mapping of the entire DNN to a multi-tiled cross-bar architecture by estimating the number of cross-bar arrays and the number of tiles per layer. Based on the size of the cross-bar and , the number of cross-bar arrays is determined by (2).
where is the precision of the weights. The total number of tiles is calculated as the ratio of the total number of crossbar arrays to the number of crossbar arrays per tile. Furthermore, the peripheral circuits are laid out, and the complete tile architecture is determined. The peripheral circuits include an ADC, sample and hold circuit, shift and add circuit, and a multiplexer circuit. However, NeuroSim lacks an accurate estimation of the interconnect cost in latency, energy, and area. Therefore, we replace the interconnect part of NeuroSim with customized BookSim. We also extract the performance metrics for tile-to-tile interconnect in NeuroSim and replace it with the BookSim tile-to-tile interconnect. With this customization, our circuit simulator only reports performance metrics, such as area, energy, and latency of the computing logic. It provides the number of tiles per layer, activations, and the number of layers to the interconnect simulator.
3.2. Interconnect Simulator: Customized BookSim
DNNs have varying structures resulting in different traffic loads and data-patterns between the IMC PEs. To accurately capture the NoC traffic of a given DNN configuration, we customize BookSim to evaluate the area, energy, and latency for interconnect, as shown in Figure 6. In the customized version of the BookSim, we enable simulation with non-uniform injection rate. We compute the injection rates for each source-destination pair in the multi-tiled architecture. The placement of tiles and routers in the IMC architecture has a direct impact on the interconnect performance. In this work, we incorporate the impact of mapping into the injection matrix calculation. The mapping of the DNN is performed such that each tile can have at least one layer while no layer is divided between two tiles. Figure 7 shows a sixteen tile IMC architecture with the tiles numbered. The red arrows show the data flow in the IMC architecture. Next, while evaluating the interconnect latency, we create an injection matrix that incorporates the position of the tile into the calculation by calculating the number of hops for each source-destination pair. Hence, the injection matrix incorporates the tile placement into the NoC latency calculation. Overall, the proposed approach can be generalized to any tile placement. Algorithm 1 describes the steps performed to compute injection rates and obtain the interconnect latency. Without loss of generality, we assume that the number of nodes required in the interconnect is equal to the total number of tiles across all layers.
where , , and represent data precision and bus width, and frames-per-second throughput, respectively. In the numerator of (3), we multiply the number of input activations () for layer by to obtain the total number of bits to be transferred from layer to layer for one frame of an image. We further multiply this term with FPS to obtain the total number of bits transferred between layers per second. Then, we divide this term by the operating frequency () to obtain the total number of bits transferred between layers per cycle. We assume an equal injection rate between all tiles in two consecutive layers. Therefore, to get the number of bits transferred from one tile to another in two consecutive layers, the denominator in (3) includes a multiplication between and . Thus, we divide the expression obtained so far by to obtain the injection rate (). The injection rate from every source to every destination is the input to the interconnect simulator. The interconnect simulator then provides average latency to complete all transactions from layer to layer ( cycles). Next, we multiply this latency with the number of bits from one tile to the next tile to get the total number of cycles required to transfer all data between two consecutive layers. Then, the latency from one layer to the next layer () is given by:
Finally, we accumulate the latency of all layers to compute the end-to-end interconnect latency as
4. Analytical Performance Models for NoCs in IMC Architecture
In this section, we discuss an analytical approach to estimate NoC performance for IMC architecture. The analytical performance model of NoCs is primarily useful to overcome longer simulation time incurred by cycle-accurate NoC simulators. Specifically, we utilize analytical performance models for NoCs to compare the performance of NoC-tree and NoC-mesh for a given DNN. The analytical model of an NoC router is adopted from the work proposed in (Ogras et al., 2010). We extend this router model for NoC-tree and NoC-mesh to obtain end-to-end communication latency for different DNNs. Algorithm 2
describes the technique to evaluate the communication latency through analytical models. There are two major steps involved in analyzing the performance of an NoC: 1) Computing injection rate and 2) Computing contention probability matrix.
Computing injection rate matrix (): First, the injection rate from each source to each destination () for each layer of the DNN is computed through (3). We note that the injection rate calculation incorporates the tile placement as detailed in Section 3.2. Each NoC router has five ports: North (), South (), East (), West (), and Self (). The injection rate at each port of every router () is computed as:
where denotes the number of tiles in the layer. is a function of the number of activations through each port of router (). From , the injection rate matrix for router () is computed (as shown in line 5–7 of Algorithm 2), where .
Computing contention matrix (): Each element of the contention matrix () denotes the contention between port and port . To compute the contention matrix of router (), we first compute forwarding probability matrix . denotes the probability of a packet that arrived at the port of the router to be forwarded to the port , and is computed as shown in (7) (Ogras et al., 2010).
The contention probability between port and port of the router is computed as . Line 10-11 of Algorithm 2 shows the computation of the contention matrix.
Next, the average queue length of each port of the router () is computed through the technique described in (Ogras et al., 2010).
where is the service time of the router, and we assume for our evaluation. is the average residual time and is calculated assuming that the packets arrive in discrete clock cycles (Mandal et al., 2019). Waiting time of the packets at each port of the router is computed as . End-to-end average latency for each layer () is obtained by averaging the waiting time through all 5 ports () of router and then adding across all routers, as shown in (9) and (10).
Finally, total communication latency () is obtained by adding end-to-end average latency for each layer as:
5. Connection-centric Architecture
In this section, we first discuss a multi-tiled SRAM-based IMC architecture with three different interconnect topologies, namely, P2P, NoC-tree, and NoC-mesh at the tile level. We perform a comprehensive analysis of these three interconnect-based SRAM IMC architectures for different DNNs using the simulation framework described in Section 3. Based on the analysis, we show the need for an NoC-based heterogeneous interconnect IMC architecture for efficient DNN acceleration. We assume all weights are stored on-chip to avoid any DRAM access. The weights are loaded pre-execution and stored on-chip. The inputs are then loaded, and the computation is performed. There is no re-loading of intermediate results or weights from the off-chip memory during the execution of the DNN. The SRAM buffer is designed large enough to hold the intermediate results on-chip rather than moving them off-chip. Multiple inferences of the images can be performed using one pre-execution loading of the weights. Hence, we do not consider the initial loading of the weights into the energy calculation, consistent with prior work (Shafiee et al., 2016; Song et al., 2017) compared in the manuscript. In addition, we adhere to layer-by-layer design instead of a layer-pipelined design, since a pipelined design introduces pipeline bubbles in the execution flow and complicates the control logic (Qiao et al., 2018).
5.1. Design Space Exploration
We evaluate different performance metrics for a wide range of DNNs with P2P, NoC-tree, and NoC-mesh-based interconnect for SRAM-based IMC architectures. We consider routers with five ports, one virtual channel for NoCs and X–Y routing for NoC-mesh for this evaluation. To facilitate fair comparison, we normalize the throughput of the hardware architectures with three interconnect topologies to that of P2P interconnect.
Figure 8 shows the throughput comparison for different DNNs. For low connection density DNNs such as MLP and LeNet-5 (LeCun et al., 1998), the choice of interconnect does not make a significant difference to the throughput, due to low data movement between different tiles of the IMC architecture. However, P2P interconnect results in 1.25 and 2 higher area cost than NoC-tree for MLP and LeNet-5, respectively. Hence, NoC-tree provides better overall performance than P2P for both MLP and LeNet-5. We further analyze dense DNNs such as NiN (Lin et al., 2013), VGG-19, ResNet-50 (He et al., 2016) and DenseNet-100 (Huang et al., 2017). The performance comparison shows that the NoC-tree and NoC-mesh-based IMC architectures perform better than the P2P-based architectures (up to 15 for DenseNet-100). Since higher connection density of the DNNs results in increased on-chip data movement, the routing latency dominates the end-to-end latency. We see a similar trend with ReRAM-based IMC architectures with similar throughput for MLP and 15 improvement in throughput for DenseNet-100. Through this, we establish that the performance of the P2P-based IMC architecture (SRAM- or ReRAM-based) diminishes with increasing connection density. In contrast, the performance of the NoC-based (tree, mesh) IMC architecture scales better (Figure 8).
Exploration of other NoC topologies: Apart from tree and mesh, the other commonly known NoC topologies include c-mesh, hypercube, and torus. These topologies utilize more resources in terms of routers and links to reduce communication latency. However, the usage of more resources increases power consumption and the area of the NoC. For example, we performed experiments with c-mesh topology for different DNNs. Figure 9 compares energy-delay-area product (EDAP) of mesh-, tree- and c-mesh-based NoC for different NoC. We observe that while mesh- and tree-NoC provides comparable EDAP, the same for c-mesh is a minimum of five orders of magnitude higher than mesh- and tree-NoC.
5.2. Hardware Architecture
Based on the conclusions from Section 5.1, we derive an NoC-based heterogeneous interconnect IMC architecture for DNN acceleration. Figure 10 shows the hardware architecture which employs the heterogeneous interconnect system.
The proposed architecture is divided into a number of tiles, with each tile having a set of computing elements (CE). The tile architecture includes non-linear activation units, I/O buffer, and accumulators to manage data transfer efficiently. Each CE further consists of multiple processing elements (PE) or crossbar arrays, multiplexers, buffers, a sense amplifier, and flash ADCs. The ADC precision is set to four bits such that there is minimum or no accuracy degradation for DNNs. In addition, the architecture does not utilize a digital-to-analog (DAC) converter; instead, it uses sequential signaling to represent multi-bit inputs (Peng et al., 2019). The proposed heterogeneous tile architecture can be used for both SRAM and ReRAM (1T1R) technologies. However, the peripheral circuits change based on the technology. In this work, we choose a homogeneous tile design consisting of four CEs and a CE structure consisting of four PEs. We evaluate both SRAM- and ReRAM-based IMC architectures for PE sizes varying from 6464 to 512512. We sample 8 DNNs (LeNet, NiN, SqueezeNet, ResNet-152, ResNet-50, VGG-16, VGG-19, and DenseNet-100) and a crossbar size of 256256 provides the lowest EDAP for 75% of the DNNs. Hence, in this work, we choose 256256 as the crossbar size for both SRAM- and ReRAM-based IMC architectures. To maximize performance, the architecture uses heterogeneous interconnects. It employs the NoC-based interconnect on the global tile-level with a P2P interconnect (H-Tree) at the CE-level and bus at the PE-level due to significantly lower data volume. For low data volume, the NoC-based interconnect provides marginal performance gain while increasing energy consumption.
6. Experiments and Results
6.1. Experimental Setup
We consider an IMC architecture (Figure 10) with a homogeneous tile structure (SRAM, ReRAM) and one NoC router per tile. Table 2 summarizes the design parameters considered. We report the end-to-end latency, chip area, and total energy obtained for a PE size of 256256 for each of the DNNs using the simulation framework discussed in Section 3. We incorporate conventional mapping (Shafiee et al., 2016), IMC SRAM bitcell/array design from (Khwa et al., 2018) and 1T1R ReRAM bitcell/array properties from (Chen et al., 2018). The IMC compute fabric utilizes a parallel read-out method. We utilize the same crossbar array size of 256256 for both SRAM and ReRAM-based IMC architectures. All rows of the IMC crossbar are asserted together, analog MAC computation is performed along the bitline, and the analog voltage/current is digitized with a 4-bit flash ADC at the column periphery. We perform an extensive evaluation of the IMC architecture with both SRAM-based and ReRAM-based PE arrays for both NoC-tree and NoC-mesh. Unless specified, the NoC utilizes one virtual channel, a buffer size (all input and output buffers) of eight, and three router pipeline stages.
|32nm||Flash ADC resolution||4 bits|
|Cell levels||1 bit/cell||
|8 bits||NoC bus width||32|
6.2. Evaluation of NoC Analytical Model
Figure 11 shows the accuracy of the analytical model (presented in Algorithm 2 in Section 4) to estimate the end-to-end communication latency with both NoC-tree and NoC-mesh. We observe that the accuracy is always more than 85% for different DNNs. On an average, the NoC analytical model achieves 93% accuracy with respect to cycle-accurate NoC simulation (Jiang et al., 2013). Moreover, we achieve 100-2000 speed-up with the NoC analytical model with respect to cycle-accurate NoC simulation. Figure 12 shows the speed-up for different DNNs with mesh-NoC. This speed-up is useful to perform design space exploration by considering various sizes of PE arrays and other NoC topologies. Due to the high speed-up in NoC performance analysis , we achieve 8 speed-up in overall performance analysis with respect to the framework which uses cycle-accurate NoC simulation.
6.3. Analysis on Traffic Congestion in NoC
In this section, we present an analysis on traffic congestion in NoC for various DNNs. To this end we discuss about average queue length of different buffers in the NoC and worst case communication latency.
Analysis of the average queue length: Furthermore, we investigated the average queue length at different ports of different routers in the NoC through a cycle-accurate NoC simulator. We performed this experiment with mesh-NoC considering the configuration parameters shown in Table 2. Figure 13 shows that 64%-100% of the queues contain no flit when a new flit arrives for different DNNs. The percentage of queues with zero occupancy for LeNet-5 and NiN is 91% and 65%, respectively. These two DNNs utilize fewer number of routers, which results in less parallelism in data communication. However, we note that determining the optimal number of routers for a given DNN is not a scope of this work.
Figure 14 shows the average queue length for NiN and VGG-19 for the queues with non-zero length when a new flit arrives to the queues. We observe that the average queue length varies from 0.004-0.5 for these DNNs. Average queue length is very low in these cases since the injection rate to the queues are less, and NoC introduces a high degree of parallelism in data transmission between routers.
Analysis of the worst case latency: Furthermore, we extracted the worst-case latency () for different source to destination pairs of different DNNs with mesh-NoC. We compared of each source to destination pair with corresponding average latency (). Then we compute mean absolute percentage deviation (MAPD) of from as the equation below.
Where is the total number of source to destination pairs with non-zero average latency. and are the worst-case latency and the average latency respectively of source to destination pair. Table 3 shows the mean absolute percentage deviation for different DNNs. We observe that the deviation is insignificant, except for LeNet-5 and NiN. The deviations for these two networks are 9.13% and 20.76%, respectively.
Furthermore, in Figure 15 we show the absolute difference between the worst-case latency and the average latency for LeNet-5 and NiN for different source to destination pairs with non-zero latency. The maximum difference is 6 cycles both for LeNet-5 and NiN. This analysis shows that the worst-case latency has very less deviation from the average latency. Therefore, the studies of average queue length and worst-case latency confirm that there is no congestion in the NoC.
6.4. Guidance on Optimal Choice of Interconnect
6.4.1. Empirical Analysis
We compare the performance of the IMC architecture using both NoC-tree and NoC-mesh for both SRAM and ReRAM-based technologies. We perform the experiments for representative DNNs. MLP, LeNet-5, and NiN depict low connection density DNNs; ResNet-50, VGG-19, and DenseNet-100 depict high connection density DNNs. We report throughput and the product of energy consumption, end-to-end latency, and area (EDAP) of the IMC architectures. EDAP is used as the metric to guide the optimal choice for the interconnect for IMC architectures.
Figure 16(a) shows the ratio of the throughput of the SRAM-based IMC architecture using NoC-tree and NoC-mesh interconnect. We normalize the throughput values with respect to that of NoC-tree. NoC-tree performs better than the NoC-mesh for DNNs with low connection density. This is because of the reduced injection bandwidth into the interconnect. In addition, while NoC-mesh provides lower interconnect latency than NoC-tree, it comes at an increased area and energy cost. However, NoC-mesh performs better for DNNs with high connection density. The improved performance stems from the reduced interconnect latency for high injection rates of data into the interconnect. The reduction in latency is much higher than the additional overhead due to both area and energy of NoC-mesh.
To better understand the performance, we report the EDAP for the SRAM-based IMC architecture. Figure 16(b) shows normalized EDAP of the NoC-tree and NoC-mesh for both low and high connection density DNNs. DNNs with low connection density have significantly lower EDAP for NoC-tree than that with NoC-mesh. Such an improved EDAP performance for NoC-tree complements the observation for throughput. At the same time, for DNNs with high connection density, the EDAP of NoC-mesh is lower than that of the NoC-tree for IMC architectures. A similar observation is seen for ReRAM-based IMC architectures as shown in Figure 17(a) and Figure 17(b). In contrast to the SRAM-based IMC architecture, NiN provides better performance in throughput for the NoC-mesh interconnect. At the same time, NoC-tree provides better EDAP compared to NoC-mesh, similar to that of the SRAM-based IMC architecture.
Furthermore, we performed another two sets of experiments with NoC-mesh and NoC-tree by varying the number of virtual channels and bus-width. In this case, we consider ReRAM-based IMC architectures. Figure 19 shows the comparison with different numbers of virtual channels, and Figure 19 shows the comparison with different bus width of the NoC. We observe similar trends for different DNNs with a different NoC configurations.
Since the injection rate to the input buffer of the NoC is always low (less than one packet in 100 cycles), increasing the number of virtual channels does not alter the inference latency significantly. Therefore, throughput remains similar (for all DNNs) both for NoC-tree and NoC-mesh with an increasing number of virtual channels. However, the area and power of both NoC-mesh and NoC-tree increase proportionally with an increasing number of virtual channels. Therefore, the normalized EDAP (EDAP of mesh-NoC divided by the EDAP of tree-NoC) is similar for all DNNs with different numbers of virtual channels.
While we change the bus width of the NoC, the latency increases (decreases) with decreasing (increasing) bus width proportionally, i.e., the latency with a bus width of 32 is twice than the latency with bus width of 64. Moreover, the area and power of the NoC increases (decreases) with increasing (decreasing) bus width proportionally. Therefore, the normalized EDAP is similar for all DNNs with different NoC bus widths. Consequently, for all configurations, we obtain exactly the same guidance on the choice of NoC for different DNNs. Therefore, the guidance is consistent across different parameters of NoCs.
6.4.2. Theoretical Analysis
We utilize the analytical model in Section 4 and the experimental results described in Figure 16 and Figure 17 to provide guidance on the optimal choice of interconnect for IMC architectures. The injection rate at each port of an NoC router for each layer of the DNN is expressed in (6). The numerator of (6) denotes the total data volume between layer and layer for each port of the router per cycle. This is divided by to obtain the injection rate from for each port of every router as detailed in Section 4. For a fixed NoC-based IMC architecture, target throughput (), frequency of operation (), and bus width () remain constant. Hence, from (6) we obtain,
Let the connection density for layer be and the number of neurons be . Data volume between and layer is proportional to the product of and , as shown in (14).
Generalising (15), we obtain,
Therefore, the injection rate is directly proportional to the connection density and inversely proportional to the number of neurons of the DNN. Figure 20 presents the preferred regions for NoC-tree and NoC-mesh for best throughput for different DNNs with IMC architectures. If the connection density of a DNN is more than 2, then NoC-mesh is suitable. If the connection density is less than 1, then NoC-tree is appropriate. Both NoC-tree and NoC-mesh are suitable for the DNNs with connection density in the range of 1-2 (the region where red and blue ovals overlap in Figure 20).
6.5. Comparison with state-of-the-art architectures
Table 4 compares the proposed architecture with state-of-the-art DNN accelerators. Prior works show the efficacy of their ReRAM-based IMC architectures for VGG-19 DNN (Qiao et al., 2018; Shafiee et al., 2016; Song et al., 2017). Hence for comparison, we choose VGG-19 network as the representative DNN. Moreover, we compare the dynamic power consumption of the DNN hardware since prior work utilizes dynamic power in their results, hence making the comparison consistent. The inference latency of the proposed architecture with SRAM arrays is 2.2 lower than the architecture with ReRAM arrays. The proposed ReRAM-based architecture achieves 4.7 improvement in FPS and 6 improvement in EDAP than AtomLayer (Qiao et al., 2018). The improvement in performance is attributed to the optimal choice of interconnect coupled with the absence of off-chip accesses. The proposed ReRAM-based architecture consumes 400 lower power per frame along with 1.74 improvement in FPS than PipeLayer (Song et al., 2017). Moreover, there is a 5.4 improvement in inference latency compared to ISAAC (Shafiee et al., 2016), which is achieved by the heterogeneous interconnect structure.
|AtomLayer (Qiao et al., 2018)||6.92||4.8||145||1.58|
|PipeLayer (Song et al., 2017)||2.6*||168.6||385||94.17|
|ISAAC (Shafiee et al., 2016)||8.0*||65.8||125||359.64|
6.6. Connection Density and Hardware Performance
Figure 1 showed a trend of DNNs moving toward a high connection density structure for performance and low connection density structure for compact models. Figure 21 shows the performance for both P2P and NoC-based interconnect at the tile-level for IMC architecture for DNNs with different connection density. We observe a steep increase in total latency with a P2P interconnect. However, the IMC architecture with NoC interconnect shows a stable curve as we move towards high connection density DNNs. With the advent of neural architecture search (NAS) techniques (Xie et al., 2019; Zoph et al., 2018), DNNs are moving towards a highly branched structure with very high connection densities. Hence, the NoC-based heterogeneous interconnect architecture provides a scalable and suitable platform for IMC acceleration of DNNs.
The trend of connection density in modern DNNs requires a re-evaluation of the underlying interconnect architecture. Through a comprehensive evaluation, we demonstrate that the P2P-based interconnect is incapable of handling the high volume of on-chip data movement of DNNs. Further, we provide guidance backed by empirical and analytical results to select the appropriate NoC topology as a function of the connection density and the number of neurons. We conclude that NoC-mesh is preferred for DNNs with high connection density, while NoC-tree is suitable for DNNs with low connection density. Finally, we show that the NoC-based heterogeneous interconnect IMC architecture achieves 6 lower EDAP than state-of-the-art ReRAM-based IMC accelerators.
This work was supported by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, and SRC task 3012.001.
- GARNET: A Detailed On-Chip Network Model Inside a Full-System Simulator. In 2009 IEEE ISPASS, pp. 33–42. Cited by: §3.
- NeuroSim: A Circuit-level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37 (12), pp. 3067–3080. Cited by: Figure 3, §3, §3, §6.1.
- Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE JETCAS 9 (2), pp. 292–308. Cited by: §1, §1, §2.3.
- New Types of Deep Neural Network Learning for Speech Recognition and Related Applications: An Overview. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 8599–8603. Cited by: §1.
- Nvsim: A Circuit-level Performance, Energy, and Area Model for Emerging Nonvolatile Memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31 (7), pp. 994–1007. Cited by: §3, §3.
Deep Residual Learning for Image Recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1, §2.1, §5.1.
- Computing’s Energy Problem (and What We Can Do About It). In IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §1.
- Densely Connected Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §1, §1, §5.1.
- Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann. Cited by: §2.3.
- A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 86–96. Cited by: §1, Figure 5, §3, Figure 11, §6.2.
- C3SRAM: An In-Memory-Computing SRAM Macro Based on Robust Capacitive Coupling Computing Mechanism. IEEE Journal of Solid-State Circuits (), pp. 1–1. Cited by: §2.2.
- A 65nm 4kb algorithm-dependent computing-in-memory sram unit-macro with 2.3 ns and 55.8 tops/w fully parallel product-sum operation for binary dnn edge processors. In 2018 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 496–498. Cited by: §2.2, §6.1.
- An Analytical Latency Model for Networks-on-Chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21 (1), pp. 113–123. Cited by: §2.4.
- Interconnect-aware area and energy optimization for in-memory acceleration of dnns. IEEE Design & Test 37 (6), pp. 79–87. Cited by: §1, §2.2, §2.2, §2.3.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
- Maeri: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In ACM SIGPLAN Notices, Vol. 53, pp. 461–475. Cited by: §1, §2.3.
- Gradient-based learning applied to document recognition. Proc. of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §5.1.
- Network in Network. arXiv preprint arXiv:1312.4400. Cited by: §5.1.
A Survey on Deep Learning in Medical Image Analysis. Medical image analysis 42, pp. 60–88. Cited by: §1.
- Analytical Performance Modeling of NoCs under Priority Arbitration and Bursty Traffic. IEEE Embedded Systems Letters. Cited by: §2.4.
- Analytical Performance Models for NoCs with Multiple Priority Traffic Classes. ACM Transactions on Embedded Computing Systems (TECS) 18 (5s), pp. 1–21. Cited by: §2.4, §4.
- A Latency-Optimized Reconfigurable NoC for In-Memory Acceleration of DNNs. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10 (3), pp. 362–375. Cited by: §1.
- MAX2: an reram-based neural network accelerator that maximizes data reuse and area utilization. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. Cited by: §2.2.
-  An Empirical Investigation of Mesh and Torus NoC Topologies under Different Routing algorithms and Traffic Models. In 10th Euromicro conference on digital system design architectures, methods and tools (DSD 2007), pp. 19–26. Cited by: §2.3.
- An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10 (3), pp. 268–282. Cited by: §1.
- An Analytical Approach for Network-on-Chip Performance Analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29 (12), pp. 2001–2013. Cited by: §1, §2.4, §4, §4, §4.
- Inference Engine Benchmarking Across Technological Platforms from CMOS to RRAM. In Proceedings of the International Symposium on Memory Systems, pp. 471–479. Cited by: §5.2.
A Support Vector Regression (SVR)-based Latency Model for Network-on-Chip (NoC) Architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35 (3), pp. 471–484. Cited by: §2.4.
- Atomlayer: A Universal Reram-based CNN Accelerator with Atomic Layer Computation. In IEEE/ACM DAC, Cited by: §2.2, §5, §6.5, Table 4.
- ISAAC: A Convolutional Neural Network Accelerator with in-situ Analog Arithmetic in Crossbars. Proceedings of the 43rd International Symposium on Computer Architecture. Cited by: §1, §1, §1, §2.2, §2.3, §5, §6.1, §6.5, Table 4.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.1.
- Pipelayer: A Pipelined Reram-based Accelerator for Deep Learning. In IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 541–552. Cited by: §1, §1, §2.2, §2.2, §5, §6.5, Table 4.
- Scaledeep: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. ACM SIGARCH Computer Architecture News 45 (2). Cited by: §1, §2.3.
- Exploring Randomly Wired Neural Networks for Image Recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293. Cited by: §2.1, §3, §6.6.
- Vesti: Energy-Efficient In-Memory Computing Accelerator for Deep Neural Networks. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28 (1), pp. 48–61. Cited by: §2.2.
- MNSIM 2.0: a behavior-level modeling tool for memristor-based neuromorphic computing systems. In Proceedings of the 2020 on Great Lakes Symposium on VLSI, pp. 83–88. Cited by: §2.3.
- Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §6.6.