Deep Neural Networks (DNNs) are widely adopted for a variety of applications, ranging from speech recognition, object detection and self-driving cars, to cancer detection, drug discovery, and genomics   
. DNNs are able to extract high-level features from input data, as in statistical learning, compared to hand-picked features from classic machine learning. This has enabled DNNs to achieve human-level accuracy, which comes at the cost of high communication and computation complexity. High complexity in DNNs is simply attributed to the huge number of parameters and multiply-and-accumulate (MAC) operations. Fig.1 shows the number of weights and MAC operations used in some of the popular DNN models. AlexNet  consists of 61M weights and 724M MACs, while VGG-16  consists of 138M weights and 15.5G MACs.
DNNs include the training phase and the inference phase. In the training phase, learning is involved in determining the network weights and the biases. The inference phase is actually the process of taking the inputs from the user or sensor and make use of the weights and biases obtained during the training phase to get the estimated result. Training DNNs often requires the use of a large dataset and is more computation-intensive than inference. Training is also a time-consuming process that may take up to several weeks at a cloud/data center. On the other hand, an inference can be employed even on edge devices like mobile phones. Due to the involvement of a huge amount of parameters (weights and biases Fig.1), it is not possible to store all of these parameters in the local memory of the processing elements (PEs). Different levels of memory, from DRAM with high access cost to register files with low access cost , are commonly used in DNN accelerators, which imposes the challenge of optimizing data movement in the memory hierarchy.
In a DNN accelerator, PEs perform MAC operations, while the involved parameters are usually stored in the global memory. There is a need to transfer data from global memory to PEs, and vice versa. PEs and the memory elements are often interconnected by a Networks-on-Chip (NoC)    to realize high throughput. These PEs operate in parallel and reduce the memory access as much as possible by sharing and reusing the parameters with each other, especially in spatial architectures.
of a DNN accelerator, NoC play an important role in supporting various traffic patterns and dataflow models, enabling processing with communication parallelism and enhancing scalability. NoC also offer a modular design property that helps in power gating. Mesh is a widely adopted NoC topology that is scalable and can support a large number of PEs. The artificial intelligence (AI) computing system from Cerebras uses 2D mesh as a communication topology to connect thousands of AI cores. Groq  reorganizes 2D mesh into functional slices to optimize the microarchitecture. Existing mesh-based accelerator systems focus more on improving scalability and data reuse, and a little attention is given to enhancing communication support. Hence, this study is focused on efficient communication support for mesh-based DNN accelerators.
Observing the nature of data traffic in DNN processing, there are many inputs/weights transferred from the global memory to PEs and results (like partial sums) collected from PEs to the global memory. This traffic can be classified as one-to-many (multicast) and many-to-one (gather) traffic, respectively. In the literature, different approaches have been proposed to support multicast traffic in NoCs . Noticeably, in a DNN workload, the multicast traffic tends to have a fixed communication pattern. Thus, existing multicast algorithms may not be suitable for DNN workloads. In addition, support of gather traffic has not been well addressed in NoCs, as many-to-one traffic rarely occurs in conventional parallel workloads.
In this work, based on the observation that the multicast traffic in DNN workloads has the same source and destination set most of the time, we propose a modified mesh architecture with a one-way/two-way streaming bus to speedup the input and weight distribution in an NoC. In support of the many-to-one type of traffic on a mesh-based NoC, gather packets are used to efficiently deliver the partial sum results to the global memory. Streaming architectures and gather packets can be used on different dataflow models. To evaluate the performance of the proposed modified mesh-based architecture with gather support, analysis and simulations are conducted on the output stationary dataflow model using Alexnet  and VGG-16  as a DNN workload, and compared with the conventional repetitive unicast method.
The remainder of the paper is organized as follows. Section 2 reviews dataflow models, NoC architectures, and communication support proposed for DNN accelerators. Section 3 provides the background and motivation behind this work. In Section 4, we describe the proposed methods and analyze data streaming architecture and gather support improvement. Section 5 presents the experimental results of the proposed methods. Section 6 concludes the paper.
2 Related Work
DNN processing involves tens of layers and a large number of MAC operations using millions of parameters, which imposes tremendous throughput and energy-efficiency challenges to the computing platforms. Recently, spatial DNN accelerators like    are gaining attention, as they are optimized to handle DNN processing effectively. Commonly used in FPGA and ASIC based designs    , spatial architectures use a distributed approach, which adopts a large number of PEs, each having its own control logic and limited local memory and shared global memory. Communication between PEs is allowed, which enables the data movement between them. In the design of a DNN accelerator, a major consideration is optimizing data movement, which aims to minimize the global memory access, and thereby reduce the power consumption during DNN processing.
The dataflow determines the processing order, as well as where data is stored and reused, i.e., the way how data (i.e., inputs, weights, and partial sums) communication happens between the PE and memory element. In the literature, various dataflow models have been proposed , including Weight Stationary (WS), Output Stationary (OS), Row Stationary(RS), and No Local Reuse (NLR); each of them has its own memory usage and energy advantages. In the WS model , weights are stationary at the PEs, while the inputs and partial sums move through the PEs and the memory element. On the contrary, the OS model has the output stationary at the PEs, while the inputs and weights move . The NLR model  focuses on increasing the size of the global buffer at the expense of a register file, and thus decrease the DRAM accesses. The RS model increases the reuse of all data types rather than focusing on the reuse of one type.
DNN workloads contain different types of communication traffic that manages the data movement, such as partial sums, weights, and inputs streams to and from the memory. In a DNN accelerator , data movement is expensive in terms of energy, consuming around 50% of the total energy. In some cases, data movement can even increase the latency  due to the communication bottleneck. Although a bus-only based system has been proposed in some prior work, this kind of system will quickly become a bottleneck when the DNN size increases . This observation leads to the works focused on the NoC architecture, and communication support of DNN accelerators      .
Various studies have been done in NoC topology to accelerate a DNN workload    . In , a hierarchical Neu-NoC architecture that adopts a hybrid ring-mesh topology. Multiple PEs are connected in a group of rings connected via a mesh topology is proposed. This structure reduces the communication distance and shows better performance against the bus and tree structures. Other research  propose, a reconfigurable topology for a 3D neural network accelerator that can be reconfigured as a tree to handle the multicast traffic. A many-core system SpiNNaker is proposed to simulate spiking neural networks with torus network topology. In , the study looks at different topologies and concludes that the mesh NoC is better for realizing spiking neural networks, compared with the tree, point-to-point, and bus-based structures. In , a fat tree and a mesh are used for intrachip communication and data movement among chips, respectively. The separation of intrachip and interchip communication may create a bottleneck for the gather traffic abundant in a DNN workload.
Changes in the NoC topology also cause a change in the communication cost and support of different traffic patterns. Another study  proposes a mesh-based interconnection network called a hierarchical mesh network for DNN processing. The PEs and memory elements are grouped into a cluster, which is then connected via the hierarchical mesh network. The NoC is capable of configuring the network topology based on the needs. The NeuronLink  is a chip-to-chip interconnection network for large neural networks that support both interchip and intrachip communication. Each chip consists of 16 PEs in a mesh topology, and 4 such chips are connected in a star topology to handle a large amount of unicast and multicast traffic.
Various routing methods are adopted to fulfill the communication needs, especially multicast and gather traffic, in a DNN accelerator. Research in  adopts XY routing for unicast traffic and a table-based routing for multicast traffic. However, another study  proposes different NoC configurations for each datatype i.e, input activation, weights, and partial sums. Further, this method is suitable for an RS dataflow architecture, where partial sums are accumulated across multiple PEs, and hence, not suitable to perform gather for unique partial sums across the PEs, which exist in an OS dataflow architecture. Other research  proposes an array of microswitches that are configured to handle different kinds of DNN traffic by creating a tree. In ClosNN 
, one or more layers can be conducted on the network by mapping the neurons (PEs) on the input/output ports. Various stages of switching are done in order to connect the input and output ports in ClosNN, depending on the type of data traffic.
Since the field of DNNs is evolving rapidly, hardware design should also be able to maintain this pace. As widely adopted NoC topology, mesh is used in most of the recently proposed DNN accelerators     . As many-to-one and one-to-many traffic are not inherently supported in a mesh topology, one-to-many has many solutions  that can be well adapted to DNN workloads, but many-to-one does not have an efficient method. Recent work modifies the topology to simulate a tree or Clos network   to support many-to-one traffic, while other work   models this traffic as repetitive unicast (RU). In this work, we focus on providing communication support to this traffic on a mesh topology rather than proposing an alternate topology.
3 Background and Motivation
Our study is focused on the inference phase of a feed-forward neural network. In general, the traffic patterns existing in a workload running on an NoC-based system significantly impact overall system performance. Hence, it is important to study the nature of traffic in a DNN workload and provide a communication mechanism to support this traffic efficiently.
A DNN model may include tens of layers (such as convolutional layers, pooling layers, and fully connected layers) and millions of parameters. The neurons (activations) in each layer are connected to neurons (activations) in another layer in full or in part via synapses (weights), as shown in Fig.2. While implementing these layers in hardware, neurons are typically mapped to PEs inside a DNN accelerator. These neurons share the weights stored in the memory element; similarly, the outputs of the neurons in one layer is the input to the neurons in the adjacent layer. This sharing of data between adjacent PEs (neurons) creates traffic inside accelerators, which can be classified as one-to-one (unicast), one-to-many (multicast), and many-to-one (gather), as shown in Fig. 3.
Unicast traffic usually occurs when sending an input activation or weight from a memory element to a PE, or any other inter-PE traffic. Multicast traffic mainly covers the distribution of weights from the memory element to multiple PEs. Different dataflow mechanisms can be used to support multicast traffic for weight distributions. Gather traffic is used to collect the output from multiple PEs to the memory element. Due to limitations of computing resources, the inference operation of a DNN workload is performed in multiple rounds. When one round of MAC operations is completed, the intermediate results are gathered back to the memory element before initiating a new round.
where represents the weights and represents the input activation for the neurons in a particular layer containing neurons. represents the output activation, which will be fed as input to another layer, and
is the activation function in the model. Many DNNs consist of multiple layers, where both convolution and fully-connected layers perform MAC operations, as shown in (1). These layers are computationally intensive, and hence, performed in multiple PEs in a distributed way. Moreover, the weights and inputs are stored in the memory element, and these steps require frequent access to the memory element, which is an expensive task in terms of latency and energy .
When designing a DNN hardware accelerator, dataflow is another crucial aspect to consider. It depends on many factors such as the input size, as well as the number of kernels, stride, and mapping scheme of the DNN workload onto the PE arrays, along with DNN optimizations like pruning and sparsity. An inefficient dataflow model will cause stalls, as appropriate data may not be available at the PE when needed, and low data reuse, so that the same data must be fetched multiple times from the memory, thus resulting higher latency and energy inefficiency. Compared with other dataflow models, the OS dataflow model achieves good performance with less complexity. In this work, we analyze our proposed method using the OS dataflow model on a mesh-based NoC. Fig. 4 shows the OS dataflow, where input activations and weights are streamed in a row-wise and column-wise manner, respectively, while the partial sums are accumulated on a PE.
Fig. 5 shows an example of how efficient communication support can affect the performance in a many-to-one type of communication traffic. The green-colored nodes in a 6x6 mesh are trying to send the data to a memory element. Fig. 5(a) illustrates the delivery of data using unicast communication and the possible gather communication scheme, respectively. Using unicast communication, each node in the same row sends its packet to the same destination, which increases the amount of traffic. Is it possible to reduce the network traffic when multiple senders are sending a payload to the same destination? With the gather support, the gather packet is initiated at the left-most node and collects the data payload from the intermediate nodes on its way to the destination node. As shown in Fig. 5 (b), gather support significantly reduces the network traffic and reduces the total hop count from 15 to 5, proving to be efficient in delivering all data to the memory element using the least amount of resources. In addition, noticeably in DNN workloads, the weights and inputs used to calculate Eqn. (1) are continuously streamed to certain groups of PEs, as shown in Fig. 4. In that sense, direct links may be added for distributing weights/inputs, thus eliminating the unnecessary hop counts. As such, both many-to-one and one-to-many traffic can be supported in mesh-based NoC.
4 The Proposed Method
In this section, we describe the gather supported routing scheme first, followed by the data streaming architectures and a support on multiple PEs per router. We also present the analysis of the performance improvement in this section.
4.1 Gather Support
Assume that a convolutional layer is implemented on an mesh-based NoC. Multiple PEs perform MAC operations in a distributed fashion as shown in Fig. 4 where input activations of size , each with channels, are streamed in X-dimension. Similarly, filter, or weight streams of size , each with channels, are streamed in Y-dimension. Both the weight and input streams meet at all PEs in the respective dimension, to complete a total of MAC operations. As shown in Fig. 4, input activations and weights are streamed from the left and top side of the PE array respectively, so that the partial sums are generated in PEs. It is also clear from Fig. 4 that multiple rounds are required to complete all MACs, due to the resource limitation in the network. When the first round of MAC operations are completed, the partial sum (PS) result, as shown in Eqn. (2), is collected using gather traffic and sent to the global buffer before the start of the next round. Additionally, after the completion of one layer, output activations will also be moved to the memory element so that a new layer can operate on the PEs.
We propose to use gather-supported routing to support gather traffic. The leftmost PE in each row after completing the operation (PS or output activation) will initiate a gather packet, with the packet format as shown in Fig. 6 (a). The packet consists of multiple fields including to identify the flit type (head, body, or tail); to identify the packet type (unicast, multicast, or gather); to indicate the available space for the gather payload;and and to indicate the source and destination address and for multicast destinations, respectively.
Algorithm 1 shows the flow for the gather supported routing implemented at each router. The incoming header flit is used to generate a signal, which indicates the router to fill a gather payload in an incoming body or a tail flit by appending the payload. Ideally, the size of a gather payload is considered to be less than a flit size. Fig. 6(b) shows the logic to generate a signal; the same signal is also used to decrement the space counter so that other PEs can estimate the space for filling their gather payloads. If the is less than a gather payload size, the router can initiate its own gather packet. However, to avoid the flooding of gather packets, each router must wait for the timeout period of cycle so that any other previously generated gather packet can go through.
The value of can be determined based on the router pipeline stages. Additionally, can be fine-tuned further for an individual router, if required. A too low value of will result in an increased amount of packets in the network, leading to congestion and increased latency, while a too high value of will cause nodes to wait too long for an incoming gather packet, which may increase the latency of the packets. Noticeably, also serves as a fault tolerance mechanism. If a link is faulty, then the node can initiate its own packet without having to wait indefinitely for a previously generated gather packet. In such a scenario, a large value of can lead to higher packet latency. In our experiments, all links are assumed to be fault free and reliable.
It is important to restrict a circular path in the routing algorithm to avoid a potential deadlock. The proposed gather packet still follows XY routing, which is deadlock-free.
4.2 Router Architecture
An NoC router typically consists of multiple pipeline stages. Fig. 7 shows a four-stage router pipeline including stages of route computation (RC), virtual channel allocation (VC), switch allocation (SA), and switch traversal (ST) . For an incoming packet, only the header flit undergoes the RC stage to determine the output port for the packet. Similarly, only the header flit moves to the VC stage, where the flit arbitrates for a virtual channel corresponding to its output port. In the SA stage, each flit arbitrates for the switch input and output ports. Finally, in the ST stage, each flit traverses the crossbar. All flits of a packet undergo the SA and ST stages. The unused pipeline stages for the body/tail flit can be used to fill the gather payload into a gather packet.
Fig. 7 shows the modified router pipeline to incorporate the gather support. When the header flit of a gather packet arrives at the input buffer, the signal is generated during the RC stage and in the VC stage, and the counter is updated. Upon the arrival of the body/tail flit, the gather payload is filled into the packet during the RC and VC stages. This modification does not require the packet to leave the router, nor add extra pipeline stages; thus, no additional latency is introduced. As the router pipeline is kept the same; there is no impact on the router performance. The modified router microarchitecture is shown in Fig. 8. The Gather Load Generator block will generate the signal and update the in the header flit. The Gather Payload block contains a queue that enqueues the gather payload from the PE, and the status of uploading is acknowledged back to the PE. The same status signal will be used by the PE to initiate its own gather packet if the incoming gather packet is full, or initiate its own unicast packet upon the expiration of cycles controlled by a counter set by the PE.
4.3 Multicast Support
As shown in Fig. 4, in the OS model, a partial sum will be accumulated at a PE with the input activations and weights streaming from the memory element  . Assume that one router can connect to up to PEs, so that these PEs will receive sets of inputs and weights. Convolution operation accounts for a large portion of energy and latency in a DNN operation . A significant portion of the communication traffic involves the distribution of input activations and weights, which can be treated as multicast traffic because of the dataflow pattern, as shown in Fig. 4. Based on this observation, a modification is proposed, where the input activations and weights are streamed using a bus from the memory elements to the PEs in the same row and column, respectively. The streaming bus will help in overcoming the additional routing overhead, and thus, improve the runtime latency of a DNN workload.
Fig. 10 (a) shows the two-way streaming architecture, where two different stream units will handle the streaming of input activations and filter weights. Each input activation streaming unit handles the streaming of the corresponding input activation from the memory element to a respective PE row. Each router in the same row will receive the same input activations, which are then buffered for MAC operation on a PE’s internal register file. Similarly, the weight streaming unit streams the filter weights from the memory element to a respective PE column. All row and column data are streamed to the respective PEs, similar to the pattern shown in Fig. 4. The partial sums or the output activations are calculated at all PEs. Results in the same row are then collected using a gather packet as it proceeds towards the memory element. Similar architectures can be orchestrated for other dataflow models.
Fig. 10 (b) shows the one-way streaming architecture, where both the input activations and the filter weights share the same streaming link to PEs in the same row. As the link is shared by either weight or the activation streams at a given clock cycle, there is inherent latency added before the PEs can move ahead with the MAC operation. Fig. 10(b) also shows the streaming unit, which streams the input/weight activation in an interleaved manner through a multiplexor on the shared link. The partial sums will be accumulated by a gather packet before sending them back to the memory element. This architecture will use less silicon area compared with the two-way streaming architecture. This architecture may be beneficial for other types of dataflow models like WS, RS, where the weights are streamed first to the PEs before input streaming begins.
4.4 Multiple PEs per Router
We also consider an expanded mesh, where each router can be connected with multiple PEs. Fig. 9 shows such an architecture. The Network Interface (NI) unit handles the packet movement between the router and PEs. The streaming units feed input(s)/weight(s) (I/W) to the NI, as shown in Fig. 9. An incoming packet from the router or streaming unit is dequeued from an incoming queue into a control logic, where the packet is disassembled and forwarded to the respective PEs. The control logic keeps track of the type of packet, start and end of the packet, and other necessary information to correctly decompose the data for a respective PE. When PEs are ready to inject data into the network, the packet format unit will collect the outgoing data from the PEs and generate a packet. The generated packet is then forwarded to the control logic, which creates flits to be enqueued in an outgoing queue. The router will access the outgoing queue and inject the packet into the network. Assume that these PEs are simple, as proposed in , which supports MAC operation and an activation function with predictable pipeline stages. Hence, the synchronization can be done without extra overhead.
Supporting multiple PEs per router allows more partial sums generated in parallel and makes better utilization of a gather payload, which can help accelerate the DNN execution with reduced power consumption. Depending on the bus width, multiple input activations and weights can be streamed in each NI at one time. As shown in Fig. 4, these input activations and weights may have different combinations depending on how the PEs are grouped. One option is multiple PEs on the same column sharing one router; then multiple sets of input activation and one set of filter weight will be streamed in the NI. For example, for two PEs/router, , , and will be streamed in the NI connected to and over multiple clock cycles. This can be further extended for 4 and 8 PEs/router. Another option is multiple PEs on the same row sharing one router; then one set of input activations and multiple sets of filter weights will be streamed in the NI. Other options are possible, with the cost of more complex design at the control logic. The information can be fed as a configuration file to the streaming units.
In the proposed method, we have two different networks: one for gather traffic and the other for a streaming bus. In the mesh network, we use a credit-based flow control mechanism . A similar end-end flow control mechanism may be employed for the streaming bus, but this may create an extra wire overhead from each node to the streaming unit. Hence, we adopt a similar credit-based mechanism to that used in  to ensure the single-cycle data delivery to the PEs. The global buffer maintains the status of the credits for the PEs, i.e, incoming queue in the NI, as shown in Fig. 9. The streaming unit will only perform the streaming if all the nodes have free space to hold the data. This ensures the integrity of the MAC operation.
4.5 Analysis of proposed modification
The total clock cycles required to finish one round of convolution operation can be attributed to the time required for: input activation streaming and weight streaming; the MAC operation; and finally, the generation and collection of the result. Fig. 11 illustrates the pipelined execution of multiple rounds of convolution operations in one row of PEs, assuming one PE/router, where the streaming of input activation and weights (I/W), followed by the MAC operation and activation function, happen in parallel at all PEs. After MAC operations, the partial sum (PS) is generated at each PE, which is then collected by the gather packet. While the gather packet collects the PS results along its way to the global memory, the next round of convolution operation occurs concurrently. Note that with multiple PEs/router, in each round the streaming time will be extended while other parts stay the same.
Equations (3)-(4) analyze the improvement of the runtime latency of a convolution layer using gather support over repetitive unicast (baseline) for the OS dataflow model on the proposed streaming architecture. In these equations, represents the time to stream the inputs to the PE, as shown in Fig. 4; represents the multiplying factor, which depends on the number of PEs per router; represents the factor that reduces the input streaming with the streaming bus in the proposed method; represents the computation time for the MAC operation; represents the number of pipeline stages at each router (with each stage occupying one cycle); and represents the number of rounds needed to finish the convolution of all inputs and filters using the OS dataflow model. Assume that each unicast packet size is , each gather packet size is , and the flit size is . The gather packet is initiated from the leftmost node of each row in Fig. 4.
Equation (3) derives the runtime latency of a convolution layer using repetitive unicast, where represents the latency for the header flit of the result packet (partial sum) from to reach the global buffer, represents the time needed for the remaining flits to arrive at the global memory, and is the latency added due to the congestion. When a data streaming bus is used, as the transmission of unicast packets, all nodes are parallelized, the packet from the leftmost node will take the longest time to arrive at the global memory.
Equation (4) derives the runtime latency of a convolution layer using gather support, where is the number of payloads that can be collected by one gather packet, represents the number of gather packets, represents the latency for the header flit in the gather packet, represents the latency for the rest flits in the gather packet, and is the latency added due to the congestion.
Note that in Eqs. 3 and 4, the data streaming and MAC operation time is same; the difference lies in the time taken to transmit the results to the global memory. When n=1, the time taken to transmit the unicast packet from the leftmost node is nearly the same as the time taken to transmit the gather packet. However, when n increases, the delay due to network congestion will increase significantly for RU (reflected by ). In comparison, the network congestion for gather packets (reflected by ) will be much less. We will evaluate the effects of and through simulations in Section 5.2.
5 Performance Evaluation
To evaluate the performance of the proposed method, simulations are conducted and compared with the repetitive unicast method on mesh-based NoCs modified with the streaming architectures for the OS dataflow model. In this section, the experiment settings are described first, followed by the study of timeout period () and gather packet size, before the performance analysis is presented.
5.1 Experiment Setup
We compare the proposed method and repetitive unicast method in terms of the runtime latency and power consumption. We assume that there is a higher level entity or a mapping framework similar to     
exists that does the task of mapping neurons to the PEs, controlling timing for better synchronization without stalls, so that our focus is on evaluating the performance of the on-chip network. In order to fully utilize the spatial PE arrays, the parameters obtained from Pytorch framework are used to model the traces for the NoC. PEs represent the neurons that are organized in a 2D mesh. The total neurons in each layer are divided to fit the PEs in a mesh. Assume that memory elements (global memory) are located on the north, east, and west sides of the network. Each PE receives the input activation and filter weights from the streaming units on the north and east sides of the mesh, respectively. Accumulation happens locally to generate the partial sums, which are then collected from the left side to the global buffer at the right side of the mesh. The output feature map of the current layer is completely generated before moving ahead with another layer.
|Topology||8x8 Mesh, 16x16 Mesh|
|Latency||router: 4 cycles, link: 1 cycle|
|Buffer Depth||4 flits|
|Flit Size||128 bits/flit|
|Gather Payload||32 bits|
|Number of PE per router||1,2,4,8|
|Gather Packet Size||
|Unicast Packet Size||
A cycle-accurate C++ based NoC simulator  is used to simulate the generated traces for Alexnet  and VGG-16 . Orion 3.0  is used to estimate the power consumption for NoC, and DSENT  is used to estimate the power consumption for the streaming bus. Simulations are performed on 8x8 and 16x16 2D mesh networks. Table I shows the NoC setting used for performance analysis. As the number of PEs/router increases, the gather payload also increases. To accommodate these gather payloads, we can either use a fixed number of flits or a dynamic flit size per gather packet. For the DNN workload, each node in the same row will generate a prefix sum result. Hence, a fixed number of flits is chosen for our experiments, which avoids the extra overhead in router design if a dynamic flit size is used. For 1,2,4,8 PEs/router, the gather packet size is set as 3,5,9,17 flits/packet by default, respectively. This flit size is enough to collect all the gather payloads for an 8x8 network; however, for a 16x16 NoC, two gather packets are needed, as the first one will be full halfway to the global memory.
5.2 Analysis of and Gather Packet Size
The timeout period plays an important role in the performance of the gather routing. The value defines the waiting time (in cycles) for a PE with a gather payload to wait before initiating its own packet, with the anticipation that the gather packet sent from its neighbor will arrive. Fig. 12 shows the impact of on the total runtime latency, as well as the total power consumption of gather supported routing. The analysis is done on an 8x8 mesh under a similar traffic scenario as in Fig. 5, where the nodes in one row are trying to deliver the gather payload to the global memory on the right side of the mesh.
The time out period ( clock cycles) actually depends on the router pipeline stages (). When , the header flit of a gather packet will not reach its adjacent node before the expiration of the clock cycle. As such, each PE will initiate its own packet. This situation is similar to sending the result from each node in a repetitive unicast way, which will cause congestion in the network and result in increased runtime latency. As the value of increases, the gather packet initiated by a PE will subsequently reach the adjacent nodes before timeout occurs at those nodes. This piggyback mechanism will effectively decrease the network traffic and improve the resource utilization in the NoC.
As shown in Fig. 12(a), the normalized runtime latency (vs. ) is reduced with value increased except for the case with 1PE/router, which is almost same. With more PEs/router, the gather packer size is increased (3, 5, 9, 17 flits for 1, 2, 4, 8 PEs/router) to accommodate more partial sums. Noticeably, after becomes sufficiently large (), further latency improvement is not noticed. This is because the value is large enough to allow all the gather payload to be collected by a gather packet. Therefore, for mesh, can be set to to ensure that the header flit of the leftmost gather packet will arrive at all nodes in the same row, so that all the gather payloads can be uploaded into the same gather packet.
Fig. 12(b) shows the normalized power consumption for different values of (vs. ). With increasing the value, the number of packets generated in the network is reduced as the gather packet is collecting all the partial sum results, which helps reduce the total number of hops traversed and optimizes the NoC resource utilization for 1,2,4,8 PEs/router, thus consuming less total power. Although, there is not any improvement in the runtime latency for the 1PE/router case, we see some significant improvement in the power consumption.
The tradeoff of using different gather packet sizes is further studied for different network sizes as well as numbers of PEs/router settings. Fig. 13 compares the performance of gather traffic using one gather packet with a larger number of flits and two gather packets with smaller number of flits on 8x8 and 16x16 mesh for different numbers of PEs/router. Fig. 13(a) and (b) shows the normalized runtime packet latency and power consumption (vs. reptitive unicast) for 8x8 mesh. We can see a clear tradeoff, where using one gather packet with larger number of flits is better in terms of latency improvement, but using two gather packets with a smaller number of flits is better at improving the power consumption. A similar trend on 16x16 mesh is also shown in Fig. 13(c) and (d).
For the 1 PE/router case, a slightly increase in runtime latency occurs, as this is the in which the network does not have much load for the case, i.e, one packet per node. In addition, we notice that with smaller gather flit size, one can expect small runtime; however, it is the opposite, as shown in Figs. 13(a) and (b). This effect is seen due to the expiration of clock cycles; for the 2 gather packets per row case, the second packet is only injected when the first packet reaches the node, with no space left for further payload. In such a case, the first node to encounter such a situation will initiate a new gather packet, which is the second gather packet. To avoid this scenario, the router can be hardwired with the information to initiate the gather packet without waiting for the incoming gather packet, which reduces the scope and scalibility of the proposed method. To balance the tradeoff of latency and power consumption, in the following performance analysis, one gather packet is used for 8x8 mesh and 2 gather packets are used for 16x16 mesh with 3, 5, 9, 17 flits/gather packet set for 1, 2, 4, 8 PEs/router, respectively.
5.3 Performance Analysis
Fig. 14 shows the simulated runtime latency improvement of the proposed gather support with two-way streaming and one-way streaming architectures vs. gather-only using the NoC parameter from Table I. On average, the gather support with two-way streaming architecture achieves 1.71 times improvement, and the gather support with one-way streaming obtains 1.48 times improvement compared against the gather-only method . It is clear that the runtime latency improvement using two-way streaming is better than using one-way streaming in the OS dataflow model. Hence, we compare gather support against repetitive unicast with two-way streaming architecture in our experiments further.
Figs. 15(a) and (c) show the improvement in the total runtime latency of the proposed method against the repetitive unicast method for all convolution layers in Alexnet  on 8x8 and 16x16 mesh-based NoCs, respectively. It is clear that as the number of PEs is increased across 8x8 or 16x16 mesh, we can see an improvement in the total runtime latency. This improvement is attributed to more parallel operations enabled by an increasing number of PEs per router. With more PEs, more MAC operations are done in parallel in one round, which reduces the number of rounds needed. For a lower number of PEs per router, the runtime improvement is minor, which can also be seen from the delta analysis (Fig. 12(a)), as the network is not congested enough for the gather packet to improve the latency.
The performance improvement is higher in the case of 16x16 mesh when compared with the 8x8 mesh. The reason is that on the 16x16 mesh, repetitive unicast traffic creates much higher congestion in the network, and the benefit of using gather traffic is more significant than on the 8x8 mesh. Similarly, Fig. 16(a) and (c) show the improvement in total runtime latency for all convolution layers in VGG-16  for 8x8 and 16x16 meshes. For VGG-16, we see a similar trend in performance improvement, with the 16x16 mesh offering more improvement (up to 1.84 times) than the 8x8 mesh, and the improvement is better with the increasing number of PEs per router. On average, performance improvement is higher in VGG-16 compared with Alexnet, as it has a lot more convolution layers to process than Alexnet.
Figs. 15(b) and (d) shows the improvement in the total network power consumption of the proposed method against the repetitive unicast method for Alexnet on the 8x8 and 16x16 mesh-based NoCs. Different from runtime latency, network power consumption is determined by the total amount of traffic communicated. For a lower number of PEs per router, the power improvement is minor because the power consumption due to the streaming is higher than the power saving from the gather traffic. As the number of PEs/router increases, improvement in the total network power also increases (up to 1.4 times) because of the reduction in streaming power. More weights or inputs can be streamed with an increasing number of PEs/router, and the advantage of gather traffic over the repetitive unicast traffic is more significant. For the 16x16 mesh, we can see that the improvement is slightly less than the 8x8 mesh, which is due to the increased number of gather packets for the same of PEs/router. Similarly, Figs. 16(b) and (d) show improvement in power performance for VGG-16  for 8x8 and 16x16 meshes. For both mesh sizes, we see a similar performance trend as in Fig. 15(b) and (d). Similarly, the improvement of power consumption is higher in VGG-16 compared to Alexnet for the same reason as the latency improvement.
5.4 Hardware Overhead
We used DSENT  to estimate the area and power for a router with the configuration shown in Table I. For a router operating on a clock, it consumes power with an area of . The hardware overhead of a proposed router from Fig. 8 is evaluated with the synthesis report obtained from the Synopsys Design Compiler with a CMOS library. The power consumption of a proposed router is and the area is . With the proposed changes in the router, the overhead is around a increase in power and increase in an area, which is worthwhile considering the performance improvement with the changes.
In this paper, we proposed using the gather packet and direct data streaming architectures on mesh-based NoC to handle abundant many-to-one and one-to-many traffic in DNN workloads. The OS dataflow model is adopted to study the proposed method, which is evaluated using two DNN models: Alexnet and VGG-16. The analysis shows that the two-way streaming architecture achieves more significant improvement in the runtime latency of a convolutional layer. Simulation results confirm the effectiveness of the proposed method, which achieves up to 1.8 times improvement in the runtime latency and up to 1.7 times improvement in the network power consumption. The hardware overhead of the proposed method is justifiable for the performance improvements achieved over the repetitive unicast method. Future work should include applying the proposed method to other dataflow models, as well as thorough performance study of different dataflow models.
Y. LeCun, Y. Bengio, & G. Hinton, ”Deep learning,” Nature, 2015, vol. 521,pp. 436–444.https://doi.org/10.1038/nature14539
C. Chen, A. Seff, A. Kornhauser and J. Xiao, ”DeepDriving: learning affordance for direct perception in autonomous driving,” Proc. IEEE Int’l Conf. on Computer Vision (ICCV), Santiago, 2015, pp. 2722-2730.
-  Esteva, A., Kuprel, B., Novoa, R. et al., ”Dermatologist-level classification of skin cancer with deep neural networks,” Nature, 2017, vol. 542, pp. 115–118. https://doi.org/10.1038/nature21056
-  A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” CoRR, vol. abs/1404.5997, 2014. [Online]. Available: http://arxiv.org/abs/1404.5997
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
-  V. Sze, Y. Chen, J. Emer, A. Suleiman and Z. Zhang, ”Hardware for machine learning: challenges and opportunities,” in Proc. IEEE Custom Integrated Circuits Conf. (CICC), Austin, TX, 2017, pp. 1-8.
-  W. J. Dally and B. Towles, “Route packets, not wires: on-chip interconnection networks,” in Proc. 38th DAC, 2001, pp. 684–689.
-  Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna, ”MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects”, in Proc. ASPLOS, 2018, pp. 461–475.
-  S. Carrillo et al., ”Scalable hierarchical network-on-chip architecture for spiking neural network hardware implementations,” IEEE Trans. Parallel and Distributed Systems, 2013, vol. 24, pp. 2451-2461.
-  B. Bohnenstiehl et al., ”KiloCore: a 32-nm 1000-processor computational array,” IEEE J. of Solid-State Circuits, 2017, vol. 52, no. 4, pp. 891-902.
-  A. Touzene, ”On all-to-all broadcast in dense gaussian network-on-chip,” IEEE Trans. on Parallel and Distributed Systems,2015, vol. 26, no. 4, pp. 1085-1095.
-  B. Tiwari, M. Yang, Y. Jiang and X. Wang, ”Efficient on-chip multicast routing based on dynamic partition merging,” in Proc. 28th Euromicro Int’l Conf. on Parallel, Distributed and Network-Based Processing (PDP), Vasteras, Sweden, 2020, pp. 274-281.
-  Cerebras Systems, ”Wafer-scale deep learning,” in Proc. IEEE Hot Chips 31 Symp. (HCS), Cupertino, CA, 2019, pp. 1-31.
D. Abts et al., ”Think Fast: a Tensor Streaming Processor (TSP) for accelerating deep learning workloads,” in Proc. ACM/IEEE 47th Annual Int’l Symp. on Comp. Architecture (ISCA), Valencia, Spain, 2020, pp. 145-158.
-  A. Karkar, T. Mak, K. Tong and A. Yakovlev, ”A survey of emerging interconnects for on-chip efficient multicast and broadcast in many-cores,” IEEE Circuits and Systems Magazine, 2016, vol. 16, no. 1, pp. 58-72.
-  Jouppi, N. P. et al., ”In-datacenter performance analysis of a tensor processing unit”, in Proc. 44th Int. Symp. Comp. Architecture (ISCA), 2017.
-  Y. Chen, T. Yang, J. Emer and V. Sze, ”Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices,” IEEE J. on Emerging and Selected Topics in Circuits and Systems, 2019, vol. 9, no. 2, pp. 292-308.
-  Z. Du et al., ”ShiDianNao: shifting vision processing closer to the sensor,” in Proc. ACM/IEEE 42nd Annual Int’l Symp. on Comp. Architecture (ISCA), Portland, OR, 2015, pp. 92-104.
-  H. Sharma et al., ”From high-level deep neural models to FPGAs,” in Proc. 49th Annual IEEE/ACM Int’l Symp. on Microarchitecture (MICRO), Taipei, 2016.
E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh and D. Marr, ”Accelerating binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC,” in Proc. Int’l Conf. on Field-Programmable Technology (FPT), Xi’an, 2016.
-  T. Chen et al., “DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proc. ASPLOS, 2014.
-  T. Luo et al., ”DaDianNao: a neural network supercomputer,” IEEE Trans. on Computers, 2017, vol. 66, no. 1, pp. 73-88.
-  D. Vainbrand and R. Ginosar, ”Network-on-chip architectures for neural networks,”in Proc. 4th Int’l Symp. Networks-on-Chip (NoCS), Grenoble, 2010, pp. 135-144.
-  E. Painkras, et al., “SpiNNaker: a 1-W 18-core system-on-chip for massively-parallel neural network simulation,” IEEE J. Solid-State Circuits, 2013, vol. 48, no. 8, pp. 1943–1953.
-  S. Xiao et al., ”NeuronLink: an efficient chip-to-chip interconnect for large-scale neural network accelerators,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 2020, vol. 28, no. 9, pp. 1966-1978.
-  R. Hojabr, M. Modarressi, M. Daneshtalab, A. Yasoubi and A. Khonsari, ”Customizing Clos network-on-chip for neural networks,” IEEE Trans. on Computers, 2017, vol. 66, no. 11, pp. 1865-1877.
-  B. Tiwari et al., ”Improving the performance of a NoC-based CNN accelerator with gather support”, in Proc. IEEE 33rd Intl. System-on-Chip Conf. (SOCC), 2020.
-  X. Liu and et al., “Neu-NoC: a high-efficient interconnection network for accelerated neuromorphic systems,” in Proc. 23rd ASP-DAC, 2018, pp. 141–146.
-  A. Firuzan, M. Modarressi, M. Daneshtalab, and M. Reshadi, “Reconfigurable network-on-chip for 3D neural network accelerators,” in Proc. 12th Int’l Symp. Networks-on-Chip (NoCS), 2018, pp. 1–8.
-  H. Kwon, A. Samajdar, and T. Krishna, “Rethinking NoCs for spatial neural network accelerators,” in Proc. 11th Int’l Symp. Networks-on-Chip (NoCS), 2017.
-  Seyedeh Mirmahaleh et al.,”Flow mapping and data distribution on mesh-based deep learning accelerator,” in Proc. 13th Int’l Symp. Networks-on-Chip (NoCS), 2019.
Ying Wang et al., ”A many-core accelerator design for on-chip deep reinforcement learning,” in Proc. 39th ICCAD, 2020, pp. 1-7.
-  N. E. Jerger, L. Peh and M. Lipasti, ”Virtual circuit tree multicasting: a case for on-chip hardware multicast support,” in Proc. Int’l Symp. on Comp. Architecture (ISCA), Beijing, 2008, pp. 229-240.
-  W. Dally and B. Towles, ”Principles and practices of interconnection networks”, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
-  B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-scalable processor for real-time large-scale ConvNets,” in Proc. Symp. VLSI, 2016, pp. 1–2.
-  H. Esmaeilzadeh, A. Sampson, L. Ceze and D. Burger, ”Neural acceleration for general-purpose approximate programs,” in Proc. 45th Annual IEEE/ACM Int’l Symp. on Microarchitecture (MICRO), Vancouver, BC, Canada, 2012, pp. 449-460.
-  A. Paszke and et al., “PyTorch: An imperative style, high-performance deep learning library,” 33rd Conf. on Neural Information Processing Systems (NeurIPS), Vancouver, Canada, 2019, pp. 8024-8035.
-  X. Wang, T. Mak, M.Yang, Y. Jiang, and et al., “On self-tuning networks-on-chip for dynamic network-flow dominance adaptation,” in Proc. 7th Int’l Symp. Networks-on-Chip (NoCS), 2013, pp. 1–8.
-  A. B. Kahng, B. Lin, and S. Nath, “ORION3.0: a comprehensive NoC router estimation tool,” IEEE Embedded Systems Letters, 2015, vol. 7, no. 2, pp. 41–45.
-  C. Sun et al., ”DSENT - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling,” in Proc. 6th Int’l Symp. Networks-on-Chip (NoCS), 2012, pp. 201-210.