The recent growing interest helps to fuse the machine learning (ML) techniques with traditional HPC approaches liu2016application ; kurth2017deep as well as Cloud-based services (known as machine learning as a service (MLaaS) 111https://azure.microsoft.com/en-us/services/machine-learning-studio/, 222https://cloud.google.com/products/machine-learning/). Traditionally, HPC applications are either compute or communication-centric. However, there is no easy way to categorise the traffic generated by the applications in the interconnect subsystem of CMPs Barrow2009 . At run-time, threads communicate at different levels. The flow of data they generate depends on several architectural factors (e.g., the type and topology of the interconnection, the presence of distributed memory banks, the number of levels in the cache hierarchy). The huge amount of data exchange among the processing elements (PEs) started to push the limits of traditional interconnections with the growing adoption of large multithreaded applications. For instance, Deep learning Lecun2015deep
(DL – essentially a subset of ML) algorithms are characterised by huge inherent parallelism. Each layer of artificial neurons has to process a large amount of input data following a dataflow style. The way this computation is performed generally reflects into a massive number of parallel threads that can quickly add stress on the interconnection. To achieve higher performance, data exchange inside the interconnect needs to be optimised. Furthermore, the bandwidth usage, latency and energy cost are the primary performance related concern, while taking into account emerging physical limitations (e.g., the way the heat is removed from the chip). In fact, inefficient interconnects may reduce the overall system performance, while consuming a significant portion of the area and power budget of the chipHoskote20075 . Past research works primarily focused on improving the PEs’ micro-architecture (i.e., compute element with some local storage), while only a little research has been carried out on the architecture of the interconnections. In fact, most of the hardware accelerators use specialised buses, mesh-based interconnects, or crossbars akopyan2015truenorth ; chen2017eyeriss ; chen2014diannao ; du2015shidiannao . Although mesh-based interconnections offer a good trade-off between performance, power consumption and available bandwidth, their scalability to a large core count (i.e., as also required by DL accelerators) remains limited.
Moving from multicore to manycore Bohnenstiehl2016
, the probability of resource contention dramatically increases. Thus, the amount of conflict-free resource sharing inside the chip should be maximised to reduce power cost and also to improve the performance. Although scalable interconnections offer a significant amount of (shared) communication resources to all PEs, the increase in the number of connected nodes up to hundreds (or even thousands) makes the possibility of resource contention still an open issue. Further, it may quickly defeat the advantages of substantial parallelism provided by the higher core count. To unleash the full capability of today’s and future CMPs, as well as accelerating DL/ML applications, efficient interconnects must offer high bandwidth, low latency, (possibly) memory coherency support, and also better I/O integration. To this end, Networks-on-Chip (NoCs) built around effective topologies demonstrated to be a possible solution for implementing massively parallel processors (thanks to the advantages offered compared to other alternatives regarding wiring area and power costLee2007chip ; Bolotin2004 ). For a small number of PEs, ring topology demonstrated to be very useful, requiring low-radix routers. Interestingly, a ring topology can outperform mesh topology for workloads exhibiting moderate to high memory locality accesses Ravindran1997p . Rings also have been used in commercial systems (e.g., the Intel Xeon Phi co-processor uses a dual ring topology). The main advantage of using the ring-topology is the low latency offered to the data packets to reach their destinations. However, rings also suffer from low bandwidth when connecting a large number of nodes, and soon becomes difficult to scale to hundreds of cores (due to their limited bisection bandwidth). Efforts have been made to connect local and global rings via “bridge routers” to improve the scalability together with performance and better energy consumption Ausavarungnirun2016case ; Hamacher2001h ; Vranesic1995n . On the other side, due to its scalability and the highest level of fault tolerance, the 2D-mesh topology has become very popular to implement the interconnect (e.g., TILE Bruce2007 , Polaris chip Hoskote20075 ). However, 2D-mesh topology suffers from space and power trade-offs for connecting a vast number of PEs Hoskote20075 ; Vangal20078 ; Balfour2006design , which is also verified during our experiments. To overcome these issues, proposals combining 2D-mesh with rings have been presented Bourduas2007h ; Ausavarungnirun2016case . Recently, power efficiency also became a major concern for NoC designs connecting several hundreds of cores inside the chip. A conventional 2D-mesh router using an internal crossbar switch can consume the most substantial portion of the power budget Hoskote20075 . For instance, the NoC for the MIT Raw processor can consume up to 36% of total system power Wang2003pwr , while Vangal et al. showed that on the Intel TeraFLOPS chip, the NoC uses up to 28% of tile power Vangal20078 . Other experiments have shown that for large core count (i.e., a 256-core-based CMP) conventional 2D-mesh NoC consume up to 45% of the total energy Harting2012 .
To tackle these above-mentioned issues and to provide a more scalable, high-performance communication medium for very large tiled CMPs (i.e., CMP equipped with hundreds or even thousands PEs), in this work we propose a hybrid on-chip interconnect targeting kilocore-oriented CMPs (up to 1024 PEs in our experiments, similar to Zheng2015c ; Grot2011kilo
). Our target application domain is generic. However, the proposed solution is optimised to support emerging applications in the ML/DL domain well. To this end, the data packet structure along with the micro-architecture of the routers/switches have been tailored to provide a good trade-off between efficiency, performance and flexibility. In fact, while general purpose applications still rely on the support to the standard floating point arithmetic, DL algorithms tend to adopt more compact data types as well as customised ones. Our solution leverages a highly efficient hybrid topology with the aim of providing high-performance along with low area cost and better power consumption. Unlike other hybrid solutions, in our work, ring and 2D-mesh NoC are combined without using any bridge-router. This hybrid approach can exploit the principle of computing and data locality exhibited by massive multithreaded applications to confine the traffic mostly inside the rings for better traffic managementBenson2010 ; Kandula2009 (not shown as it is out of the scope of this paper). Also, using efficient run-time systems (e.g., the Codelet model Suettlerlein2013 ) can offer the opportunity to exploit such locality easily. Our proposed architecture provides a customised router architecture which processes the traffic and also bypasses it when necessary for better latency improvement. In general, ring topology provides contention-free traffic without consuming a significant amount of power, thanks to its simple architecture. On the other hand, 2D-mesh topology has a large bisection bandwidth but suffers from the large diameter. Hence, our approach is to use 2D-mesh interconnect for high-speed data transfer between distant PEs, while exploiting rings for data exchange between PEs that are close to each other. The proposed NoC design could be a good candidate for serving communications in modern accelerators as it is agnostic to the specific PE’s architecture (we can assume each PE is implementing an in-order execution, and exposing a small local scratchpad memory). Such architecture can be reflected and mapped on modern reconfigurable hardware devices (FPGAs), which may offer “enough” resources to implement complex computing systems equipped with hundreds of specialised cores (e.g., coarse grain reconfigurable architectures – CGRA). Saving hardware resources used by interconnecting logic dramatically contributes to the reduction of overall power and area cost. The efficiency of the proposed NoC architecture allows using a minimal amount of hardware resources, which contributes to energy saving and also can help to reduce the area cost. Finally, by extending the interconnect capabilities to adapt to the application requirements dynamically, it is possible for the proposed design to improve performance further, while still reducing power consumption.
In this paper, our contribution can be summarised as follows. i) We detail our scalable ring-mesh based NoC design and also evaluate our hybrid topology by implementing it on FPGAs. ii) We show how our approach can scale well (from 16 to 1024 connected PEs in our simulations) and also outperform the standard 2D-mesh implementation with regards to lower latency, higher throughput and better power efficiency. iii) Finally, we also discuss how our approach can benefit frameworks (such as MapReduce Dean2008m ), which can dynamically vary the requirement of allocated resources (mainly PEs).
2 Related Work
In past years, NoC has received much attention from the research community. Whereas some of the works focused on proposing low latency router (e.g., Kumar2007ex ; Hoskote20075 ) and power efficient microarchitectures (e.g., Wang2003pwr ; Moraes2004h ), other researchers focused on proposing different topologies (e.g., Dragonfly Kim2008 , Flattened butterfly Kim2007flattened ). Other works tried to improve the performance-power consumption trade-off through the introduction of hierarchical NoC topologies (e.g., Ausavarungnirun2016case ; Das2009d ; Bourduas2007h ).
Ring and mesh-based approaches: HiRD is a hierarchical ring-based NoC design for improving energy efficiency, where buffers within individual rings are not used Ausavarungnirun2016case . It provides buffer support between different levels of the ring hierarchy, and upon the saturation of buffers, flits are deflected in the rings. It needs four levels of hierarchy to connect 256 PEs, while the connection among different levels is based on dedicated bridge routers. CSquare proposes a way of clustering routers so that clusters adopt an internal tree-like organization Zheng2015c . It is a topology with clusters forming a global parallel-oriented structure to provide high scalability. The authors also showed that the proposed topological design improves throughput, while lowers the average latency over mesh-like topologies (under the uniform traffic pattern). Transportation network inspired NoC (tNoC) is another proposed hierarchical ring topology Kim2014tran . It employs hybrid packet-flit, credit-based flow control mechanism for better scalability, as well as priority-based arbitration for achieving better performance. tNoC allocates channels with a flit granularity, while buffers are allocated with a packet granularity for reducing buffer counts. Koohi et al. proposed 2D-HERT, a two-dimensional hierarchical expansion of a ring topology focusing on optical NoCs Koohi2011a . Kilo-NoC is a topology-aware QoS-oriented architecture, adopting a low-diameter topology Grot2011kilo . It provides a service guarantee for applications with reduced power and area costs. It reduces the extent of hardware support to portions of the die, which in turn reduces router complexity to support large core counts. In Bourduas2007h the authors present a hybrid architecture where a large 2D-mesh is partitioned into several smaller sub-meshes. Next, the sub-meshes are connected using a hierarchical ring interconnect for delivering global traffic. In this work, a bridge module is used for driving traffic to the different levels of the hierarchy. The addressing and routing scheme has also been modified to support the proposed topology.
Other approaches: In Das2009d , a two-tier hierarchical topology consisting of local networks managed through a bus, and a global network controlled by a low-radix mesh router has been proposed. Authors showed that the proposed topology could reduce the latency, power consumption and energy-delay product only for localised communication-based applications. Apart from hybrid ring-mesh or bus-mesh topologies, concentrated mesh (CMesh) is a modified mesh architecture with replicated sub-networks where express channels are used to incorporate the second network without increasing the die area and wire length Balfour2006design . This approach aims at reducing the hop count and load imbalance. Here, the channel length is kept short to reduce energy dissipation, while express channels are used to improve energy efficiency. In kwon2017rethinking
the authors proposed an efficient NoC design tailored for a spatial neural network (NN) accelerators which are based on an array of reconfigurable micro-switches. Such micro-switches can be configured cycle-by-cycle (dynamically) to provide fast communication paths from PEs to global memory. Similar to our proposed solution, the adoption of micro-switches with a simple micro-architecture allows consuming very low power. Although this solution assumes a generic configuration for the PE micro-architecture, the design is optimised for the specific class of algorithms (NNs. Thus limiting its applicability to other ML algorithms or even to applications belonging to different domains.
Here, we propose a hybrid topology based on local rings globally connected through a 2D-mesh. The primary insight behind choosing ring and 2D-mesh for this proposed hybrid topology is that rings can outperform other interconnects regarding performance, power savings and area cost when the number of PEs is low. Furthermore, rings keep such features over different traffic patterns, as generated by applications belonging to various domains (e.g., multimedia, ML/DL, scientific simulations). On the contrary, with a relatively high number of connected nodes, 2D-mesh topology still provides the best trade-off between performance and scalability. Hence, in the proposed solution, the global mesh is intended to route the “global traffic” between distant PEs, while rings offer better performance and energy efficiency in managing “local” traffic among cores located close to each other. This feature allows our architecture to be scalable with the increased core count. We modified the architecture of the mesh router together with the ring packet transfer mechanism. We avoid the use of bridge routers to connect the local and global network which helps our approach not to require the introduction of further hierarchical levels for connecting larger networks (thus contributing power and area saving). We compared the performance and the power-area costs of our solution with a flattened 2D-mesh topology since it represents the most adopted NoC topology akopyan2015truenorth ; chen2014diannao ; du2015shidiannao .
3 System Overview
On-chip packet-switched interconnects (i.e., NoCs) provide the physical substrate used by PEs to communicate to each other and also to the primary memory. Besides, NoCs can be extended to support off-chip communications, thus easing the creation of multi-chip computing systems which further extend the design scalability. The main reason behind developing hybrid topologies is to exploit the fact that most of the communication in a parallel application affects a group of resources (i.e., PEs and routers) that are close each other Das2009d in an optimal resource allocation strategy. To this end, efficient Program eXecution Models (PXM) specifically designed to take advantage of locality of computation can be used to reduce the communication overheads and maximise the efficiency of the system Suettlerlein2013 . Hence, the optimisation of the local communication may lead to a substantial improvement regarding data-packet latency, throughput and energy efficiency.
Our design is based on the observation of how the traffic moves in motorways and takes exits. Once the traffic exits the main motorway, it is injected into small roads where simple traffic management decisions need to be taken. Conversely, more complex decisions and management policies are needed at the level of the global interconnect (main motorways). Similarly, our design provides two levels of communications: global traffic is managed by the customised 2D-mesh routers, while local traffic is injected (or ejected) into small rings. Traffic travelling in these two levels of the hierarchy is decoupled and processed differently to improve the system throughput and reduce the communication latency (similar to Ausavarungnirun2016case ; Udipi2010tow ; Das2009d ). In general, mesh routers are more complicated than ring switching stations (i.e., switch modules are responsible for driving the traffic in the ring or ejecting it towards the mesh). The proposed architecture achieves excellent levels of performance and efficiency by exploiting the fact that the majority of the traffic remains restricted to the rings. Such restriction can be adequately provided by a dedicated run-time or additional hardware support (i.e., fine-grain threads are grouped to form a task that can be forced to execute on a group of PEs connected by a ring) 7911285 . Furthermore, the deep neural networks offer a natural and inherent way of exploiting data and computation locality. Furthermore, such restriction of the traffic to the local communication allows the NoC to exploit more efficient and scalable communication mechanisms, (e.g., rings, compared to traditional flattened 2D-mesh).
The proposed architecture is aimed at connecting up to 1024 PEs (supporting in-order execution, and has a local scratchpad memory block) and uses the deadlock-free X-Y dimension order routing (XY-DoR) algorithm. Figure 1 depicts the main architecture of our reference CMP. A group of four PEs are locally connected through a small ring called ringlet. Within a ringlet, one of the PEs is designated as the master core. It is responsible for injecting/ejecting traffic towards the global traffic channels. To this end, a link between the ring switch and a mesh router is enabled along with dedicated buffers. The advantage of this architecture is the absence of a dedicated bridge component to connect the mesh and the ringlets. The proposed NoC is divided into blocks formed by a group of four ringlets. Such ringlets are directly linked to the mesh router, which is responsible for moving traffic outside the block. For instance, to support 256 cores, we need 16 modified mesh router and 64 ringlets. Multiple blocks are globally connected through a 2D-mesh topology. Each router is equipped with a high performance internal crossbar switch to support smooth traffic transfer (in and out). The smart packet processing implemented in the mesh routers allow decoupling of global and local traffic. Every time the destination of a packet is outside the local block, the packet is forwarded to another mesh router, thus bypassing PEs in the ringlets, and minimising the overall latency.
4 Proposed Network-on-Chip Architecture
In this section, we describe the main components of the proposed NoC architecture. Specifically, we provide details regarding the internal organisation of the mesh router and the ring switch (master core).
4.1 Modified 2D-mesh router
Figure 2 depicts the internal organisation of the mesh router and the Table 1 provides its main micro-architectural characteristics. The router employs a crossbar switch to support: i) global traffic movement in both dimensions (i.e., North-South and East-West), ii) traffic exchange with local ringlets. Four channels are used for driving global traffic within the 2D-mesh network. The other four input channels are used to steer local traffic to/from the ringlets. Each ringlet is associated with a dedicated channel so that the traffic exchange with the master ring switch happens through this dedicated link. In general, routers can have a significant number of virtual channels (VCs) to hold a large amount of incoming traffic while VCs are also used to avoid deadlocks. In Figure 2 we highlighted in red the path taken by control information carried by the packet headers and with blue lines the control signals activated by the internal router stages.
However, large buffer requirements and quality of service (QoS) overheads reduce the ability to support a high number of cores with an efficient area and energy usage Grot2011kilo . In our proposed design, to further saving energy and area, the output channels do not have any VCs because there is less resource contention on the output channels. Furthermore, a large number of VCs consume a huge chunk of energy and also leads to more input buffer counts for traffic management. It is worth to mention that buffers are one of the largest leakage power sources in the router. Their power consumption can represent up to 64% of the total router’s leakage power Chen2003leakage (sometimes it comprises up to 74% of the total NoC power budget Sun2012dsent ), and also a significant amount of dynamic power Wang2003pwr . In experiments using PARSEC benchmark 333http://parsec.cs.princeton.edu/, it has been found that single-flit packets represent the large segment of the network traffic for real applications Ma2012whole . In this work, the proposed mesh-router also supports a single-flit packet (total length of 43-bits) with 32-bits data (similar to Hoskote20075 ), and the remaining 11-bits are devoted to carrying header information. The small data packet length represents the bit-stream kind of data transmission, as well as it reflects traffic for the applications that require smaller precision arithmetic and data format (e.g., artificial NNs uses low precision data types such as 16-bits floating point, 8-bits integer, or even customized ones to represent internal input/output values and the weights). Also, the size of the packet has been considered by taking into account that increasing the packet size, leads to a quadratic increment of the internal crossbar switch overhead. Thus we maintain the packet size as small as possible Lee2013we , still supporting a wide range of applications.
|No. of input and output ports||8 each (4 ringlets, 4 mesh)|
|Width of each port||32-bits (payload) + 11-bits (header)|
|No. of Virtual Channel||2 per input port|
|Packet switching||Store-and-Forward (SAF)|
|Switch allocator arbitration||Weighted Round-Robin|
|Packet Routing||X-Y dimension order routing|
|Router pipeline stages||4 stages|
|Latency||1 cycle (speculation)|
The internal router is organised into a standard four-stage pipeline: routing stage, flow-control stage, VC allocation stage, and switch allocation stage. However, with the aim of reducing the latency of the packets during router traversal, proposed design can perform pipeline operations in parallel, thus reducing the overall latency. Thus entire packet transfer can be restricted to a single cycle due to the optimisation of the routing logic for processing the single-flit packet with a reduced overall size. These design choices lead to a router architecture with a latency of one cycle. Our routing mechanism is based on the XY-DoR algorithm since it provides a simple implementation with a deterministic routing latency. Decoupling the traffic between local ringlets and mesh, and exploiting data-computation locality, the probability of congestion in the 2D-mesh is significantly reduced. Thus, the need for an adaptive algorithm (e.g., hot potato routing Baran1964dis , also known as “deflection routing”) disappear.
Primarily, we fused the routing logic with the flow-control module. We implemented a speculative allocation technique for both the VC allocation stage (VCA) and the switch allocator stage (SA). In case the pre-arbitration fails, the packet is buffered while VCA and SA arbitration are performed sequentially. In that case, the latency increases up to four cycles. The timing of the proposed mesh router in case of pre-arbitration success is shown in Figure 3 (a), while the event of failure is represented in Figure 3 (b). Whenever a packet is entering into the router, the following operations are performed:
Routing/Flow control module (RF) extracts the packet header and processes the information to determine the destination router. If the packet destination is within one of the four ringlets belonging to the block, the RF module selects the corresponding output channel, reducing the latency of the VCA and SA module. A control signal is then used to drive the input multiplexer (MUX) at the input port. In this phase, speculative operations are performed to pre-allocate channels.
VC allocator module (VCA) is responsible for allocating buffer resources for incoming packets by selecting one of the VCs. An allocation request signal (i.e., req) is set, and if the selected VC has space to buffer the incoming packet, an acknowledge signal (i.e., ack) is also set. In that case, the selected VC is also signalled both to the RF module and the SA module.
Switch allocator module (SA) performs two steps of arbitration. First, multiple VCs in each input port are arbitrated to select one the available VCs. Next, each one of the selected VCs is routed to the selected output port.
4.2 Ring switch
A bidirectional ring is implemented on top of the structure of a ring switch (RS) (see Figure 4 for microarchitecture of the RS) to achieve better performance while keeping power consumption low. The RS is responsible for driving the traffic within the ring, and also to steer it towards the mesh router. The ringlet uses simple policy to check and forward the flits to the next PE based on the header information. The header contains the indication of the destination ring and the PE responsible for the extraction of the packet from the NoC (see Section 4.3). In addition to that, a simple modification is done to the existing ring routing policy. In a single ringlet, four PEs are numbering 00, 01, 10 and 11. Packets destined for 00 and 01 will be holding at VC-0 and VC-1 will hold the rests. This simple policy helps to route the packets faster. The RS is composed by two main multiplexers (MX1 and MX2) which manage the traffic within the ring. Compared to conventional RSs, we customised the proposed microarchitecture by incorporating buffers (similar to VCs) to allow the ring to steer the traffic to/from the mesh router.
To avoid a complex control logic, the RS uses prioritisation of the traffic travelling in the same dimension (i.e., traffic that remains within the ring and moves in the same direction). Prioritization helps to reduce the size of internal buffers too (buffers Buf-1 and Buf-2, see Figure 4). In particular, one of the two directions is selected as with high priority, thus moving first in the RS. Prioritization mechanism is implemented directly in the control logic of input-output multiplexers (MX1 and MX2).
The main drawback of traffic prioritisation is the starvation (not able to access to the output link) of the low priority traffic and may potentially wait indefinitely, without winning link arbitration. To avoid such situation, we allow traffic coming from low priority buffers to be injected in the onward link after a fixed amount of elapsed cycles. This mechanism is easily implemented as a slightly modified round-robin selection strategy, where moving from one selected input to another is weighted by the priority of the input. The interface with the local PE is implemented using a dedicated buffer (i.e., Buf-3) which is written by the PE (i.e., the PE injects traffic in the ring) and the RS reads it. The buffer is accessible by the PE within its address space. Similarly, traffic that is ejected by the ring is collected temporarily in a local output buffer, from where the PE can extract the payload. The interface with the mesh router is implemented similarly: traffic injected in the mesh is stored temporarily in a small buffer, from where it is transferred to the input link of the mesh router. Traffic ejected from the mesh router is moved within a VC buffer. When the mesh router tries to access the RS, two VCs are implemented to better support resource contention. From this viewpoint, the RS implements a weighted round-robin selection strategy between the two VCs to keep control logic simple. Such allocation strategy also avoids buffer exhaustion and traffic starvation. Similarly to the PE interface, traffic injection in the router has higher priority, since this reduces pressure on the ring buffers. When packets move within the ring or between a ring and the mesh router, the following steps are performed by RSs:
The multiplexer of each input port (i.e., MX1, MX2, MX3 and MX4 – see Figure 4) determines the destination based on the packet’s header information and also based on the arbitration.
Packets travelling in the ring (i.e., horizontal dimension) have higher priority compared to packets coming from the PE or the mesh router. Thus, packets are moved first from the input port to the output port with a minimal delay. The employed arbitration strategy also ensures that packets already in the main ring traffic flow are quickly routed to prevent the network bandwidth saturation. To enable the transfer, the RS set the request signal of the next switch in the ring (by following the moving direction of the packets), waiting for the acknowledge signal to be set by the peer switch.
The available two VCs’ buffers are used to temporarily store the packets when the master RS receives a request from the mesh router to inject packets in the ring. If there is room in the selected VC buffer, the RS enables the corresponding acknowledge signal of the mesh router. Each buffer will take turns to send out the packets via round-robin arbiter to exhibit fairness.
Worth noting that, RS modules which are not connected to the mesh router have the same structure depicted in Figure 4, except for the mesh router interface to minimise the amount of resources used by routing structures. In that case, the interfaces are removed to save area and power cost. To reduce the pressure on the mesh routers due to the four rings sending packets to each other, the priority rule is applied. In that case, the modified mesh router will process first rings’ traffic. This approach avoids the mesh router to become a bottleneck in the communication network. Also, more VCs can be added to the ring but at the expense of the power and hardware cost. Given these features, completing a transaction (i.e., reading a data from a remote PE scratchpad or writing data to) on a fabric block requires not more than 12 cycles.
4.3 Data Packet Structure and Control Flow
The design has been optimized for single-flit packets, and for higher scalability (up to 1024 PEs). Flits have a length of 43-bits. They are formed by a header 11-bits long and a payload of 32-bits. The header structure allows to route the flits within the global 2D-mesh hierarchically and within the ringlets in a block. We have incorporated the XY-DoR protocol in our design. Here, each flits is moved on the X dimension first, and then along the Y dimension, to reach the destination. To support the XY-DoR, the flit header contains two fields, each 3-bits long. In such way, a regular mesh of up routers can be created. Within a physical block (i.e., a router and four ringlets), the ringlets are numbered progressively from 0 (the top-left ringlet) to 3 (bottom-right). Thus, a 2-bits field is used to select the destination ringlet. Similarly, PEs within a ringlet are numbered progressively from 0 to 3. Another 2-bits field is used to select the final destination. Finally, a 1-bit field is used to control the virtual channel assignment. The source router does the assignment of the VC. Aiming at separating traffic generated by applications running on the chip and control traffic (i.e., configuration packets), the latter is assigned to VC-0 by default (see Section 7 – whenever the starting flit of the configuration packet sequence is detected the VC arbitration will select VC-0 for the subsequent flit). This design choice also contributes to simplifying routing and control logic. Figure 5 shows the structure of the single-flit packet used by our proposed NoC architecture.
The adoption of single-flit packets allows us also to simplify the control flow mechanism. Routers’ logic, as well as ring switches manage control signals used to enable packet transfer. This logic use a back pressure mechanism: every time the router/ring switch has to send a flit, it rises a request signal towards the selected next hop. In case the selected router/ring switch (next hop) has no free slots in the input queue to store the incoming flit, it resets the acknowledge signal. Resetting the acknowledge signal allows to temporary stop transmission of flits(packets) at the source node. The complexity of such logic is reduced, although it ensures correctness and deadlock avoidance.
The following steps are followed to steer the data traffic by processing the packet header:
Traffic confined in the ring: the block identifier sub-fields are reset, while ring and PE identifiers are set.
Traffic injected/ejected to/from the 2D-mesh: all sub-fields are set while the bypass logic is disabled.
Traffic confined in the 2D-mesh: the bypass logic is enabled.
5 Network-on-Chip Characterization
The proposed hierarchical topology is regular (at each level of the hierarchy) and presents some symmetry, which helps in deriving analytic performance metrics. Here, we consider the two most relevant metrics: the maximum distance, and the bisection bandwidth.
5.1 Maximum Distance
The maximum distance or diameter () is defined as the maximum shortest path between all pairs of nodes. In a hierarchical topology, the diameter depends on the configuration of each topology level and is proportional to the number of hops to traverse. Diameter determines the worst case distance. Here, the worst case is represented by the communication between nodes placed at the opposite corners of the global mesh. Such nodes reside in two distinct ringlets, for which the diameter is equal to two. It leads us to the following formulation of the topology diameter:
Where is the number of links to traverse in the vertical dimension of the global 2D-mesh, is the number of links to traverse in the horizontal dimension of the global 2D-mesh, and are the links to traverse within the ringlets (three links each). Specifically, since the ringlets are based on bi-directional rings, only two links must be traversed to reach any node inside the rings, plus an additional link that connects the ring to the mesh router.
5.2 Bisection Bandwidth
The bisection bandwidth () is defined as the bandwidth of the minimum cut that divides the network into two halves. It implies that is equal to the number of links comprised in the cut multiplied by the bandwidth offered by each link. Given our hierarchical topology, regardless of the specific configuration of the two levels, the least connected cut of the graph representing the network will always be along the global 2D-mesh. Therefore, the bisection bandwidth is given by:
Where represents the bandwidth of each of the links comprised in the cut. Similarly, it is still possible to determine the bisection bandwidth associated with the other hierarchical levels. Looking at the single block (i.e., a router with four associated ringlets) the bisection bandwidth is given by half of the bandwidth offered by the internal router crossbar switch (i.e., ); while, the bisection bandwidth of each ringlet () is equal to that of a bidirectional ring. Although from the above analysis the proposed topology preserves characteristics that are in line with those of well-known topologies (i.e., 2D-mesh and rings). The main advantage resides in the simpler micro-architecture of the rings that provides room for exploring more aggressive clocks without negatively impacting on the overall power consumption. The proposed architecture provides large benefit compared to traditional interconnections adopted in manycore accelerators, along with the exploitation of data and computation locality.
We simulated the proposed architecture taking into account average network latency and average throughput. This section also presents the hardware area and power consumption analysis based on synthesis results of the NoC using the FPGA. FPGAs provide a mean to test the effectiveness of the proposed design quickly. Specifically, we synthesised instances of the NoC on a Xilinx Virtex 7 FPGA (XC7VX690-3) using Vivado Design Suite 2017.4. All the tests are done setting the clock speed to 400 MHz, and synthetic traffic generators have been developed for generating packets instead of PEs. Vivado tools have been used to calculate the area and power cost of the design. In the following, we report the primary results by comparing the proposed architecture with a reference design based on a traditional flattened 2D-mesh interconnect. Main microarchitecture parameters for the flattened 2D-mesh design are the same of those used for the proposed modified mesh router, except for the number of input-output ports which is smaller (following a canonical design).
6.1 Cost metrics
Cost metrics are represented by the power consumption and the resource utilisation when the proposed design and the flattened 2D-mesh topology are implemented on the FPGA device.
6.1.1 Resource Utilisation: Area
We compared two designs on a relative scale and reported the values as the percentage of the total used resources (see Table 3). To this end, we have counted LUTs, FFs and Block RAMs (BRAMs each 36Kb in size). In our proposed design, for the implementation of a single block (i.e., four ringlets connected to a mesh router), the total number of used LUTs, FFs and BRAMs are 2434, 2768 and 48 respectively. Specifically, only four ringlets consume a total of 1076 LUTs, 1800 FFs and 40 BRAMs, while resource consumption of the proposed mesh router is reported in Table 2.
In Table 2, we also compare the resource utilisation and power consumption between a standard 2D-mesh router with our proposed one. The table reports the absolute number of resources consumed by the two designs, along with the static and dynamic power breakdown. As previously stated, the standard 2D-mesh router share microarchitectural parameters with our proposed mesh router (see Table 1), except the number of input-output ports (the number of ports are five in total). Unlike a standard router, the modified one can support sixteen cores via four ringlets with around increment in the resource utilisation compared to a traditional mesh router and with less than W increment of power consumption. Regarding the power consumption, we split the values between static power (due to leakage currents and which depends on the manufacturing process) and dynamic power. Results show that the proposed design consume mW and mW more respectively regarding static and dynamic power. This result is strictly correlated to a large number of memory blocks (BRAMs) used by the proposed design.
We compare resource utilisation of our single block (connecting 16 PEs in total) with the publicly available FPGA friendly NoC generator CONNECT Papamichael2012co . In this experiment, our design saves 74.65% of LUTs and 39.51% of FFs compared to CONNECT (we used the code that is available on its official website without any further modification) to connect sixteen cores via an on-chip network. CONNECT also has used 1728 Distributed RAM elements, each 64-bits in size. It is important to highlight that such smaller SRAM elements have a high cost regarding wires when compared to BRAMs.
Scaling the system up to 1024 PEs, we need 64 modified mesh router and 256 ringlets. Such architectural blocks consume up to total 155776 LUTs, 177152 FFs and 3072 BRAM blocks (total of four FPGA devices are used, interconnected each other through dedicated links). From the Table 3, we can see that our model is very resource efficient compared to the standard flattened 2D-mesh design. For connecting sixteen cores (one block), our design can save (on an average) 2% LUTs, 0.7% FFs and 2.2% of BRAM compared to the sixteen 2D-mesh routers. Although, it might seem small saving when the design is scaled to 1024 cores the total average resource saving (over all the four FPGAs) increases up to 129.3% for LUTs, 47.2% for FFs and 139.3% for BRAMs.
6.1.2 Power consumption
In Figure 6, we show the static and dynamic power breakdown for each network configuration. Reported values are normalised to the total power consumption exhibited by each configuration, and expressed as a percentage. Initially, the static part dominates the power consumption, but as the size of the network grows, it started to diminish. Interestingly, we also notice that the static part slightly grows with the increase in the network size. However, the dynamic power component quickly starts to dominate. We can also see from Figure 6 that the amount of static power is more related to the implementation of the mesh routers (compared to ringlets). However, it is worth to note that ringlets in all cases consume more FFs and BRAMs, while router consumes more LUTs (specifically, in the range of 0.06% to 4.2% – see Table 3).
Next, we plot the power consumption comparison between the proposed model and the standard 2D-mesh in Figure 7. It is worth to note that the power consumption of the proposed NoC architecture is the summation of the power consumption of both ringlets and mesh routers. For one topology block (16 cores), the power consumption is W and W, respectively for the mesh router and the ringlet. However, as the size of the network grows, the total power consumption of ringlets starts to dominate. For instance, for 16 topology blocks (i.e., 256 cores), the power consumption of routers is W while the 64 ringlets consume W, which is more than of total routers power consumption. Following this trend, for 1024 core configuration, all the ringlets consume around of the total power consumption referred to all the routers. Apart from that, for network size of 16 cores, both the proposed design and the flattened 2D-mesh consume almost the same amount of power. However, as the core count of the network started to grow, the 2D-mesh starts to consume comparatively more power. For the network configuration, the proposed model consume W while the conventional design consumes W. The situation becomes worse when it touches W for connecting 1024 cores, which represents relatively more power compared to the proposed design.
6.2 Performance metrics
We evaluated our proposed routers and ringlets under statistical traffic patterns regarding throughput and average data packet latency (i.e., the time for a packet to move from source to destination, including the time for a packet to cross the channel). We consider three well-known statistical traffic patterns (such as uniform random, bit-reversal and transpose Dally2004p ) to represent the way real-world applications generate traffic. We developed VHDL-based cycle-accurate models for generating synthetic traffic patterns. In modern computing applications, communication requirements are dynamic and unknown before the execution. Generally, bit-reversal and transpose do not support smooth traffic operations. For the experiment, we generated synthetically a large number of packets that need to be independently routed to a dynamically determined destination. We have used four packet injection rates ( – i.e., it provides the fraction of nodes that simultaneously inject a packet in the network every cycle): specifically . For , nodes generating traffic are selected randomly, while with (worst case) all the nodes inject packets at the same time. The stress-test was to ensure the working capability of the proposed NoC design under both bandwidth and worst/average latency scenarios.
6.2.1 Network latency
Figures 8, 9, 10 show the average packet latency as a function of the four injection rates, when the three different traffic patterns are used. Bars show that the network latency increases with the increased size of the injection rate, as well as the increase of the network size. However, the proposed system shows very much consistency with increasing network configuration. For low injection rates (i.e., ), the latency for each traffic pattern remains very consistent with the others. When increasing the injection rate up to and using the bit-reversal traffic pattern, the latency is minimised. The worst case for the packet latency is represented by the transpose traffic pattern with an injection rate of .
When comparing the proposed architecture with conventional flattened 2D-mesh, we found that for all the three traffic patterns, our design outperforms traditional NoC designs, by keeping the latency lower. Specifically, analysing the behaviour of the 2D-mesh NoC, we found that it is very consistent for latency increments for all the cases, while it also has its largest latency for the transpose traffic pattern with the injection rate of (similar to the proposed). We improved (on average) the latency by 10% for all three traffic patterns for smallest network configurations (i.e., 16 cores). While, we improved by for uniform-random traffic, by for transpose traffic, and finally by for bit-reversal for the largest case (i.e., 1024 cores).
6.2.2 Network throughput
Figures 11, 12, 13 show the achieved network throughput for all three traffic patterns. Similar to latency, the network throughput is also consistent with the packet injection rate. In fact, in the proposed design the average throughput increases as the number of PEs increases. From this viewpoint, by analysing the number of packets delivered per cycle, we observed an increase of a factor with the increase in the number of PEs in the network. For instance, with an injection rate equal to and a uniform random traffic pattern, the average throughput increases from 12 packets/cycle for a single block unit (i.e., 16 cores) to 22 packets/cycle for two block units (i.e., 32 cores).
Similarly, for a configuration with 512 cores, the average throughput is 345 packets/cycle, while it increases up to 680 packets/cycle for a 1024 cores configuration. Again, it represents approximately an improvement of a factor with the network size doubling. This trend is also followed by our design when other traffic patterns are considered. It shows that our NoC architecture is capable of offering higher performance and scalability compared to traditional flattened 2D-mesh. Interestingly, we also observed a similar trend in 2D-mesh throughput. Conversely, when the injection rate is low (i.e., ), we see that our design performs better for the transpose traffic pattern. For an injection rate equals to , the proposed design performed well for all the traffic patterns, while for largest network configuration (i.e., 1024 cores), again the design shows the best throughput for transpose traffic pattern. However, considering the worst injection rate case, it is worth to note that best throughput is achieved with the uniform-random traffic, while traditional 2D-mesh topology did not demonstrate a similar consistency among the patterns.
These results clearly show that our design can improve the performance of the NoC regarding higher throughput and lower average latency compared to the traditional 2D-mesh topology. The capability of our design to sustain such performance also with high injection rates and random traffic pattern (which represent a critical pattern) can be mainly ascribed to the hierarchical organisation of the network. In fact, most of the traffic is kept inside ringlets or is exchanged by ringlets connected to the same mesh router. Such organisation (ringlet-oriented) is the main contributor to the scalability of the proposed design.
6.2.3 Network scalability
To better analyse the scalability of the proposed design, we further performed a set of experiments specifically aimed at evaluating the average packet latency and throughput with the increasing number of cores in the network. For comparison purpose, we averaged the four injection rates (i.e., ) for all the three traffic patterns. The results are shown in Figure 14.
From the plot, it is evident that the proposed NoC architecture shows a significant reduction of the average latency up to 128 cores, compared to the 2D-mesh. Specifically, moving from a configuration with 16 PEs to the that counting 128 PEs, the latency increases from to cycles. Conversely, for the same two configurations, the 2D-mesh topology shows an increment of the latency from to cycles. This behaviour has been observed irrespective of the traffic pattern. For the next two network configurations (i.e., 256 and 512 cores respectively) the latency increment exhibited by our design is better (e.g., the increment from 128 to 256 PEs shows a lower slope of the curve), while in the 2D-mesh the latency increment still follows a linear trend, thus showing worse performance. Finally, considering the most significant configuration (i.e., 1024 cores) the trend is still non-linear for the proposed design and the latency drops to cycles (this trend is similar for all the three traffic patterns). The trend of average packet latency improvement is also analogous for the other traffic patterns. Conversely, 2D-mesh increases the latency up to cycles. It is worth to note that, although the two architectures have similar behaviour when the number of the PEs increases, the average latency is always significantly lower with the proposed design. If we combine this observation with the lower power consumption and resource utilisation, we can advocate that the proposed design scales better than conventional ones and it can be a good candidate for supporting next-generation high-performance manycore accelerators. Interestingly, to have similar performance using the flattened 2D-mesh topology, more resources are needed (e.g., a more substantial number of VCs, deeper buffers), leading to a more power hungry and area consuming solution (similar the case reported in Hoskote20075 ; Vangal20078 ; Balfour2006design ). It is the primary reason that standard 2D-mesh does not scale well with increasing network size.
To further confirm the capability of our design to scale, we also analysed how the average network throughput improves with the growing network size (using the same packet injection rate for the three traffic patterns – ). The results of this experiment are shown in Figure 15. From the trend, it is evident that the average throughput tends to increase almost linearly by a factor when the number of PEs is doubled. For instance, considering the transpose traffic pattern, for a single block unit (i.e., 16 cores) the average throughput is packets/cycle while it increases to packets/cycle for 32 PEs. Similar behavior is observed when moving from 128 cores ( packets/cycle) to 256 cores ( packets/cycle), as well as when moving towards the largest configuration, i.e., from 512 cores ( packets/cycle) to 1024 cores ( packets/cycle). Although the 2D-mesh topology shows similar behaviour for lower core counts, it is important to highlight that the average throughput is always lower than our proposed design, and quickly start to decrease when the core count increases (i.e., for more than 128 cores our ring-mesh combination far outperforms the 2D-mesh topology).
Finally, we compared how the network performs when increasing the number of PEs, by plotting the average throughput versus the number of processing cores and also the average throughput versus the average latency together. The result of this comparative analysis is reported in Figure 16. Here, all the reported values of latency and throughput are the average of all the experiments. From the plot, it appears that the average latency grows by a factor of when moving from 256 cores to 512 cores while all the other cases we can see a lower latency growth. The NoC design also registers the average throughput growth of a factor of for almost all cases. The proposed design has shown its robust performance by improving its throughput higher than latency, and even the latency growth is started to reduce with the increase in the network size.
It is worth to note that, good performance exhibited by our design also derive from the adoption of the single-flit packet format. In fact, this format greatly reduces the processing overhead associated with the packet header, since information regarding traffic movement within the ringlets and the global mesh are decoupled.
7 Further improvements: Adaptability
In this section, we discuss possible extensions of the underlying working mechanism of the proposed network architecture, which is aimed at better supporting multiple applications with dynamic resource requirements, as well as to reduce power consumption further whenever the resources of the whole fabric are not used. For instance, applications based on the MapReduce programming model can vary the number of required PEs during their execution. In general, the number of PEs needed by map functions is higher than the number required by reduce functions. Similarly, applications built upon an explicit dataflow programming model (e.g., Suettlerlein2013 ) show a similar behaviour: dataflow graphs representing the execution flow and the thread/task dependencies can grow and shrink during the application lifetime. It is worth noting that in DL/ML applications, the computations of different layers of neurons may require a different mapping on the fabric resources (i.e., it might be required a different number of resources in a different stage of a neural network). With the aim of supporting such applications, we provide a set of features which allow exploiting better the large number of PEs that our NoC design can support, as well as to provide resiliency of the system against failures. Explicitly, we allow the active elements of the network hierarchy (i.e., mesh routers and RSs) to be selectively bypassed or completely switched off.
7.1 Morphing capability
Smart selection of specific topology to run applications can optimise cost and improve performance. The operating system (OS) or a dedicated run-time can work in conjunction with the NoC support to exploit such feature to optimise the thread/task communication overheads. Here, we present an extension of the proposed design aimed at supporting dynamic reconfiguration of the topology (to better adapt the application needs). The routers and RSs can process a unique configuration single-flit packet called morph packet, which specifies how to configure single links within the mesh routers or the RSs. In HPC environments, hypervisor/OS is responsible for generating such control packets consulting the compiler (similar to Murali2004su ).
Our single-flit packet structure does not permit to encode in the flit header the information for issuing control signals to configure the network. Such information must be carried within the flit payload. We adopted a communication protocol to avoid consuming bits by reserving part of the payload for control information. Here, the control packets are signalled by sending an initial starting flit. Since payload with all bits set (i.e., the payload is 0xFFFFFFFF) is rare, we use such value to enable the transmission of the subsequent control flit. Furthermore, a simple combinatorial logic can be used to detect such packets. Conversely, when such data flit must be transmitted to the destination, a pair of flits with the 0xFFFFFFFF payload is transferred. Preliminary simulations confirmed us that the penalty is negligible, especially when real application traffic (e.g., DL/ML applications) is taken into account. To deal with such protocol, the routing logic is slightly augmented with a simple finite state machine that can process configuration packets. The protocol ensures that configuration flits can be correctly delivered without sacrificing bandwidth for the application traffic. Furthermore, configuration traffic can be restricted to use only one of the two VCs available in the routers and RSs (e.g., we can assume to assign VC-0 to manage morph packets). The morph packet is organised in such way the 32-bits of the payload are used to carry configuration information. Notably, the payload is composed of four sub-fields: hierarchy level (HL=1-bit), execution region size (ERS=10-bits), link configuration (LC=16-bits), and PE-type selector (PTS=5-bits). Figure 17 (left) shows the internal organisation of the morph packets.
Hierarchy level (HL): Single bit allows distinguishing if the configuration must be applied to an RS (HL = 0) or the mesh router (HL = 1).
Execution region size (ERS): The next 10-bits form this field in the payload. With this length, it is possible for an application demanding for its execution a subset of the total core count, as well as the whole CMP computing resources.
Link configuration (LC): This field can use up to 16-bits to specify how to configure single links. Each group of two-bits in this field allows specifying the state of the corresponding link. In the case of the mesh router, links are eight in total (i.e., north, south, east and west for the 2D-mesh and four to connect to the ringlets). Conversely, an RS has four links at most. Thus, only a reduced number of bits in this field are used in the case of RS configuration. A group of two-bits allows to set the link into three main states:
Active: Link is fully active, and the traffic normally flows inside the router/switch for routing decision.
Bypass: Link is configured in such way the incoming traffic is directly presented to the corresponding output port, moving in the same direction. For instance, bypassing the east channel of a mesh router allows injecting the traffic directly in west output channel.
Switch-off: Link is completely switched off, by disabling the logic governing it. The router/switch logic governing the other links is reconfigured accordingly.
PE type selector (PTS): The remaining 5-bits can be used to target particular resources in the chip (such as dedicated accelerating cores when a heterogeneous environment is used). In the case all the configuration bits must be set, the least significant bit (LSB) is always set to 0 to allow distinguishing between the data payload (0xFFFFFFFF) and the configuration payload. This design choice still presents enough room to target several types of embedded cores or IPs.
Morph packets can be generated at the OS/hypervisor level in such way they can be sent selectively to a subset of the routers. However, the way control information is organised in the payload, allows the routers and RS not involved in the configuration process to skip their processing, thus avoiding further throughput limitations. Starting from the requests of the application, routers can be organised in such way they create execution regions for each application instance (i.e., an execution region corresponds to the dynamic group of PEs required by the application). For instance, it is possible for a mesh router to dynamically restrict the execution of an application to two ringlets, and use the other two ringlets for the execution of another application (application awareness). The proposed morphing solution is flexible enough so that it can be exploited to tailor the NoC topology to the application requirements. Similarly to Scionti2016 , by selectively bypassing or disabling links, it is possible to allow the NoC to assume a special (virtualised) configuration that provides more performance for the specific application.
Figure 17 (right) shows the modification to the control signals to allow morphing capabilities. Blue dashed lines represents signals set to bypass crossbar switch and input/output channel buffers. Every time this signal is set, packets entering in the selected input channels are directly driven by the output links (blue line). Conversely, the switch off control signal (it is dominant over the bypass signal) completely disables input and output channel logic. Traffic entering in switched off channels is dropped. Similarly to mesh router, bypass and switch off logic directly operates on the MUXs/DMUXs governing the RSs (see Figure 4, Section 4.2).
Morphing capabilities can also be used to improve further power saving (such as power gating technique based on a traffic activity threshold Parikh2014p ) and ensure system resiliency. Whenever any running application does not use a portion of the fabric, such resources can be switched off. When an application requires more resources, switched off elements (e.g., a ringlet) can be enabled by resetting switch off control signals. The proposed extension can play an important role to ensure system resiliency to faults. By detecting faulty PEs, or a failure in the router/switch logic, the component can be easily bypassed.
Finally, the morphing capability is mainly aimed at improving the energy-cost by allowing multiple applications to share the spatial as well as computational resources of the chip. The primary idea beyond the use of morphing capabilities is to better tailor NoC resources available on the fabric to adapt to an application context that exploits data and computation locality in a more dynamic way.
8 Conclusion and Future Work
Following the current trend, we will have general purpose chips with hundreds, or even thousands of PEs. To efficiently use the massive built-in parallelism, we need to manage the traffic inside the chip efficiently, both from the power (reducing hot-spots) and performance (lower latency and throughput) perspective. In this paper, we propose a simple yet scalable two-level hybrid hierarchical interconnection where rings and a 2D-mesh topology are fused together without using any bridge router. We have implemented the proposed NoC design with an efficient traffic generator on high-end FPGA devices. Using experiments, we showed that, with multiple synthetic traffic patterns, our design is scalable, while keeping high performance regarding throughput and latency. Experimental results also showed that our hierarchical organisation of the interconnect could easily outperform the capabilities of traditional 2D-mesh NoCs. Such popular topology, especially in hardware accelerators tailored for the emerging domain of DL/ML. This proposed topological design can also be extended using the special configuration packets to exploit chip resources better, depending on the specific application requirements. Finally, we also discussed how the proposed design could be utilised for the applications whose requirements are dynamically changing over their execution lifetime (e.g., applications based on an explicit dataflow execution model, DL/ML-based applications). Future works will be to compare the performance of our design against other NoC topologies. Furthermore, the performance analysis of our design will be done while mapping DL/ML applications. Finally, it will be interesting to explore its performance by exploiting different micro-architectural parameters and larger packet sizes.
- (1) Akopyan, F., Sawada, J., Cassidy, A., Alvarez-Icaza, R., Arthur, J., Merolla, P., Imam, N., Nakamura, Y., Datta, P., Nam, G.J., et al.: Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34(10), 1537–1557 (2015)
- (2) Ausavarungnirun, R., Fallin, C., Yu, X., Chang, K.K.W., Nazario, G., Das, R., Loh, G.H., Mutlu, O.: A case for hierarchical rings with deflection routing: An energy-efficient on-chip communication substrate. Parallel Computing 54, 29–45 (2016)
- (3) Balfour, J., Dally, W.J.: Design tradeoffs for tiled cmp on-chip networks. In: Proceedings of the 20th annual international conference on Supercomputing, pp. 187–198. ACM (2006)
- (4) Baran, P.: On distributed communications networks. IEEE transactions on Communications Systems 12(1), 1–9 (1964)
- (5) Barrow-Williams, N., Fensch, C., Moore, S.: A communication characterisation of splash-2 and parsec. In: Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, pp. 86–97. IEEE (2009)
- (6) Benson, T., Akella, A., Maltz, D.A.: Network traffic characteristics of data centers in the wild. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pp. 267–280. ACM (2010)
- (7) Bohnenstiehl, B., Stillmaker, A., Pimentel, J., Andreas, T., Liu, B., Tran, A., Adeagbo, E., Baas, B.: A 5.8 pj/op 115 billion ops/sec, to 1.78 trillion ops/sec 32nm 1000-processor array. In: 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), pp. 1–2. IEEE (2016)
- (8) Bolotin, E., Cidon, I., Ginosar, R., Kolodny, A.: Cost considerations in network on chip. INTEGRATION, the VLSI journal 38(1), 19–42 (2004)
- (9) Bourduas, S., Zilic, Z.: A hybrid ring/mesh interconnect for network-on-chip using hierarchical rings for global routing. In: First International Symposium on Networks-on-Chip (NOCS’07), pp. 195–204. IEEE (2007)
- (10) Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices 49(4), 269–284 (2014)
- (11) Chen, X., Peh, L.S.: Leakage power modeling and optimization in interconnection networks. In: Proceedings of the 2003 international symposium on Low power electronics and design, pp. 90–95. ACM (2003)
Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks.IEEE Journal of Solid-State Circuits 52(1), 127–138 (2017)
- (13) Dally, W.J., Towles, B.P.: Principles and practices of interconnection networks. Elsevier (2004)
- (14) Das, R., Eachempati, S., Mishra, A.K., Narayanan, V., Das, C.R.: Design and evaluation of a hierarchical on-chip interconnect for next-generation cmps. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture, pp. 175–186. IEEE (2009)
- (15) Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
- (16) Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O.: Shidiannao: Shifting vision processing closer to the sensor. In: ACM SIGARCH Computer Architecture News, vol. 43, pp. 92–104. ACM (2015)
- (17) Edwards, B., Wentzlaff, D., Bao, L., Hoffmann, H., Miao, C.C., Ramey, C., Mattina, M., Griffin, P., Agarwal, A., III, J.F.B.: On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15–31 (2007)
- (18) Grot, B., Hestness, J., Keckler, S.W., Mutlu, O.: Kilo-noc: a heterogeneous network-on-chip architecture for scalability and service guarantees. In: ACM SIGARCH Computer Architecture News, vol. 39, pp. 401–412. ACM (2011)
- (19) Hamacher, V.C., Jiang, H.: Hierarchical ring network configuration and performance modeling. IEEE Transactions on Computers 50(1), 1–12 (2001)
- (20) Harting, R.C., Parikh, V., Dally, W.J.: Energy and performance benefits of active messages. Concurrent VLSI Architectures Group, Stanford University, Tech. Rep 131 (2012)
- (21) Hoskote, Y., Vangal, S., Singh, A., Borkar, N., Borkar, S.: A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro 27(5), 51–61 (2007)
- (22) Kandula, S., Sengupta, S., Greenberg, A., Patel, P., Chaiken, R.: The nature of data center traffic: measurements & analysis. In: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pp. 202–208. ACM (2009)
- (23) Kim, H., Kim, G., Maeng, S., Yeo, H., Kim, J.: Transportation-network-inspired network-on-chip. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 332–343. IEEE (2014)
- (24) Kim, J., Balfour, J., Dally, W.: Flattened butterfly topology for on-chip networks. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 172–182. IEEE Computer Society (2007)
- (25) Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: ACM SIGARCH Computer Architecture News, vol. 36, pp. 77–88. IEEE Computer Society (2008)
- (26) Koohi, S., Abdollahi, M., Hessabi, S.: All-optical wavelength-routed noc based on a novel hierarchical topology. In: Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip, pp. 97–104. ACM (2011)
- (27) Kumar, A., Peh, L.S., Kundu, P., Jha, N.K.: Express virtual channels: towards the ideal interconnection fabric. ACM SIGARCH Computer Architecture News 35(2), 150–161 (2007)
- (28) Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I., Patwary, M.M.A., Malas, T., Sundaram, N., Bhimji, W., Smorkalov, M., et al.: Deep learning at 15pf: supervised and semi-supervised classification for scientific data. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 7. ACM (2017)
- (29) Kwon, H., Samajdar, A., Krishna, T.: Rethinking nocs for spatial neural network accelerators. In: Proceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip, p. 19. ACM (2017)
- (30) LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436 (2015)
- (31) Lee, H.G., Chang, N., Ogras, U.Y., Marculescu, R.: On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches. ACM Transactions on Design Automation of Electronic Systems (TODAES) 12(3), 23 (2007)
- (32) Lee, J., Nicopoulos, C., Park, S.J., Swaminathan, M., Kim, J.: Do we need wide flits in networks-on-chip? In: 2013 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 2–7. IEEE (2013)
- (33) Liu, Y., Racah, E., Correa, J., Khosrowshahi, A., Lavers, D., Kunkel, K., Wehner, M., Collins, W., et al.: Application of deep convolutional neural networks for detecting extreme weather in climate datasets. arXiv preprint arXiv:1605.01156 (2016)
- (34) Ma, S., Jerger, N.E., Wang, Z.: Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip. In: IEEE International Symposium on High-Performance Comp Architecture, pp. 1–12. IEEE (2012)
- (35) Moraes, F., Calazans, N., Mello, A., Möller, L., Ost, L.: Hermes: an infrastructure for low area overhead packet-switching networks on chip. INTEGRATION, the VLSI journal 38(1), 69–93 (2004)
- (36) Murali, S., De Micheli, G.: Sunmap: a tool for automatic topology selection and generation for nocs. In: Proceedings of the 41st annual Design Automation Conference, pp. 914–919. ACM (2004)
- (37) Papamichael, M.K., Hoe, J.C.: Connect: re-examining conventional wisdom for designing nocs in the context of fpgas. In: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays, pp. 37–46. ACM (2012)
- (38) Parikh, R., Das, R., Bertacco, V.: Power-aware nocs through routing and topology reconfiguration. In: 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2014)
- (39) Ravindran, G., Stumm, M.: A performance comparison of hierarchical ring-and mesh-connected multiprocessor networks. In: High-Performance Computer Architecture, 1997., Third International Symposium on, pp. 58–69. IEEE (1997)
- (40) Scionti, A., Mazumdar, S., Portero, A.: Software defined network-on-chip for scalable cmps. In: 2016 International Conference on High Performance Computing Simulation (HPCS), pp. 112–115. IEEE (2016)
- (41) Scionti, A., Mazumdar, S., Zuckerman, S.: Enabling massive multi-threading with fast hashing. IEEE Computer Architecture Letters PP(99), 1–1 (2017)
- (42) Suettlerlein, J., Zuckerman, S., Gao, G.R.: An implementation of the codelet model. In: European Conference on Parallel Processing, pp. 633–644. Springer (2013)
- (43) Sun, C., Chen, C.H.O., Kurian, G., Wei, L., Miller, J., Agarwal, A., Peh, L.S., Stojanovic, V.: Dsent-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In: Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, pp. 201–210. IEEE (2012)
- (44) Udipi, A.N., Muralimanohar, N., Balasubramonian, R.: Towards scalable, energy-efficient, bus-based on-chip networks. In: HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, pp. 1–12. IEEE (2010)
- (45) Vangal, S.R., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Singh, A., Jacob, T., Jain, S., et al.: An 80-tile sub-100-w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits 43(1), 29–41 (2008)
- (46) Vranesic, Z.G., Brown, S., Stumm, M., Caranci, S., Grbic, A., Grindley, R., Gusat, M., Krieger, O., Lemieux, G., Loveless, K., et al.: The NUMAchine multiprocessor. Citeseer (1995)
- (47) Wang, H., Peh, L.S., Malik, S.: Power-driven design of router microarchitectures in on-chip networks. In: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p. 105. IEEE Computer Society (2003)
- (48) Zheng, N., Gu, H., Huang, X., Chen, X.: Csquare: A new kilo-core-oriented topology. Microprocessors and Microsystems 39(4), 313–320 (2015)