Rapid advancement in semiconductor technology has made it possible to integrate tens, hundreds and even thousands of processing elements on a single chip multi-processor (CMP). Networks-on-Chips (NoCs)  have become the mainstream intra-chip communication infrastructure for CMPs and many-core systems due to their scalability and power efficiency. The communication in NoCs can be categorized into unicast (one-to-one) and multicast (one-to-many) communication. A significant amount of multicast traffic is demonstrated in parallel applications involving multithreaded programs, replication, barrier/clock synchronization , and cache coherence  protocols. Simulations of a set of CMP benchmark applications with MESI cache protocols show that the multicast traffic percentage ranges from . The study in  shows that depending on the coherence protocol used, the multicast traffic percentage can vary with up to destinations. Emerging heterogeneous CPU-GPU many-core systems specifically target to machine learning applications which contain large amount of communication (around ) to the memory controller 5]
are commonly used deep learning architectures. CNNs contain a stack of convolution and pooling layers followed by a fully-connected layer at the end. The convolution and pooling layers use distributed operations for moving data between the cores, while the fully connected layer is a communication intensive process where the traffic is inherently multicast in nature. All these applications require attention for the multicast traffic.
Although multicast communication can be performed by splitting a multicast packet into multiple unicast packets, it is not an efficient way of doing it due to poor resource utilization. Authors in 
show that a small percentage of multicast traffic can significantly degrade the network’s performance. Thus, an efficient multicast algorithm is needed to overcome this degradation. Multicast routing algorithms are broadly classified into tree-based and path-based. In tree-based
algorithms, a spanning tree is constructed from the source node to cover all destination nodes. In a tree, messages are replicated at the intermediate node and forwarded along multiple branches. Although the tree-based algorithm can deliver messages to each destination via the shortest path, it is susceptible to deadlocks because of high blocking probability. The multiple unicast routing can be considered as a simple tree-based algorithm which has an advantage when the multicast destination is sparsely located.
In path-based multicast algorithms, intermediate branching is prohibited. Multiple paths at the source node are identified and the packet is delivered to the destination nodes along the paths. The issue with deadlock is not present in path-based algorithms, however, the path to each destination tends to be long resulting in longer packet latency. In the dual-path  algorithm, the destination set is divided into two partitions and covering all the destination nodes with higher and lower label number than the source node, respectively. Then the multicast packet is delivered along and paths. In the multi-path (MP)  algorithm, and are further divided into two subsets, one enclosing destinations with coordinate less than that of the source node ( and ) and the other having destinations with coordinate greater or equal than that of the source node ( and ). Each multicast packet is distributed on four paths to four destination groups. By making full utilization of four outgoing channels at the source node, the multi-path algorithm achieves better performance than the dual-path algorithm. However, the existing multi-path algorithms all use static partitions which lack the global view of path optimization.
In this paper, we first formulate the multicast destination set partition problem and propose an efficient dynamic partition merging (DPM) algorithm to solve it. The proposed algorithm divides the multicast destination set into partitions dynamically by comparing the routing cost of different partition merging options and selecting the merged partitions with lower cost. The multicast routing in each partition takes advantage of both multiple unicast and dual-path algorithms. Deadlock is avoided by separating the physical network in high and low-channel networks with corresponding turn restrictions. Simulation results of the proposed algorithm by both synthetic traffic and PARSEC benchmark workloads show that the DPM-based multicast routing algorithm achieves significant improvement over the existing multi-path algorithms in both average packet latency and power consumption.
The rest of the paper is organized as follows. Section II provides brief review of existing on-chip multicast routing algorithms. In Section III, the problem of multicast destination set partition is formulated and the proposed algorithm is described in detail. Section IV provides the experimental settings, simulation results, and discussion of the results. Section V concludes the paper.
Ii Related Work
In the literature, several tree-based and path-based multicast routing algorithms have been proposed for NoCs. In the Virtual Circuit Tree Multicast (VCTM)  scheme, a unique VCT ID is created for each source destination set and a virtual circuit is formed before the multicast traffic begins to flow. The overhead of virtual circuit setup time and tree maintenance information is not trivial, especially when the number of multicast destinations is high. For Recursive Partitioning Multicast (RPM) , the network is first partitioned according to the source node location and selectively a tree is formed by replication. After the replica reaches the next hop in each partition, the same process repeats recursively inside the partition till the message is delivered to all destinations. In , a deadlock free tree-based multicast routing based on a hold-release tagging method is proposed where flits can be interleaved and pheromone tracking method is used to reduce the communication energy.
Two multicast routing algorithms are proposed for 3D NoCs , where MXYZ is a tree-based dimension order routing algorithm where packets are replicated to X, Y or Z direction based on the destinations with common coordinates and AL+XYZ finds an alternate output channel if the selected output channel is not available. In 
, an adaptive and deadlock free tree-based multicast routing scheme is proposed by interleaving flits from different sources in a message queue. The K-Means based Multicast Routing (KMCR) scheme employs K-means clustering method to partition the network balanced in terms of destinations. Then the centroid of each partition is found, a tree from the source node to each centroid is created and finally a subtree is formed from the centroid to each destination. All tree-based routing algorithms require the support of flit replication in the router microarchitecture.
In dual-path (DP)  and multipath (MP)  routing algorithms, the destination nodes are partitioned into two groups and four groups respectively according to their relative position to the source node. In the column-path  based method, the message is traversed on X dimension to the first node which x coordinate is matched with the destination along the East or West direction and then delivered along the North or South direction to match the y coordinate of the destination. Authors in  have proposed a low distance multipath algorithm wherein the destination list is sorted according to hop count instead of the label. The algorithm is shown to be a better option than both multipath and column-path algorithms. In , a path-based adaptive routing scheme is proposed to exploit potential alternative paths by prohibiting the turns, which achieves better performance with a high degree of adaptivity than .
To reduce the latency in multicast packets, the path-based routing algorithm proposed in  selects the available output channel according to the size of buffer space in downstream routers based on the destination partitions. Authors in  proposed a balanced partitioning method in a 3D NoC which is an extended version of dualpath, multipath and column-path method. Authors in  also proposes a hierarchical multicast routing algorithm combining both multiple unicast and multi-path routing which shows better resilience to hardware trojan attacks to multicast traffic. All the aforementioned path-based multicast routing algorithms use a static destination set partition strategy which lacks the global view of path optimization. As a matter of fact, the partition strategy should be flexible enough to cope with various distribution of destination nodes.
Iii Dynamic Partition Merging-based Multicast Routing
The problem of static destination set partition is illustrated in Fig. 1. Based on the source node, the destination set is divided into four partitions using the partition method proposed in . Thus, four packets will be sent along the four paths as shown in Fig. 1a towards four destination groups. Fig. 1b shows an alternate partition strategy which yields less total hop count from the source node to all destination nodes than the static four-partition strategy. This example motivates us to study the problem of an optimal partition of the destination set such that the total hop count of the multicast routing paths is minimized. In the following, we first formulate the problem and propose the Dynamic Partition Merging (DPM)-based multicast routing algorithm.
Iii-a Problem Formulation
The destination set partition problem can be modeled as follows. Given a source node and a universal set of the multicast destination nodes . A collection of subsets of , denoted as where can be formed with the cost correspondingly, where is the cost of multicast routing from source to . The objective of the problem is to find the set with the minimum total cost such that all destination nodes in are covered, i.e.,
Iii-B Proposed Solution
Considering the hardware implementation cost, we propose a heuristic algorithm to solve the destination set partition problem. In light of the fine-grained partition used in, the destination set is partitioned into at most eight partitions according to the relative position to the source node , as shown in Fig. 2a. For each non-corner edge node Fig. 2b, there are five partitions. While for each corner node Fig. 2c, three partitions are possible. Without loss of generality, we describe the DPM algorithm using a non-edge node. For each node , its basic partition is obtained according to the following rule:
A basic destination partition set is formed where and all the elements of are mutually exclusive. An extended partition set is defined as where stands for and stands for for . is obtained by merging up to 3 consecutive partitions from the partition set . The limit is set to 3 in order to prevent the final partition from being converged to two partitions like in the dual-path algorithm which is already proven to be worse than the multipath algorithm . Additionally, this limit will also help in saving unnecessary computation effort. The collection set which contains all possible partitions by merging up to 3 consecutive partitions. The cost of multicast routing for each destination partition in set is obtained as defined in Definition 2 which is determined after finding the representative node (Definition 1) in the partition. The saving defined in Definition 3 is used to measure the merging benefit for each element in .
The of an element is the nearest destination node from a source node for a given multicast packet obtained by: R = , where .
The for each destination partition can be obtained by selecting the least cost between multiple unicast routing and dual-path routing as , where
, gives the number of destinations in ,
= i.e.,the hop count of dual-path  from node .
The for merging t number of partitions, i.e., for partitions in ,is obtained by the following rule: = max( , - ), where is the cost of the merged partition.
The proposed DPM algorithm is shown in Algorithm 1. For each node at in a nxn mesh, it is labeled as if y is even, and
if y is odd. The labels will be used for calculating the dual-path routing paths. The network is first partitioned inaccording to the partition rule, then the extended partition set and the search set are obtained. For each partition in , a representative node is found, followed by the saving for each element in is calculated. The partition in with the maximum saving will be selected as a final partition in the set . The saving of the other partitions that contain the selected partition will be reset to to avoid overlapping of the selected partitions. The searching process will quickly converge (up to 4 iterations) to a point when there is no more saving. Any non-merged partition from is added to . For each partition in the final set , a multicast packet is formed with the destination group as well as the selected routing algorithm and delivered to the representative node using routing. From the representative node , the selected routing algorithm (either dual-path or multiple unicast based on the comparison in Definition 2) will be used to deliver the packet.
Fig. 3 shows the comparison of the multipath (MP) routing, the new multipath (NMP) routing proposed by Ebrahimi et.al  and the proposed (DPM)-based routing algorithm. In the example, the source node is marked in yellow and all destination nodes are marked in green. Fig. 3a shows the operation of MP, the destination set is divided into four partitions, which are , , , and . For partition, the destination nodes are sorted in the ascending order of node labels and delivered accordingly. Whereas in the partition, destinations are sorted in the descending order of node labels and delivered accordingly. Fig. 3b shows the operation of NMP, where the partition is similar to that of MP, however, the node is label as . In each partition, the destination is sorted according to the least hop from the source node. A multicast packet is forwarded to the nearest destination first and then again sorted for the next destination. This method is better than multipath in saving the number of routing hops in and .
Fig. 3c shows the partition before merging and Fig. 3d shows the partition after merging where and are merged. If the cost of merging is same then the group containing the least number of partitions is chosen first then the partition with the smallest indexed partition is chosen. For partition , after delivery to the representative node (a dotted path), a dual-path routing is performed for the rest of the destinations. For partition , a unicast is performed to deliver a packet to destination . In partition , after delivery to the representative node , multiple unicast (MU) option is chosen as both dual-path and MU gives the same cost but the overhead of computing is eliminated using MU. The proposed method uses dynamic partition to regroup the eight original partitions into four partition as shown in Fig. 3d. With this optimization, DPM is able to deliver with less number of hops than MP  or NMP .
Iii-C Deadlock Avoidance
Deadlock is an important aspect to consider when designing a multicast routing algorithm. The proposed DPM-based multicast routing algorithm is likely to produce a deadlock because of mixed turns. To avoid deadlocks, the physical network is divided into high-channel and low-channel subnetwork as shown in Fig. 4 where the high-channel subnetwork is used when the next hop label is higher than the present node label and the low-channel subnetwork is used if the next hop label is lower than the present node label. Fig. 4 also shows the allowed turns in each subnetwork, invalid turns are not allowed. As long as this rule is followed deadlocks can be avoided. This method of channel selection and restriction of turns are used by both multicast and unicast traffic to avoid deadlocks.
Iv Performance Evaluation
In the following sections, the proposed DPM-based multicast routing algorithm is evaluated against the multipath (MP)  and the new multipath (NMP)  algorithms using synthetic traffic and real application workloads.
Iv-a Simulation Settings
is used to estimate the dynamic power consumption. The 8x8 mesh based NoC is used for the simulation purpose where 4 VCs are used, 2 VCs are used for the high and low channel subnetwork respectively. Different destination ranges are adopted based on the study of average destination set size per multicast packet. In our simulations, on average 10% of multicast portion and 90% unicast portion are assumed for the synthetic traffic. Netrace  is used to generate a trace file of workloads from PARSEC  benchmark. Netrace uses 64 cores, in-order Alpha ISA 2Ghz as the target system with 32KB I/D cache with MESI coherence protocol for L1 cache. The packet format for the simulation is shown in Fig. 5 where the flit field identifies a head, body or tail flit. The packet field can identify a multicast or unicast packet. The routing field indicates the dual-path or multiple unicast routing algorithm, as determined in Definition 2. In the header flit, a source node ID, a destination ID followed by multicast destination in bit string format are included. A body or tail flit only contains the flit type identifier and the payload.
|Buffer Depth||4 flits|
|Packet Size||4 flits/packet|
|Multicast Packet Portion||10 %|
|Multicast Destination Range||2-5,4-8,7-10,10-16|
Fig. 6 shows the simulation results for the average packet latency of 8x8 mesh network with synthetic traffic. The multiple unicast (MU) where a multicast packet is delivered by multiple unicast packets, MP , NMP  and DPM are compared. As the destination range increasing from to , all the routing algorithms are saturated earlier mainly because of the increased congestion in the network. The proposed algorithm has better average packet latency in all destination ranges. The improvement is mainly due to the dynamic partition merging which allows the regrouping of destinations to reduce the routing cost, i.e., saving the hops to deliver leads to comparatively less network congestion.
The dynamic power consumption for all four algorithms are measured for synthetic traffic and the percentage improvement of the latter three algorithms vs. MU at the saturation point of the MU is shown in Fig. 7. DPM is consuming around , and less power than the MU algorithm at the saturation point for the destination range and , respectively. For the other two algorithms as well we see a similar trend of the increasing improvement in power consumption up to around improvement at the range . Similarly, on average DPM is consuming around less power than NMP and less power than MP for all four destination range.
The performance of average packet latency under the PARSEC benchmark traces is shown in Fig. 8 where multipath  is considered as a baseline for the benchmark evaluation. For each benchmark, the trace file of the region of interest section is extracted for the simulation. The performance improvement of NMP and DPM vs. MP is measured. For the average latency in Fig. 8a, the proposed algorithm shows around improvement in fluidanimate and NMP shows around improvement in canneal and swaptions workloads. Similarly, for the total power consumption, the proposed algorithm shows around improvement in fluidanimate while NMP shows around improvement in swaptions and blackscholes. The variation in delay and power performance in benchmark traces is due to the variation in the destination size and multicast percentage in the benchmark traces.
In this paper, the dynamic partition merging (DPM)-based multicast routing algorithm is proposed for NoCs. The destination set partition problem is first modeled as the exact weighted set cover problem. The proposed DPM algorithm is proposed to solve the problem by comparing the different options of merging partitions and selecting the option with the most saving in routing cost. Compared with the static partition strategy, DPM finds the partitions with lower routing cost and hence helps in reducing the network latency and power consumption. The routing algorithm inside each partition is selected from the dual-path and multiple unicast routing. The simulation results show that for synthetic traffic, the proposed algorithm achieves lower average packet latency and the network is saturated later than the existing algorithms with less power consumption. Similarly, for PARSEC benchmark loads, latency improvement is up to 23% and power improvement is up to 14% are observed. In our future work, the extension of the DPM-based multicast routing algorithm to 3D NoCs and evaluation of machine-learning benchmarks will be conducted.
-  L. Benini and G. De Micheli, “Networks on chips: a new soc paradigm,”Comput., vol. 35, no. 1, pp. 70-78, Jan. 2002.
-  H. Xu, P. K. McKinley and L. M. Ni, “Efficient implementation of barrier synchronization in wormhole-routed hypercube multicomputers,” in Proc. 12th Intl. Conf. Distrib. Comput. Syst., 1992, pp. 118-125.
-  N. E. Jerger, L. Peh and M. Lipasti, “Virtual circuit tree multicasting: a case for on-chip hardware multicast mupport,” in Proc. Intl. Symp. Comput. Archit., 2008, pp. 229-240.
-  A. Karkar, T. Mak, K. Tong and A. Yakovlev, “A survey of emerging interconnects for on-chip efficient multicast and broadcast in many-cores,” in IEEE Circuits and Syst. Mag., vol. 16, no. 1, pp. 58-72, 2016
-  B. K. Joardar, J. R. Doppa, P. P. Pande, D. Marculescu and R. Marculescu, “Hybrid on-chip communication architectures for heterogeneous manycore syst.,” in Proc. Intl. Conf. Comput.-Aided Design (ICCAD),2018, pp. 1-6.
-  W. Choi et al., “On-chip communication network for efficient training of deep convolutional networks on heterogeneous manycore syst.,” IEEE Trans. Comput., vol. 67, no. 5, pp. 672-686, May 2018.
-  S. Yasaman et al., “Flow mapping and data distribution on mesh-based deep learning accelerator,” in Proc. 13th IEEE/ACM Intl. Symp. Networks-on-Chip,2019
-  M. P. Malumbres, J. Duato and J. Torrellas, “An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors,” in Proc. 8th IEEE Symp. Parallel and Distrib. Process., 1996, pp. 186-189.
-  D. R. Kumar, W. A. Najjar and P. K. Srimani, “A new adaptive hardware tree-based multicast routing in k-ary n-cubes,” IEEE Trans. Comput., vol. 50, no. 7, pp. 647-659, July 2001.
-  Xiaola Lin, P. K. McKinley and L. M. Ni, “Deadlock-free multicast wormhole routing in 2-D mesh multicomputers,”IEEE Trans. on Parallel and Distrib. Syst., vol. 5, no. 8, pp. 793-804, Aug. 1994.
-  L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques in direct networks,”Comput., vol. 26, no. 2, pp. 62-76, Feb. 1993.
-  L. Wang, Y. Jin, H. Kim and E. J. Kim, “Recursive partitioning multicast: a bandwidth-efficient routing for networks-on-chip,” in Proc. 3rd Intl. Symp. Networks-on-Chip,2009, pp. 64-73.
-  F. A. Samman, T. Hollstein and M. Glesner. “Planar adaptive network-on-chip supporting deadlock-free and efficient tree-based multicast routing method,” Microprocessors and Microsystems, vol.36, no. 6, pp. 449–461,2012
-  X. Wang et.al,“Efficient multicast schemes for 3-D networks-on-chip”,Syst. Archit. 59, 9, 693-708, 2013.
-  F. A. Samman, T. Hollstein and M. Glesner, “Adaptive and deadlock-free tree-based multicast routing for networks-on-chip,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 7, pp. 1067-1080, July 2010.
-  T. H. Vu and A. B. Abdallah, “Low-latency k-means based multicast routing algorithm and architecture for three dimensional spiking neuromorphic chips,” in Proc. Intl. Conf. Big Data and Smart Comput. (BigComp) 2019, pp. 1-8.
-  R. V. Boppana, S. Chalasani and C. S. Raghavendra, “Resource deadlocks and performance of wormhole multicast routing algorithms,” IEEE Trans. Parallel and Distrib. Syst., vol. 9, no. 6, pp. 535-549, June 1998.
-  M. Ebrahimi et al., “An efficent dynamic multicast routing protocol for distributing traffic in NOCs,” in Proc. Des., Automat. & Test Europe Conf. & Exhib., 2009, pp. 1064-1069.
-  P. Bahrebar and D. Stroobandt, “The hamiltonian-based odd-even turn model for adaptive routing in interconnection networks,” in Proc. Intl. Conf. Reconfigurable Comput. and FPGAs (ReConFig), 2013, pp. 1-6.
-  M. Ebrahimi, M. Daneshtalab, P. Liljeberg and H. Tenhunen, “HAMUM - a novel routing protocol for unicast and multicast traffic in MPSoCs,” in Proc. 18th Euromicro Conf. Parallel, Distrib. and Network-based Process., 2010, pp. 525-532.
-  Z. Wang, H. Gu, Y. Yang, H. Zhang, Y. Chen,“An adaptive partition-based multicast routing scheme for mesh-based networks-on-chip”, in Comput. & Elect. Eng., vol.51,pp 235-251,2016
-  M. Ebrahimi, M. Daneshtalab, P. Liljeberg and H. Tenhunen, “Partitioning methods for unicast/multicast traffic in 3D NoC architecture,” in Proc. 13th IEEE Symp. Des. and Diagnostics of Electronic Circuits and Syst.,2010, pp. 127-132.
-  B. Tiwari, M. Yang, Y. Jiang and X. Wang, “Effect of hardware trojan attacks on the performance of on-chip multicast routing algorithms,” in Proc. 9th Annu. Computing and Communication Workshop and Conf. (CCWC), 2019, pp. 0623-0629.
-  V. V. Vazirani, “Approximation algorithms”, in SpringerVerlag, 2003.
-  X. Wang, T. Mak, M. Yang, Y. Jiang, M. Daneshtalab and M. Palesi, “On self-tuning networks-on-chip for dynamic network-flow dominance adaptation,” in Proc. 7th Intl. Symp. Networks-on-Chip (NoCS), 2013, pp. 1-8.
-  A. B. Kahng, B. Li, L. Peh and K. Samadi, “ORION 2.0: a power-area simulator for interconnection networks,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 1, pp. 191-196, Jan. 2012.
-  S. Ma, N. E. Jerger and Z. Wang, “Supporting efficient collective communication in NoCs,” in Proc. Intl. Symp. High-Performance Comput. Archit., 2012, pp. 1-12.
-  J. Hestness, B. Grot, S. W. Keckler. “Netrace: dependency-driven, trace-based network-on-chip simulation.” in Proc. 3rd Intl. Workshop Network on Chip Architectures (NoCArc), 2010.
-  C. Bienia and K. Li, “Parsec 2.0: a new benchmark suite for chip-multiprocessors,” in Proc. 5th Annu. Workshop Model., Benchmarking and Simul., 2009.