Deep neural networks (DNN) have revolutionized the field of machine learning by providing unprecedented human-like performance in solving many real-world problems such as image and speech recognition. However, the training and inference of large DNNs is a computationally intensive task and has motivated the search for novel computing architectures targeting this application[Fleischer et al., 2018, Lu et al., 2017, Farabet et al., 2011, Wang et al., 2018]
. Recent years have seen an explosion of companies developing customized hardware accelerators for DNN training and inference. Companies are motivated to develop such hardware both to accelerate their large-scale internal DNN workloads and to give access to such devices on a pay-per-use basis in the cloud. The DAWNBench benchmark for training time on ImageNet is currently topped by a system using GPUs with specialized cores designed for DNN training, and the corresponding benchmark for training cost is topped by a cloud-based workload running on TPUs: a customized chip specifically designed for DNN training and inference[Jouppi et al., 2017].
Although customized accelerators are now considered mainstream in the machine-learning community, they still suffer from the inherent limitations of von Neumann computing architectures. Namely, synaptic weights must be repeatedly moved between the memory units and the compute units. This bottleneck leads many to consider alternative non-von Neumann architectures such as in-memory computing [Burr et al., 2017, Le Gallo et al., 2018, Hu et al., 2018, Ielmini and Wong, 2018, Prezioso et al., 2015, Xia and Yang, 2019]. Fundamentally, by taking advantage of a set of memristive devices organized in a crossbar array, one can leverage the physical properties of these devices via Ohm’s and Kirchhoff’s laws to perform in-place matrix-vector multiplication with time complexity. As DNN training and inference is dominated by such operations, which require time with traditional architectures, the potential for acceleration is self-evident. It was also shown recently that it is possible to achieve near software accuracies when synaptic weights are stored in phase-change memory devices fabricated in the 90 nm technology node [Sebastian et al., 2019].
However, the ability to perform matrix-vector multiplication in constant time, as well as the one-to-one mapping between synaptic weights and memristive devices, creates a new challenge for designers of such accelerators. If we want to perform computation in a pipelined manner, the bottleneck becomes how quickly we can communicate activations from one layer/computational memory (CM) core to another. This clearly depends on both the architecture of the neural network model itself, i.e. how the different layers are connected, as well as the topology of the communication fabric of the accelerator device, i.e. how long it takes to transfer activations between physically distinct layers. To manufacture a general-purpose inference engine, the challenge would be to design a communication topology that supports low-latency execution for a wide range of neural network architectures without introducing unnecessary redundancy, e.g. unused connections.
Contributions We attempt to address exactly the above problem with a focus on convolutional neural networks (CNN) architectures due to their popularity in a range of tasks and their suitability for in-memory computing: Every layer executes with
time complexity irrespective of channel depth, and the computation of activations can be pipelined across different layers, making full use of the physically instantiated neurons. Specifically, our contributions are:
We propose a novel graph topology (5 Parallel Prism, or 5PP) for pipelined executions of CNNs that maximally exploits the physical proximity of computational units.
We prove theoretically that all state-of-the-art neural networks featuring feedforward, residual, Inception-style and dense connectivity are mappable onto the proposed topology.
We present a quantitative comparison of 5PP with existing communication topologies and demonstrate its advantages in terms of latency, throughput and efficiency.
We give a detailed example of how to map ResNet-32 onto an array of CM elements.
2 A Consolidated Graph Representation of CNNs
A CNN can be implemented in CM in a pipelined fashion if there exists communication channels that match the connectivity of the network. If this was not verified, the activations from one CM core may require transferring for several cycles of the execution schedule through different cores before they can reach their destination. As the execution happens in a pipelined fashion, every core would then systematically stall at every cycle until the slowest data transfer is completed, severely impairing the throughput. Mathematically:
Given a communication fabric with topology and the directed graph representation of a CNN, with vertices representing convolutional layers and edges representing activations directed toward the direction of computation, then the CNN is executable in a pipelined fashion on if there exists a homomorphism .
As the communication fabric will have to provide enough connection overhead to accommodate a diverse set of networks, the homomorphism will generally be injective. The definition of a homomorphism between a directed and an undirected graph is justified by the fact that, in our representation of the communication fabric, undirected edges represent bi-directional communication channels, which can be assigned either direction when the CNN is mapped onto it.
Here we present one such graph representation of CNNs, . Typically, CNN architectures are represented as directed graphs, with vertices representing convolutional layers and directed edges representing activations directed toward their next layers. Nevertheless, CNN representations are not coherent with their physical implementation, some examples being input images or concatenation operations themselves being represented as one vertex of the network, or pooling operations being at times represented along convolutions in the same vertex and at times separately. Based on these consolidations, we present five design rules:
Vertices are identified solely with convolutional layers. Any other operation, such as pooling or addition for the residual path, is treated as pre- or post-processing of the convolution.
We do not distinguish between input and output of a vertex. Vertices are either connected or not connected. This is illustrated in Fig. 1a.
Edges that may make the graph non-simple are removed. This is illustrated for the ResNet architecture in Fig. 1b.
Concatenation does not imply any operation on the data, thus it cannot be represented as a vertex in the graph. Given this assumption, the concatenation of m vertices being fed into n others is equivalent to a complete bipartite graph as in Fig. 1c.
A series-parallel (s-p) graph such as Inception can have some parallel path with latency much smaller than the critical path (see Fig. 1d, with the nodes laid out in order of latency and the critical path in red) that are unable to reach their physical destination unless they hop to the neighboring vertex as the critical path is executing. We require that such activations hop to their nearest temporally subsequent node (green arrows in Fig. 1d).
3 A New Communication Fabric: 5 Parallel Prism
Here we propose a topology that can be thought of as a graph spawning from the coalesce of multiple smaller graphs, referred to as “unit graphs”. Figure 2 portrays the unit graph as an undirected complete graph . The physical implementation of the unit graph in Fig. 2 clearly depicts how we maximize the use of spacial proximity in the interconnection of CM units with non-negligible physical size: The basic communication infrastructure takes place between a 2-by-3 neighborhood, which is the way of packing six computational units on a meshgrid that yields the shortest intra-unit maximum distance. The construction of an -vertice 5PP starts from a disjoint union of unit graphs and is described by a pseudocode in Algorithm 1, with the resulting topology depicted in Fig. 3. To represent the overall topology, all diagonal edges of the prisms are omitted for clarity. The construction uses a vertex identification operation that keeps the graph simple as described in Eq. 1, where is the vertex identification operation, G is the graph to which it is applied and V(G), E(G), the set of its vertices and edges.
4 Mapping CNN architectures onto 5PP
Here we establish the existence a homomorphism between the consolidated graph representation of four different state-of-the-art CNN architectures and the 5PP topology. The process of verifying the existence of the homomorphism is often referred to as H-coloring [Hell and Nešetřil, 1990]. We propose an iterative H-coloring method that verifies the H-coloring of the vertices of the directed graph in their order from initial to final. That is, at the -th iteration of verifying the H-coloring, vertex in graph is colored on a vertex in graph such that there are enough edges to connect it to whatever previous vertices it is connected to. The homomorphism is proven if all vertices of have been successfully H-colored onto at the last iteration. Generally speaking, the proof of existence of a homomorphism between two arbitrary graphs is an NP-hard problem for any non-bipartite graph. Leveraging the regularity of graphs representing CNNs, we present some properties of the 5PP that facilitate verification of the existence of a homomorphism with the topologies of state-of-the-art CNNs.
We define an even(odd) H-coloring as an H-coloring that colors an even(odd) number of vertices. Furthermore, we define the out-degree of a vertex v of a directed graph as the number of outgoing edges from v. Figure 4 illustrates three H-colorings that are representative of the two main properties of the 5PP:
By construction, every vertex belongs to at least one complete graph . Given an odd(even) H-coloring of the 5PP , vertex has possible maximum 5(4). The maximum number of vertices accessible in parallel in our topology gives it its name.
Given any H-coloring of the 5PP , one can always continue the coloring with a complete bipartite graph with . Furthermore, if the H-coloring is odd(even), the coloring can be continued with a bipartite graph with only for configurations with n and m odd(even) numbers.
4.1 Feedforward and ResNet topologies
The existence of a homomorphism between a standard, pre-2014 feedforward connection is banal, as in their graph representation this connectivity is a path, and there always exists a homomorphism, e.g., that maps a path.
Residual networks (ResNet) are relatively deep networks that employ skip connections or shortcuts to jump over certain layers [He et al., 2016].
Any ResNet architecture can be H-colored onto the 5PP.
Regarding the ResNet architecture, given R2, we assimilate, without loss of generality, standard residual connections to connections from the input memory of one vertex to the output of the next (in Fig.1b, to , thus merging to the feedforward connection from to ). This requires the shortest connections possible to implement the residual connection, merging with the feedforward connection between two adjacent vertices. It also simplifies the ResNet graph into a series-parallel graph with maximum out-degree of the vertices equal to 2 for three adjacent vertices, which is reached for a residual path across layers with different strides, where one resampling layer is required as shown in Fig. 1a. Note that any other implementation of the residual connections would have yielded the same results in terms of maximum in the graph and number of adjacent vertices required. A complete mathematical argument is provided in Appendix A.1. ∎
4.2 DenseNet topologies
Dense connectivity as proposed by [Huang et al., 2017] sees a sequence of densely connected layers, all with the same channel depth, where every node receives as input the concatenation of all its preceding layers. The DenseNet CNN comprises dense blocks connected in series.
Any DenseNet topology can be H-colored onto the 5PP by using the maximum number of edges possible.
To represent densely connected layers as a graph, there are fundamentally two possible representations, based on whether activations are communicated to the subsequent layers before or after data aggregation implied by the concatenation operation. In the former case, in Fig.5a, the activations that are communicated between layers are the output activations of each layer, and n densely connected layers are equivalently n vertices with an edge connecting each pair of vertices: that is, a complete graph . Physically, these edges represent channels that communicate the same number of activations. Conversely, in the latter case in Fig. 5b, the activations communicated between layers are the input and output activations of each layer and, after application of R2 and R3, in the same fashion as depicted in Fig. 1a,b, the dense connectivity streamlines into a path connectivity. Physically, these edges represent channels where the number of activations that are communicated increases linearly in the direction of execution. Although representing dense blocks as paths makes their hardware mapping trivial, it makes minimal use of the entire infrastructure and consequently channels the entire traffic on a limited number of edges, inflating the bandwidth requirement for the single communication link. As our proposed topology is a sort of path connection of complete graphs , we adopt both representations in Fig. 5 to distribute the communication and obtain the minimum bandwidth requirements for the physical channels. A complete mathematical argument is provided in Appendix A.2. ∎
4.3 Inception-style topologies
Inception-style architectures feature a less regular, broader spectrum of connections with respect to the previous examples, comprising a sequence of s-p graphs (“Inception blocks”) connected in series through concatenation nodes.
Inception v1,2,3,4 and Inception ResNet v1,2 are H-colorable on the 5PP.
Inception blocks feature a source concatenation node connecting to a maximum of four parallel branches in all Inception architectures; the number of parallel branches always tapers in the direction of the destination concatenation node and all vertices between source and destination nodes have an out-degree lower than or equal to 2. Note that in our graph representation of Inception networks, as discussed in design rule 4, we choose to hop all data from layers prior to the longest latency path through the nearest temporally subsequent layer. Although this simple criterion allows implementation of the Inception blocks of all Inception networks, data from one vertex can be hopped through any temporally subsequent vertex and can be considered a design parameter to optimize power consumption or maximum bandwidth of the links. We can prove the H-colorability of the single Inception blocks with property 3 based on these patterns. A complete mathematical argument is provided in Appendix A.3. ∎
5 Quantitative Evaluation of 5PP
This section delves into the search space defined by Algorithm 1 and explains why, among the possible complete graphs employed as unit graphs, complete graph and the corresponding topology 5PP constitute the optimal choice. We also quantitatively compare 5PP with well-established communication topologies.
Evaluation metrics. The core criterion that defines optimality is the latency of the single pipeline stage. When a homomorphism cannot be established, the latency of the single pipeline stage would be limited by the additional computational cycles required for data movement. In Fig. 7, pipeline stage latency is measured in number of computational cycles. The bandwidth of the links is determined by the CNN layers with the greatest channel depth. We thus normalize the bandwidth requirements by the quantity with maximum number of channels per layer in the network, the accuracy of the activations in bits and the computational cycle.
Variations on 5PP. Figure 7a shows a comparison of these metrics between the 5PP and two topologies built in the same fashion as the 5PP from complete graphs (3PP) and (7PP), with the intent of showing the trend for topologies built on complete graphs with a lower and higher number of edges. In terms of latency, 5PP never stalls the pipeline. It is not possible to establish a homomorphism for every class of networks built on , resulting in the overall higher pipeline stage latency as is the case for 3PP. Conversely, 5PP is a subgraph of any topology built on , meaning the H-coloring still holds and the pipeline is never stalled. For DenseNet, the higher the number of edges, the more effectively one can distribute the data movement, yielding lower bandwidth requirements. Indeed, distributing the communication of activations as described in Section 4 yields the minimum bandwidth requirement: If communication of the activations requires bandwidth k, d densely connected layers can be made to communicate synchronously on a topology built on a unit graph requiring maximum bandwidth , where is the Heaviside step function and the computational cycle. These lower requirements come at the price of instantiating multiple additional edges that have no advantage in terms of latency. In Fig. 7a, the 7PP does perform 1.5 better than 5PP solely on the bandwidth requirement for DenseNet-201, at the cost of 1.44 more physical links overall.
Existing topologies. We now compare the performance of our topology with long-established communication topologies [Sanchez et al., 2010] on the metrics defined in Section 3. We consider 2D meshes with different aspect ratios as the prior art in communication fabric topologies. Note that mapping onto these topologies is based on the same hypotheses as mapping onto the 5PP shown in Section 3. Figure 7b. gives the bar plot rendition of pipeline stage latency and bandwidth requirements measured in the same fashion as in Fig. 7a. In terms of latency, path-connected networks (AlexNet) and DenseNet in its path representation (Fig. 5b) can be executed without breaking the pipeline. Conversely, this shows quantitatively how the absence of an H-coloring between ResNet-32 and Inception v4 impairs execution of the network by stalling the pipeline to allow communication between non-adjacent vertices. With regard to bandwidth, thanks to the DenseNet representation described in Section 4, 5PP significantly lowers bandwidth requirements 4 with respect to the best 2D mesh performance. Although the better performance overall in 5PP costs a greater number of interconnections, their increase(2.31) is significantly lower than the best factors of improvement in latency(7) and bandwidth(4).
6 Case study: ResNet-32 on a CM array
The physical mapping of a ResNet-32 [He et al., 2016] for CIFAR-10 dataset on an array of CM cores is depicted in Fig. 8. The network features three layer levels with channel depths equal to 16, 32, and 64, whereas the input image size is 32
32 pixels. The CM cores are represented as boxes with their associated input (left) and output (bottom) memory. Each core also comprises modest digital processing capability to scale the crossbar outputs, e.g. to apply the batch normalization, to implement the activation functions and to perform the residual addition. We assume 576576 for the memristive crossbar size per CM core, meaning the vector-matrix multiplication with a matrix of size 576576 can be performed in one computational cycle, which is assumed to be 100 ns. During execution, the memristive elements of the crossbar hold one convolutional weight each, whereas the input memory will store the pixel neighborhood required for the convolution. The results of the convolution, that is a single activation across the entire channel depth, is stored in the output memory. We assume 8-bit precision for activations, which has proved sufficient to achieve state-of-the-art accuracy for the CIFAR-10 dataset [Shafiee et al., 2016].
As mentioned in Section 2
, the communication channels can communicate indistinctly to both the input memory (to communicate the standard feedforward activations) or to the output memory (to communicate the activations used for residual addition). The dataflow occurs row-wise starting from the core located at the top left. Whenever one set of activations is computed by a core, it is communicated to the core assigned to compute the subsequent layer and stored in its input memory. Computation on one core begins when it has received sufficient activations to perform the convolution operation corresponding to its own assigned layer. The dataflow and network/dataset specifications define the memory and bandwidth requirements. The memory requirement is defined by the minimum number of activations to be stored per feature map. The proposed topology ensures that the pipeline is never stalled irrespective of the bandwidth. However, the most efficient implementation would be having sufficient bandwidth for all data transfer to occur in parallel with the computation cycle. Conversely, if the bandwidth is not sufficient, a constant communication overhead must be added to each computational cycle. In the former case, we estimate the required bandwidth per channel to be approximately 5 Gbps, which is state-of-the-art for on-chip links[Sacco et al., 2017]. Note that the link that delivers activations from both the feedforward and residual paths would be physically implemented as two separate channels.
We introduced 5 Parallel Prism (5PP), an interconnection topology for executing CNNs on an array of CM cores. We then proved the executability of ResNet, Inception, and DenseNet networks on the proposed communication fabric by proving the existence of a homomorphism between a consolidated graph representation of the CNNs and 5PP. Moreover, we validated the efficacy of our approach by comparing the proposed topology to various 2D meshes on the metrics of inference latency as well as bandwidth requirements per communication channel. Finally, we provided a case study with the physical mapping of ResNet-32 on an array of CM cores. The presented work is a significant step towards developing general-purpose DNN accelerators based on in-memory computing.
- Burr et al.  G. W. Burr et al. Neuromorphic computing using non-volatile memory. Advances in Physics: X, 2(1):89–124, 2017.
- Farabet et al.  C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR Workshops, pages 109–116, 2011.
- Fleischer et al.  B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A. Agrawal, T. Babinsky, et al. A scalable multi-teraOPS deep learning processor core for AI training and inference. In Proceedings of the IEEE Symposium on VLSI Circuits, pages 35–36. IEEE, 2018.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
- Hell and Nešetřil  P. Hell and J. Nešetřil. On the complexity of H-coloring. Journal of Combinatorial Theory, Series B, 48(1):92–110, 1990.
- Hu et al.  M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R. S. Williams, J. J. Yang, et al. Memristor-based analog computation and neural network classification with a dot product engine. Advanced Materials, 30(9):1705914, 2018.
- Huang et al.  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Ielmini and Wong  D. Ielmini and H.-S. P. Wong. In-memory computing with resistive switching devices. Nature Electronics, 1(6):333, 2018.
Jouppi et al. 
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,
S. Bhatia, N. Boden, A. Borchers, et al.
In-datacenter performance analysis of a tensor processing unit.In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12. ACM, 2017.
- Le Gallo et al.  M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou. Mixed-precision in-memory computing. Nature Electronics, 2018.
- Lu et al.  W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 553–564. IEEE, 2017.
- Prezioso et al.  M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. C. Adam, K. K. Likharev, and D. B. Strukov. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature, 521(7550):61, 2015.
- Sacco et al.  E. Sacco, P. A. Francese, M. Brändli, C. Menolfi, T. Morf, A. Cevrero, I. Ozkaya, M. Kossel, L. Kull, D. Luu, et al. A 5Gb/s 7.1 fj/b/mm 8 multi-drop on-chip 10mm data link in 14nm FinFET CMOS SOI at 0.5 v. In 2017 Symposium on VLSI Circuits, pages C54–C55. IEEE, 2017.
- Sanchez et al.  D. Sanchez, G. Michelogiannakis, and C. Kozyrakis. An analysis of on-chip interconnection networks for large-scale chip multiprocessors. ACM Transactions on Architecture and Code Optimization (TACO), 7(1):4, 2010.
- Sebastian et al.  A. Sebastian, I. Boybat, M. Dazzi, I. Giannopoulos, et al. Computational memory-based inference and training of deep neural networks. In Proceedings of the IEEE Symposium on VLSI Circuits, 2019.
- Shafiee et al.  A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 44(3):14–26, 2016.
- Wang et al.  N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In Advances in neural information processing systems, pages 7675–7684, 2018.
- Xia and Yang  Q. Xia and J. J. Yang. Memristive crossbar arrays for brain-inspired computing. Nature materials, 18(4):309, 2019.
Appendix A Homomorphism Proofs
a.1 Proof of Theorem 4.1
Let a graph represent a ResNet topology according to the rules in Section 2. Such a graph is a series connection of paths (R3, Fig. 1b) and resampling layers (R2, Fig. 1a) in that order. We prove a homomorphism by H-coloring the two elements in sequence. Any path can be iteratively H-colored in 5PP as proved above, ending with a certain parity. Then, the resampling layer is H-colored, which is equivalent to coloring a complete graph , represented in Fig. 1a by vertices , and . Regardless of the parity of the coloring before the resampling layer, according to P1 there are always three uncolored vertices on which to color one ; there are also always enough edges, as is always a subgraph of for any three vertices belonging to .
a.2 Proof of Theorem 4.2
Let a graph represent an n-layer DenseNet topology. Assume there exists a homomorphism that maps its layers to a sequence on 5PP using all edges, ordered as in Fig. 3. If the number of densely connected vertices is , they all belong to the same , and the H-coloring follows from Fig. 5a. If the number of densely connected layers is , the complete connection of all individual instances of is used, but there are vertices that do not belong to the same . For example for eight layers, vertices and are connected through a complete graph to , but not to and . The communication of activations from and to vertices and is thus distributed among the communication from the four vertices belonging to the same as and as described in Fig. 5b. In general, for an arbitrary number n of densely connected layers, the i-th vertex, where is an even(odd) number and able to communicate with the previous 5(4) vertices through the communication fabric. The edges between these 5(4) vertices and communicate the output activations of the previous 5(4) layers, plus the activations from vertices .
a.3 Proof of Theorem 4.3
Let a graph represent one Inception topology. Inception topologies are made up of Inception blocks connected by concatenation nodes. Inception blocks are s-p graphs with four or less parallel branches and vertices with of at most 2. We first prove the coloring of the individual Inception blocks. From P1, every vertex in a 5PP belongs to at least one . Given a maximum of four parallel branches, they are always colorable on a plane with vertices , all belonging to a single with vertices . As can be chosen belonging to the same , every vertex will have possible equal at least to the number of uncolored vertices in that , which is equal to 2. What remains to be proven is the colorability of the concatenation nodes through which they are connected. From P3, complete bipartite graphs with are always colorable. This condition applies to all concatenations in Inception architectures, which are therefore mappable.