I Introduction
Machine Learning (ML) algorithms are widely used for Internet of Things and Artificial Intelligence applications, such as computer vision
[7], speech recognition [3]and natural language processing
[2]. Deep Neural Networks (DNNs) have reached stateoftheart results in terms of accuracy, compared to other ML algorithms. Recently, Sabour and Hinton et al. [12] proposed the Dynamic Routing algorithm to efficiently perform training and inference on CapsuleNets [5]. Such CapsuleNets are able to encapsulate multidimensional features across the layers, while traditional Convolutional Neural Networks (CNNs) do not. Thus,
CapsuleNets can beat traditional CNNs in multiple tasks, like image classification, as shown in [12]. The most evident difference is that the CapsuleNets are deeper in width than in height, when compared to DNNs, as each capsule incorporates the information hierarchically, thus preserving other features like position, orientation and scaling (see an overview of CapsuleNets in Section II). The data is propagated towards the output using the socalled routingbyagreement algorithm.Current stateoftheart DNN accelerators [4] [1] [6] [11] [9] proposed energyaware solutions for inference using traditional CNNs. As for our knowledge, we are the first to propose a hardware acceleratorbased architecture for the complete CapsuleNets inference. Although systolic array based designs like [9] perform parallel matrix multiplyandaccumulate (MAC) operations with good efficiency, the existing CNN accelerators cannot compute several key operations of the CapsuleNets (i.e., squashing and routingbyagreement) with high performance. An efficient dataflow mapping requires a direct feedback connection from the outputs coming from the activation unit back to the inputs of the processing element. Thus, such key optimizations can highly increase the performance and reduce the memory accesses.
Our Novel Contributions:

We analyze the memory requirements and the performance in the forward pass of CapsuleNets, through experiments on a highend GPU, which allows to identify the corresponding bottlenecks.

We propose CapsAcc, an accelerator that can perform inference on CapsuleNets with an efficient data reuse based mapping policy.

We optimize the routingbyagreement process at algorithm level, by skipping the first step and directly initializing the coupling coefficients.

We implement and synthesize the complete CapsAcc architecture for a nm technology using the ASIC design flow, and perform evaluations for performance, area and power consumption. We performed the functional and timing validation through gatelevel simulations. Our results demonstrate a speedup of in the ClassCaps layer, of in the Squashing and of in the overall CapsuleNet inference, compared to a highly optimized GPU implementation.
Paper Organization: Section II summarizes the fundamental theory behind CapsuleNets and highlights the differences with traditional DNNs. In Section III, we systematically analyze the forward pass of the CapsuleNets executing on a GPU, to identify the potential bottlenecks. Section IV describes the architectural design of our CapsuleNet accelerator, for which the dataflow mapping is presented in Section V. The results are presented in Section VI.
Ii Background: An Overview of CapsuleNets
Sabour and Hinton et al. [12]
introduced many novelties compared to CNNs, such as the concept of capsules, the squashing activation function, and the routingbyagreement algorithm. In this paper, since we analyze the inference process, the layers and the algorithms that are involved in the training process
only (e.g., decoder, margin loss and reconstruction loss) are not discussed.Iia CapsuleNet Architecture
Figure 1 illustrates the CapsuleNet architecture [12] designed for the MNIST [8] dataset. It consists of 3 layers:

PrimaryCaps: first capsule layer, with 32 channels. Each eightdimensional (8D) capsule has 9x9 convolutional filters with stride=2.

ClassCaps: last capsule layer, with 16D capsules for each output class.
At the first glance, it is evident that a capsule layer contains multidimensional capsules, which are groups of neurons nested inside a layer. One of the main advantages of CapsuleNets over traditional CNNs is the ability to learn the hierarchy between layers, because each capsule element is able to learn different types of information
(e.g., position, orientation and scaling). Indeed, CNNs have limited model capabilities, which they try to compensate by increasing the amount of training data (with more samples and/or data augmentation) and by applying pooling to select the most important information that will be propagated to the following layers. In capsule layers, however, the outputs are propagated towards the following layers in form of a prediction vector, whose size is defined by the capsule dimension. A simple visualization of how a CapsuleNet works is presented in
Figure 2. After the weight matrix multiplication , the values are multiplied by the coupling coefficients , before summing together the contributions and applying the squash function. The coupling coefficients are computed and updated at runtime during each inference pass, using the routingbyagreement algorithm (Figure 4).IiB Squashing
The squashing is an activation function designed to efficiently fit for the prediction vector. It introduces the nonlinearity into an array and normalizes the outputs to values between and . Given as the input of the capsule (or, from another perspective, the sum of the weighted prediction vector) and as its respective output, the squashing function is defined by the Equation 1.
The behaviors of the squashing function and its first derivative are shown in Figure 3. Note that we have plotted the singledimensional input function, since a multidimensional input version cannot be visualized in a chart. The squashing function produces an output bounded between and , while its first derivative follows the behavior of the red line, with a peak at the point .
IiC RoutingbyAgreement Algorithm
The predictions are propagated across two consecutive capsule layers through the routingbyagreement algorithm. It is an iterative process, that introduces a feedback path in the inference pass. For clarity, we present the flow diagram (Figure 4) of the routingbyagreement at software level. Note, this algorithm introduces a loop in the forward pass, because the coupling coefficients are learned during the routing, as their values depend on the current data. Thus, they cannot be considered as constant parameters, learned during the training process. Intuitively, this step can cause a computational bottleneck, as demonstrated in Section III.
Iii Motivational Analysis of CapsuleNet Complexity
In the following, we perform a comprehensive analysis to identify how CapsuleNet inference is performed on a standard GPU platform, like the one used in our experiments, i.e., the Nvidia GeForce GTX1070 GPU (see Figure 6). First, in Section IIIA we quantitatively analyze how many trainable parameters per layer must be fed from the memory. Then, in Section IIIB
we benchmark our pyTorch
[13] based CapsuleNet implementation for the MNIST dataset to measure the performance of the inference process on our GPU.Iiia Trainable parameters of the CapsuleNet

Figure 5 shows quantitatively how many parameters are needed for each layer. As evident, the majority of the weights belong to the PrimaryCaps layer, due to its channels and D capsules. Even if the ClassCaps layer has fullyconnected behavior, it counts just for less than of the total parameters of the CapsuleNet. Finally, Conv1 and the coupling coefficients counts for a very small percentage of the parameters. The detailed computation of the parameters is reported in Table I. Based on that, we make an observation valuable for designing our hardware accelerator: by considering an
bit fixed point weight representation, we can estimate that an onchip memory size of
MB is large enough to contain every parameter of the CapsuleNet.IiiB Performance Analysis on a GPU
At this stage, we measure the time required for an inference pass on the GPU. The experimental setup is shown in Figure 7. Figure 8 shows the measurements for each layer. The ClassCaps layer is the computational bottleneck, because it is around slower than the previous layers. To obtain more detailed results, a further analysis has been performed, regarding the performance for each step of the routingbyagreement (Figure 9). It is evident that the Squashing operation inside the ClassCaps layer represents the most computeintensive operation. This analysis gives us the motivation to spend more effort in optimizing routingbyagreement and squashing in our CapsuleNet accelerator.
IiiC Summary of Key Observations from our Analyses
From the analyses performed in Sections IIIB and IIIA, we derive the following key observations:

The CapsuleNet inference performed on GPU is more computeintensive than memoryintensive, because the bottleneck is represented by the squashing operation.

A massive parallel computation capability in the hardware accelerator is desirable to achieve the same or a better level of performance than the GPU for Conv1 and ClassCaps layers.

Since the overall memory required to store all the weights is quite high, the buffers located in between the onchip memory and the processing elements are beneficial to maintain high throughput and to mitigate the latency due to onchip memory reads.
Iv Designing the CapsAcc Architecture
Following the above observations, we designed the complete CapsAcc accelerator and implemented it in hardware (RTL). The toplevel architecture is shown in Figure 10, where the bluecolored blocks highlight our novel contributions over other existing accelerators for CNNs. The detailed architectures of different components of our accelerator are shown in Figure 11. Our CapsAcc architecture has a systolic array supporting a specialized dataflow mapping (see Section V), which allows to exploit the computational parallelism for multidimensional matrix operations. The partial sums are stored and properly added together by the accumulator unit. The activation unit performs different activation functions, according to the requirements for each stage. The buffers (Data, Routing and Weight Buffers) are essential to temporarily store the information to feed the systolic array without accessing every time to the data and weight memories. The two multiplexers in front of the systolic array introduce the flexibility to process new data or reuse them, according to the dataflow mapping. The control unit coordinates all the accelerator operations, at each stage of the inference.
Iva Systolic Array
The systolic array of our CapsAcc architecture is shown in Figure (a)a. It is composed of a D array of Processing Elements (PEs), with rows and columns. For illustration and space reasons, Figure (a)a presents the version, while in our actual CapsAcc design we use a systolic array. The inputs are propagated towards the outputs of the systolic array both horizontally (Data) and vertically (Weight, Partial sum). In the first row, the inputs corresponding to the Partial sums are zerovalued, because each sum at this stage is equal to . Meanwhile, the Weight outputs in the last row are not connected, because they are not used in the following stages.
Figure (b)b shows the data path of a single Processing Element (PE). It has inputs and outputs: Data, Weight and Partial sum, respectively. The core of the PE is composed of the sequence of a multiplier and an adder. As shown in Figure (b)b, it has internal registers: (1) Data Reg. to store and synchronize the Data value coming from the left; (2) Sum Reg. to store the Partial sum before sending it to the neighbor PE below; (3) Weight Reg. synchronizes the vertical transfer; (4) Weight Reg. stores the value for data reuse. The latter is particularly useful for convolutional layers, where the same weight of the filter must be convolved across different data. For fullyconnected computations, the second weight register introduces just one clock cycle latency, without affecting the throughput. The bitwidths of each element have been designed as follows: (1) each PE computes the product between an bit fixedpoint Data and an bit fixedpoint Weight; and (2) the sum is designed as a bit fixedpoint value. At full throttle, each PE produces one outputperclock cycle, which also implies one outputperclock cycle for every column of the systolic array.
IvB Accumulator
The Accumulator unit consists of a FIFO buffer to store the Partial sums coming from the systolic array, and sum them together when needed. The multiplexer allows the choice to feed the buffer with the data coming from the systolic array or with the one coming from the internal adder of the Accumulator. We designed the Accumulator to have bit fixedpoint data. Figure (c)c shows the data path of our Accumulator. In the overall CapsAcc there are as many Accumulators as the number of columns of the systolic array.
IvC Activation Unit
The Activation Unit follows the Accumulators. As shown in Figure (d)d, it performs different functions in parallel, while the multiplexer (placed at the bottom of the figure) selects the path to propagate the information towards the output. As for the case of the Accumulator, the figure shows only one unit, while in the complete CapsAcc architecture there is one Activation Unit per each column of the systolic array. The bits data values coming from the Accumulators are reduced to an bit fixedpoint value, to reduce the computations at this stage.
Note: the Rectified Linear Unit (ReLU)
[10] is a very simple function and its implementation description is omitted, since it is straightforward. This function is used for every feature of the first two layers of the CapsuleNet.We designed the Normalization operator (Norm) with a structure similar to the MultiplyandAccumulate operator, where, instead of a traditional multiplier, there is the Power2 operator. Its data path is shown in Figure (f)f. A register stores the partial sum and the Sqrt operator produces the output. We designed the square operator as a Look Up Table with bit input and bit output. It produces a valid output every clock cycles, where is the size of the array for which we want to compute the Norm. This operator is used either as it is to compute the classification prediction, or as an input for the Squashing function.
We designed and implemented the Squashing function as a Look Up Table, as shown in Figure (e)e. Looking at Equation 1, the function takes an input and its norm . The Norm input is coming from its respective unit. Hence, this Norm operation is not implemented again inside the Squash unit. The LUT takes as input a bit fixedpoint data and a bit fixedpoint norm to produce an bit output. We decided to limit the bitwidth to reduce the computational requirements at this stage, following the analysis performed in Section III that shows the highest computational load for this operation. A valid output is produced with just one additional clock cycle compared to the Norm.
The Softmax function design is shown in Figure (g)g. First, it computes the exponential function (bit Look Up Table) and accumulates the sum in a register, followed by division. Overall, having an array of elements, this block is able to compute the softmax function of the whole array in clock cycles.
IvD Control Unit
At each stage of the inference process, it generates different control signals for all the components of the accelerator architecture, according to the operations needed. It is essential for correct operation of the accelerator.
V DataFlow Mapping
In this section, we provide the details on how to map the processing of different types of layers and operations onto our CapsAcc accelerator, in a stepbystep fashion. To feed the systolic array, we adopt the mapping policy described in Figure 13. For the ease of understanding, we illustrate the process with the help of an example performing MNIST classification on our CapsAcc accelerator, which also represents our case study. Note, each stage of the CapsuleNet inference requires its own mapping scheme.
Va Conv1 mapping
The Conv1 layer has filters of size and channels. As shown in Figure (a)a, we designed the mapping row by row (A,B), and after the last row we move to the next channel (C). Figure (a)a shows how the dataflow is mapped onto our CapsAcc accelerator. To perform the convolution efficiently, we hold the weight values into the systolic array to reuse the filter across different input data.
VB PrimaryCaps mapping
Compared to the Conv1 layer, the PrimaryCaps layer has one more dimension, which is the capsule size (i.e., ). However, we treat the D capsule as a convolutional layer with output channels. Thus, Figure (b)b shows that we map the parameters rowbyrow (A,B), then moving through different input channels (C), and only at the third stage we move on to the next output channel (D). This mapping procedure allows us to minimize the accumulator size, because our CapsAcc accelerator computes first the output features for the same output channel. Since the type of this layer is convolutional, the dataflow is the same as the one in the previous layer, as reported in Figure (a)a.
VC ClassCaps mapping
The mapping of the ClassCaps layer is shown in Figure (c)c. After mapping row by row (A,B), we consider input capsules and input channels as the third dimension (C), and output capsules and output channels as the fourth dimension (D).
Then, for each step of the routingbyagreement process, we design the corresponding dataflow mapping. It is a critical phase, because a less efficient mapping can potentially have a huge impact on the overall performance.
First, we apply an algorithmic optimization on the routingbyagreement algorithm. During the first operation, instead of initializing to and computing the softmax on them, we directly initialize the coupling coefficients . The starting point is indicated with the blue arrow in Figure 4. With this optimization, we can skip the softmax computation at the first routing iteration. In fact, this operation is dummy, because all the inputs are equal to , thus they do not depend on the current data.
Regarding the dataflow mapping in our CapsAcc accelerator, we can identify three different dataflow scenarios during the routingbyagreement algorithm:

[label=0)]

First sum generation and squash: The predictions are loaded from the Data Buffer, the coupling coefficients are coming from the Routing Buffer, the systolic array computes the sums , the Activation Unit computes and selects Squash, and the outputs are stored back in the Routing Buffer. This dataflow is shown in Figure (b)b.

Update and softmax: The predictions are reused through the horizontal feedback of the architecture, are coming from the Routing Buffer, the systolic array computes the updates for , and the Softmax at the Activation Unit produces that are stored back in the Routing Buffer. Figure (c)c shows the dataflow described above.

Sum generation and Squash: Figure (d)d shows the dataflow mapping for this scenario. Compared to the Figure (b)b, the predictions are coming from the horizontal feedback link, thus exploiting data reuse also in this stage.
Vi Results and Discussion
Via Experimental Setup
We implemented the complete design of our CapsAcc architecture in RTL (VHDL), and evaluated it for the MNIST dataset (to stay consistent with the original CapsuleNet paper). We synthesized the complete architecture in a nm CMOS technology library using the ASIC design flow with the Synopsys Design Compiler. We did functional and timing validation through the gatelevel simulations using ModelSim, and obtained the precise area, power and performance of our design. The complete synthesis flow is shown in Figure 15, where the orange and blue colored boxes represent the inputs and the output results of our experiments, respectively.
Important Note: since our hardware design is fully functionally compliant with the original CapsuleNet design of the work of [12], we observed the same accuracy of classification. Therefore, we do not present any classification results in this paper, and only focus on the performance, area and power results, which are more relevant for an optimized hardware architecture.
ViB Discussion on Comparative Results
The graphs shown in Figure 16 report the performance (execution time) results of the different layers of CapsuleNet inference on our CapsAcc accelerator, while Figure 17 shows the performance of every sequence of the routing process. Compared with the GPU performance (see Figures 9 and 8), we obtained a significant speedup for the overall computation time of a CapsuleNet inference pass (). The main notable improvements are witnessed in the ClassCaps layer () and in the Squashing operation ().
ViC Detailed Area and Power Breakdown


The details and synthesis parameters for our design are reported in Table II. Table III shows the absolute values for the area and power consumption of all the components of the synthesized accelerator. Figures (b)b and (a)a show the area and power breakdowns, respectively, of our CapsAcc architecture. These figures show that the area and power contributions are dominated by the buffers, and the systolic array is just 1/4 of the total budget.
Vii Conclusions
We presented the first CMOSbased hardware accelerator for the complete CapsuleNet inference. To achieve high performance, our CapsAcc architecture employs a flexible systolic array with several optimized dataflow patterns that enable it to fully exploit a high level of parallelism for diverse operations of the CapsuleNet processing. To efficiently use the proposed hardware design, we also optimized the routingbyagreement algorithm without changing its functionality and thereby preserving the classification accuracy of the original CapsuleNet design of [12]. Our results show a significant speedup compared to an optimized GPU implementation. We also presented power and area breakdown of our hardware design. Our CapsAcc provides the first proofofconcept for realizing CapsuleNet hardware, and opens new avenues for its highperformance inference deployments.
References
 [1] Y. H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy efficient dataflow for convolutional neural networks. In ISCA, 2016.
 [2] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. In JMLR, 2011.
 [3] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. In Neural Networks, 2005.
 [4] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efcient Inference Engine on Compressed Deep Neural Network. In ISCA, 2016.
 [5] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming autoencoders. In ICANN, 2011.

[6]
N. P. Jouppi et al. Indatacenter performance analysis of a tensor processing unit. In
ISCA, 2017. 
[7]
A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In
NIPS, 2012.  [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 1998.
 [9] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In HPCA, 2017.

[10]
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In
ICML, 2010.  [11] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. SCNN: An accelerator for compressedsparse convolutional neural networks. In ISCA, 2017.
 [12] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NIPS, 2017.
 [13] pyTorch framework: https://github.com/pytorch/pytorch