. Such CapsuleNets are able to encapsulate multi-dimensional features across the layers, while traditional Convolutional Neural Networks (CNNs) do not. Thus,CapsuleNets can beat traditional CNNs in multiple tasks, like image classification, as shown in . The most evident difference is that the CapsuleNets are deeper in width than in height, when compared to DNNs, as each capsule incorporates the information hierarchically, thus preserving other features like position, orientation and scaling (see an overview of CapsuleNets in Section II). The data is propagated towards the output using the so-called routing-by-agreement algorithm.
Current state-of-the-art DNN accelerators      proposed energy-aware solutions for inference using traditional CNNs. As for our knowledge, we are the first to propose a hardware accelerator-based architecture for the complete CapsuleNets inference. Although systolic array based designs like  perform parallel matrix multiply-and-accumulate (MAC) operations with good efficiency, the existing CNN accelerators cannot compute several key operations of the CapsuleNets (i.e., squashing and routing-by-agreement) with high performance. An efficient data-flow mapping requires a direct feedback connection from the outputs coming from the activation unit back to the inputs of the processing element. Thus, such key optimizations can highly increase the performance and reduce the memory accesses.
Our Novel Contributions:
We analyze the memory requirements and the performance in the forward pass of CapsuleNets, through experiments on a high-end GPU, which allows to identify the corresponding bottlenecks.
We propose CapsAcc, an accelerator that can perform inference on CapsuleNets with an efficient data reuse based mapping policy.
We optimize the routing-by-agreement process at algorithm level, by skipping the first step and directly initializing the coupling coefficients.
We implement and synthesize the complete CapsAcc architecture for a nm technology using the ASIC design flow, and perform evaluations for performance, area and power consumption. We performed the functional and timing validation through gate-level simulations. Our results demonstrate a speed-up of in the ClassCaps layer, of in the Squashing and of in the overall CapsuleNet inference, compared to a highly optimized GPU implementation.
Paper Organization: Section II summarizes the fundamental theory behind CapsuleNets and highlights the differences with traditional DNNs. In Section III, we systematically analyze the forward pass of the CapsuleNets executing on a GPU, to identify the potential bottlenecks. Section IV describes the architectural design of our CapsuleNet accelerator, for which the data-flow mapping is presented in Section V. The results are presented in Section VI.
Ii Background: An Overview of CapsuleNets
Sabour and Hinton et al. 
introduced many novelties compared to CNNs, such as the concept of capsules, the squashing activation function, and the routing-by-agreement algorithm. In this paper, since we analyze the inference process, the layers and the algorithms that are involved in the training processonly (e.g., decoder, margin loss and reconstruction loss) are not discussed.
Ii-a CapsuleNet Architecture
PrimaryCaps: first capsule layer, with 32 channels. Each eight-dimensional (8D) capsule has 9x9 convolutional filters with stride=2.
ClassCaps: last capsule layer, with 16D capsules for each output class.
At the first glance, it is evident that a capsule layer contains multi-dimensional capsules, which are groups of neurons nested inside a layer. One of the main advantages of CapsuleNets over traditional CNNs is the ability to learn the hierarchy between layers, because each capsule element is able to learn different types of information
(e.g., position, orientation and scaling). Indeed, CNNs have limited model capabilities, which they try to compensate by increasing the amount of training data (with more samples and/or data augmentation) and by applying pooling to select the most important information that will be propagated to the following layers. In capsule layers, however, the outputs are propagated towards the following layers in form of a prediction vector, whose size is defined by the capsule dimension. A simple visualization of how a CapsuleNet works is presented inFigure 2. After the weight matrix multiplication , the values are multiplied by the coupling coefficients , before summing together the contributions and applying the squash function. The coupling coefficients are computed and updated at run-time during each inference pass, using the routing-by-agreement algorithm (Figure 4).
The squashing is an activation function designed to efficiently fit for the prediction vector. It introduces the nonlinearity into an array and normalizes the outputs to values between and . Given as the input of the capsule (or, from another perspective, the sum of the weighted prediction vector) and as its respective output, the squashing function is defined by the Equation 1.
The behaviors of the squashing function and its first derivative are shown in Figure 3. Note that we have plotted the single-dimensional input function, since a multi-dimensional input version cannot be visualized in a chart. The squashing function produces an output bounded between and , while its first derivative follows the behavior of the red line, with a peak at the point .
Ii-C Routing-by-Agreement Algorithm
The predictions are propagated across two consecutive capsule layers through the routing-by-agreement algorithm. It is an iterative process, that introduces a feedback path in the inference pass. For clarity, we present the flow diagram (Figure 4) of the routing-by-agreement at software level. Note, this algorithm introduces a loop in the forward pass, because the coupling coefficients are learned during the routing, as their values depend on the current data. Thus, they cannot be considered as constant parameters, learned during the training process. Intuitively, this step can cause a computational bottleneck, as demonstrated in Section III.
Iii Motivational Analysis of CapsuleNet Complexity
In the following, we perform a comprehensive analysis to identify how CapsuleNet inference is performed on a standard GPU platform, like the one used in our experiments, i.e., the Nvidia Ge-Force GTX1070 GPU (see Figure 6). First, in Section III-A we quantitatively analyze how many trainable parameters per layer must be fed from the memory. Then, in Section III-B
we benchmark our pyTorch based CapsuleNet implementation for the MNIST dataset to measure the performance of the inference process on our GPU.
Iii-a Trainable parameters of the CapsuleNet
Figure 5 shows quantitatively how many parameters are needed for each layer. As evident, the majority of the weights belong to the PrimaryCaps layer, due to its channels and D capsules. Even if the ClassCaps layer has fully-connected behavior, it counts just for less than of the total parameters of the CapsuleNet. Finally, Conv1 and the coupling coefficients counts for a very small percentage of the parameters. The detailed computation of the parameters is reported in Table I. Based on that, we make an observation valuable for designing our hardware accelerator: by considering an -bit fixed point weight representation, we can estimate that an on-chip memory size of
-bit fixed point weight representation, we can estimate that an on-chip memory size ofMB is large enough to contain every parameter of the CapsuleNet.
Iii-B Performance Analysis on a GPU
At this stage, we measure the time required for an inference pass on the GPU. The experimental setup is shown in Figure 7. Figure 8 shows the measurements for each layer. The ClassCaps layer is the computational bottleneck, because it is around slower than the previous layers. To obtain more detailed results, a further analysis has been performed, regarding the performance for each step of the routing-by-agreement (Figure 9). It is evident that the Squashing operation inside the ClassCaps layer represents the most compute-intensive operation. This analysis gives us the motivation to spend more effort in optimizing routing-by-agreement and squashing in our CapsuleNet accelerator.
Iii-C Summary of Key Observations from our Analyses
The CapsuleNet inference performed on GPU is more compute-intensive than memory-intensive, because the bottleneck is represented by the squashing operation.
A massive parallel computation capability in the hardware accelerator is desirable to achieve the same or a better level of performance than the GPU for Conv1 and ClassCaps layers.
Since the overall memory required to store all the weights is quite high, the buffers located in between the on-chip memory and the processing elements are beneficial to maintain high throughput and to mitigate the latency due to on-chip memory reads.
Iv Designing the CapsAcc Architecture
Following the above observations, we designed the complete CapsAcc accelerator and implemented it in hardware (RTL). The top-level architecture is shown in Figure 10, where the blue-colored blocks highlight our novel contributions over other existing accelerators for CNNs. The detailed architectures of different components of our accelerator are shown in Figure 11. Our CapsAcc architecture has a systolic array supporting a specialized data-flow mapping (see Section V), which allows to exploit the computational parallelism for multi-dimensional matrix operations. The partial sums are stored and properly added together by the accumulator unit. The activation unit performs different activation functions, according to the requirements for each stage. The buffers (Data, Routing and Weight Buffers) are essential to temporarily store the information to feed the systolic array without accessing every time to the data and weight memories. The two multiplexers in front of the systolic array introduce the flexibility to process new data or reuse them, according to the data-flow mapping. The control unit coordinates all the accelerator operations, at each stage of the inference.
Iv-a Systolic Array
The systolic array of our CapsAcc architecture is shown in Figure (a)a. It is composed of a D array of Processing Elements (PEs), with rows and columns. For illustration and space reasons, Figure (a)a presents the version, while in our actual CapsAcc design we use a systolic array. The inputs are propagated towards the outputs of the systolic array both horizontally (Data) and vertically (Weight, Partial sum). In the first row, the inputs corresponding to the Partial sums are zero-valued, because each sum at this stage is equal to . Meanwhile, the Weight outputs in the last row are not connected, because they are not used in the following stages.
Figure (b)b shows the data path of a single Processing Element (PE). It has inputs and outputs: Data, Weight and Partial sum, respectively. The core of the PE is composed of the sequence of a multiplier and an adder. As shown in Figure (b)b, it has internal registers: (1) Data Reg. to store and synchronize the Data value coming from the left; (2) Sum Reg. to store the Partial sum before sending it to the neighbor PE below; (3) Weight Reg. synchronizes the vertical transfer; (4) Weight Reg. stores the value for data reuse. The latter is particularly useful for convolutional layers, where the same weight of the filter must be convolved across different data. For fully-connected computations, the second weight register introduces just one clock cycle latency, without affecting the throughput. The bit-widths of each element have been designed as follows: (1) each PE computes the product between an -bit fixed-point Data and an -bit fixed-point Weight; and (2) the sum is designed as a -bit fixed-point value. At full throttle, each PE produces one output-per-clock cycle, which also implies one output-per-clock cycle for every column of the systolic array.
The Accumulator unit consists of a FIFO buffer to store the Partial sums coming from the systolic array, and sum them together when needed. The multiplexer allows the choice to feed the buffer with the data coming from the systolic array or with the one coming from the internal adder of the Accumulator. We designed the Accumulator to have -bit fixed-point data. Figure (c)c shows the data path of our Accumulator. In the overall CapsAcc there are as many Accumulators as the number of columns of the systolic array.
Iv-C Activation Unit
The Activation Unit follows the Accumulators. As shown in Figure (d)d, it performs different functions in parallel, while the multiplexer (placed at the bottom of the figure) selects the path to propagate the information towards the output. As for the case of the Accumulator, the figure shows only one unit, while in the complete CapsAcc architecture there is one Activation Unit per each column of the systolic array. The -bits data values coming from the Accumulators are reduced to an -bit fixed-point value, to reduce the computations at this stage.
Note: the Rectified Linear Unit (ReLU) is a very simple function and its implementation description is omitted, since it is straightforward. This function is used for every feature of the first two layers of the CapsuleNet.
We designed the Normalization operator (Norm) with a structure similar to the Multiply-and-Accumulate operator, where, instead of a traditional multiplier, there is the Power2 operator. Its data path is shown in Figure (f)f. A register stores the partial sum and the Sqrt operator produces the output. We designed the square operator as a Look Up Table with -bit input and -bit output. It produces a valid output every clock cycles, where is the size of the array for which we want to compute the Norm. This operator is used either as it is to compute the classification prediction, or as an input for the Squashing function.
We designed and implemented the Squashing function as a Look Up Table, as shown in Figure (e)e. Looking at Equation 1, the function takes an input and its norm . The Norm input is coming from its respective unit. Hence, this Norm operation is not implemented again inside the Squash unit. The LUT takes as input a -bit fixed-point data and a -bit fixed-point norm to produce an -bit output. We decided to limit the bit-width to reduce the computational requirements at this stage, following the analysis performed in Section III that shows the highest computational load for this operation. A valid output is produced with just one additional clock cycle compared to the Norm.
The Softmax function design is shown in Figure (g)g. First, it computes the exponential function (-bit Look Up Table) and accumulates the sum in a register, followed by division. Overall, having an array of elements, this block is able to compute the softmax function of the whole array in clock cycles.
Iv-D Control Unit
At each stage of the inference process, it generates different control signals for all the components of the accelerator architecture, according to the operations needed. It is essential for correct operation of the accelerator.
V Data-Flow Mapping
In this section, we provide the details on how to map the processing of different types of layers and operations onto our CapsAcc accelerator, in a step-by-step fashion. To feed the systolic array, we adopt the mapping policy described in Figure 13. For the ease of understanding, we illustrate the process with the help of an example performing MNIST classification on our CapsAcc accelerator, which also represents our case study. Note, each stage of the CapsuleNet inference requires its own mapping scheme.
V-a Conv1 mapping
The Conv1 layer has filters of size and channels. As shown in Figure (a)a, we designed the mapping row by row (A,B), and after the last row we move to the next channel (C). Figure (a)a shows how the data-flow is mapped onto our CapsAcc accelerator. To perform the convolution efficiently, we hold the weight values into the systolic array to reuse the filter across different input data.
V-B PrimaryCaps mapping
Compared to the Conv1 layer, the PrimaryCaps layer has one more dimension, which is the capsule size (i.e., ). However, we treat the D capsule as a convolutional layer with output channels. Thus, Figure (b)b shows that we map the parameters row-by-row (A,B), then moving through different input channels (C), and only at the third stage we move on to the next output channel (D). This mapping procedure allows us to minimize the accumulator size, because our CapsAcc accelerator computes first the output features for the same output channel. Since the type of this layer is convolutional, the data-flow is the same as the one in the previous layer, as reported in Figure (a)a.
V-C ClassCaps mapping
The mapping of the ClassCaps layer is shown in Figure (c)c. After mapping row by row (A,B), we consider input capsules and input channels as the third dimension (C), and output capsules and output channels as the fourth dimension (D).
Then, for each step of the routing-by-agreement process, we design the corresponding data-flow mapping. It is a critical phase, because a less efficient mapping can potentially have a huge impact on the overall performance.
First, we apply an algorithmic optimization on the routing-by-agreement algorithm. During the first operation, instead of initializing to and computing the softmax on them, we directly initialize the coupling coefficients . The starting point is indicated with the blue arrow in Figure 4. With this optimization, we can skip the softmax computation at the first routing iteration. In fact, this operation is dummy, because all the inputs are equal to , thus they do not depend on the current data.
Regarding the data-flow mapping in our CapsAcc accelerator, we can identify three different data-flow scenarios during the routing-by-agreement algorithm:
First sum generation and squash: The predictions are loaded from the Data Buffer, the coupling coefficients are coming from the Routing Buffer, the systolic array computes the sums , the Activation Unit computes and selects Squash, and the outputs are stored back in the Routing Buffer. This data-flow is shown in Figure (b)b.
Update and softmax: The predictions are reused through the horizontal feedback of the architecture, are coming from the Routing Buffer, the systolic array computes the updates for , and the Softmax at the Activation Unit produces that are stored back in the Routing Buffer. Figure (c)c shows the data-flow described above.
Vi Results and Discussion
Vi-a Experimental Setup
We implemented the complete design of our CapsAcc architecture in RTL (VHDL), and evaluated it for the MNIST dataset (to stay consistent with the original CapsuleNet paper). We synthesized the complete architecture in a nm CMOS technology library using the ASIC design flow with the Synopsys Design Compiler. We did functional and timing validation through the gate-level simulations using ModelSim, and obtained the precise area, power and performance of our design. The complete synthesis flow is shown in Figure 15, where the orange and blue colored boxes represent the inputs and the output results of our experiments, respectively.
Important Note: since our hardware design is fully functionally compliant with the original CapsuleNet design of the work of , we observed the same accuracy of classification. Therefore, we do not present any classification results in this paper, and only focus on the performance, area and power results, which are more relevant for an optimized hardware architecture.
Vi-B Discussion on Comparative Results
The graphs shown in Figure 16 report the performance (execution time) results of the different layers of CapsuleNet inference on our CapsAcc accelerator, while Figure 17 shows the performance of every sequence of the routing process. Compared with the GPU performance (see Figures 9 and 8), we obtained a significant speed-up for the overall computation time of a CapsuleNet inference pass (). The main notable improvements are witnessed in the ClassCaps layer () and in the Squashing operation ().
Vi-C Detailed Area and Power Breakdown
The details and synthesis parameters for our design are reported in Table II. Table III shows the absolute values for the area and power consumption of all the components of the synthesized accelerator. Figures (b)b and (a)a show the area and power breakdowns, respectively, of our CapsAcc architecture. These figures show that the area and power contributions are dominated by the buffers, and the systolic array is just 1/4 of the total budget.
We presented the first CMOS-based hardware accelerator for the complete CapsuleNet inference. To achieve high performance, our CapsAcc architecture employs a flexible systolic array with several optimized data-flow patterns that enable it to fully exploit a high level of parallelism for diverse operations of the CapsuleNet processing. To efficiently use the proposed hardware design, we also optimized the routing-by-agreement algorithm without changing its functionality and thereby preserving the classification accuracy of the original CapsuleNet design of . Our results show a significant speedup compared to an optimized GPU implementation. We also presented power and area breakdown of our hardware design. Our CapsAcc provides the first proof-of-concept for realizing CapsuleNet hardware, and opens new avenues for its high-performance inference deployments.
-  Y. H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture for energy efficient dataflow for convolutional neural networks. In ISCA, 2016.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. In JMLR, 2011.
-  A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. In Neural Networks, 2005.
-  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efcient Inference Engine on Compressed Deep Neural Network. In ISCA, 2016.
-  G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN, 2011.
N. P. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. InISCA, 2017.
A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. InNIPS, 2012.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
-  W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In HPCA, 2017.
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. InICML, 2010.
-  A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally. SCNN: An accelerator for compressed-sparse convolutional neural networks. In ISCA, 2017.
-  S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NIPS, 2017.
-  pyTorch framework: https://github.com/pytorch/pytorch