I Introduction
The recently introduced breakthrough concept of Capsule Networks (CapsNets) by the Google Brain team has achieved a significant spotlight due to its powerful new features offering high accuracy and better learning capabilities [Sabour2017dynamic_routing]. Traditional Convolutional Neural Networks (CNNs) are not able to learn the spatial relations in the images much efficiently [Sabour2017dynamic_routing]. Moreover, they make extensive use of the pooling layer to reduce the dimensionality of the space, and consequently as a drawback, the learning capabilities are reduced. Therefore, a huge amount of training data is required to mitigate such deficit. On the other hand, CapsNets take advantage from their novel structure, with socalled capsules and their crosscoupling learnt through the dynamic routing algorithm, to overcome this problem
. Capsules produce vector outputs, as opposed to scalar outputs of CNNs
[Sabour2017dynamic_routing]. In a vector format, CapsNets are able to learn the spatial relationships between the features. For example, the Google Brain team [Sabour2017dynamic_routing] demonstrated that CNNs recognize an image where the nose is below the mouth as a “face”, while CapsNets do not make such mistake because they have learned the spatial correlations between features (e.g., the nose must appear above the mouth). Other than image classification, CapsNets have been successfully showcased to perform vehicle detection [Yu2019CapsNetDetection], speech recognition [Wu2010CapsNetspeechrecognition]and natural language processing
[zhao2019CapsNetNLP].The biggest roadblock in the realworld deployments of CapsNet inference is their extremely high complexity, requiring a specialized hardware architecture (like the recent one in [Marchisio2019CapsAcc]) that may consume a significant amount of energy/power. Not only deep CapsNet models [Rajasegaran2019DeepCaps], but also shallow models like [Sabour2017dynamic_routing] require intense computations due to matrix multiplications in the capsule processing and the iterative dynamic routing algorithm for learning the crosscoupling between capsules. To deploy CapsNets at the edge, as commonly adopted for the traditional CNNs [Han2016EIE], network compression techniques (like pruning and quantization) [Han2016DeepCompression] can be applied, but at the cost of some accuracy loss. Moreover, the current trends in Approximate Computing can be leveraged to achieve energyefficient hardware architectures, as well as for enabling design/runtime energyquality tradeoffs. However, this requires a comprehensive resilience analysis of CapsNets considering approximation errors in the hardware, in order to make correct design decisions on which computational steps of the CapsNets are more likely to be approximated and which not. Note, unlike in the case of approximations, an error can also be caused by a misfunctioning of the computing hardware [Jiao2017VulnerabilityDNN] or of the memory [Kim2014RowHammer]. Fault injections have demonstrated to fool CNNs [Liu2017FaultInjectionDNN], and can potentially cause a CapsNet misclassification as well.
Concept Overview and our Novel Contributions:
To address these challenges, we propose ReDCaNe, a novel methodology (see Fig. 1) for analyzing the resilience of CapsNets under approximations, which, to the best of our knowledge, is the first of its kind. First, we devise a noise injection model to simulate realcase scenarios of errors coming from approximate hardware components like multipliers, which are very common in multiplyandaccumulate (MAC) operations for the matrix multiplications of capsules. Then, we analyze the error resilience of the CapsNets by building a systematic methodology for injecting noise into different operations of the CapsNet inference, and evaluating their impact on the accuracy. The outcome of such analysis will produce guidelines for designing and selecting approximate components, based on the resilience of each operation. At the output, our methodology produces an approximated version of a given CapsNet, to achieve an energyefficient inference.
Ia In a nutshell, our novel contributions are:

[leftmargin=*]

We analyze and model the noise injections that can be generated by different approximate arithmetic components, e.g., multipliers. (Section III)

We devise ReDCaNe, a novel methodology for analyzing the Resilience and Designing Capsule Networks under approximations, by systematically adding noise at different operations of the CapsNet inference and by monitoring the test accuracy. The approximated components are selected based on the resilience level of the different operations of the CapsNet inference. (Section IV)

We test our methodology on several benchmarks. On the DeepCaps model [Rajasegaran2019DeepCaps] for CIFAR10 [Krizhevsky2009CIFAR], MNIST [LeCun1998MNIST], and SVHN [Netzer2011SVHN] datasets, and on the CapsNet model [Sabour2017dynamic_routing] for MNIST and FashionMNIST [Xiao2017FashionMNIST] datasets. Our results demonstrate that the least resilient operations are the convolutions in CapsLayers, while the operations performed during the dynamic routing of the Caps3D and ClassCaps layers are relatively more resilient. (Section VI)
Before proceeding to the technical sections, in Section II, we summarize the concepts of CapsNets and review the existing works of error resilience for traditional CNNs, with necessary details to understand the rest of the paper.
Ii Background and Related Work
Iia Capsule Networks (CapsNets)
CapsNets, first introduced in [Hinton2011TransformingAutoencoder], have become popular in [Sabour2017dynamic_routing], thanks to the new concepts like capsules and the dynamic routing algorithm. Following this trend, DeepCaps [Rajasegaran2019DeepCaps] proposed to increase the depth of CapsNets, achieving stateoftheart accuracy for the CIFAR10 [Krizhevsky2009CIFAR] dataset.
A capsule
is a group of neurons where the instantiation parameters are represented as the orientation of each element of the vector, and the vector length represents the probability that the entity exists. Moreover, vector predictions of the capsules need to be supported by nonlinear vectorized activation functions. Towards this end, the squashing function bounds the output of the capsule between 0 and 1.
In the dynamic routing, the coupling coefficients, which are connecting two consecutive capsule layers, learn the agreement during the inference by iteratively updating their values according to the relevance of the path. As an example, the architecture^{1}^{1}1Since we focus on the CapsNet inference, we do not discuss the operations that are involved in the training process only (e.g., decoder and reconstruction loss). For further details on CapsNets, we refer the readers to [Sabour2017dynamic_routing][Rajasegaran2019DeepCaps]. of the DeepCaps is shown in the Fig. 2. It has 16 convolutional capsule layers (ConvCaps), where one of them is 3D, and one fullyconnected capsule layer (ClassCaps) at the end. A special focus is on the operations required for the dynamic routing, which is performed in the 3D ConvCaps and in the ClassCaps layers, as shown in Fig. 3. Note that the operations like matrixvector multiplications and squash are different from the traditional CNNs. Hence, a challenge that we want to address in this paper is to study the interrelation between the precision of these operations and the accuracy of the CapsNets, when subjected to errors due to approximations.
IiB Error Resilience of Traditional CNNs
The resilience of traditional CNNs has recently been investigated in the literature. Du2014ErrorResilientAccelerators [Du2014ErrorResilientAccelerators] analyzed the error resilience, showing that it is possible to obtain high energy savings with minimal accuracy loss. Hanif2018ErrorResilienceCNN [Hanif2018ErrorResilienceCNN] proposed a methodology to apply approximations for CNNs, based on the error resilience. Li2017ErrorPropagationDNN [Li2017ErrorPropagationDNN] studied the error propagation with the end goal of adopting countermeasures for obtaining resilient hardware accelerators. Zhang2019faulttolerantDNN [Zhang2019faulttolerantDNN] proposed a method to design fault tolerant systolic arraybased accelerators. Hanif2019CANN [Hanif2019CANN] introduced a method for applying approximate multipliers into CNN accelerators without any error in the final result. Mrazek2019autoAx [Mrazek2019autoAx][Mrazek2019ALWANN] proposed a methodology to successfully search and select approximate components for CNN accelerators. Hanif2018XDNNs [Hanif2018XDNNs] and Marchisio2019DL4EC [Marchisio2019DL4EC] analyzed crosslayer approximations for CNNs. However, these works analyzed only traditional CNN accelerators, and such studies cannot be efficiently extrapolated for CapsNets, as discussed before. Hence, there is a dire need to perform the resilience analysis for CapsNets in a systematic way, such that we can take efficient decisions about approximating the appropriate operations of CapsNets.
IiC Error Sources
In a generic Deep Learning application, errors may occur due to different sources like software approximations (e.g., quantization), hardware approximations (e.g., approximate multipliers), transient faults (i.e., bit flips due to particle strikes) and permanent faults (e.g., stuckatzero and stuckatone). In this paper, due to the focus on energyefficiency, we target approximation errors
^{2}^{2}2For further details on reliability and security related works on DNNs that study soft errors, permanent faults, and adversarial noise, we refer the readers to [Goodfellow2015explainingadvexamples][Jiao2017VulnerabilityDNN][Srinivasan20166TSRAMANN][Zhang2019RobustML]..If the CapsNet inference is performed by specialized hardware accelerators [Marchisio2019CapsAcc], a fixedpoint representation is typically preferred, as compared to the floatingpoint counterpart [google2017quant]. Therefore, a floatingpoint value , which must be represented in a bit fixedpoint arithmetic [Parashar2010WLOptimization], is mapped to a range . The quantization function is defined in Eq. 1.
(1) 
In this work, we simulate the CapsNets with floatingpoint arithmetic, but
the behavior of approximate fixedpoint components is simulated by adjusting their values according to the quantization effect. Hence, we focus on modeling the errors subjected to the employment of approximate components in CapsNet hardware accelerators.Iii Modeling the Errors as Injected Noise
Iiia Analysis of Different Operations in CapsNets
We perform a comprehensive analysis to investigate which hardware components have the highest impact on the total energy consumption of the CapsNets’ computational blocks. Table I reports the number of operations that occur in the computational path of the DeepCaps [Rajasegaran2019DeepCaps] inference and the energy consumption per operation. The latter has been generated by synthesizing the implementation with 8 bits fixedpoint operations, in a 45nm CMOS technology with the Synopsys Design Compiler tool. Fig. 4
presents the breakdown of the estimated energy share for each operation. It is worth noticing that the multipliers count for 96% of the total energy share of the computational path of the DeepCaps. The occurrences of the addition is also high, but energywise the additions consume only 3% of the total share due to their reduced complexity as compared to that of the multipliers. Hence, it is important to explore the energy savings from approximating the multiplier operations first, as we target in this paper.

In the following, we study the energy optimization potential of employing approximate components. As a case study, we select from the EvoApprox8B library [Mrazek2017Lib] the NGR approximate multiplier and the 5LT approximate adder. The results in Fig. 5 show that approximating only the multipliers (XM) can save more than 28% of the energy consumption, compared to the accurate implementation (Acc). Due to the low share of energy consumed by the additions, the advantage of employing approximate adders (XA) or employing approximate adders and multipliers (XAM) is negligible compared to Acc and XM solutions, respectively.
Motivated by the above discussions and analysis, in the following, without loss of generality and for the ease of proofofconcept development, we focus our analyses on the approximate multipliers, since they have high impact on the energy consumption, thus opening huge optimization potentials.
IiiB Error Profiles for the Approximate Hardware Multipliers
We selected 35 approximate multipliers from the EvoApprox8B library [Mrazek2017Lib] and analyzed the distributions of the erroneous products generated by such multipliers, compared to the accurate product of an 8bit multiplier (i.e., 16bit output). The arithmetic error is computed in Eq. 2, where denotes the inputs to the multipliers from a representative set of inputs .
(2) 
The distributions of the arithmetic errors are calculated as having a single multiplier, a sequence of 9 multiplyandaccumulate (MAC) units, and as a sequence of 81 MAC units, with random samples per each scenario. These analyses are performed for estimating the accumulated error of a convolution with and filters, respectively. We selected these values because they reflect the size of the convolutional kernels of the DeepCaps [Rajasegaran2019DeepCaps] and CapsNet [Sabour2017dynamic_routing].
The majority of the components (31 of 35) has a Gaussianlike distribution of the arithmetic error , with a mean value m
and a standard deviation
std. The error distributions of two approximated multipliers^{3}^{3}3Since the remaining 29 elements from the EvoApprox8B library [Mrazek2017Lib] which have a Gaussianlike distribution show a similar behavior, we only report these two examples of approximate multipliers. from [Mrazek2017Lib] are shown in Fig. 6.Modeling a Gaussian noise , when employing bit fixedpoint approximate components in a CapsNet which has floatingpoint operations, is an open research problem. We propose to adjust the noise w.r.t. the range of values of a given array . Hence, we introduce the noise magnitude () to indicate the standard deviation () of the noise scaled w.r.t. , and the noise average () to indicate the mean value () of the noise scaled w.r.t. .
Since the inputs of components () employed in CapsNets have typically some specific distribution patterns, the of the approximate component is dependent on the application. This implies that the can change significantly for different CapsNet models and different dataset used. Hence, we show several experiments for different benchmarks in Sec. VI.
IiiC Noise Injection Modeling
Based on the above analysis, without loss of generality, we can model the error source coming from approximate components as a Gaussian random noise added to the array under consideration.
An error with certain values of and
, associated to a given tensor
(i.e., a multidimensional output of a CapsNet operation) with shape is modelled as in Equation 3. The noisy output is denoted as in Equation 4.(3) 
(4) 
Here, is a function which generates a tensor of random numbers with shape
, following a Gaussian distribution with mean
and standard deviation .Iv ReDCaNe: Our Methodology for Error Resilience Analysis and Design of Approximate CapsNets
Our methodology is composed of 6 steps, as shown in Fig. 7. Once we identify the lists of arrays in which we want to inject noise, called Groups, we apply the noise injection, as described in Sec. IIIC. By monitoring the impact on the test accuracy of different arrays of operations, we can identify the most and the least critical operations in a given CapsNet from the accuracy point of view. Therefore, our ReDCANE methodology can provide useful guidelines for designing energyefficient inference, showing the potential to apply approximations to specific layers and operations (i.e., the more resilient ones) without significantly sacrificing the accuracy. A stepbystep flow of our methodology is described in the following:

[leftmargin=*]

Group Extraction:
We divide the operations of the CapsNet inference into groups, based on the type of operation (e.g., MAC, activation function, softmax or logits update). This step generates the
Groups. 
GroupWise Resilience Analysis: We monitor the test accuracy drop by injecting noise to different groups.

Mark Resilient Groups: Based on the results of the analysis performed at the Step 2, we mark the more resilient groups. After this step, there are two categories of Groups, the Resilient and NonResilient ones.

LayerWise Resilience Analysis for NonResilient Groups: For each nonresilient group^{4}^{4}4Compared to a layerwise analysis for each group, by performing such analysis to the nonresilient groups only, a considerable amount of unuseful testing can be skipped, and a significant exploration time is saved., we monitor the test accuracy drop by injecting noise at each layer.

Mark Resilient Layers for Each NonResilient Group: Based on the results of the analysis performed at the Step 4, we mark the more resilient layers.

Select Approximate Components: For each operation, we select approximate components from a given library, based on the resilience measured as the noise magnitude .
Note, a step of resilience analysis consists of setting the input parameters of the noise injection, i.e., and , to add the noise to the selected CapsNet operations, and monitoring the accuracy for the noisy CapsNet.
The output of our methodology is the approximated version of a given CapsNet, which is ready to be executed in a specialized hardware accelerator for inference with approximate components. For the purpose of saving area and energy, we select, for each operation, the approximate components, from a given library, that correspond to their level of resilience. Hence, more aggressive approximations are selected for more resilient operations, without significantly affecting the classification accuracy of the CapsNet inference.
V Experimental setup
The experimental setup is shown in Fig. 8. We train a given CapsNet model for a given dataset using TensorFlow [Google2016TensorFlow], running on two Nvidia GTX 1080 Ti GPUs. The trained model serves as an input to our ReDCane methodology. The noise is injected to the arrays and then the accuracy is monitored to identify the resilience of the operations.
Va CapsNet Models and Datasets
We test our methodology on two networks, the DeepCaps [Rajasegaran2019DeepCaps] and the original CapsNet [Sabour2017dynamic_routing]. We use the datasets CIFAR10 [Krizhevsky2009CIFAR], SVHN [Netzer2011SVHN], MNIST [LeCun1998MNIST], and FashionMINST [Xiao2017FashionMNIST]
, to classify generic images, house numbers, handwritten digits, and fashion clothes, respectively. The accuracy results obtained by training these networks for different datasets are reported in Table
II. Table III shows the partition of the CapsNet operation into groups, which is then used for the group extraction step.Architecture  Dataset  Accuracy 
DeepCaps [Rajasegaran2019DeepCaps]  CIFAR10  92.74 
SVHN  97.56  
MNIST  99.72  
CapsNet [Sabour2017dynamic_routing]  FashionMNIST  92.88 
MNIST  99.67 
#  Group Name  Description 

1  MAC Outputs  Outpus of the matrix multiplications 
2  Activations  Output of the activation functions (RELU or SQUASH) 
3  Softmax  Results of the softmax (k coefficnents in dynamic routing) 
4  Logits Update  Update of the logits (b coefficients in dynamic routing) 
VB TensorFlow Implementation
The proposed methodology is implemented in TensorFlow [Google2016TensorFlow]. First, the network is trained using standard approaches. We modified the computational graph in protobuf format by including our noise injection model in the Graph tool. We implemented a specialized node for the noise injection, where the values of and can be specified as inputs to this node. Hence, for each node , a new set of nodes is added to the graph. The nodes in have the same shape as and they consist of the set of operations for adding a Gaussian noise with and , given the range of the node .
VC Approximate Multiplier Library
We use the EvoApprox8b library, which consists in 35 8bit unsigned components. We select 8bit wordlength since it was shown to be enough accurate in the computational path of CapsNets [Marchisio2019CapsAcc].
Vi Experimental Results
Via Detailed Analysis for the CIFAR10 Dataset
As a case study analysis, we report detailed results for the DeepCaps on the CIFAR10 datasets. The results for other benchmarks are reported in Section VIC.
For the following analyses, we used a . To analyze the general case of error resilience, we selected the average error . In the experiment for the Step 2 of our methodology, we inject the same noise to every operation within a group, while keeping the other groups accurate. From the results shown in Fig. 9, we notice that the Softmax and the Logits update groups are more resilient than MAC outputs and Activations, because the CapsNet accuracy starts to decrease with a correspondent lower . Note, for low , the noise injection slightly increases the accuracy due to regularization, with a similar effect as the dropout [Srivastava2014Dropout].
In Fig. 10, we analyze the resilience of each layer of the nonresilient groups (i.e., MAC outputs and Activations). We notice that the first convolutional layer is the least resilient, followed by the layers in the middle. Moreover, the Caps3D layer is the most resilient one. Since this layer is the only convolutional layer that employs the dynamic routing algorithm, we correlate the higher resilience to the iterations performed in this layer, because the coefficients are updated dynamically at runtime, thus they can adapt to the noise.
ViB Evaluating the Selection of Approximate Components
The choice of the approximate component for each operation depends on the level of corresponding to a tolerable accuracy loss, which is typically null or very low. Recalling Eq. 2, the parameters and are dataset dependent because their values change accordingly to the input range . In our case study (DeepCaps for CIFAR10), we select a subset of elements from the inputs of every Conv2D layers of the DeepCaps, with its corresponding distribution (frequency of occurrence) shown in Fig. 11 (left). The distribution is approximately Gaussian, but there is a peak between and for the input feature maps, which is caused by a specific distribution of the input dataset. Indeed, the peak occurs in the first Caps2D layer, as shown in Fig. 11 (right).
Hence, we measure the and parameters of the selected multipliers in the library (Tab. IV). We use two different input distributions, the modeled
one that is based on random inputs generated with a uniform distribution, and the
real one, which is based on the input distribution previously shown in Fig. 11. Note, these values slightly differ, because the and parameters are dataset dependent. The major differences are due to an overestimation of the and by our modeled distribution. Therefore, the selection of approximate components based on our models can be systematically employed for designing approximate CapsNets.Multiplier  Power  Area  Modeled  Real  

mul8u_  W  m  
1JFF  391 (0%)  710 (0%)  0.0000  0.0000  0.0000  0.0000 
14VP  364 (7%)  654 (8%)  0.0000  0.0001  0.0000  0.0001 
GS2  356 (9%)  633 (11%)  0.0004  0.0017  0.0001  0.0013 
CK5  345 (12%)  604 (15%)  0.0000  0.0002  0.0000  0.0002 
7C1  329 (16%)  607 (14%)  0.0011  0.0033  0.0007  0.0026 
96D  309 (21%)  605 (15%)  0.0035  0.0077  0.0020  0.0051 
2HH  302 (23%)  542 (24%)  0.0001  0.0007  0.0001  0.0007 
NGR  276 (29%)  512 (28%)  0.0001  0.0008  0.0002  0.0009 
19DB  206 (47%)  396 (44%)  0.0010  0.0019  0.0010  0.0021 
DM1  195 (50%)  402 (43%)  0.0003  0.0025  0.0005  0.0025 
12N4  142 (64%)  390 (45%)  0.0018  0.0054  0.0019  0.0056 
1AGV  95 (76%)  228 (68%)  0.0027  0.0080  0.0026  0.0117 
YX7  61 (84%)  221 (69%)  0.0484  0.0741  0.0268  0.0347 
JV3  34 (91%)  111 (84%)  0.0021  0.0267  0.0028  0.0301 
QKX  29 (93%)  112 (84%)  0.0509  0.0736  0.0293  0.0350 
We have randomly selected 14 components, representative for the complete library. 
ViC Testing our Methodology on Different Benchmarks
We apply our methodology to the other benchmarks. The results coming from the resilience analysis of the Step 2 are shown in Fig. 12. A key property that we can observe is that MAC outputs and activations are less resilient than the other two groups. Moreover, we noticed that the logits update on the CapsNet [Sabour2017dynamic_routing] for MNIST (bottom right) is slightly less resilient than the same group on the DeepCaps [Rajasegaran2019DeepCaps] for MNIST (top right), because the CapsNet has only one layer that performs Dynamic routing, while the DeepCaps has two.
ViD Results Discussion
From our analyses, we can derive that the CapsNets have interesting resilience properties. A key observation, valid for every benchmark, is that the layers computing the the dynamic routing (ClassCaps and Caps3D), and the corresponding groups of operations (softmax and logits update) are more resilient than others. Such outcome is attributed to a common feature of the dynamic routing. The values of the involved coefficients (logits and coupling coefficients , see Fig. 3) are updated dynamically, thereby adapting to the injected noise. Hence, more aggressive approximations can be tolerated for these computations.
Vii Conclusion
We proposed a systematic methodology for analyzing the resilience of CapsNets under approximation errors that can provide foundation to design approximate CapsNet hardware. We designed an error injection model, which accounts for the approximation errors. We modeled the errors of applying approximate multipliers in the computational units of CapsNet accelerators. We systematically analyzed the (groupwise and layerwise) resilience of the operations and designed approximated CapsNets, based on different resilience levels. We showed that the operations in the dynamic routing are more resilient to approximation errors. Hence, more aggressive approximations can be adopted for these computations, without sacrificing the classification accuracy much. Our methodology provides the first step towards realworld approximate CapsNets to realize their energyefficient inference.
Acknowledgments
This work has been partially supported by the Doctoral College Resilient Embedded Systems which is run jointly by TU Wien’s Faculty of Informatics and FHTechnikum Wien, and partially supported by the Czech Science Foundation project 1910137S.
Comments
There are no comments yet.