The recently introduced breakthrough concept of Capsule Networks (CapsNets) by the Google Brain team has achieved a significant spotlight due to its powerful new features offering high accuracy and better learning capabilities [Sabour2017dynamic_routing]. Traditional Convolutional Neural Networks (CNNs) are not able to learn the spatial relations in the images much efficiently [Sabour2017dynamic_routing]. Moreover, they make extensive use of the pooling layer to reduce the dimensionality of the space, and consequently as a drawback, the learning capabilities are reduced. Therefore, a huge amount of training data is required to mitigate such deficit. On the other hand, CapsNets take advantage from their novel structure, with so-called capsules and their cross-coupling learnt through the dynamic routing algorithm, to overcome this problem
. Capsules produce vector outputs, as opposed to scalar outputs of CNNs[Sabour2017dynamic_routing]. In a vector format, CapsNets are able to learn the spatial relationships between the features. For example, the Google Brain team [Sabour2017dynamic_routing] demonstrated that CNNs recognize an image where the nose is below the mouth as a “face”, while CapsNets do not make such mistake because they have learned the spatial correlations between features (e.g., the nose must appear above the mouth). Other than image classification, CapsNets have been successfully showcased to perform vehicle detection [Yu2019CapsNetDetection], speech recognition [Wu2010CapsNetspeechrecognition]zhao2019CapsNetNLP].
The biggest roadblock in the real-world deployments of CapsNet inference is their extremely high complexity, requiring a specialized hardware architecture (like the recent one in [Marchisio2019CapsAcc]) that may consume a significant amount of energy/power. Not only deep CapsNet models [Rajasegaran2019DeepCaps], but also shallow models like [Sabour2017dynamic_routing] require intense computations due to matrix multiplications in the capsule processing and the iterative dynamic routing algorithm for learning the cross-coupling between capsules. To deploy CapsNets at the edge, as commonly adopted for the traditional CNNs [Han2016EIE], network compression techniques (like pruning and quantization) [Han2016DeepCompression] can be applied, but at the cost of some accuracy loss. Moreover, the current trends in Approximate Computing can be leveraged to achieve energy-efficient hardware architectures, as well as for enabling design-/run-time energy-quality tradeoffs. However, this requires a comprehensive resilience analysis of CapsNets considering approximation errors in the hardware, in order to make correct design decisions on which computational steps of the CapsNets are more likely to be approximated and which not. Note, unlike in the case of approximations, an error can also be caused by a misfunctioning of the computing hardware [Jiao2017VulnerabilityDNN] or of the memory [Kim2014RowHammer]. Fault injections have demonstrated to fool CNNs [Liu2017FaultInjectionDNN], and can potentially cause a CapsNet misclassification as well.
Concept Overview and our Novel Contributions:
To address these challenges, we propose ReD-CaNe, a novel methodology (see Fig. 1) for analyzing the resilience of CapsNets under approximations, which, to the best of our knowledge, is the first of its kind. First, we devise a noise injection model to simulate real-case scenarios of errors coming from approximate hardware components like multipliers, which are very common in multiply-and-accumulate (MAC) operations for the matrix multiplications of capsules. Then, we analyze the error resilience of the CapsNets by building a systematic methodology for injecting noise into different operations of the CapsNet inference, and evaluating their impact on the accuracy. The outcome of such analysis will produce guidelines for designing and selecting approximate components, based on the resilience of each operation. At the output, our methodology produces an approximated version of a given CapsNet, to achieve an energy-efficient inference.
I-a In a nutshell, our novel contributions are:
We analyze and model the noise injections that can be generated by different approximate arithmetic components, e.g., multipliers. (Section III)
We devise ReD-CaNe, a novel methodology for analyzing the Resilience and Designing Capsule Networks under approximations, by systematically adding noise at different operations of the CapsNet inference and by monitoring the test accuracy. The approximated components are selected based on the resilience level of the different operations of the CapsNet inference. (Section IV)
We test our methodology on several benchmarks. On the DeepCaps model [Rajasegaran2019DeepCaps] for CIFAR-10 [Krizhevsky2009CIFAR], MNIST [LeCun1998MNIST], and SVHN [Netzer2011SVHN] datasets, and on the CapsNet model [Sabour2017dynamic_routing] for MNIST and Fashion-MNIST [Xiao2017Fashion-MNIST] datasets. Our results demonstrate that the least resilient operations are the convolutions in CapsLayers, while the operations performed during the dynamic routing of the Caps3D and ClassCaps layers are relatively more resilient. (Section VI)
Before proceeding to the technical sections, in Section II, we summarize the concepts of CapsNets and review the existing works of error resilience for traditional CNNs, with necessary details to understand the rest of the paper.
Ii Background and Related Work
Ii-a Capsule Networks (CapsNets)
CapsNets, first introduced in [Hinton2011TransformingAutoencoder], have become popular in [Sabour2017dynamic_routing], thanks to the new concepts like capsules and the dynamic routing algorithm. Following this trend, DeepCaps [Rajasegaran2019DeepCaps] proposed to increase the depth of CapsNets, achieving state-of-the-art accuracy for the CIFAR10 [Krizhevsky2009CIFAR] dataset.
is a group of neurons where the instantiation parameters are represented as the orientation of each element of the vector, and the vector length represents the probability that the entity exists. Moreover, vector predictions of the capsules need to be supported by nonlinear vectorized activation functions. Towards this end, the squashing function bounds the output of the capsule between 0 and 1.
In the dynamic routing, the coupling coefficients, which are connecting two consecutive capsule layers, learn the agreement during the inference by iteratively updating their values according to the relevance of the path. As an example, the architecture111Since we focus on the CapsNet inference, we do not discuss the operations that are involved in the training process only (e.g., decoder and reconstruction loss). For further details on CapsNets, we refer the readers to [Sabour2017dynamic_routing][Rajasegaran2019DeepCaps]. of the DeepCaps is shown in the Fig. 2. It has 16 convolutional capsule layers (ConvCaps), where one of them is 3D, and one fully-connected capsule layer (ClassCaps) at the end. A special focus is on the operations required for the dynamic routing, which is performed in the 3D ConvCaps and in the ClassCaps layers, as shown in Fig. 3. Note that the operations like matrix-vector multiplications and squash are different from the traditional CNNs. Hence, a challenge that we want to address in this paper is to study the inter-relation between the precision of these operations and the accuracy of the CapsNets, when subjected to errors due to approximations.
Ii-B Error Resilience of Traditional CNNs
The resilience of traditional CNNs has recently been investigated in the literature. Du2014ErrorResilientAccelerators [Du2014ErrorResilientAccelerators] analyzed the error resilience, showing that it is possible to obtain high energy savings with minimal accuracy loss. Hanif2018ErrorResilienceCNN [Hanif2018ErrorResilienceCNN] proposed a methodology to apply approximations for CNNs, based on the error resilience. Li2017ErrorPropagationDNN [Li2017ErrorPropagationDNN] studied the error propagation with the end goal of adopting countermeasures for obtaining resilient hardware accelerators. Zhang2019faulttolerantDNN [Zhang2019faulttolerantDNN] proposed a method to design fault tolerant systolic array-based accelerators. Hanif2019CANN [Hanif2019CANN] introduced a method for applying approximate multipliers into CNN accelerators without any error in the final result. Mrazek2019autoAx [Mrazek2019autoAx][Mrazek2019ALWANN] proposed a methodology to successfully search and select approximate components for CNN accelerators. Hanif2018X-DNNs [Hanif2018X-DNNs] and Marchisio2019DL4EC [Marchisio2019DL4EC] analyzed cross-layer approximations for CNNs. However, these works analyzed only traditional CNN accelerators, and such studies cannot be efficiently extrapolated for CapsNets, as discussed before. Hence, there is a dire need to perform the resilience analysis for CapsNets in a systematic way, such that we can take efficient decisions about approximating the appropriate operations of CapsNets.
Ii-C Error Sources
In a generic Deep Learning application, errors may occur due to different sources like software approximations (e.g., quantization), hardware approximations (e.g., approximate multipliers), transient faults (i.e., bit flips due to particle strikes) and permanent faults (e.g., stuck-at-zero and stuck-at-one). In this paper, due to the focus on energy-efficiency, we target approximation errors222For further details on reliability and security related works on DNNs that study soft errors, permanent faults, and adversarial noise, we refer the readers to [Goodfellow2015explainingadvexamples][Jiao2017VulnerabilityDNN][Srinivasan20166TSRAMANN][Zhang2019RobustML]..
If the CapsNet inference is performed by specialized hardware accelerators [Marchisio2019CapsAcc], a fixed-point representation is typically preferred, as compared to the floating-point counterpart [google2017quant]. Therefore, a floating-point value , which must be represented in a -bit fixed-point arithmetic [Parashar2010WLOptimization], is mapped to a range . The quantization function is defined in Eq. 1.
In this work, we simulate the CapsNets with floating-point arithmetic, butthe behavior of approximate fixed-point components is simulated by adjusting their values according to the quantization effect. Hence, we focus on modeling the errors subjected to the employment of approximate components in CapsNet hardware accelerators.
Iii Modeling the Errors as Injected Noise
Iii-a Analysis of Different Operations in CapsNets
We perform a comprehensive analysis to investigate which hardware components have the highest impact on the total energy consumption of the CapsNets’ computational blocks. Table I reports the number of operations that occur in the computational path of the DeepCaps [Rajasegaran2019DeepCaps] inference and the energy consumption per operation. The latter has been generated by synthesizing the implementation with 8 bits fixed-point operations, in a 45nm CMOS technology with the Synopsys Design Compiler tool. Fig. 4
presents the breakdown of the estimated energy share for each operation. It is worth noticing that the multipliers count for 96% of the total energy share of the computational path of the DeepCaps. The occurrences of the addition is also high, but energy-wise the additions consume only 3% of the total share due to their reduced complexity as compared to that of the multipliers. Hence, it is important to explore the energy savings from approximating the multiplier operations first, as we target in this paper.
In the following, we study the energy optimization potential of employing approximate components. As a case study, we select from the EvoApprox8B library [Mrazek2017Lib] the NGR approximate multiplier and the 5LT approximate adder. The results in Fig. 5 show that approximating only the multipliers (XM) can save more than 28% of the energy consumption, compared to the accurate implementation (Acc). Due to the low share of energy consumed by the additions, the advantage of employing approximate adders (XA) or employing approximate adders and multipliers (XAM) is negligible compared to Acc and XM solutions, respectively.
Motivated by the above discussions and analysis, in the following, without loss of generality and for the ease of proof-of-concept development, we focus our analyses on the approximate multipliers, since they have high impact on the energy consumption, thus opening huge optimization potentials.
Iii-B Error Profiles for the Approximate Hardware Multipliers
We selected 35 approximate multipliers from the EvoApprox8B library [Mrazek2017Lib] and analyzed the distributions of the erroneous products generated by such multipliers, compared to the accurate product of an 8-bit multiplier (i.e., 16-bit output). The arithmetic error is computed in Eq. 2, where denotes the inputs to the multipliers from a representative set of inputs .
The distributions of the arithmetic errors are calculated as having a single multiplier, a sequence of 9 multiply-and-accumulate (MAC) units, and as a sequence of 81 MAC units, with random samples per each scenario. These analyses are performed for estimating the accumulated error of a convolution with and filters, respectively. We selected these values because they reflect the size of the convolutional kernels of the DeepCaps [Rajasegaran2019DeepCaps] and CapsNet [Sabour2017dynamic_routing].
The majority of the components (31 of 35) has a Gaussian-like distribution of the arithmetic error , with a mean value m
and a standard deviationstd. The error distributions of two approximated multipliers333Since the remaining 29 elements from the EvoApprox8B library [Mrazek2017Lib] which have a Gaussian-like distribution show a similar behavior, we only report these two examples of approximate multipliers. from [Mrazek2017Lib] are shown in Fig. 6.
Modeling a Gaussian noise , when employing -bit fixed-point approximate components in a CapsNet which has floating-point operations, is an open research problem. We propose to adjust the noise w.r.t. the range of values of a given array . Hence, we introduce the noise magnitude () to indicate the standard deviation () of the noise scaled w.r.t. , and the noise average () to indicate the mean value () of the noise scaled w.r.t. .
Since the inputs of components () employed in CapsNets have typically some specific distribution patterns, the of the approximate component is dependent on the application. This implies that the can change significantly for different CapsNet models and different dataset used. Hence, we show several experiments for different benchmarks in Sec. VI.
Iii-C Noise Injection Modeling
Based on the above analysis, without loss of generality, we can model the error source coming from approximate components as a Gaussian random noise added to the array under consideration.
An error with certain values of and
, associated to a given tensor(i.e., a multidimensional output of a CapsNet operation) with shape is modelled as in Equation 3. The noisy output is denoted as in Equation 4.
Here, is a function which generates a tensor of random numbers with shape
, following a Gaussian distribution with meanand standard deviation .
Iv ReD-CaNe: Our Methodology for Error Resilience Analysis and Design of Approximate CapsNets
Our methodology is composed of 6 steps, as shown in Fig. 7. Once we identify the lists of arrays in which we want to inject noise, called Groups, we apply the noise injection, as described in Sec. III-C. By monitoring the impact on the test accuracy of different arrays of operations, we can identify the most and the least critical operations in a given CapsNet from the accuracy point of view. Therefore, our ReD-CANE methodology can provide useful guidelines for designing energy-efficient inference, showing the potential to apply approximations to specific layers and operations (i.e., the more resilient ones) without significantly sacrificing the accuracy. A step-by-step flow of our methodology is described in the following:
We divide the operations of the CapsNet inference into groups, based on the type of operation (e.g., MAC, activation function, softmax or logits update). This step generates theGroups.
Group-Wise Resilience Analysis: We monitor the test accuracy drop by injecting noise to different groups.
Mark Resilient Groups: Based on the results of the analysis performed at the Step 2, we mark the more resilient groups. After this step, there are two categories of Groups, the Resilient and Non-Resilient ones.
Layer-Wise Resilience Analysis for Non-Resilient Groups: For each non-resilient group444Compared to a layer-wise analysis for each group, by performing such analysis to the non-resilient groups only, a considerable amount of unuseful testing can be skipped, and a significant exploration time is saved., we monitor the test accuracy drop by injecting noise at each layer.
Mark Resilient Layers for Each Non-Resilient Group: Based on the results of the analysis performed at the Step 4, we mark the more resilient layers.
Select Approximate Components: For each operation, we select approximate components from a given library, based on the resilience measured as the noise magnitude .
Note, a step of resilience analysis consists of setting the input parameters of the noise injection, i.e., and , to add the noise to the selected CapsNet operations, and monitoring the accuracy for the noisy CapsNet.
The output of our methodology is the approximated version of a given CapsNet, which is ready to be executed in a specialized hardware accelerator for inference with approximate components. For the purpose of saving area and energy, we select, for each operation, the approximate components, from a given library, that correspond to their level of resilience. Hence, more aggressive approximations are selected for more resilient operations, without significantly affecting the classification accuracy of the CapsNet inference.
V Experimental setup
The experimental setup is shown in Fig. 8. We train a given CapsNet model for a given dataset using TensorFlow [Google2016TensorFlow], running on two Nvidia GTX 1080 Ti GPUs. The trained model serves as an input to our ReD-Cane methodology. The noise is injected to the arrays and then the accuracy is monitored to identify the resilience of the operations.
V-a CapsNet Models and Datasets
We test our methodology on two networks, the DeepCaps [Rajasegaran2019DeepCaps] and the original CapsNet [Sabour2017dynamic_routing]. We use the datasets CIFAR-10 [Krizhevsky2009CIFAR], SVHN [Netzer2011SVHN], MNIST [LeCun1998MNIST], and Fashion-MINST [Xiao2017Fashion-MNIST]
, to classify generic images, house numbers, handwritten digits, and fashion clothes, respectively. The accuracy results obtained by training these networks for different datasets are reported in TableII. Table III shows the partition of the CapsNet operation into groups, which is then used for the group extraction step.
|1||MAC Outputs||Outpus of the matrix multiplications|
Output of the activation functions (RELU or SQUASH)
|3||Softmax||Results of the softmax (k coefficnents in dynamic routing)|
|4||Logits Update||Update of the logits (b coefficients in dynamic routing)|
V-B TensorFlow Implementation
The proposed methodology is implemented in TensorFlow [Google2016TensorFlow]. First, the network is trained using standard approaches. We modified the computational graph in protobuf format by including our noise injection model in the Graph tool. We implemented a specialized node for the noise injection, where the values of and can be specified as inputs to this node. Hence, for each node , a new set of nodes is added to the graph. The nodes in have the same shape as and they consist of the set of operations for adding a Gaussian noise with and , given the range of the node .
V-C Approximate Multiplier Library
We use the EvoApprox8b library, which consists in 35 8-bit unsigned components. We select 8-bit wordlength since it was shown to be enough accurate in the computational path of CapsNets [Marchisio2019CapsAcc].
Vi Experimental Results
Vi-a Detailed Analysis for the CIFAR-10 Dataset
As a case study analysis, we report detailed results for the DeepCaps on the CIFAR-10 datasets. The results for other benchmarks are reported in Section VI-C.
For the following analyses, we used a . To analyze the general case of error resilience, we selected the average error . In the experiment for the Step 2 of our methodology, we inject the same noise to every operation within a group, while keeping the other groups accurate. From the results shown in Fig. 9, we notice that the Softmax and the Logits update groups are more resilient than MAC outputs and Activations, because the CapsNet accuracy starts to decrease with a correspondent lower . Note, for low , the noise injection slightly increases the accuracy due to regularization, with a similar effect as the dropout [Srivastava2014Dropout].
In Fig. 10, we analyze the resilience of each layer of the non-resilient groups (i.e., MAC outputs and Activations). We notice that the first convolutional layer is the least resilient, followed by the layers in the middle. Moreover, the Caps3D layer is the most resilient one. Since this layer is the only convolutional layer that employs the dynamic routing algorithm, we correlate the higher resilience to the iterations performed in this layer, because the coefficients are updated dynamically at run-time, thus they can adapt to the noise.
Vi-B Evaluating the Selection of Approximate Components
The choice of the approximate component for each operation depends on the level of corresponding to a tolerable accuracy loss, which is typically null or very low. Recalling Eq. 2, the parameters and are dataset dependent because their values change accordingly to the input range . In our case study (DeepCaps for CIFAR-10), we select a subset of elements from the inputs of every Conv2D layers of the DeepCaps, with its corresponding distribution (frequency of occurrence) shown in Fig. 11 (left). The distribution is approximately Gaussian, but there is a peak between and for the input feature maps, which is caused by a specific distribution of the input dataset. Indeed, the peak occurs in the first Caps2D layer, as shown in Fig. 11 (right).
Hence, we measure the and parameters of the selected multipliers in the library (Tab. IV). We use two different input distributions, the modeled
one that is based on random inputs generated with a uniform distribution, and thereal one, which is based on the input distribution previously shown in Fig. 11. Note, these values slightly differ, because the and parameters are dataset dependent. The major differences are due to an overestimation of the and by our modeled distribution. Therefore, the selection of approximate components based on our models can be systematically employed for designing approximate CapsNets.
|1JFF||391 (-0%)||710 (-0%)||0.0000||0.0000||0.0000||0.0000|
|14VP||364 (-7%)||654 (-8%)||0.0000||0.0001||0.0000||0.0001|
|GS2||356 (-9%)||633 (-11%)||0.0004||0.0017||0.0001||0.0013|
|CK5||345 (-12%)||604 (-15%)||0.0000||0.0002||0.0000||0.0002|
|7C1||329 (-16%)||607 (-14%)||0.0011||0.0033||0.0007||0.0026|
|96D||309 (-21%)||605 (-15%)||0.0035||0.0077||0.0020||0.0051|
|2HH||302 (-23%)||542 (-24%)||-0.0001||0.0007||-0.0001||0.0007|
|NGR||276 (-29%)||512 (-28%)||0.0001||0.0008||0.0002||0.0009|
|19DB||206 (-47%)||396 (-44%)||0.0010||0.0019||0.0010||0.0021|
|DM1||195 (-50%)||402 (-43%)||0.0003||0.0025||0.0005||0.0025|
|12N4||142 (-64%)||390 (-45%)||0.0018||0.0054||0.0019||0.0056|
|1AGV||95 (-76%)||228 (-68%)||0.0027||0.0080||0.0026||0.0117|
|YX7||61 (-84%)||221 (-69%)||0.0484||0.0741||0.0268||0.0347|
|JV3||34 (-91%)||111 (-84%)||0.0021||0.0267||-0.0028||0.0301|
|QKX||29 (-93%)||112 (-84%)||0.0509||0.0736||0.0293||0.0350|
|We have randomly selected 14 components, representative for the complete library.|
Vi-C Testing our Methodology on Different Benchmarks
We apply our methodology to the other benchmarks. The results coming from the resilience analysis of the Step 2 are shown in Fig. 12. A key property that we can observe is that MAC outputs and activations are less resilient than the other two groups. Moreover, we noticed that the logits update on the CapsNet [Sabour2017dynamic_routing] for MNIST (bottom right) is slightly less resilient than the same group on the DeepCaps [Rajasegaran2019DeepCaps] for MNIST (top right), because the CapsNet has only one layer that performs Dynamic routing, while the DeepCaps has two.
Vi-D Results Discussion
From our analyses, we can derive that the CapsNets have interesting resilience properties. A key observation, valid for every benchmark, is that the layers computing the the dynamic routing (ClassCaps and Caps3D), and the corresponding groups of operations (softmax and logits update) are more resilient than others. Such outcome is attributed to a common feature of the dynamic routing. The values of the involved coefficients (logits and coupling coefficients , see Fig. 3) are updated dynamically, thereby adapting to the injected noise. Hence, more aggressive approximations can be tolerated for these computations.
We proposed a systematic methodology for analyzing the resilience of CapsNets under approximation errors that can provide foundation to design approximate CapsNet hardware. We designed an error injection model, which accounts for the approximation errors. We modeled the errors of applying approximate multipliers in the computational units of CapsNet accelerators. We systematically analyzed the (group-wise and layer-wise) resilience of the operations and designed approximated CapsNets, based on different resilience levels. We showed that the operations in the dynamic routing are more resilient to approximation errors. Hence, more aggressive approximations can be adopted for these computations, without sacrificing the classification accuracy much. Our methodology provides the first step towards real-world approximate CapsNets to realize their energy-efficient inference.
This work has been partially supported by the Doctoral College Resilient Embedded Systems which is run jointly by TU Wien’s Faculty of Informatics and FH-Technikum Wien, and partially supported by the Czech Science Foundation project 19-10137S.