Log In Sign Up

Assessing Capsule Networks With Biased Data

Machine learning based methods achieves impressive results in object classification and detection. Utilizing representative data of the visual world during the training phase is crucial to achieve good performance with such data driven approaches. However, it not always possible to access bias-free datasets thus, robustness to biased data is a desirable property for a learning system. Capsule Networks have been introduced recently and their tolerance to biased data has received little attention. This paper aims to fill this gap and proposes two experimental scenarios to assess the tolerance to imbalanced training data and to determine the generalization performance of a model with unfamiliar affine transformations of the images. This paper assesses dynamic routing and EM routing based Capsule Networks and proposes a comparison with Convolutional Neural Networks in the two tested scenarios. The presented results provide new insights into the behaviour of capsule networks.


Improving the Robustness of Capsule Networks to Image Affine Transformations

Convolutional neural networks (CNNs) achieve translational invariance us...

Efficient-CapsNet: Capsule Network with Self-Attention Routing

Deep convolutional neural networks, assisted by architectural design str...

An Adversarial Attack against Stacked Capsule Autoencoder

Capsule network is a kind of neural network which uses spatial relations...

Grouping Capsules Based Different Types

Capsule network was introduced as a new architecture of neural networks,...

Quantum Capsule Networks

Capsule networks, which incorporate the paradigms of connectionism and s...

VideoCapsuleNet: A Simplified Network for Action Detection

The recent advances in Deep Convolutional Neural Networks (DCNNs) have s...

Localizing Load-Altering Attacks Against Power Grids Using Deep Capsule Nets

Recent research has shown that the security of power grids can be seriou...

1 Introduction

A robust classification system is expected to give the same prediction for every image of the same class or for images representing the same element in different poses. Machine learning methods, such as Convolutional Neural Networks (CNN), have been used in many classification, detection and recognition tasks [16, 10, 3]. However, in order to achieve good performance with data driven approaches, well representative data of the visual word are required [19, 14, 11]. While it is possible to mitigate some bias effects with de-biasing techniques [12] or with data augmentation [23], it is important to use machine learning approaches with good generalization performance as it contributes to design more robust applications to unseen or underrepresented imaging conditions. This paper focuses on the latter topic and presents a comparison between Convolutional Neural Networks (CNNs) and Capsule Networks (CapsNets) [22, 7]

. The neurons in a CapsNet are organized in groups denoted as Capsules

[8]. In contrast to a single neuron, a capsule can learn a specific image entity over a range of viewing conditions such as viewpoint and rotation. With the use of a routing algorithm to interconnect the capsules, a CapsNet model would be affine invariant and spatially aware. While the behaviour of CNNs with biased data has been extensively investigated [11, 14, 15], how bias influences CapsNets’ performance has received little attention so far.

This paper aims to fill this gap by proposing two experimental scenarios. The first experiment set evaluates a model’s classification accuracy with unfamiliar affine transformations. It introduces a capture bias [26] obtained with training and test data having transformation intensities sampled from different distributions. The second test scenario is to assess the variation of a network’s performance when trained with a dataset presenting several overrepresented classes with respect to evenly distributed classes. The results are presented for five network models: three dynamic routing-based CapsNet [22] with one, two and three capsule layers respectively, an EM-Matrix routing CapsNet [7] and for a CNN, which represents a comparison baseline.

The rest of this paper is organized as follows. Section 2 provides an overview of related work; Section 3 gives an introduction on capsule networks; Section 4 describes the method and criteria used for the performance evaluation. The results obtained are presented and discussed in Section 5. Finally, Section 6 draws conclusions and proposes possible extensions.

2 Related Work

The impact of bias on data driven methods have been extensively explored in the literature. A review of various types of bias in machine learning datasets is provided in [5]. The problem of bias in popular datasets dissected by cause is presented in [26] and further discussed in [25] where several de-biasing methods are compared. The generalization performance of CNNs is assessed with unfamiliar scale factor in [11] and with unfamiliar yaw pose and lighting conditions in [14]

, utilizing face recognition tasks. The analysis of imbalanced data is addressed in

[19] and [2]. In [19] several imbalanced datasets are built from CIFAR-10 [15] by means of class down and over-sampling and used to assess CNNs. In [2]

, the importance of choosing the suitable performance evaluation metric in the presence of imbalanced classes is discussed. To the best of our knowledge, the only work addressing the generalization problem for CapsNets is

[6], which demonstrates that dynamic routing based CapsNets generalize faster than CNNs when training data is injected with a few examples of an unfamiliar class. Only a few other works analyze this type of CapsNet but without considering bias or generalization performance: [27] and [20] only test CapsNets with more complex data than those utilized in the original paper [22]. Our paper aims to fill these gaps by proposing an analysis of the generalization performance with unfamiliar affine transformations and imbalanced training data for both the available architectures of CapsNets: dynamic routing [22]

(denoted as Vector-CapsNet from now on) and EM-Matrix routing based

[7] (MatrixEM-CapsNet).

3 Capsule Networks

A capsule is a group of neurons whose activity is a tensor which can learn to detect a specific entity over a domain of limited range of viewing conditions such as viewpoint, rotation and lighting

[8]. Two Capsule Networks (CapsNets) are proposed in [22] and [7] which are characterized by the architecture outlined as follows. 1) An input stage including one or more regular convolution layers; 2) a single Primary Capsule Layer consisting of a convolutional stage whose neurons are grouped into capsules; 3) one or more Capsule Layers, with the last one as network output, and consists of a capsule per class. Every pair of capsule layers (this includes the Primary layer) are fully connected by means of a routing stage. Routing allows a CapsNet to learn relationships between entities by directing the output of a capsule to the proper parent capsule located in the next level. For example, a capsule that learned to recognize eyes, will be routed towards the parent capsule for faces but not to a torso capsule.

CapsNets from [22] and [7] have significant differences in their capsule architecture and routing algorithm. The architecture from [22]

(Vector-CapsNet) utilizes 1D vector capsules whose length is an hyperparameter. A capsule encodes an entity and its pose like a CNN, deeper capsules encoding higher level entities. The routing stage fully connects two consecutive capsule layers (

and ), thus the total input of a capsule () in depends on the output of every capsule in . Dynamic routing between capsules works as follows. The output () of a capsule is multiplied by a transformation matrix to obtain the prediction vector (û${}_{i|j}$). If the prediction vector is similar to the output of the parent capsule , then the routing algorithm concludes that and are highly related and assigns a high value to the related coupling coefficient (). As the contribution to the total input of provided by the capsule is computed as û${}_{i|j}c_{ij}$, the coupling coefficient expresses how likely capsule will activate capsule . Furthermore, the capability of learning relationship between entities that characterize CapsNets is due to a transformation matrix for each capsule pair and .

The capsules of the network proposed in [7] (MatrixEM-CapsNet) consist of a scalar activation () and a pose matrix (). As in Vector-CapsNet, capsule layers are fully connected. Thus, each capsule in a layer is connected to each capsule in the next layer by means of a transformation matrix (

) which is learned with an iterative routing algorithm based on EM (Expectation Maximization) clustering and denoted as EM Routing. The prediction of the parent capsule’s pose matrix

(vote) is computed as the product between and and utilized along with by a routing algorithm to assign routes between capsule in layer and capsule in layer ().

The main difference between CapsNet and CNN is how features are routed between layers. CNN utilizes single neurons for representing image features and pooling operations as routing mechanisms. Pooling ensures invariance to small image changes (translation in particular) at the cost of information loss [17] and makes nearly impossible for a CNN to learn relationship between image entities.

4 Experimental Setup

The proposed approach consists of two types of experiment to assess a network’s performance with unseen affine transformations and with prominent class imbalance.

4.1 Capture Bias Experiment

Training data and test data are built from the same dataset by applying affine transformations whose intensity is sampled from different distributions. Hence, a model becomes familiar with several image transformations which appear at different intensities in the training and test datasets. For example, if the considered transformation is rotation, the training set would be augmented by a rotation angle sampled in a range, such as , while the transformation magnitude for testing would be sampled from a wider range such as .
The performance metric utilized for these experiments is classification accuracy, which is the number of correct predictions from all predictions made. Hence, more general models are those achieving higher accuracy on unseen magnitude of a given affine transformation.

In order to provide more comprehensive insights about the influence of unseen imaging conditions, two different criteria for sampling training data are used: uniform and sparse sampling.

4.1.1 Uniform Sampling

Let be an affine transformation, a training dataset, the relative test dataset, and two magnitude ranges such that . A network is trained with whose every sample , is augmented with where is the magnitude uniformly sampled from : . Our tests consist of running the model along the complete axis of transformation range . Thus, a set of magnitudes are sampled at fixed size steps starting from the lower bound of until the end of the range. For each , the complete dataset is transformed with and used to compute a network’s accuracy. This process results in a curve showing the relationship between transformation magnitude and a model’s accuracy.

4.1.2 Sparse Sampling

Let be an affine transformation, a training dataset, the relative test dataset, and two magnitude ranges such that . A subset of of values are chosen from to form a set . A network is trained with whose sample is augmented with where is the magnitude uniformly sampled from : . Our test procedure is the same as in the Uniform Sampling experiment.

4.2 Imbalanced Data

A model trained with imbalanced classes presents a bias towards the overrepresented ones, which results in more frequent prediction of such majority classes [5]. The performance measure is the Matthew’s Correlation Coefficient (MCC) for multiple classes [9] as it is proven to be more insensitive to imbalanced data than accuracy [2]. MCC value can fall in , where corresponds to a perfect classification. A network is trained with both balanced and imbalanced data and the resulting MCC values are compared. Better models are expected to have a narrower gap between MCC scores of balanced and imbalanced data.

5 Results

   Model Layers
Table 1: Models assessed: cnn-wp is a CNN similar to the comparison baseline from [22], vcaps-s, vcaps-d and vcaps-t are Vector-CapsNet with single, double and triple capsule layers respectively, caps-em is a MatrixEM-CapsNet.

Results are presented for several models as listed in Table 1

: cnn-wp is a CNN with three layers and max pooling, vcaps-s, vcaps-d and vcaps-t are Vector-CapsNet with one, two and three layers of capsules respectively and caps-em is a MatrixEM-CapsNet. All the networks are implemented with Tensorflow

[1]. In particular, vcaps-s, vcaps-d and vcaps-t are built on top of the source code provided by the authors of Vector-CapsNet [21], while caps-em is derived from the code shared at [28]. The cnn-wp model is implemented from scratch and has similar architecture and hyperparameters as the comparison baseline from [22] used to evaluate Vector-CapsNet on the MNIST dataset [18]. For the notation in Table 1, the following convention is utilized. C represents a convolutional layer with kernel

, stride

and filters; P indicates a max pool layer with kernel and stride ; F is a fully connected network with a single hidden layer of neurons; Pr indicates a Primary Capsule Layer having capsules with length and utilizing iterations for the routing algorithm; Cps represents a capsule layer and , and have the same meaning as for Pr. Except for an additional convolutional layer at the start, caps-em has the same architecture as proposed in [7] but uses less capsules per layer. While in [7] the hyperparameters , , , are all equal to 32, our implementation reduces the complexity of the network by setting B, C and D to . This compromise was necessary to run caps-em with at least 2 routing iterations on our 8GB RAM graphics card. The models have been trained with the Adam [13] optimizer with default parameters ( and ) and with an initial learning rate of for Vector-CapsNet and cnn-wp, and

for MatrixEM-CapsNet. The loss function to train vcaps-s, vcaps-d and vcaps-t is Margin Loss

[22] with parameters . The Spread Loss [7] has been used for caps-em with margin increasing from up to

in around 10 epochs. Regularization has been obtained with a reconstruction stage consisting of a neural network with two hidden layers of

and units respectively.

5.1 Generalization Performance on Unfamiliar Affine Transformations

Figure 1:

Several MNIST images as they are transformed and padded for testing a model accuracy.

Generalization performance with uniformly sampled affine transformations (Section 4.1.1) has been assessed utilizing affMNIST [24] as training data and MNIST [18] for tests. AffMNIST is a dataset obtained from MNIST by applying to each image several uniformly sampled transformations, namely rotation in , scale between and , shear along the axis in and translation. As compared to MNIST, which has 28 pixel images, affMNIST has 40 pixel images in order to fit scaled up digits. Accuracy data is obtained for each transformation using the MNIST test set with the following extended ranges: rotation , scale factor , horizontal shear and horizontal translation ( axis) . As test required wider range of transformations with respect to those available during training, the models have been fed with 56 pixel images obtained by zero-padding affMNIST images. Padding allowed us to test the models with scale factors up to 2.0 and wider translations than those present in affMNIST without any crop to MNIST digits. Figure 1 shows some samples from MNIST as they are transformed and padded for testing a model accuracy.

The results for uniform sampling experiments are shown in Figure 2 where the accuracy as a function of an affine transformation is plotted for each model.

Figure 2: Accuracy as a function of Rotation Angle (a), shear along the axis (b), scale factor (c) and horizontal Translation (d). The green area indicates the affine transformation range available in training data (affMNIST).

The most prominent difference among models occurs with unfamiliar scales where vcaps-t outperforms both cnn-wp and the other capsule networks. A closer look at the scale plot (Figure 2.c) allows us to infer a positive relationship between the number of capsule layers in Vector-CapsNets and generalization performance with unfamiliar scale factors. Indeed, vcaps-t achieves better accuracy at each unfamiliar scale than vcaps-s and vcaps-d for scale factors larger than 1.2, which is the largest scale present in affMNIST. On the contrary, for small test scale this trend is inverted and it appears that Vector-CapsNet has the slowest decay in accuracy among the considered models. Also with rotation, CapsNets generalize better than other types of networks, keeping the accuracy above in the interval , which is wider than the sample interval for the rotation used to generate affMNIST.

The same four affine transformations have been considered in sparse sampling experiments. Model training is carried out by augmenting MNIST samples with a single transformation a time whose intensity is sampled in a finite set. Hence, rotation is sampled in , scale in , horizontal shear in and horizontal translation in .

The models do not present significant differences with respect to each other for rotation and horizontal shear (Figure 3). In particular, the networks show a very good generalization performance to unseen shear magnitudes. In fact, just including two values for shear in the training set, yields an almost flat accuracy plot along all shear test range. Generalization performance with sparse shear sampling is coherent with the results obtained with uniform sampling. Indeed, the models’ accuracy has a flat trend along the entire test interval . Similarly to the uniform sampling scenario, the scale results show that deeper Vector-CapsNets generalize better than the other models with unfamiliar scale factors.

The results from sparse translation experiments show that cnn-wp and the three considered Vector-CapsNet have a prominent accuracy drop in the middle of the test interval, while caps-em has stable accuracy on the entire test interval. The reason for the performance gap between caps-em and Vector-CapsNet is probably due to the routing algorithm, which is the main difference between these two types of network (Section


Figure 3: Effect of sparse sampling of affine transformations in the training data. Accuracy is represented as a function of Rotation Angle (a), shear along the axis (b), scale factor (c) and horizontal Translation (d).

5.2 Performance Analysis with Imbalanced Data

Figure 4: Confusion matrices of the vcaps-t model for BAL and I-BAL. The over-represented classes , and are more often predicted by the model trained with I-BAL thus, this results in misclassification increase.
   cnn-wp 0.9923 0.9926 0.9258 0.9021 -0.0237
   caps-s 0.9958 0.9999 0.9202 0.8973 -0.0229
   caps-d 0.9935 0.9981 0.9336 0.8929 -0.0407
   caps-t 0.9933 0.9999 0.9139 0.9004 -0.0135
   caps-em 0.9827 0.9961 0.8899 0.7483 -0.1416
Table 2: The models’ accuracy with MNIST and affMNIST and the models’ MCC with balanced (BAL) and imbalanced (I-BAL) datasets. GAP shows the difference between BAL and I-BAL MCC values.

The datasets utilized for these experiments have been generated from EMNIST-Letters [4], which consists of 26 balanced classes of handwritten letters with 4800 samples each. The balanced dataset (BAL) is a subset of EMNIST including 10 of its classes ( to ) with 2400 samples each, while for the imbalanced dataset (I-BAL) classes have been down-sampled to 600 images, except for , and which have the same 4800 samples from EMNIST-Letters. Figure 4 shows the confusion matrices of vcaps-t for BAL and I-BAL. As expected, the three overrepresented classes, , and , are predicted more often. This is particularly evident for classes that are similar to each other such as and . Indeed, the similarities between lowercase letters and uppercase letters result in several misclassifications even with BAL datasets where is predicted as in cases and is called in cases. In I-BAL, is overrepresented as compared to

, which is wrongly classified as

more than half of the time (). MCC for all the models are summarized in Table 2. The least robust model to imbalanced data is caps-em, with a gap between BAL and I-BAL of 0.1416. cnn-wp and vcaps-s have similar results while vcaps-t capture the best performance with a gap of 0.0135, which is about one half of vcaps-s’ gap.

The number of capsule layers alone does not explain the better performance of vcaps-t over vcaps-s. Indeed, vcaps-d outperforms the other networks with BAL (MCC of ) but it also has the widest gap with unbalanced data among Vector-CapsNet: 0.0407. Several double layer architectures were examined other than vcaps-d, but it was neither possible to find a better model nor to precisely determine the factor that influences the performance the most. For example, replacing the two capsule layers of vcaps-d (Table 1) with increased the learnable parameters from to however, the performance decreased sightly from for BAL to for I-BAL (with a gap of ) in our experiments.

6 Conclusions

The analysis of capsule networks has received little attention. This paper aimed to provide novel insights into this new type of neural network and proposed several experiments to assess the performance of a network with biased data. Overall, CapsNet outperforms CNNs in most of the cases but not by a large gap. Our results have allowed us to infer that the number of capsule layers (depth) influences generalization performance, this is particularly evident in scale plots (Figure 2.c) where the accuracy at unseen scales improves with a network depth. Apart from this, the influence of a CapsNet’s hyperparameters is not totally understood and would deserve a more detailed and specific analysis. On imbalanced data vcaps-t outperforms all the other networks by a consistent gap but the contribution of the triple capsule layer of vcaps-d remains unclear, which is affected by imbalance data more than vcaps-s. Finally, the worst model in any scenario is caps-em with the exception of sparse translation (Figure 3). However, it is worth mentioning that the caps-em implementation it not from its authors and includes less capsules than the model originally proposed in [7]. Indeed, our Tensorflow implementation is very demanding in terms of RAM and caps-em is the most complex model that can fit in an 8GB Graphics card. A natural extension of this work would include MatrixEM-CapsNet once an official implementation is available. Furthermore, new insights would be provided from a more specific analysis of the relationship between hyperparameters and generalization performance such as the depth and the distribution of capsules among a CapsNet’s layers.


This work has been supported by the UK Engineering and Physical Sciences Research Council EPSRC [EP/K004638/1, EP/R02572X/1 and EP/P017487/1]