In the last decade, Convolutional Neural Networks (CNNs) have shown remarkable performance for a wide range of computer vision tasks[11, 5, 8]. However, CNNs have many drawbacks, such as the inability to learn viewpoint invariant representations and the need for large amount of training data. Capsule Networks (CapsNets) [19, 6]
is a new neural network architecture that tackles those shortcomings by using capsules. A capsule is a group of neurons (depicted as a vector) whose output represents the various perspectives of an entity, such as pose, texture, scale, or the relative relationship between the entity and its parts.
This technique has immense potential in the medical field, such as in cell classification, where (i) different types of cells are classified depending on the hierarchical relationship of the cell and its parts (shape of the nucleus, texture of the cytoplasm, presence of subcellular organelles), and where (ii) rotational invariance is crucial. However, CapsNets require a large number of parameters when the network is enlarged, and they have mainly shown promising performance for small images and/or with barely any background noise.
In this work, we present the concept of CapsNets parallelization, where parts of the network are subdivided in branches to isolate capsules, helping the network to (i) identify different entities in different branches, and (ii) avoid instability problems when capsule layers are enlarged. This concept is applied in both types of current CapsNets [19, 6]. We also propose a variation to the Sabour et al.’s CapsNets ; our proposal entails to lose the spatial information in the first layer of capsules, forcing the middle layer of capsules to encode whole entities. We also show how, against general assumption, conventional CapsNets do not seem to perform proficiently as more capsule layers are added, and they are not more robust than CNNs for small datasets.
1.1 Capsule Networks
Sabour et al. 
proposed the first CapsNet based on dynamic routing (DR), with one CNN and two capsule layers, to solve the MNIST dataset (images of 2828 px). Their first layer of capsules, Primary-Caps, took the output of a convolution (66256) and considered that every 8 elements along the feature axis would represent a capsule instantiation, thereby creating 32 capsules, each one evaluated in a grid of 66. The latter simply entails that –in the subsequent steps– the weights () that multiply a lower-layer capsule to produce the next-layer capsule
are shared between the capsules of the grid. Furthermore, they proposed that the module of a capsule vector should represent a probability (with range 0–1) and thus they defined a squashing function
where and are the squashed and non-squashed capsule , respectively, and the capsule is computed as indicated above for all (except the first) layer of capsules, where are the coefficients obtained by the dynamic routing and are the squashed capsules from the lower layer. Briefly, the dynamic routing aims to determine how close the predicted vectors are to the mean predicted vector (by using the scalar product), giving a higher
to those closer. They also defined a specific loss function, named ‘margin loss’. Further details in.
Hinton et al.  proposed a different network (with one CNN and four capsule layers) to solve the smallNORB dataset (images of 96
96 px). They extended the concept of CapsNets in the following ways: (i) capsules were depicted as matrices instead of vectors; (ii) a routing based on the Expectation-Maximization (EM) algorithm was proposed, where the matching between capsules is done by considering that a higher-layer capsule represents a Gaussian and the lower-layer capsules are data-points (further details in); (iii) convolutional capsules were presented, where a higher-layer capsule is computed only based on the neighbouring lower-layer capsules (K in Fig. 1); (iv) a technique called ‘Coordinate Addition’ was proposed (applied only in the last convolutional capsule), which adds the position of the capsule to its vote matrix in order to keep the spatial information; (v) a new loss function, spread loss, was defined.
1.2 Related Work
Many publications have used CapsNets to perform tasks in medical images. The majority simply applied Sabour et al.’s network  to small patches to perform tasks such as detection of diabetic retinopathy (fundus images) and mitosis (histology images, H&E) , classification of breast cancer [9, 2] and colorectal tissue  (H&E), detection of liver lesions (CT, adding attention gates to the network) , and blood vessel segmentation (fundus, with an inception block  as CNN) . Others used Sabour et al.’s network in larger images by adding more CNN blocks (glaucoma detection, OCT ) or heavily downsampling the images (brain tumour classification, MRI ). Only a few explored the use of a 3-capsule-layer DR-CapsNet for the classification of white cells  and detection of cell apoptosis , and an interesting use of DR within a CNN architecture has been used for classification of thoracic disease (CT) . Special attention goes to LaLonde et al., who has developed CapsNets with several capsule layers to perform polyp classification  and expanded the concept of CapsNets to perform segmentation (CT images of the lungs) . Hinton et al.’s architecture  has seldom been applied in the literature.
2.1 Parallel Capsule Networks
Parallelization can be applied in several ways, depending where to start and merge the branches. In this work, we focus in studying networks with 3 capsule layers (3-CapsLayer). These contain Primary-, Mid-, and Class-Caps, and the parallelization is performed by creating a unique set of CNNs and Primary-Caps for each Mid-Cap, which are then concatenated to be routed to the Class-Caps (Fig. 1, top). Several CNN-blocks precede the capsule section, allowing the image to be reduced to a suitable size for the capsules. The first CNNs, which generate basic features, are common to all branches. Our chosen CNN is ResNeXt .
Since there is only one Mid-Cap per branch, there is no routing between Primary-Caps and Mid-Caps, which means that the algorithms only need to find the most appropriate transformation matrix per Primary-Cap. This allows each branch to focus in generating specialized features (either in the CNNs or Primary-Caps) that are suitable for the Mid-Cap.
For comparative purposes, networks with 2 and 4 capsule layers were also tested. The 2-CapsLayer does not include Mid-Caps, and thus the different sets of Primary-Caps are concatenated before routing to the Class-Caps. The 4-CapsLayer performs the merging before the Class-Caps.
2.1.1 Parallel EM-CapsNets, 3 capsule layers.
Our proposed network (Fig. 1, top) applies parallelization to Hinton et al.’s architecture . All capsules have a pose matrix of 44 and an activation value. In a branch, there are 32 Primary-Caps and 1 Mid-Cap. The window (K) used in the convolutional capsules is 33 for the Mid-Caps and 1
1 for the Class-Caps, with a stride of 2 for the Mid-Caps. Coordinate addition is not employed.
2.1.2 Parallel DR-CapsNets, 3 capsule layers.
Our DR network (Fig. 1, bottom) introduces two changes to Sabour et al.’s network : the weights are not shared among the capsules of the grid, and the Primary-Caps are 256 elements long (two capsules per grid-point). The former entails that the spatial information is lost in the first layer, forcing the Mid-Caps to encode the whole existence of an entity. Furthermore, since all Primary-Caps are directed to a single Mid-Cap, the algorithm can only discard a useless Primary-Cap by setting its to zero and, thus, not sharing weights can help to such discard. Both, the Mid-Caps and Class-Caps, have a size of 16 elements.
2.1.3 Parallel CapsNets, 2 capsule layers.
In the EM network, the number of Primary-Caps are evenly distributed among the different branches, and the last CNN-block is duplicated to allow another stride. In the DR network, the length of the Primary-Caps is reduced based on the number of branches.
2.2 Data, baseline networks, implementation details, and metrics
2.2.1 Data Description.
We evaluate the proposed networks on a public dataset of white blood cells (leukocytes), which were present in patients with acute myeloid leukaemia (AML), a blood-type cancer that leads to the overproduction of abnormal leukocytes, published by Matek et al.  (Laboratory of Leukemia Diagnostics at Munich University Hospital, Germany). The dataset contains 18,365 single-cell images of 400400 pixels from 15 highly-unbalanced classes (see Table 2). This dataset is particularly interesting because (i) leukocytes show hierarchical structures (the nucleus might depict a prominent nucleolus or be formed by different segments, and cytoplasm might show different textures), (ii) it is highly unbalanced, and (ii) many red cells appear in the background (noise).
The network ResNeXt-50  was chosen as baseline. This network was also used by the dataset’s authors , but we achieved a higher performance with the same network and setup: +6% in PRE and SEN. DenseNets  and ResNets  were also tested, but they provided significantly inferior performance. An adapted single block of ResNeXt was chosen for the CNN-blocks in CapsNets, which also gave better results than other alternatives. We also tested the non-parallel versions of both architectures (EM and DR).
2.2.3 Implementation Details.
All networks were implemented in Tensorflow 2.2 on a single NVIDIA V100 GPU with 32GB of memory. In order to determine the most appropriate number of branches in the networks, each class in the dataset was subdivided into 5 folds, using 4 for training and 1 for validation. Once established the best branching, a 5-fold cross-validation (CV) was performed on the whole dataset for a final comparison (specifically, we use the same aforementioned 5 folds in the CV setup). The batches contained one example of each class (15 images), randomly shuffling the order within the batch to avoid bias. Data augmentation was performed by flipping the images up-down and left-right and by rotating the images (0–180). For the CNN baseline, the loss function was categorical cross-entropy. For the CapsNets, we used the loss functions suggested by their original authors (margin loss and spread loss). Nadam optimizer 
was used, with a learning rate of 0.001. We defined an epoch as 500 iterations, and we trained for 700 epochs (no early stop), which required from 6 to 20 days to train, depending on the number of layers.
2.2.4 Evaluation Metrics.
To quantity the performance, we used the weighted (WAcc) and non-weighted categorical accuracy (Acc),
reported as a percentage, where is the class, TP is true positives, TN is true negatives, FP is false positives, and FN is false negatives. We also defined an agreement metric (Agr
) that measures the percentage of images that were classified as the same type in all eight basic orientations (flipped up-down and left-right), regardless of whether the classification was correct. A higher agreement should suggest better rotational invariance. For the final architectures, we estimate the sensitivity, SEN = TP/(TP+FN), and precision, PRE = TP/(TP+FP).
3.0.1 Non-parallel architectures.
In the DR-CapsNets, the 2-layer type performed rather poorly (Table 1), the 3-layer increased performance but diverged with more than two Mid-Caps (Fig. 2-B), and the 4-layer was unable to converge. Indeed, our proposed modification in DR networks (losing spatial information in the Primary-Caps) is not suitable for networks with only two layers. Regarding the EM network, it allowed for deeper architectures, but performance decreased with the addition of layers and/or Mid-Caps (Fig. 2). Building a 4-layer EM network was feasible, but performance was poor in all possible setups (WAcc in the range of 60–65). This suggests that, contrary to popular belief, the original EM-CapsNet does not perform as expected when the capsule section is increased in layers. EM-CapsNets were also highly sensitive to the position where capsule strides were placed, not being advised to place it in the Primary-Caps. Other instability problems were observed: EM networks were unable to converge when either (i) the image was only reduced to 50x50 before entering the capsule section (regardless if further striding was performed in the capsule section), or (ii) the Primary-Caps were increased from 32 to 64 (regardless the number of capsule layers). In summary, these findings suggest that both routing algorithms fail when a large number of capsules take part. Overall, only the 2-layer EM network (better WAcc) and the 3-layer DR network (better Acc and Agr) were slightly better than ResNeXt-50 (Table 1), and their number of parameters were notably lower (Table 1).
3.0.2 Parallel architectures.
As expected, the parallelization of 2-layer networks was rather detrimental: even though DR-CapsNets provided slightly better Acc and Agr in all cases, WAcc was always lower (Fig. 2-A), and EM-CapsNet provided lower results in all three metrics for all cases. In contrast, parallelizing 3-layer CapsNets was highly beneficial, particularly for our DR network (Fig. 2-B), which did not depict convergence problems with the addition of branches. Regarding EM-CapsNets, the right number of branches yielded a peak in accuracy (Fig. 2-B) (Acc and Agr were always higher than their non-parallel counterparts). In general, this agrees with our hypothesis that parallelization allows each branch to detect an entity independently (in its Mid-Cap) without affecting the remaining Mid-Caps. The drawback is the high increase in parameters. Specifically, it increased 3M parameters per each new branch in the EM network (only 0.01M per each Mid-Cap in the non-parallel version) and 6M in the DR network (5M in the non-parallel version). However, the increase in computational time was not excessive (a 0–50% more). Overall, DR-CapsNet outperformed ResNeXt-50 with the right selection of branches (much higher Acc and Agr, Table 1), but EM-CapsNet was still most proficient with only 2-layers and no branching (none of the EM nets –parallel or not– outperformed ResNeXt-50 in Acc and Agr).
3.0.3 Size of the capsules.
The original CapsNets [19, 6] employed 16 elements to encode a capsule, but it was not discussed the relevance of that size or whether capsules from different layers should have different sizes. For this, we experimented with several sizes (9, 16, and 25, and their equivalent matrices 33, 44, 55) in all possible combinations, and we found that (i) a size of 16 is the most optimal among our options, (ii) it is preferable –but not crucial– that all layers have the same capsule size, (iii) a smaller capsule (9) worked surprisingly well as long as all layers employ it, and (iv) a larger size (25) started overfitting.
3.0.4 Performance for small training data.
To evaluate this, we split the dataset in the following way: up to images per class were used to create 5 folds, using 4 folds for training and 1 for testing, placing the remaining images above in the test set (being ). For the lowest , classes were balanced, becoming more unbalanced as increases. This was tested in the baseline ResNeXt-50 and the best CapsNets. Interestingly, our experiments depicted a similar behavior as increased for all cases, which contradicts the assumption that CapsNets outperform CNNs in smaller datasets.
3.0.5 Overall benefits of our proposed networks.
The non-parallel CapsNets showed convergence problems when a capsule layer became slightly large, which limits the number of entities to detect in the layer. However, there is no supportive evidence to believe that capsule layers should be fully connected between them. Indeed, a capsule (object) would only need to have connections to a reduced number of lower capsules (parts of the object), and different objects might be formed by completely different parts. Thus, branching would help to overcome the aforementioned limitation, although at the expense of a higher number of parameters. Moreover, the 3-layer CapsNets showed an interesting behaviour: the case with one single Mid-Cap was able to encode a generic white cell in just a capsule of 16 element that, subsequently, was transformed to 15 different classes. This highlights the power of capsules to encode entities and transformations. It also suggests that branching should be considered with heterogeneous datasets (groups of objects with morphological dissimilarities among the groups).
|Type of cell||Images||Precision||Sensitivity||Precision||Sensitivity||Precision||Sensitivity|
3.0.6 Final results
Our proposed 3-layer parallel DR-CapsNet provided slightly better sensitivity and precision (Table 2), but there was not a clear pattern (the detection of low-represented classes was not highly improved by CapsNets). Overall, we believe that three layers of capsules is appropriate for this dataset because of the morphological structure of white cells: the CNNs might denoise the image from background red cells while retaining the important features from the white cells, Primary-Caps might encode those basic features into capsules, Mid-Caps might then encode whole entities (nucleus, cytoplasm, or even generic whole cells), and Class-Caps might simply be the connection (and transformation) of different Mid-Caps entities. Our experiments also seemed to suggest that losing the spatial information in the layer previous to the merging is the most appropriate approach to exploit branching, but we could not test that hypothesis in the EM network due to lack of time. Many other experiments could also be tested to further improve the performance: branches with different sizes, merging some branches at different layers, etc.
Our work suggests that, for the classification of white cells, original CapsNets (i) do not generally outperform a well-established CNN (ResNeXt-50) unless it is a simple 2-layer network, therefore (ii) adding more capsule layers is usually detrimental, (iii) they are not more robust for small training data, (iv) they tend to be very sensitive to the tuning parameters, (v) they are unable to converge if a layer contains too many capsules, and (vi) their rotational encoding does not seem to be outstanding. In contrast, our proposed parallel DR-CapsNet seems to better learn the viewpoint invariant representations (highest Agr), provides better accuracy (highest Acc), and does not suffer from convergence problems.
-  (2018) Brain tumor type classification via Capsule Networks. In 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, pp. 3129–3133. Cited by: §1.2.
-  (2019) Breast cancer classification using Capsule Network with preprocessed histology images. In 2019 International Conference on Communication and Signal Processing (ICCSP), Chennai, India. Cited by: §1.2.
-  (2016) Incorporating Nesterov momento into Adam. In International Conference on Learning Representations Workshop (ICLRW), San Juan, Puerto Rico. Cited by: §2.2.3.
-  (2019) Glaucoma assessment from OCT images using Capsule Network. In 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, pp. 5581–5584. Cited by: §1.2.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770–778. Cited by: §1, §2.2.2.
-  (2018) Matrix capsules with EM routing. In International Conference on Learning Representations (ICLR), Cited by: §1.1.2, §1.2, §1, §1, §2.1.1, §3.0.3.
-  (2019) Self-attention Capsule Networks for object classification. arXiv 1904.12483. Cited by: §1.2.
-  (2017) Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 2261–2269. Cited by: §1, §2.2.2.
-  (2018) Convolutional Capsule Network for classification of breast cancer histology images. In 15th International Conference on Image Analysis and Recognition (ICIAR), Póvoa de Varzim, Portugal. Cited by: §1.2.
-  (2018) Capsule Networks against medical imaging data challenges. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. MICCAI Workshop. LABELS 2018, CVII 2018, STENT 2018. LNCS., Vol. 11043. Cited by: §1.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pp. 1097–1105. Cited by: §1.
-  (2020) Inception Capsule Network for retinal blood vessel segmentation and centerline extraction. In IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, pp. 1223–1226. Cited by: §1.2.
Capsules for object segmentation.
Medical Imaging with Deep Learning (MIDL) Conference, Amsterdam, The Netherlands. Cited by: §1.2.
-  (2020) Diagnosing colorectal polyps in the wild with Capsule Networks. In 17th IEEE International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, pp. 1086–1090. Cited by: §1.2.
-  (2019) WBCaps: a capsule architecture-based classification model designed for white blood cells identification. In 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, pp. 7027–7030. Cited by: §1.2.
-  (2019) Human-level recognition of blast cells in acute myeloid leukaemia with convolutional neural networks. Nature Machine Intelligence 1, pp. 538–544. Cited by: §2.2.1, §2.2.2.
-  (2019) Automated classification of apoptosis in phase contrast microscopy using Capsule Network. IEEE Transactions on Medical Imaging 31 (1), pp. 1–10. Cited by: §1.2.
-  (2021) Classification of colorectal tissue images from high throughput tissue microarrays by ensemble deep learning methods. Nature Scientific Reports 11:2371. Cited by: §1.2.
-  (2017) Dynamic routing between capsules. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), pp. 3859–3869. Cited by: §1.1.1, §1.2, §1, §1, §2.1.2, §3.0.3.
-  (2018) Dynamic routing on deep neural network for thoracic disease classification and sensitive area localization. In 9th International Conference on Machine Learning in Medical Imaging (MLMI). Workshop. LNCS, Vol. 11046, Granada, Spain, pp. 389–397. Cited by: §1.2.
-  (2015) Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 1–9. Cited by: §1.2.
-  (2017) Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5987–5995. Cited by: §2.1, §2.2.2.