Capsule Networks with Max-Min Normalization

03/22/2019 ∙ by Zhen Zhao, et al. ∙ 26

Capsule Networks (CapsNet) use the Softmax function to convert the logits of the routing coefficients into a set of normalized values that signify the assignment probabilities between capsules in adjacent layers. We show that the use of Softmax prevents capsule layers from forming optimal couplings between lower and higher-level capsules. Softmax constrains the dynamic range of the routing coefficients and leads to probabilities that remain mostly uniform after several routing iterations. Instead, we propose the use of Max-Min normalization. Max-Min performs a scale-invariant normalization of the logits that allows each lower-level capsule to take on an independent value, constrained only by the bounds of normalization. Max-Min provides consistent improvement in test accuracy across five datasets and allows more routing iterations without a decrease in network performance. A single CapsNet trained using Max-Min achieves an improved test error of 0.20 With a simple 3-model majority vote, we achieve a test error of 0.17

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 10

page 11

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning systems are powerful tools for recognition, prediction and strategy in fields such as vision, speech, language and games Lecun_2015 ; Silver_2017 . The mammalian visual system extracts features from objects in cluttered scenes and then combines them for robust recognition. Inversion of sensory processing, using the ubiquitous feedback pathways present throughout sensory systems, is a major component of object recognition Harth_1987 . Learning systems that utilize compositionality of objects Lake_2015 , along with dynamic binding of parts (or features) to wholes (or objects) can become power architectures. Capsule Networks (CapsNets) have the potential to perform object recognition in a natural and systematic fashion.

Capsule Networks Sabour_2017 , Hinton_2018 use a dynamic routing algorithm to calculate a set of routing coefficients that link lower and higher-level capsules between adjacent layers in the network. Each routing coefficient represents the probability that an individual lower-level capsule should be assigned to a higher-level capsule. These routing coefficients are not learned during training, as is the case for the rest of the network parameters. For each input presented to the network, the routing coefficients are calculated at run-time (during both training and inference) from a set of initial values.

The Softmax function, given by Eq. 1

, has been widely used for object recognition tasks due to its ability to reduce the impact of outlier values in the dataset while still allowing those values to have an effect on the network’s learning ability during training. In CapsNets, Softmax is used to convert the log priors between capsules

i in layer l and capsules j in layer l + 1

into a set of assignment probabilities between the capsules. While dampening the effects of outliers can be beneficial when training typical network parameters (following the Maximum Likelihood Estimation principle), outliers in the routing coefficients can provide optimal separation between features in adjacent capsule layers. Since these coefficients are not learned in the conventional sense (i.e., gradients do not flow through the routing coefficients during backpropagation), other normalization functions and methods can be used for the task of dynamic routing. In addition, the function need not be differentiable (e.g., a look-up table can be used to assign lower-level capsules to higher-level capsules).

Here, we show that the use of the scale-invariant Max-Min function (Eq. 2) improves the performance of CapsNets. We focus on the CapsNet formalism of Sabour et al. Sabour_2017 . The lower bound for the normalization is set to and gives higher-level capsules the ability to completely disregard non-essential features presented by one of the lower-level capsules. This serves as a kind of dynamic dropout for the routing coefficients and forces the network to generalize better. The upper bound, in principle, can be set to any value. We tested upper bounds in the range of and found that the network performs well for all values within the range (we use as the upper bound in the rest of the paper). Bounding the routing coefficients between and in this manner allows each lower-level capsule to have an independent assignment probability to each of the higher-level capsules. That is, the sum of the probabilities for a single lower-level capsule across each of the higher-level capsules is no longer constrained to be . This can be beneficial for CapsNets since, often times, a single feature might have high probabilities of being assigned to multiple higher-level objects.

The use of Max-Min over Softmax leads to an improvement in the test accuracy across five datasets and allows the use of more routing iterations between capsule layers without overfitting to the training data. In addition, we train a single CapsNet (with minimal data augmentation) on the MNIST dataset and achieve a test error of . With a -model majority voting system, we achieve a test error of on MNIST, surpassing the accuracy of the model ensemble used by Wan_2013 by .

Section 2 provides a summary of the three-layer CapsNet from Sabour_2017 and the differences in the routing procedure between the Softmax and Max-Min normalizations. In Section 3, we compare the evolution of the logits and routing coefficients for a CapsNet trained using Softmax and Max-Min. Section 4 shows the tuning curves (i.e., outputs of the routing layer) for the network. Section 5 shows the main results of the sessions trained using Softmax vs. Max-Min. Performance on the MNIST dataset is detailed in Section 6 along with results on the MNIST dataset using other normalization functions.

2 Capsule Network Architecture

The CapsNet architecture we used follows the network described in Sabour_2017 and is shown in Fig. 1. A input image is fed into a convolutional layer (Conv1) that operates on the input with ,

kernels using a stride of

and the ReLU activation. The output of this operation is a

feature map tensor that is then fed into a second convolutional layer (PrimaryCaps) that uses

, kernels, a stride of and the ReLU activation. This results in a feature map tensor, which represents the lower-level capsules for the network. Each set of

scalar neurons in the

tensor is grouped channel-wise and forms a single lower-level capsule , for a total of () lower-level capsules.

The outputs from PrimaryCaps are fed through a dynamic routing algorithm, resulting in the DigitCaps output matrix. The squashing function used to calculate is as given in Sabour_2017 . Each row in the DigitCaps matrix represents the -D instantiation parameters of a single class and the length of a

-D vector represents the probability of the existence of a particular class. During training, the non-ground-truth rows are masked with zeros and the matrix is passed to a reconstruction sub-network that consists of two fully-connected layers of dimensions

and with ReLU activations and a final fully-connected layer of dimension with a sigmoid activation. During inference, the row in the DigitCaps matrix with the largest length (i.e., highest probability) is taken as the predicted object class.

The inputs to the routing algorithm consist of the prediction vectors, . These prediction vectors are calculated using learned transformation weight matrices and the capsule outputs from the PrimaryCaps layer. The prediction vectors remain fixed inside the algorithm as the routing procedure boot-strap calculates the DigitCaps capsules, , using the prediction vectors. Although there are no gradient flows in the routing layer, both the inputs and outputs of the routing layer are subjected to the usual gradient flows during training. In particular, the DigitCaps capsules are passed onto a sub-network that learns to reconstruct the original input image. As a result, the prediction vectors and parent-level capsules tend to evolve such that the scaled summation of the prediction vectors are similar to the parent-level capsules. In other words, during the forward pass, the network calculates a set of parent-level capsules that are used to recreate the original image. Any errors in the reconstruction network will backpropagate themselves to the prediction vectors and the preceding layers. During the next forward pass, the prediction vectors will evolve in such a way (via the transformation matrices) that aligns with the previously calculated parent-level capsules.

The routing procedure from Sabour_2017 is given below for reference. The routing procedure using Max-Min normalization remains largely the same except the Softmax function is replaced with the Max-Min function as given by Eq. 2, where / are the lower/upper bounds of the normalization. For the first iteration, the routing coefficients are initialized to outside of the routing for-loop.

Training is conducted similar to the approach taken in Sabour_2017

. Our implementation uses TensorFlow

TensorFlow and the Adam optimizer Adam_Optimizer

with TensorFlow’s default parameters and an exponentially decaying learning rate. Unless otherwise noted, the same network hyperparameters in

Sabour_2017 were used for all training sessions. Original code is adapted from Sabour_Code .

Figure 1: (Top) Three-layer CapsNet architecture, following Sabour et al. Sabour_2017 . PrimaryCaps layer consists of -D vector capsules. The dynamic routing algorithm produces the DigitCaps layer, which is used to calculate the margin loss. (Bottom) Reconstruction sub-network with three fully-connected layers. The DigitCaps layer is passed to the reconstruction sub-network where the non-ground-truth class rows are masked with zeros. The sparse

matrix is then passed through the network, which learns to reproduce the input image. Margin and reconstruction loss functions follow those from

Sabour_2017 .
Softmax Routing Procedure
1: Input to Routing Procedure: (, , )
2:  for all capsules in layer and capsule in layer ( + 1): 0
3:  for iterations:
4:   for all capsule in layer : Softmax()
5:   for all capsule in layer ( + 1):
6:   for all capsule in layer ( + 1): Squash()
7:   for all capsule in layer and capsule in layer ( + 1):
  return
(1)
Max-Min Routing Procedure
1: Input to Routing Procedure: (, , )
2:  for all capsules in layer and capsule in layer ( + 1): 1.0
3:  for iterations:
4:   for all capsule in layer ( + 1):
5:   for all capsule in layer ( + 1): Squash()
6:   for all capsule in layer and capsule in layer ( + 1):
7:   for all capsule in layer : Max-Min ()  2
  return
(2)

3 Evolution of Logits and Routing Coefficients

In the CapsNet architecture presented in Fig. 1, the capsules in the PrimaryCaps layer can represent features useful for recognizing objects (e.g., hand-written digits). The convolutional layer before PrimaryCaps allow efficient learning of these features. The dynamic binding of parts (i.e., features) to wholes (i.e., objects) are then carried out through the routing coefficients.

In such CapsNets, the ability to create optimal separation between competing features in adjacent capsule layers is a crucial feature for efficient object recognition. The evolution of the logits and routing coefficients in the routing layer of a CapsNet offer insights into how two adjacent capsule layers assign object features to their wholes. When using Softmax, a zero initialization for the logits sets the routing coefficients to be (assuming parent-level output capsules) for the first iteration. With Max-Min normalization, the routing coefficients are initialized to for the first iteration; thus, for the first iteration, the parent-level capsules, , are simply the (non-scaled) squashed summation of the prediction vectors. The top rows of Figs. 2 and 3 show the initial values of the logits and routing coefficients for MNIST and CIFAR10, each for the same training image in their respective datasets. The middle and bottom rows show the evolution of the logits and coefficients throughout the routing procedure. For both the Softmax and Max-Min cases, the logits and coefficients are extracted from the network after their respective sessions have finished training under the same conditions.

Figure 2: Evolution of logits and routing coefficients for the same training image from the MNIST dataset for networks trained using (a) Softmax and (b) Max-Min. Y-axes are displayed in log-scale. Note: For clarity, only the routing coefficients associated with the ground-truth (GT) column are shown here. The histograms of all routing coefficients exhibit the same behavior.
Figure 3: Evolution of logits and routing coefficients for the same training image from the CIFAR dataset for networks trained using (a) Softmax and (b) Max-Min. Y-axes are displayed in log scale. Note: For clarity, only the routing coefficients associated with the GT column are shown here. The histograms of all routing coefficients exhibit the same behavior.

As the routing progresses, the logits form a tight cluster around zero, with the majority of the values remaining at (y-axes are log-scale for Figs. 2 and 3). Due to the tight clustering and the non-linear behavior of Softmax, the routing coefficients from a Softmax trained network evolves in a manner similar to their corresponding logits (c.f. Figs. 2 (a) and 3 (a)); i.e., the majority of the routing coefficients remain at their initial value of , with only a few that evolve to significantly different values. As a result, the routing coefficients just barely separate each lower-level capsule among the higher-level capsules. With Max-Min normalization, the majority of the logits also have a value of . However, due to the scale-invariant nature of the Max-Min normalization, the tight grouping of logits can be better separated to form the routing coefficients (c.f. Figs. 2 (b) and 3 (b)).

Max-Min also allows a lower-level capsule to have high assignment probabilities with multiple higher-level capsules. With Softmax, competition between a lower-level capsule and each of the higher-level capsules reduces the likelihood of multiple high probabilities between features in adjacent capsule layers. Figure 4 shows examples of the routing coefficients for three lower-level capsules in PrimaryCaps across the ten higher-level capsules in DigitCaps for the MNIST and CIFAR10 datasets at the last routing iteration. Since the majority of the logits are tightly clustered around their initial values, Softmax computes nearly identical assignment probabilities between a lower-level capsule and each of the higher-level capsules. Max-Min normalization computes high assignment probabilities for multiple higher-level capsules. In addition, the differences between the probabilities among the higher-level capsules (for each lower-level capsule) is larger when Max-Min is used. This leads to a more optimal separation between capsules in adjacent layers.

Figure 4: Examples of logits and corresponding routing coefficients for an individual lower-level capsule from PrimaryCaps for the MNIST (top row) and CIFAR10 (bottom row) datasets for networks trained using Softmax ((a) and (c)) and Max-Min ((b) and (d)) normalizations at the last routing iteration.

4 DigitCaps Outputs

It is also instructive to examine the outputs of the DigitCaps layer (i.e., the parent-level capsules,

). During inference, the output capsule with the largest vector length (i.e., highest probability) is used to classify the input image. For each input to the network, an ideal CapsNet would have a single output capsule with probability near

corresponding to the GT class and all other classes with probabilities near . The outputs can be examined on an input-by-input basis (for each image, there is a corresponding DigitCaps matrix from which the classification is made) or on a class-by-class basis (for each class, there is a corresponding matrix that is the average of the individual matrices for that class).

Figures 5 (a) and (b) show the output capsule probabilities for the same set of test images from the MNIST dataset and Figs. 5 (c) and (d) show the output capsule probabilities for the same set of test images from the CIFAR10 dataset. For MNIST, the network is properly trained and both normalizations provide digit class probabilities that are highly-peaked for their corresponding GT classes and vice versa for the other classes. For CIFAR10, the network does better at separating the classes when Max-Min normalization is used (i.e., higher GT probability and lower non-ground-truth probabilities). However, both normalizations produce output capsules that have multiple high peaks, signifying that the networks are not able to adequately differentiate between the object classes. This issue with CapsNets was addressed in Sabour_2017 by including a “none-of-the-above” category for the routing Softmax—our network does not have this category.

Figure 5: Output class probabilities for the same set of test images from the MNIST and CIFAR10 datasets calculated using Softmax ((a) and (c)) and Max-Min ((b) and (d)) normalizations.

If we view the capsules in the DigitCaps layer to be “grandmother cells” Gross_2002 , then how well-tuned they are to the objects they recognize provides a picture about the robustness of the system. Figures 6 (a) and (b) show the class-averaged output capsule probabilities for the test images for the MNIST dataset. These can be viewed as the tuning curves of the recognition units and demonstrate the ability of the network to adequately distinguish between each of the digit classes. The similarity between the tuning curves of the respective digits when trained with Softmax (Fig. 6 (a)) and Max-Min (Fig. 6 (b)) show that Max-Min normalization does not degrade the network’s ability to discriminate the ten digits. In contrast, the tuning curves for CIFAR10 (Figs. 6 (c) and (d)) show that, for certain object classes, the discriminability is not as good as those for MNIST. This is also reflected in the accuracies given in Table 1.

Figure 6: Class-averaged output probabilities for the MNIST and CIFAR10 test datasets calculated using Softmax ((a) and (c)) and Max-Min ((b) and (d)) normalizations. The shape similarities in the tuning curves between the Softmax and Max-Min trained networks suggests that both models are trained in a similar fashion.

5 Results

We compared the network’s performance with Max-Min and Softmax normalizations on five datasets: MNIST MNIST , Background MNIST (bMNIST) and Rotated MNIST (rMNIST) R_and_BG_MNIST , Fashion MNIST (fMNIST) F_MNIST , and CIFAR10 CIFAR10 . In addition, we also evaluated the performance of the networks as a function of the number of routing iterations. All sessions were trained using the same three-layer model shown in Fig. 1 and hyperparameters as in Sabour_2017 . For the results in this section, no data augmentations were used for the datasets with the exception of CIFAR10, where random croppings were conducted for the training images and a centered cropping conducted for the test images. For variations of the MNIST dataset, the PrimaryCaps layer has capsules. For CIFAR10, the PrimaryCaps layer has capsules. Three routing iterations were used for all sessions. Unlike Sabour_2017 , we did not introduce a “none-of-the-above” category for the network classifier.

Table 1

lists the mean of the maximum test accuracies and their standard deviations for the five datasets, and shows that Max-Min normalization provides a consistent improvement in test accuracy compared with Softmax.

111Experiments using Softmax normalization and routing coefficients initialized to produced test accuracies of for the MNIST dataset. In particular, the improvement is most significant for datasets that have a non-zero background (i.e., bMNIST and CIFAR10). Max-Min also allows more routing iterations to be conducted without decreasing the test accuracy. As shown in Fig. 7, an improvement in test accuracy is obtained when the number of routing iterations is increased. With Softmax, the test accuracy decreases for all five datasets.

Normalization MNIST [] rMNIST [] fMNIST [] bMNIST [] CIFAR10 []
Softmax 99.28  0.06 93.72  0.08 90.52  0.14 89.08  0.19 73.65  0.09
Max-Min 99.55  0.02 95.42  0.03 92.07  0.12 93.09  0.04 75.92  0.27
Table 1: Mean of the maximum test accuracies and their standard deviations on five datasets for Max-Min and Softmax normalizations. Five training sessions were conducted for each dataset for both Max-Min and Softmax. For CIFAR10, a “none-of-the-above” category was not used during training.
Figure 7: Test accuracy as a function of the number of routing iterations for CapsNets trained using Max-Min (blue solid lines with triangle markers) and Softmax (orange dashed lines with circle markers) normalization for (a) MNIST, (b) Rotated MNIST, (c) Fashion MNIST, (d) Background MNIST, and (e) CIFAR10. Max-Min normalization allows the network capacity to increase without decreasing the network performance.

Max-Min also prevents the network from overfitting to the training data, especially as the number of routing iterations is increased. As shown in Fig. 8, the differences between the mean of the maximum training and test accuracies are lower for CapsNets trained using Max-Min compared with Softmax. Thus, Max-Min normalization not only prevents the model from overfitting, but also allows the performance of the network to scale positively with the number of routing iterations.

Figure 8: Difference between training and test accuracies as a function of the number of routing iterations for CapsNets trained using Max-Min (blue solid lines with triangle markers) and Softmax (orange dashed lines with circle markers) normalization for (a) MNIST, (b) Rotated MNIST, (c) Fashion MNIST, (d) Background MNIST, and (e) CIFAR10. Max-Min normalization prevents the network from overfitting on the training data, especially as the number of routing iterations is increased.

6 Performance on MNIST

Sabour et al. demonstrates a low test error of Sabour_2017 using a single three-layer CapsNet with Softmax and performing image translation by up to

pixels in each direction with zero padding. Section

5 shows that Max-Min gives an average of improvement in test accuracy compared with Softmax for MNIST. Thus, it stands to reason that a single CapsNet trained using Max-Min and minimal augmentations can outperform the current state-of-the-art results on MNIST Wan_2013 . We train the same three-layer CapsNet in Fig. 1 on the full MNIST training images using random image translation by up to pixels with zero padding and random image rotation (around image center) by up to degrees. In addition, we relax the margin loss constraints such that and and use only routing iterations. The batch size used for training is images and the networks are trained for epochs each. All other parameters follow those from Sabour_2017 .

Table 2 gives a comparison on the test errors for the images in the MNIST test dataset for networks trained with Max-Min and Softmax using the parameters and image augmentations listed above. Each experiment was conducted a total of times. A single CapsNet using Max-Min achieves a test error of , while a 3-model majority vote achieves a test error of . The misclassified images from the model ensemble are shown in Fig. 9. Further discussions on the MNIST results are presented in Appendix A along with misclassifications from each of the three models used in the ensemble.

Normalization Maximum Minimum Mean Stdev.
Softmax 0.35% 0.29% 0.32% 0.021%
Max-Min 0.27% 0.20% 0.24% 0.025%
Table 2: Test errors on MNIST dataset for Max-Min and Softmax normalizations. Ten training sessions were conducted for both the Max-Min and Softmax normalizations.
Figure 9: Misclassified MNIST images using 3-model majority vote from CapsNets trained using Max-Min normalization.

6.1 Comparisons with Human Polling Results

Out of predictions the network makes on the MNIST test set, differ from the GT labels. A poll of individuals on these images showed that in some cases ( out of ) the results agreed with the networks’ predictions. Table 3 illustrates this and details for one of the images are given in Fig. 10 (a). Figure 10 (b) shows an image that is consistently misclassified by the network but is almost always correctly classified by humans. This particular example of the digit is misclassified as a by several other methods as well Belongie_2002 ; Stuhlsatz_2012 ; Ciresan_2012

and points to the short-comings of current machine learning methods, compared to the human brain. Several images from the

are of poor quality and Fig. 10 (c) gives the networks’ predictions and polling results for one such example.

Image Index GT Label
Human/Network Prob.
on GT Label
Network Pred. Label
Human/Network Prob.
on Pred. Label
1260 7 0.38/0.53 1 0.54/0.64
1901 9 0.29/0.18 4 0.61/0.79
2597 5 0.47/0.23 3 0.51/0.79
4823 9 0.42/0.52 4 0.55/0.71
5937 5 0.38/0.55 3 0.60/0.76
9729 5 0.15/0.48 6 0.82/0.80
Table 3: Comparison of human and network predictions for out of the misclassified MNIST test images. For these images, the network predicted labels agree with the human predictions. Polling results are from individuals who labelled the misclassified MNIST test images.
Figure 10: (a) Agreement between human and network predictions for a misclassified image. Here, the majority of the human polling results agree with the network predictions and both disagree with the GT label. (b) Disagreement between human and network predictions for a misclassified image. Here, the majority of the human polling results agree with the GT label. (c) Example of human and network predictions for a poor quality image. Although the majority of the human predictions agree with the GT label, a significant portion misclassified the image as the digit . Note: The predictions from human polling are averaged over individuals who took the same poll whereas the network predictions are averaged over three separately trained models (and hence, the probabilities do not sum to ).

6.2 Other Normalizations

Max-Min is not unique in its ability to optimally separate the logits. Various other functions can be applied in the routing procedure for CapsNets. We also tested the following functions on the MNIST dataset: 1) Winner-Take-All (WTA), 2) sum, 3) centered Max-Min, 4) Z-score, and 5) adjusted log normalizations. An exhaustive study was not done for each of these methods—our primary goal was to probe the utility of other methods in creating the routing coefficients and whether or not a valid probability distribution was a strict requirement for the assignment of capsules. For WTA, each lower-level capsule in PrimaryCaps only contributes to a single higher-level capsule in DigitCaps. The higher-level capsule assignment is determined by the largest coefficient value for each lower-level capsule. The equation for adjusted log normalization is given by:

(subtracting the min value and adding one ensures that the min value of the transformed logits is zero). Sum, centered Max-Min, and Z-score normalizations have their usual meanings. A good initialization for each of the five methods is required in order for the network to converge during training. For Max-Min, the initialization for the routing coefficients was robust across two orders of magnitude (). To simplify matters, we initialized routing coefficients to for all five methods listed above.

Table 4 shows the test accuracies on the MNIST datasets for the six methods, including Max-Min. Sum normalization performed the worst among the six methods primarily due to difficulties in loss convergence during training—this issue might be alleviated with a more suitable initialization. Centered max-min and adjusted log normalizations performed approximately the same as one another and WTA and Z-Score performed approximately the same. It is worth noting that the WTA method only results in a decrease in test accuracy. This is somewhat surprising since WTA assigns a value of for out of routing coefficients associated with each lower-level capsule. Both centered max-min and Z-score normalization allow routing coefficients to take on negative values. However, the range of values transformed by Z-score is unbounded and can be difficult to initialize properly. The range of values transformed by centered max-min is bounded between and . However, it is possible for large negative routing coefficients from centered max-min normalization to counterbalance large positive routing coefficients, leading to a lower network performance compared with Max-Min. Log transformations generally compress high values and spread low values by expressing the values as orders of magnitude and are useful when a high degree of variation exists within variables. This transformation gives decent performance on MNIST but is difficult to initialize properly since the range of the transformed values are not bounded.

Normalization Test Accuracy [%]
Max-Min 99.55 0.02
Centered Max-Min 99.52 0.03
Adjusted Log 99.50 0.01
WTA 99.24 0.36
Z-Score 99.21 0.03
Sum 72.62 30.05
Table 4: Test accuracies on the MNIST dataset for several normalization functions. WTA: Winner-Take-All.

7 Summary

In the formalism from Sabour_2017 , the logits are converted to the routing coefficients using the Softmax function. For optimal class separation, the routing coefficients should be widely separated in their values. That is, features that are useful for one class, represented by the output of PrimaryCaps, should be strongly coupled to the features in DigitCaps for that class.

We analyzed the distribution of generated by CapsNets using Softmax and find that they are tightly clustered around the initial value of . One reason for this may be that the Softmax function is not scale invariant and for the range of ’s being produced in the network, Softmax normalization reduces the dynamic range of ’s. With Max-Min normalization, the dynamic range of the routing coefficients is increased. We demonstrate improved recognition errors, ranging from % to % across five datasets and show that Max-Min allows more routing iterations between adjacent capsule layers without overfitting to the training data. Finally, a single CapsNet is able to achieve a state-of-the-art result on the MNIST test set using just a single model with minimal data augmentation.

Acknowledgments

We would like to thank Peter Dolce for setting up and running the human polling. KPU acknowledges many useful conversations with PS Sastry.

References

Appendix A Misclassified Images From the MNIST Dataset

The MNIST images misclassified by each of the three models used in the majority-voting scheme are shown below. Models A and B each misclassified images out of the test images. Model C misclassified images. Images with missing pieces of information present the most challenge to the network. Each of the three models are trained using the same set of network parameters and image augmentations mentioned in Section 6, with the only difference being the weight initializations for the network layers. The image index, model prediction, and GT labels are listed above each image.

Figure A.1: Misclassified MNIST images using CapsNet model A trained using Max-Min normalization.
Figure A.2: Misclassified MNIST images using CapsNet model B trained using Max-Min normalization.
Figure A.3: Misclassified MNIST images using CapsNet model C trained using Max-Min normalization.