1 Introduction
In last few years, deep learning has made breakthroughs in many computer vision tasks, especially convolutional neural networks leading to stateoftheart performance. In the convolutional neural network, neurons are scalar and unable to learn the complex relationship between neurons. But in the human brain, neurons usually work together rather than work alone. To overcome this shortcoming of convolutional neural networks, Hitton proposed the concept of “capsule”
[4] that a combination of neurons that stack features (neurons) of the feature map into vectors (capsules). In the capsule network, the model not only considers the attributes of the feature when training, but also takes account of the relationship between the features. The proposed dynamic routing algorithm enables the idea of “capsule” to be implemented [15]. After the neurons are stacked into vectors(capsules), the coupling coefficient between the lowlayer capsule and the highlayer capsule is learned through a dynamic routing algorithm. The relationship between the partial features and the whole will be obtained.Improving the performance of neural networks is a major direction of deep learning research. A common method to improve the performance of deep neural networks is to increase the depth of the network. For example, VGG[16], GoogLeNet[17], and ResNet[3]
improves the network depth by proposed effective solutions and continuously improves the accuracy of classification of ImageNet
[1]. In capsule networks, in order to improve the performance of the capsule network can be achieved by increasing the number of capsule layers. Rajasegaran et al. [14] have tried and achieved impressive results in this research direction. However, the dynamic routing algorithm proposed by Sabour et al. [15] cannot simply increase the number of capsule layers in the capsule network.Dynamic routing algorithm is the method used to learn the relationship between partial features and the whole in a capsule network, but it shows some shortcomings. After several iterations of training, the coupling coefficient of the capsule network shows a large sparsity, indicating that only a small number of lowlayer capsules are useful for highlayer capsules. Most coupling coefficient computations are futile, which increases the amount of invalid computation during gradient backpropagation. The sparsity of the coupling coefficients in the dynamic routing algorithm makes most of the gradient flow propagating between the capsule layers very small. If the capsule layer is simply stacked, the gradient in the front layer of the model will become small, so that the model not working. If the interference of the coupling coefficient can be removed during the routing process, the stacked layers can continue to work.
To this end, in this paper, we proposed adaptive routing that a new routing algorithm for capsule networks. Unlike the dynamic routing algorithm, which updates the coupling coefficient at the end of each iteration, our proposed algorithm only updates the lowlayer capsule itself at the end of each iteration, which makes the lowlayer capsules more ”similar” to the highlayer capsules. Since there is no coupling coefficient , the propagation of the gradient flow in the capsule network is not suppressed during the routing process, so the gradient can be better transmitted to the layer in front of the model. More specifically, we made the following contributions in this article:

The motivation proposed by the adaptive routing algorithm and explains why the dynamic routing algorithm causes the gradient vanishing and the capsule network to not work when stacking multiple layers.

The adaptive routing algorithm is proposed to overcome the shortcoming that the dynamic routing algorithm will cause the gradient vanishing when stacking multiple layers. The adaptive routing algorithm can stack multiple layers and improve the performance of the capsule network.

The iterative process of adaptive routing algorithms can be simplified, and the adaptive routing algorithm without routing process is used. The introduced hyperparameter is used instead of the iteration number, which reduces the amount of computation and amplifying the gradient.
The rest of the paper is organized as follows: In Section 2, we discuss the related work on Capsule Networks, Section 3 describes the motivation and adaptive routing algorithm, Section 4 shows our experimental results. Finally, Section 5 concludes the paper.
2 Related Work
The capsule network is a new neural network architecture that stacks traditional scalar neurons into vector neurons called “capsule” neurons[4] which can store spatial location information of the feature so that it is more in line with the human brain mechanism. The dynamic routing algorithm was proposed by Sabour et al. [15] that a method learned the coupling relationship between lowlayer capsules and highlayer capsules in neural networks so that the capsule network has become a practical model. Then, Hitton et al. [5] proposed the EM routing algorithm, which used matrix capsules instead of vector capsules. The EM routing algorithm is used to iteratively learn the coupling coefficient between the lowlayer matrix capsule and the highlayer matrix capsule. In the research field of capsule networks, almost researches related to capsule networks are based on these two algorithms.
In this field, there are many great extensions. Lenssen et al. [11]
proposed a generic routing algorithm that defines the reliable variability and invariance for the capsule network and proved the equal variance of the output pose vector and the output activation. Rajasegaran
et al. [14]proposed a deep capsule network architecture for the shortcomings of dynamic routing algorithms that cannot simply stack multiple layers. It uses 3D convolution to learn the spatial information between the capsules and the idea of skip connection in the residual network, and the skip connection in the capsule layer allows for a good gradient flow in backpropagation. At the bottom of the network, when skipping connections to more than one layer, a large number of route iterations are used. The 3D convolution is used to generate votes from the capsule tensor for dynamic routing. This helps route a set of localized capsules to a higher layer capsule. Jeong
et al. [7] proposed a new definition method for entities, which deletes the capsules that do not want to be closed and preserves the spatial relationship between lowlayer and highlayer entities, and proposed the concepts of building layers and step layers. To capture the relationship between the part and the entire space, another new layer called a ladder layer is introduced, the outputs of which are regressed lowlayer capsule outputs from highlayer capsules.These extensions also make a lot of sense. Zhang et al. [19]
proposed to use a capsule carrier instead of a neuron activation sample, using a set of capsule subspaces, inputting a feature vector on this set of subspaces, and then using the length of the resulting capsule for the pair scores that fall into different categories. Such a capsule projection network (CapProNet) is trained by learning the orthogonal projection matrix of each capsule subspace and it is shown that each capsule subspace is updated until it contains an input feature vector corresponding to the relevant class. Since the dimension of the capsule subspace is low and an iterative method of estimating the matrix inverse is used, the network can be trained with only a small computational overhead. Ding
et al. [2]divided all capsules into different groups and then performs a group reconstruction routing algorithm to obtain the corresponding advanced capsules. Capsule MaxPooling is used between the lower and upper layers to prevent overfitting. Li
et al. [12] proposed to use two branches to approximate the routing process: one master branch collects the main information from its direct contact in the lower layer, and one auxiliary branch is based on the schema variables encoded in other lower containers to supplement the main information. These two branches communicate in a fast, supervised, and onetime pass compared to previous iterative and unsupervised routing schemes. As a result, the complexity and runtime of the model are reduced dramatically.3 Methodology
3.1 Motivation
In the capsule networks used the dynamic routing algorithm, the lowlayer capsules learn the ability of affine transformation through the affine transformation matrix
. Affine transformation matrix is similar to the Transformer Networks proposed by Jaderberg
et al. [6], enabling the capsule to have the ability to transform, scale, rotate, etc. The capsule network uses the backpropagation algorithm to train the parameters of the affine transformation matrix in the model. The coupling coefficient
between the lowlayer capsules and the highlayer capsules is iteratively learned by the dynamic routing algorithm. The dynamic routing algorithm outputs the affinetransformed lowlayer capsules to the highlayer capsules. During the backpropagation, the coupling coefficient adds weight to the gradient flow.Figure 1 is the illustration of data flow and gradient flow between adjacent capsule layers with the dynamic routing algorithm. Same as the architecture of capsule network proposed by Sabour[15], the feature maps in the PrimaryCaps layer are , , …, . Features on the feature maps as defined in Equation LABEL:feaure_map below:
(1) 
Features on the different feature maps are stacked (8 feature maps as a group) and formed into capsules. And all capsules is in layer and capsules is in layer . Capsules in the lowerlayer are composed of features on the feature maps , , …, (36 features on each feature map), which are defined according to the Equation 2 below:
(2) 
Affine matrix is defined by Equation 3 and transforms the capsule of dimension 8 to the capsule of dimension 16. Therefore, are obtained by affine transformation of as defined in Equation 4 below:
(3) 
(4) 
Calculate the weighted sum of and the coupling coefficient to get as described in the Equation 5 below:
(5) 
(6) 
It can be obtained from Equation 6 that the loss of the capsule networks are related to the length of the capsule and values of the capsule. And is the parameter in the affine transformation matrix , which is learned by the backpropagation algorithm. And is the coupling coefficient, which is learned by iterative calculation of dynamic routing.
When the gradient flows through the adjacent capsule layers, the result is as below:
(7)  
In the Equation 7, the is a parameter in the affine transformation matrix , and is the feature associated with in the capsule and on the feature maps , , …, . The values of the gradient in backpropagation will be affected by the coupling coefficient .
From the Figure 2, the coupling coefficient obtained by the dynamic routing algorithm is mostly close to 0.1 and even smaller[7]. When the capsule networks is stacked in multiple capsule layers, the presence of will make the gradient value smaller which affects the learning of the parameters of the front layer and makes the capsule networks not working.
From the Figure 3, we compared the range of gradients in the ReLU Conv1 layer in the original capsule network(used dynamic routing algorithm in only two capsule layers) and multiple capsule network. It turns out that in the front layer of the multiple capsule networks, the gradient value is too small for the network to work.
In summary, the loss is related to the length of capsule . In the process of gradient backpropagation, the value of is close to 0.1 and even smaller, causing the gradient vanishing and making the capsule network not working. If coupling coefficient does not participate in routing iterations, the capsule network will continue to work with multiple capsule layers.
3.2 Adaptive Routing
In order to overcome the shortcomings of the coupling coefficient in the capsule network. We proposed the adaptive routing algorithm that does not involve parameter training in the route iteration process.
In the capsule networks, the direction of the highlayer capsule is close to the maximum direction of the lowlayer capsule length, if the coupling coefficient is removed, all the lowlayer capsules are directly summed after affine transformation as described in Equation 8 below:
(8) 
Squeeze
using the activation function (
squash), then we can obtain (same direction as sj) as in Equation 9:(9) 
From the Figure 4, the direction of the corresponding highlayer capsule is the same as that of the longer capsule in the lower layer. The purpose of the dynamic routing algorithm is that if the lowlayer capsule and the corresponding highlayer capsule have higher similarity, the bigger the coupling coefficient between them after iteration. Thus, we can move the lowlayer capsule towards the corresponding highlayer capsules. If the lowlayer capsule and the corresponding highlayer capsule have higher similarity, then the new moved toward the corresponding highlayer capsule, enhanced directionality based on the original . And if the lowlayer capsule and the corresponding highlayer capsule have lower similarity, the new also moved toward the corresponding highlayer capsule, reduced directionality based on the original . The adaptive update process is as defined in Equation 10 below:
(10) 
The adaptive routing algorithm can be described as Algorithm 1.
In the capsule networks used dynamic routing algorithm, is a lowlayer capsule neurons after affine transformation. From the dynamic routing in Figure 5, when the dynamic routing algorithm starts iterating, the coupling coefficient of each lowlayer capsule neuron for the corresponding highlayer neurons is equal. , , are weighted sum to get , and the weights are , , . After the first routing, calculate the weighted sum for overall lowlayer capsules. If the length of the lowlayer capsule is larger, its direction is more similar to the direction of the corresponding highlayer capsule. After each iteration, the coupling coefficient are updated according to the dot product (similarity and length) of the lowlayer capsules and the corresponding highlayer capsules. Update the new weights , , after iterating through the dynamic routing algorithm. If , , and are more similar, then becomes larger after updating. Similarly, becomes larger in the same direction before iteration. After the dynamic routing algorithm, the orientation of the highlayer capsules is close to the direction of the longer length capsules in the lowlayer capsules. With the number of iterations increased, if the lowlayer capsules are more similar to corresponding highlayer capsules, the coupling coefficient (weight) is larger. On the other hand, the is smaller.
Similarly, from the adaptive routing in Figure 5, when the adaptive routing algorithm starts to iterate, the coupling coefficient of each lowlayer capsule neuron for the corresponding highlayer neurons is removed. , , are summed to get . After the first routing, sum overall lowlayer capsules. If the length of the lowlayer capsule is larger, its direction is more similar to the direction of the corresponding highlayer capsule. After each iteration, move the lowlayer capsule , , to the direction of the highlayer capsule . After each iteration, the lowlayer capsule , , will become closer the direction of the highlayer capsule
. After the process of adaptive routing algorithm, the orientation of the highlayer capsules is close to the direction of the capsules with longer lengths in the lowlayer capsules. The lowlayer capsules will move adaptively to highlayer capsules increasingly, and highlayer capsules will definitely represent the probability of object presence. Without the influence of the coupling coefficient
, the same effect as the dynamic routing algorithm can be obtained.3.3 Introduce the gradient coefficient
The adaptive routing we proposed does not involve the coupling coefficient in the routing process. And we can simplify the training process of adaptive routing. No parameters need to be trained during the route iteration, only the capsules in the lower layer are summed. When the iteration r=1, the training process of adaptive routing as in Equation 11, 12, 13 below:
(11) 
(12)  
(13)  
So after the first iteration, the output of the adaptive routing algorithm as in Equation 14 below:
(14) 
(16) 
(17)  
(18)  
The introduction of indicates that the is amplified, and its value is close to after the activation function.
So after the second iteration, the output of the adaptive routing algorithm as in Equation 19 below:
(19) 
In summary, if the number of iterations increased, the will be larger and finally get as in Equation 20 below:
(20) 
The improved adaptive routing without iteration is described as Algorithm 2.
In the adaptive routing algorithm as described in the Equation 21 below:
(21) 
Combine Equation21 and 6, we will get the gradient flows through the adjacent capsule layers used adaptive routing as below(the meaning of and is equivalent to Equation 7):
(22)  
By comparing the Equation 22 the Equation 7 we obtained the improvement of the gradients in the backpropagation between the capsule layers. The gradient coefficient of the dynamic routing algorithm is mostly close to 0.1 or even smaller, which causes the gradient vanishing. The gradient coefficient of the adaptive routing algorithm is a hyperparameter, usually a positive integer greater than 1, which amplifies the gradient.
From the Figure 6, we compared the range of gradients in the ReLU Conv1 layer in the multiple capsule layers network(with adaptive routing). Compared with the results of dynamic routing algorithm in the Figure 3, it turns out that in the front layer of the multiple capsule layers network, the value of the gradients is larger and the capsule network still continue to work.
The hyperparameter not only inhibits the gradient vanishing to some extent, but also the appropriate can magnify the gradient and spread the gradient more smoothly to the front of the model.
4 Experiments
4.1 Implementation
We tested our proposed adaptive routing algorithm for classification experiment on several common datasets, MNIST[10], FashionMNIST[18], SVHN[13] and CIFAR10[9]. For CIFAR10 and SVHN, we resized the images to and shifted by up to
pixels in each direction with zero padding, and there is no other data augmentation/deformation. For other datasets, original image sizes are used throughout our experiments. In the experiment of two capsule layers, we set the number of capsules per layer is [1152, 10] and the same as the dynamic routing algorithm
[15]. And for the experiment of three capsule layers and four capsule layers, the number of capsules per layer we set is [1152, 256, 10] and [1152, 256, 32, 10] respectively.We used pytorch libraries for the development of experiment. For the training procedure, we used Adam optimizer with an initial learning rate of 0.001, which is reduced
after each epochs
[8]. We set the batchsize is 128 that train with 128 images each time. The models were trained on GTX1080Ti and training 150 epoch for every experiment. All experiments were run three times and the results were averaged.4.2 Classification Results
We tested our proposed adaptive routing algorithm and dynamic routing algorithm on several benchmark datasets, CIFAR10 [9], SVHN [13], FashionMNIST [18] and MNIST [10].
Model  CIFAR10  SVHN  FMNIST  MNIST 

DRA  76.05%  93.65%  93.02%  99.65% 
ARA  78.41%  94.27%  93.07%  99.65% 

From the Table 1, we have obtained the same network configuration and achieved better performance than the dynamic routing algorithm. The routing algorithm between the capsule layers learns the affine transformation of the object and the combination of lowlayer capsules and highlayer capsules. Therefore, stacking multiple capsule layers can improve model performance, which can learn more powerful affine transformation capabilities and more complex combinations corresponding adjacent layer capsules.
=1  =2  =3  =4  

2layers  92.78%  93.23%  93.07%  92.96% 
3layers  93.54%  93.63%  93.39%  93.38% 
4layers  93.61%  93.71%  93.57%  93.41% 
=1  =2  =3  =4  

2layers  78.24%  77.97%  78.41%  78.34% 
3layers  78.41%  78.01%  78.66%  78.44% 
4layers  78.42%  78.13%  78.68%  78.50% 
From the Table 2 and 3 , we have obtained different performances in different numbers of capsule layers and different values of on the dataset CIFAR10[9] and FashionMNIST[18]. When the other configuration parameters are identical, the performance of the model improved with the number of capsule layers increased. Moreover, the performance of the model is different by . When the value of is 2 or 3, the performance is better on the the dataset FashionMNIST. And when the value of equals to 1 or 3, we can also obtain the better performance on the dataset CIFAR10.
=0.1  =0.001  =0.0001  =0.00001  

2layers  77.24%  69.25%  10.58%  10.42% 
3layers  10.23%  10.01%  10.22%  10.12% 
4layers  10.18%  10.15%  10.02%  10.06% 
From the Table 4, we have obtained different performances in small values of on the dataset CIFAR10[9]. It is obvious that there are two situations leading to the capsule networks not working, First, the capsule network will collapse when the value of is setted to 0.0001 or even less in two capsule layers which is same as the original paper[15]. Second, when the value of is setted to 0.1 or even less in multiple capsule layers (3layers and 4layers), the capsule network is not working too. Also, capsule networks using dynamic routing algorithm has the same situation when stacking multiple capsule layers. In the end, by comparing the results of multiple capsule layers in Table 3 and Table 4, it proved that too small gradient coefficients in the capsule network result in the gradient vanishing and according to the value of the coupling coefficient in Figure 2
In our proposed algorithm, is equivalent to the number of iterations in the routing algorithm. In the capsule network, although the number increasing of iterations brings noise, it can enhance the activation probability of highlayer capsules. Further, we can get the best performance in the original capsule network when the number of iterations is three. In the end, although the meaning of the hyperparameter is the same as the number of iterations, the scale is different.
5 Conclusion
In the original capsule network(used dynamic routing algorithm), the gradient vanishes when the model stacks multiple capsules layers. We analyzed the forward and backward propagation of the data flow in the capsule network and found that the coupling coefficient leads to the gradient vanishing. Therefore, we proposed the adaptive routing algorithm to overcome the disadvantage of gradient vanishing when the network stacks multiple capsule layers, which do not involve the coupling coefficient in the routing process. Considering the process of routing iteration will bring a large amount of computation, first, we derived the iterative process of the adaptive routing algorithm. Second, simplified the iteration of the routing by replacing the number of iteration with a hyperparameter . The hyperparameter not only inhibits the gradient vanishing but also the appropriate can magnify the gradient so that it can propagate more effectively to the front of the layers in the model. As a result, our proposed adaptive routing algorithm can achieve better performance than Sabour [15] on FashionMNIST[18], SVHN[13] and CIFAR10[9], and have the stateoftheart performance on MNIST[10] datasets. Further, we have obtained different performance in the different numbers of capsule layers and different values of hyperparameters and analyzed the experimental results.
As future work, we will continue to research the capsule network to increase the number of network layers while reducing the amount of computation.
References
 [1] (2009) ImageNet: A largescale hierarchical image database. See DBLP:conf/cvpr/2009, pp. 248–255. External Links: Link, Document Cited by: §1.
 [2] (2019) Group reconstruction and maxpooling residual capsule network. See DBLP:conf/ijcai/2019, pp. 2237–2243. External Links: Link, Document Cited by: §2.
 [3] (2016) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. External Links: Link, Document Cited by: §1.
 [4] (2011) Transforming autoencoders. See DBLP:conf/icann/20111, pp. 44–51. External Links: Link, Document Cited by: §1, §2.
 [5] (2018) Matrix capsules with EM routing. See DBLP:conf/iclr/2018, External Links: Link Cited by: §2.
 [6] (2015) Spatial transformer networks. See DBLP:conf/nips/2015, pp. 2017–2025. External Links: Link Cited by: §3.1.
 [7] (2019) Ladder capsule network. See DBLP:conf/icml/2019, pp. 3071–3079. External Links: Link Cited by: §2, §3.1.
 [8] (2015) Adam: A method for stochastic optimization. See DBLP:conf/iclr/2015, External Links: Link Cited by: §4.1.
 [9] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.1, §4.2, §4.2, §4.2, Table 3, Table 4, §5.
 [10] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1, §4.2, §5.
 [11] (2018) Group equivariant capsule networks. See DBLP:conf/nips/2018, pp. 8858–8867. External Links: Link Cited by: §2.
 [12] (2018) Neural network encapsulation. See DBLP:conf/eccv/201811, pp. 266–282. External Links: Link, Document Cited by: §2.
 [13] (2011) Reading digits in natural images with unsupervised feature learning. In Neural Information Processing Systems Workshop(NeurIPSW), Cited by: §4.1, §4.2, §5.
 [14] (2019) DeepCaps: going deeper with capsule networks. See DBLP:conf/cvpr/2019, pp. 10725–10733. External Links: Link Cited by: §1, §2.
 [15] (2017) Dynamic routing between capsules. See DBLP:conf/nips/2017, pp. 3856–3866. External Links: Link Cited by: §1, §1, §2, §3.1, §4.1, §4.2, §5.
 [16] (2015) Very deep convolutional networks for largescale image recognition. See DBLP:conf/iclr/2015, External Links: Link Cited by: §1.
 [17] (2015) Going deeper with convolutions. See DBLP:conf/cvpr/2015, pp. 1–9. External Links: Link, Document Cited by: §1.

[18]
(2017)
Fashionmnist: a novel image dataset for benchmarking machine learning algorithms
. CoRR abs/1708.07747. External Links: Link, 1708.07747 Cited by: §4.1, §4.2, §4.2, Table 2, §5. 
[19]
(2018)
CapProNet: deep feature learning via orthogonal projections onto capsule subspaces
. See DBLP:conf/nips/2018, pp. 5819–5828. External Links: Link Cited by: §2.
Comments
There are no comments yet.