Convolutional Neural Networks(CNNs) have been around for almost two decades now. Since introduction in 1998 for digit classification problem  it took almost 14 years to improve them for more complicated problems [2, 3]. CNNs have grown deeper and got better in the field of visual recognition [4, 5, 6]
. However it suffered from a key issue of generating invariant features. It mainly focused on the amount of activations with respect to different features without focusing on the correlation among the presence of various features. A notable attempt has been made to address this issue in the capsule networks. In capsule networks, intermediate activations are represented in the form of vectors or primary capsules. Moreover, unlike the normal CNNs, the activations in the output layer are also represented in the form of vectors or output capsules(as defined in ). Capsule networks also implement a technique called “dynamic routing” that measures the agreement among the primary capsules and generate the output capsules. Dynamic routing involves a weighted summation of primary capsules where the weights are iteratively updated during the forward proportional to the similarity between the individual activations and the combined activation. In the native work the network was demonstrated to perform well for 10 class problems like MNIST 111http://yann.lecun.com/exdb/mnist/, CIFAR-10 222https://www.cs.toronto.edu/ kriz/cifar.html, SVHN 333http://ufldl.stanford.edu/housenumbers/ and so on.
Capsule networks have also been shown to improve upon more complex networks like AlexNet for Indic character recognition . Another approach proposed by the same authors of capsule network demonstrated the use of matrix capsules  instead of vector capsules to improve upon the performance. However, while implementing on problems with higher number of classes it was observed that capsule networks take a lot of time to train. Not only that, it was also seen the performance severely degrades as the number of classes increases. The degradation of performance of capsule networks for more complicated dataset was shown in the works of 
. Thus, capsule networks by itself is not scalable for more complicated problems. In its native form the dynamic routing algorithm depends directly on the number of classes. While for low number of classes it is easier to model agreement between primary capsules, for complicated problems, the number of interrelationships increases considerably. Thus normal capsule networks tend to over-fit the data. In the current work we propose to use capsule networks for feature extraction rather than generating class specific capsules. By using capsules for generating intermediate features we abolish the dependency of dynamic routing on the number of classes. Moreover the number of features generated by dynamic routing can be much lesser than the number of classes thus preventing the network from over-fitting while learning more general features in the process. A brief overview of the proposed system is shown in fig.1. The proposed approach is tested on three Indic digit database  namely “Bangla”, “Devanagari”, and “Telugu”, as well as more complicated datasets with a higher number of class such as “Bangla Basic”, “Bangla Compound”.
Ii The capsule network
While traditional CNNs are invariant to object positions and orientations, capsule networks aim to bring equivariance of pose vectors. In capsule networks, the successive layers get a higher activation when the kernels in the previous layers agree to the same decision. A schematic diagram of capsule network is shown in fig. 2. It mainly consists two different layers, the primary capsule layer and the output capsule layer. The primary capsule layer groups together outputs from multiple convolutions into a single capsule unit. The output of primary layer is fed into output capsule layer via dynamic routing of capsules. Each of the vectors of the output capsule layer corresponds to a single output class. Length of the vectors corresponds to the likelihood of its corresponding class.
Ii-a Primary Capsule Layer
Capsule networks start with convolution layers that convert image samples to group of activation tensors. These tensors are converted to primary capsules in the primary capsule layer. Considering single channel inputs,number of
kernel is convoluted with a stride ofto obtain a -dimensional feature for the image. Unlike the scalar activations of a traditional convolutional layer, primary capsule are in the form of a -dimensional vectors generated by number of kernels convoluted with a stride of . By have such groups of kernels, blocks of primary capsules are generated. These blocks are reshaped in a way such that all the 8-dimensional primary capsules are lined up to form a tensor where is the total number of primary capsules.
Ii-B Output Capsule Layer
The output capsule layer takes the primary capsules from the previous layer and using dynamic routing, produces output capsules. Each capsule of the output capsule layer is a vector describing a single class. Usually, the output layer of a fully connected neural network produces as the output, being the number of classes for a given dataset. However, with respect to capsule networks, the output capsule layer is of the dimension where each class is represented by a dimensional output capsule. Instead of providing with a scalar representation of the likelihood of each class, the information is encapsulated in a vector of dimension . Length of the output vector denotes the likelihood of the presence of the corresponding class.
Ii-C Dynamic Routing
The output capsules are obtained from the primary capsules by the operation of dynamic routing. The trainable weights of dynamic routing, that is, are used to determine the individual opinions of every capsule. Considering to be the index of the 8-dimensional primary capsules of dimension and to be the index of the 16 dimensional output capsules, has the dimension . The individual opinion of the primary capsule regarding the output capsule is given by :
where the -th primary capsule is denoted by . For every primary capsule , we get a output block of shape . For the operation of dynamic routing, another type of weight is considered of dimension , called routing weights, . They are used in combining individual opinions to form the final output capsules. However, unlike
, these weights are learned not by backpropagation, but are learned through dynamic routing during forward pass, depending on the degree of agreement of the individual opinions with the combined output. These weights are initialized as zeros on the start of every forward pass. The coupling coefficientsis given by :
These coupling coefficients are used for combining the individual opinions to form the combined output capsule. The -th combined output capsule is obtained by squashing is given by:
A non-linear ”squashing” function is used to ensure that shorter vectors get shrunk to nearly zero length and longer vectors get shrunk to a length slightly below 1. The output capsule is given by :
A simple dot product calculates the agreement between individual output capsules and the squashed combined output capsules . The individual capsules having more agreement with the combined output are given higher preference. This is done by updating the as :
Ii-D Loss Function
The loss function is divided into two parts, the margin loss for object existence and mean square loss with respect to the generated images from the output capsules.The marginal loss for objectis given by :
Here, iff a object of class is present. The upper and lower bounds and are set to 0.9 and 0.1 respectively. is set to 0.5.
Ii-E Decoder network for regularization
The decoder network is a series of fully connected layers to reconstruct the original input. The output capsule layer is masked out to convert all capsules but the one corresponding to the label to zero. The masked out output capsules are sent to the decoder for reconstruction. The masking helps in creating class-specific reconstructions during the test time. The reconstruction loss is minimized along with the margin loss so that the model does not overfit on the training dataset. To prevent the reconstruction loss from dominating the margin loss, the former is scaled down using a factor of 0.0005.
Iii The Proposed Approach
The main drawback of capsule network is the computation time for doing the dynamic routing. The number routing weights is directly proportional to the number of classes. Traditionally, capsule networks were shown to perform only with smaller number of classes. But classification of higher class datasets become extremely time consuming to the point that it is virtually impractical. It has also been observed that with an increasing number of classes, the performance of capsule networks drop significantly. With an increasing number of classes, analyzing agreement among capsules become more complicated. The network tends to overfit the data and hence resulting in an overall degradation in performance.
Iii-a Capsule Connections as Feature Extractor
To overcome the above mentioned issues, a modification to the capsule network has been proposed. Instead of performing dynamic routing to obtain class-specific output capsules, it has been used to map the inputs to an intermediate feature space. Thus, the dynamic routing phase becomes invariant to the number of classes, rather it depends only on the number of features extracted. A schematic diagram of our proposed is shown in fig. 3. In our proposed model, the capsule network gives feature specific capsules instead of class specific capsules. Samples in a dataset may be represented using features fewer than the number of classes in that dataset. This enables us to target higher class datasets using smaller number of features. Since majority of the time of execution is taken up by computation of features using capsule network, keeping the number of features same, increase in the number of output classes does not have major impact on the time taken to train the network. Furthermore by limiting the number of features the generalization capability of the network is also boosted, thus resulting in an improved performance. In the original capsule networks equivariance among capsules were measured to analyze agreement among them regarding the presence of one of the output classes. In the present work, the agreement among capsules are measured with respect to presence of some specific features from which the output classes are inferred through another fully connected layer.
Iii-B Feature Capsule Layer
Instead of a final output capsule layer, we use an intermediate feature capsule layer that will create feature capsules from original primary capsules. For input of size , feature capsule layer creates feature capsules of size where should ideally be less than for a gain in speed as well as reduction of memory footprint compared to normal capsule network. These feature capsules are flattened and passed into a fully connected output layer to compute the class specific likelihood. If the capsule features are represented by for
and the output probability for classfor is written as,
where is the output of a fully connected layer.
Iii-C Regularization using Feature Capsules
Unlike native capsule networks, the decoder tries to reconstruct the input from the feature capsules rather than the output capsules. These reconstructed images are used as a regularization method by minimizing the reconstruction loss, thereby preventing the network from over-fitting. Just like normal capsule network, the reconstruction loss is scaled down by a factor of 0.0005 so that the margin loss is not dominated.
Iii-D The modified loss function
Similar to original capsule networks, the margin loss for a class is computed. For class the margin loss is given by,
Here, iff a sample of a class is present, else . The upper and lower bounds and are set to 0.9 and 0.1 respectively and is set to 0.5
The decoder network tries to reconstruct the original samples from intermediate feature capsules. Unlike the capsule networks, the decoder recieves the entire set of feature capsules without any form of masking. The reconstruction loss is given by simple mean square error between the input and the reconstructed image. The reconstruction loss is given by,
where, is the input image and is the reconstructed image. Finally, the net loss is calculated by ,
Here is the scale down factor that prevents the reconstruction loss from dominating over the margin loss. In our experiments, the value of is taken as 0.0005, same as the one used in original capsule networks.
Iv Experimental Setup
|Dataset||Classes||Train Set||Test Set|
|Bangla Basic Character||50||12000||3000|
|Bangla Compound Character||199||33282||8254|
Training Time (in seconds) per Sample per Epoch for Numeral Datasets
We have used five Indic handwritten datasets444https://code.google.com/archive/p/cmaterdb/downloads for our experiments. Of them, Bangla numeral(CMATERdb 3.1.1), Devanagari numeral(CMATERdb 3.2.1) and Telugu numeral(CMATERdb 3.4.1) datasets have 10 classes each and are divided in the ratio 2:1 with two parts train and one part test. datasets named Bangla basic character(CMATERdb 3.1.2) and Bangla compound character(18.104.22.168) gives us 50 classes and 199 classes respectively (Table I). These are the higher class problems that have considerable amount of time and memory to train on original proposed model of capsule network.
The proposed approach is benchmarked against the traditional capsule network on the five above mentioned datasets. The main goal of the experimentation is to demonstrate the efficiency of the proposed network for problems with large number of classes. The primary factor controlling the performance for the proposed approach is the number of features which is previously represented as . The three numeral datasets namely Bangla, Devanagari and Telugu have 10 classes each. We have measured the performance of the proposed approach for . There was no point going higher than because as the speed of computation starts to drop below the speed traditional capsule networks as exceeds . For the higher class datasets such as the Bangla basic characters or Bangla compound characters, the proposed approach is tested for . The proposed network started to show saturation towards for higher than . The model with best training performance was saved and performance of the model on the test set has been reported. Separate validation set was not used because dedicated regularization techniques are already implemented in the process. Along with accuracy, many other factors are also measured corresponding to the time and memory consumption of the network. All the experiments were conducted using a Nvidia GeForce GTX 1080ti(11GB).
Iv-C Results and Discussion
The first set of experiments show the performance of the network for numeral datasets. Though there is not much improvement in terms of accuracy (Table V), the overall training time has been greatly reduced (Table VI).
While the proposed network itself reduces the computational time for each sample (Table II), the major speedup comes from the decrease in the memory consumption (Table III). With a lower memory consumption it is possible to train the network with a higher batch size (Table IV).
The second set of experiments represents the scalability of the model. The proposed network is run on two datasets with higher number of classes namely, Bangla basic characters with 50 classes and Bangla compound character with 199 classes. The proposed model obtained a much better accuracy (Table X) for both cases with a much lesser training time (Table XI).
The training time for individual samples per epoch (Table VII) shows a trend similar to the numeric datasets. Thus providing some boost in time consumption over basic capsule networks. However, the real benefit is in terms of the GPU memory consumption (Table VIII). Unlike capsule networks where the memory consumption increases with the increasing number of classes, the proposed method is immune to the problem. Because the dynamic routing for a fixed dimension of features is independent of the number of classes, the memory consumption is almost identical. The only increase in memory is due to the last fully connected layer from the feature capsules to the output layer, however the difference is negligible with respect to the overall consumption. For example, for a 10-dimensional feature capsules, extra memory requirement for the last fully connected layer for digit datasets would be 0.006 MiB, for Bangla basic dataset it is 0.031 MiB, and for Bangla compound character dataset it is 0.123 MiB. Hence it can be seen the difference is insignificant with respect to the total consumption. This allows the use of same batch size accross all datasets with different number of classes (Table IX).
|Bangla||97.3 (8)||LeNet ||94.6|
|Basu et al. ||96.67|
|Roy et al. ||95.08|
|Devanagari||94.8 (4)||LeNet ||92.1|
|Das et al. ||90.44|
|Telugu||96.9 (2)||LeNet ||95.8|
|Roy et al. ||87.2|
|Bangla||93.23 (8)||LeNet ||63|
|Sarkhel et al. ||86.53|
|Bhattacharya et al .||92.15|
|Bangla||87.44 (6)||LeNet ||75.4|
|Sarkhel et al. ||86.64|
In several occassions we can see in table XII the proposed approach beating the classic capsule networks. The difference is more prominent for problems with higher number of classes like Bangla Basic and Bangla Compound Characters
One of the primary concerns of capsule network is its poor scalability for problems of larger classes. In our approach we have shown how capsule networks can be used to extract feature specific capsules rather than class specific capsules. Through this method a considerable boost has been demonstrated in terms of overall training time per epoch. Moreover the memory requirements of the proposed network is independent of the number of classes, thus allowing networks to be trained using a much higher number of batches and which provides a much more efficient implementation of the parallelization capabilities of a GPU. Agreement among capsule become much more complicated for higher class problems. Thus the normal capsule network tend to overfit the data. By generating intermediate features of lower dimensions, a more generalized learning environment is created. This results in an improvement in accuracy for higher class problems. The proposed network beats many other popular works on the current Indic datasests. In future, more analysis needs to be carried out regarding the effect of feature capsule on the concept of equivariance. Moreover, since reconstruction is carried out based on entire set of feature capsules, class specific reconstruction is not possible and hence provides another avenue to expand the work.
This work is partially supported by the project order no. SB/S3/EECE/054/2016, dated 25/11/2016, sponsored by SERB (Government of India) and carried out at the Centre for Microprocessor Application for Training Education and Research, CSE Department, Jadavpur University.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
-  S. Roy, N. Das, M. Kundu, and M. Nasipuri, “Handwritten isolated bangla compound character recognition: A new benchmark using a novel deep learning approach,” Pattern Recognition Letters, vol. 90, pp. 15–21, 2017.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3.
-  S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Advances in Neural Information Processing Systems, 2017, pp. 3856–3866.
-  B. Mandal, S. Dubey, R. Sarkhel, and N. Das, “Handwritten indic character recognition using capsule networks,” in Proceedings of the 1st IEEE Conference on Applied Signal Processing, 2018.
-  G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with em routing,” 2018.
-  E. Xi, S. Bing, and Y. Jin, “Capsule network performance on complex data,” arXiv preprint arXiv:1712.03480, 2017.
-  S. M. Obaidullah, C. Halder, K. Santosh, N. Das, and K. Roy, “Phdindic_11: page-level handwritten document image dataset of 11 official indic scripts for script identification,” Multimedia Tools and Applications, vol. 77, no. 2, pp. 1643–1678, 2018.
-  S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, and D. K. Basu, “An mlp based approach for recognition of handwrittenbangla’numerals,” arXiv preprint arXiv:1203.0876, 2012.
-  A. Roy, N. Mazumder, N. Das, R. Sarkar, S. Basu, and M. Nasipuri, “A new quad tree based feature set for recognition of handwritten bangla numerals,” in Engineering Education: Innovative Practices and Future Trends (AICERA), 2012 IEEE International Conference on. IEEE, 2012, pp. 1–6.
-  N. Das, B. Das, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri, “Handwritten bangla basic and compound character recognition using mlp and svm classifier,” arXiv preprint arXiv:1002.4040, 2010.
A. Roy, N. Das, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri, “An axiomatic fuzzy set theory based feature selection methodology for handwritten numeral recognition,” inICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India-Vol I. Springer, 2014, pp. 133–140.
-  R. Sarkhel, A. K. Saha, and N. Das, “An enhanced harmony search method for bangla handwritten character recognition using region sampling,” in Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on. IEEE, 2015, pp. 325–330.
-  U. Bhattacharya, M. Shridhar, and S. K. Parui, “On recognition of handwritten bangla characters,” in Computer Vision, Graphics and Image Processing. Springer, 2006, pp. 817–828.
-  R. Sarkhel, N. Das, A. K. Saha, and M. Nasipuri, “A multi-objective approach towards cost effective isolated handwritten bangla character and digit recognition,” Pattern Recognition, vol. 58, pp. 172–189, 2016.