1 Introduction
Recently the development of Deep Neural Networks (DNNs), especially Convolutional Neural Networks (CNNs), has seen a dramatic increase, leading to many different architectures. An important computing challenge is the optimization of CNNs and their hyperparameters. Most of the finetuning optimizations (e.g., choosing the training policy, the learning rate, the optimizer, etc.) are repetitive and timeconsuming, because every change must be tested with many epochs of training and repeated many times to have statistical significance. This is significantly worsened by increasing the CNN complexity, which leads to a more demanding compute effort to find the right set of optimizations.
Different methods have been explored in the literature to reduce the training time of CNNs, such as onecycle policy by Smith2018, warm restarts by Loshchilov, adaptive batch size by De, Goyal, Devarakonda. However, a key limitation of CNNs is: no information on the spatial correlation between detected features is retained. This causes poor network performances in terms of accuracy when the object to be recognized is rotated, has a different orientation, or presents any other geometrical variation. Currently, this problem is solved by training networks on expanded datasets, that include also transformed and modified objects. However, wider datasets lead to much longer training times. Longer training times not only pose delays in the DNN development cycle, but also make it a complex job for a wide range of developers, requiring supercostly training machines like Nvidia’s DGX2 or rely on outsourcing to third party cloud services that can compromise privacy and security requirements. Hence, both advanced DNN architectures and fast training techniques are necessary.
Capsule Networks (CapsNets, see Figure 1
) aim to solve the limitation of CNNs (w.r.t. preserving the spatial correlation between detection features) by substituting single neurons with so called Capsules (i.e., groups of neurons) and routing by agreement between capsule layers, which provide the ability to encode both the instantiation probability of an object and its instantiation parameters (width, orientation, skew, etc.). However, one of the major hurdles in the adoption of CapsNets is their gigantic training time, which is primarily due to the higher complexity of their constituting elements.
To address this challenge, we present XTrainCaps, a systematic methodology to employ different optimization methods for significantly reducing the training time and number of parameters of CapsNets, while preserving or improving their accuracy^{1}^{1}1Our methodology may also be beneficial for other complex CNNs, as it enables integration of available opensource optimizations in the DNN training loop.. However, this requires a study of impact of different optimizations (like learning rate policy, batch size adaptation, etc.) on the training time and accuracy of CapsNets. XTrainCaps provides multiple Paretooptimal solutions that can be leveraged to tradeoff training time reductions with no / tolerable accuracy loss. For instance, in our experimental evaluations, XTrainCaps provides a solution that can reduce the training time of CapsNets by 58.6% while providing a slight increase in its accuracy (i.e. 0.9%). Another solution provides 15% reduction in the number of parameters of the CapsNet without affecting its accuracy.
Our Novel Contributions:

[leftmargin=*]

Analysis of the CapsNets behavior when different learning rate policies (like onecycle policy or warm restarts) are applied (Section 3).

A novel training methodology, XTrainCaps, which accelerates the training of CapsNets by combining warm restarts, adaptive batch size and weight sharing in a automated flow specialized for the nature of CapsNets like capsule and different capsule layers (Section 4).
Before proceeding to the technical sections, we present an overview of CapsNets and the learning rate policies in Section 2, to a level of detail necessary to understand the contributions of this paper.
2 Background and Related Work
2.1 CapsNets
Capsules were introduced by hintontransforming2011. A capsule is a group of neurons that encodes both the instantiation probability of an object and the spatial information. sabourdynamic2017 and hintonmatrix2018 introduced two architectures based on capsules, tested on MNIST (mnist), with better accuracy compared to the stateofthe art traditional CNNs. The architecture of the CapsNet that we use in our work corresponds to the model of sabourdynamic2017, as shown in Figure 1
. The layers composing the encoder are: an initial convolutional layer (ConvLayer), a PrimaryCaps layer, which transforms the scalars numbers of ConvLayer in vectors, and a DigitCaps layer, which outputs the probabilities of the digits being present in the input image. The encoding network is followed by a decoder, composed by three fullyconnected layers, which output the reconstructed image. The loss computed on the reconstructed image (Reconstruction Loss) encourages the capsules of the DigitCaps layer to encode the instantiation parameters of the object.
2.2 Learning Rate Schedules
OneCycle Policy: This method, proposed by Smith2015 and Smith2018, consists of three phases of training. In the first phase, the leaning rate is linearly increased from a minimum to a maximum value in an optimal range. In the second phase, the learning rate is symmetrically decreased. In a small fraction of the last steps, the learning rate must be annealed to a very low value. In Equation 1 the formulas of the three parts of one cycle policy are reported: is the training step, is the total number of steps in the training epochs, and
are the learning rate range boundaries. Saddle points in the loss function slow down the training process, since the gradients in these regions have smaller values. Increasing the learning rate helps to faster traverse the saddle points.
(1) 
Warm Restarts:Stochastic Gradient Descent with Warm Restarts technique was proposed by Loshchilov. The learning rate is initialized to a maximum value and then it is decreased with cosine annealing until reaching the lower bound of a chosen interval. When the learning rate reaches the minimum value, it is set again to the maximum value, realizing a step function. The cosine annealing function which is used is reported in Equation 2: and are learning rate range boundaries, is the training step, is the number of training steps for each cycle. When , is set to and the cycle starts again. This process is repeated cyclically during the whole training time. The period of a cycle needs to be properly set to optimize the training time and the accuracy. Increasing stepwise the learning rate emulates a warm restart of the network and encourages the model to step out from possible local minima or saddle points.
(2) 
2.3 Adaptive Batch Size
When training a DNN, a small batch size can provide a faster convergence, while a larger batch size allows to have an higher data parallelism and, consequently, high computational efficiency. For this reason, many authors have studied methods to increase the batch size with fixed schedules (Babanezhad, Daneshmand) or following an adaptive criterion (De, Goyal, Devarakonda).
3 Analysis of Learning Rate Schedules applied to CapsNets
The techniques described in Section 2 have been tailored for traditional CNNs, to improve their performances in terms of accuracy and training time. This section aims to study whether different learning rate policies and batch size selection are effective when applied for training the CapsNets. Since the traditional neurons of the CNNs are replaced by multidimensional capsules in the CapsNets, the number of parameters (weights and biases) to be trained is huge.
For this purpose, we implemented different stateoftheart learning rate policies for the training loop of the CapsNet, such that these techniques are enhanced for the capsule structures and relevant parameters of the CapsNet. Figure 2 (a) shows the learning rate changes for different techniques and how the accuracy of the CapsNet varies accordingly. More detailed results of our analyses, including the comparisons with the LeNet5, are reported in the Table of Figure 2 (b).
From the results of this analisys, we can derive the following key observations:

[leftmargin=*]

The warm restarts technique is the most promising because it allows to reach the same accuracy (99.366%) as the CapsNet with a fixed learning rate, providing a reduction of 79.31% in the training time.

A more extensive training with warm restarts leads to to an accuracy improvement of 0.074%.

The adaptive batch size shows similar improvements in terms of accuracy (99.41%) and number of training epochs.

The first epochs with smaller batch size execute in a longer time.
4 XTrainCaps: Our Framework to Accelerate the Training of CapsNets
Training a CapsNet consists of a multiobjective optimization problem, because our scope is to maximize the accuracy, while minimizing the training time and the network complexity. The processing flow of our XTrainCaps framework is shown in Figure 3. Before describing how to integrate different optimizations in an automated training methodology and how to generate the optimized CapsNet at the output (Section 4.5), we present how these optimizations have been implemented with enhancements for the CapsNets, which is necessary to realize an integrated training methodology.
4.1 Learning Rate Policies for CapsNets
The first parameter analyzed to improve the training process of CapsNets is the learning rate. The optimal learning rate range is evaluated with the range boundaries 0.0001 and 0.001. For our framework, we use the following parameters in these learning rate policies:

[leftmargin=*]

Fixed learning rate: 0.001;

Exponential decay: starting value 0.001, decay rate 0.96, decay steps 2000;

One cycle policy: lower bound 0.0001, upper bound 0.001, annealing to in the last 10% of training steps (see Algorithm 1);

Warm restarts: lower bound 0.0001, upper bound 0.001, cycle length = one epoch (see Algorithm 2).
4.2 Batch Size
To realize adaptive batch size, the batch size is set to 1 for the first three epochs, and then increased every five epochs for three times. In particular, the user can choose a value and batch size will assume values , and (see Algorithm 3).
4.3 Complexity of Decoder
The decoder is an essential component of the CapsNet. Indeed, absence of a decoder results in a lower accuracy of the CapsNet. The outputs of DigitCaps are fed to the decoder: the output of the capsule (a vector) with the highest value is left untouched, while the remaining 9 vectors are set to zero (Figure 4 left). Thus, the decoder receives 1016 values, where 916 are null. Therefore, we optimize the model by using a reduced decoder (Figure 4 right) with only the 116 inputs which are linked to the capsule that outputs the highest probability. Overall, the original decoder has 1.4M parameters (weights and biases) while the reduced decoder provides a 5% reduction, with 1.3M parameters.
4.4 Complexity Reduction through Weight Sharing
Algorithm 4
illustrates how to share the weights between the PrimaryCaps and the DigitCaps layers, by having a single tensor weight in common for all the 8element vectors inside each 6x6 capsule. Using this method, it is possible to reduce the total number of parameters by more than 15%, from 8.2 millions to 6.7 millions. However, the accuracy drops by almost 0.3%, when comparing it to the baseline CapsNet.
4.5 WarmAdaBatch
Among the explored learning rate policies, warm restarts guarantees the most promising in terms of accuracy, while adaptive batch size provides a good tradeoff to obtain fast convergence. We propose WarmAdaBatch (see Algorithm 5) learning rate policy to expand the space of the solutions by combining the best of two worlds. For the first three epochs, the batch size is set to 1, then it is increased to 16 for the remaining training time. A first cycle of warm restarts policy is done during the first three epochs, a second one during the remaining training epochs. The learning rate variation of the WarmAdaBatch is shown in Figure 5.
4.6 Optimization Choices
Our framework is able to automatically optimize CapsNets and its training depending on which parameters the user wants to improve: using WarmAdaBatch, the accuracy and the training time are automatically cooptimized. The number of parameters can be reduced, at the cost of some accuracy loss and training time increase, by enabling the weight sharing, along with WarmAdaBatch.
5 Evaluation
5.1 Experimental Setup
We developed our framework using PyTorch library (paszke2017automatic), running on two Nvidia GTX 1080 Ti GPUs. We tested it on the MNIST dataset (mnist), since the original CapsNet proposed by sabourdynamic2017 is tailored on it. The dataset is composed of 60000 samples for training and 10000 test samples. After each training epoch, a test is performed. At the beginning of each epoch, the samples for training are randomly shuffled.
5.2 Accuracy Results
Evaluating the learning rate policies: Among the stateofthe art learning policies that we enhanced for CapsNets, warm restarts is the most promising with maximum accuracy improved of 0.074%. CapsNet with warm restarts reaches the maximum accuracy of the baseline (with fixed learning rate) in 6 epochs rather than 29, with a training time reduction of 62.07%.
Evaluating Adaptive batch size: Different combinations of batch sizes in adaptive batch size algorithm have been tested, since the smaller the batch size, the faster the initial convergence. However, large batch sizes lead to slightly higher accuracy after 30 epochs and, most importantly, to a reduced training time. In fact, a CapsNet training epoch with batch size 1 lasts for 7 minutes, while with batch size 128 it lasts for only 28 seconds. Batch size 16 is a good trade off between fast convergence and short training time (i.e., 49sec/epoch). The best results, applying adaptive batch size, are obtained using batch size 1 for the first three epochs and then increasing it to 16 for the remaining part of the training. With this parameter selection, there is a 0.044% accuracy gain with respect to baseline and the maximum accuracy of the baseline is reached in 5 epochs rather than in 29. However, the first three epochs take a longer time (88% longer time) because of the reduced batch size, so adaptive batch size alone is not convenient. However, the total training time is reduced by 30%, compared to the baseline.
Evaluating WarmAdaBatch: As for the batch, the first cycle of learning rate lasts 3 epochs and the second one 27 epochs. Variation of batch size and learning rate cycles are synchronized. This solution allows to have a 0.088% gain in accuracy with respect to CapsNet baseline implementation, and the baseline maximum accuracy is reached by CapsNet with WarmAdaBatch in 3 epochs against 29. After the first three epochs batch size changes and learning rate is restarted, so there is a drop of accuracy, that however reconverges in a few steps to the highest stable value obtained.
Evaluating Weight Sharing: applying weight sharing to the DigitCaps layer, we are able to achieve a 15% reduction in the number of total parameters, decreasing from 8.2 millions to 6.7 millions. However, this reductions leads also to a decrease in the maximum accuracy, dropping by 0.26%.
5.3 Comparison of different optimization types
We compare the different types of optimizations in terms of accuracy, and also based on the training time to reach the maximum accuracy and the number of parameters. As we can see in Figure 6a, we compare different optimization methods in a 3dimensional space. This representation provides Paretooptimal solutions, depending on the optimization goals. We also compare, in Figure 6b, the accuracy and the learning rate evolution in different epochs, for AdaBatch, WarmRestarts and WarmAdaBatch. Among the space of the solutions, we discuss in more details two Paretooptimal ones, WarmAdaBatch and the combination of WarmAdaBatch and weight sharing, which we call WAB+WS.
WarmAdaBatch: This solution provides the optimal point in terms of accuracy and training time, because it achieves the highest accuracy (99.454%) in the shortest time (2400 seconds). Varying the batch size boosts the accuracy improvements in the first epoch and the restart policy contributes to speedup the training.
WAB+WS: The standalone weight sharing reduces the number of parameters by 15%. By a combination of it with WarmAdaBatch, the accuracy loss is compensated (99.38% versus 99.366% of the baseline), while training time is shorter than the baseline (7200 seconds versus 8700 seconds) but longer than simple WarmAdaBatch. Our framework chooses this solution if also the number of parameter reduction is selected among the optimization goals.
6 Conclusion
In this paper, we proposed XTrainCaps, a novel framework to automatically optimize the training of a given CapsNet for accuracy, training time or number of parameters, based on the requirements needed. We enhanced the different learning policies, for the first time, for accelerating the training of CapsNets. Afterwards, we discussed how an integrated training framework can be developed to find Paretooptimal solutions, including different new optimizations for fast training like WarmAdaBatch, complexity reduction for the CapsNet decoder, and weight sharing for CapsNets. These solutions not only provide significant reduction in the training time while preserving or improving the accuracy, but also enable a new mechanism to provide tradeoff between training time, network complexity, and achieved accuracy. This enables new design points under different userprovided constraints.
Comments
There are no comments yet.