Efficiency and Scalability of Multi-Lane Capsule Networks (MLCN)

by   Vanderson M. do Rosario, et al.
University of Campinas

Some Deep Neural Networks (DNN) have what we call lanes, or they can be reorganized as such. Lanes are paths in the network which are data-independent and typically learn different features or add resilience to the network. Given their data-independence, lanes are amenable for parallel processing. The Multi-lane CapsNet (MLCN) is a proposed reorganization of the Capsule Network which is shown to achieve better accuracy while bringing highly-parallel lanes. However, the efficiency and scalability of MLCN had not been systematically examined. In this work, we study the MLCN network with multiple GPUs finding that it is 2x more efficient than the original CapsNet when using model-parallelism. Further, we present the load balancing problem of distributing heterogeneous lanes in homogeneous or heterogeneous accelerators and show that a simple greedy heuristic can be almost 50 random approach.



There are no comments yet.


page 7


The Multi-Lane Capsule Network (MLCN)

We introduce Multi-Lane Capsule Networks (MLCN), which are a separable a...

Path Capsule Networks

Capsule network (CapsNet) was introduced as an enhancement over convolut...

Wasserstein Routed Capsule Networks

Capsule networks offer interesting properties and provide an alternative...

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipeDream is a Deep Neural Network(DNN) training system for GPUs that pa...

Multi-Kernel Capsule Network for Schizophrenia Identification

Objective: Schizophrenia seriously affects the quality of life. To date,...

P = FS: Parallel is Just Fast Serial

We prove that parallel processing with homogeneous processors is logical...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Several approaches to the distributed model parallelization of Deep Neural Networks (DNN) have concentrated in their depth [huang2018gpipe, mehta2018high, ben2018demystifying], but DNNs can also be organized in a way to be parallelized in their width [jia2018beyond]. The DNN architecture may be organized into distinct neural network [MLCN-SPL]. This creates separable and resource efficient data-independent paths in the network that can be used to learn different features or add resilience to the network. Examples of neural networks with  are the Google Inception [chollet2017xception, szegedy2017inception] and the Multi-lane Capsule Network (MLCN) [MLCN-SPL]. As these  are data-independent they can be (1) processed in parallel and (2) specialized for distinct computational targets (CPUs, GPU, FPGAs, and cloud), as well as resource-constrained mobile and IoT targets, leading to opportunities and challenges. Recently, our research focus was on Multi-Lane Capsule Networks (MLCN), which are a separable and resource efficient organization of Capsule Networks (CapsNet) that allows parallel processing while achieving high accuracy at a reduced cost. Table LABEL:tab:comp shows results from MLCN in comparison with the baseline CapsNet. With a similar number of parameters, MLCN achieves similar accuracy but with a significant speedup stemming from the  organization. Initial experiments were performed in single GPU environments but, with highly-parallel   it is interesting to explore how MLCN scales with more GPUs. Here we present a first comprehensive study of the scalability and efficiency of MLCN for multi-GPU systems.

Network/set # of ’s Width Params. Train Time


Baseline - - 11k 240 66.36%
Mlcn2 4 4 5k 53 69.05%
Mlcn2 32 2 14k 204 75.18%
Baseline - - 8k 220 91.30%
Mlcn2 2 4 3.6k 20 91.01%
Mlcn2 8 4 10.6k 92 92.63%
Table I: Comparison between Baseline CapsNet and MLCN.

Moreover, the  do not necessarily need to have the same sizes or shapes and may perhaps even learn different features of the given task. This implies that each distinct  may be better suitable for a distinct HW substrate. Further, each  may tolerate different impacts from various optimizations (such as quantization). Thus, given a set of , , and a set of hardware (HW), , there is an optimal pair for and and an optimal sequence of  optimizations for each pair of  and HW.

In this work, we describe and present this lane-hardware matching problem for homogeneous or heterogeneous accelerator scenarios. We also show that a simple greedy heuristic can be almost 50% faster than a random naïve approach.

The main contributions of this work are:

  • We present a first comprehensive analysis of the efficiency and scalability of MLCN showing its advantages over the data-parallelism-limited approach of the original CapsNet.

  • We define the load balancing problem of distributing heterogeneous  in heterogeneous hardware.

  • We present a greedy heuristic to solve the lane-hardware match problem showing that it is superior to a naïve approach.

This paper is organized as follows: Section II presents the state-of-art in Capsule Networks and DNN parallelization; Section III describes the Multi-Lane Capsule Network (MLCN) and discusses how it can be parallelized; Section IV further discusses the heterogeneous distribution problem and presents a heuristic approach to it; finally, Section V and VI shows the experimental setup and the experimental results, and Section VII presents our conclusions.

Ii Related Work

Ii-a Capsule Network

The Convolutional Neural Network (CNN) is a class of DNN which is commonly used when working with images. CNNs have already achieved state of art results in tasks such as image and video recognition, image classification and medical image analysis. However, these networks have difficulties with location invariance and loss of location information, e.g., one CNN which is able to recognize faces could also mistakenly recognize an image with eyes, mouth, and nose at random positions as a face, not understanding that there is an important spatial relationship between the composing elements. To address this problem, many different new DNN approaches were proposed, including the notion of capsules proposed by Hiton, Krizhevsky, and Wang in 2011


To encode spatial relationship, Capsule Networks also known as CapsNets, do not work/represent neurons as simple scalars (as in regular CNNs), but as vectors. Later in 2017 an efficient and realistic training algorithm for such networks was proposed

[sabour2017dynamic]. The algorithm, named Dynamic Routing, dynamically chooses activation paths between capsules from one layer to another, calculating the vectors from the next layer based on a mean from dynamically selected vectors from all previous layers.

CapsNet [sabour2017dynamic] produces a set of Primary Capsules (PCs) by applying two convolutional steps to the original image and splitting it in vectors. Each of these PCs (vectors), identified as , is multiplied by a weight matrix

and finally, a final set of capsules, the digit capsules, is created using the dynamic routing algorithm. Each of these digit capsule vectors represents one of the classes in the classification problem and the vector’s length encodes the probability of the class. The digit capsule can also be used to reconstruct the image like an auto-encoder.

This network with the Dynamic Routing algorithm was shown to have some advantages such as a smaller necessary training set and location invariance. It also has some drawbacks such as slower execution and lower accuracy than CNNs. Since the initial publication, however, multiple improvements were proposed and the concept has been evolving. Shahroudnejad, Mohammadi, and Plataniotis [shahroudnejad2018improved] presented an analysis of the explainability of CapsNet, showing that it has properties to help understand and explain its behavior. Jaiswal et al. [jaiswal2018capsulegan] used the CapsNet in a Generative Adversarial Network (GAN) and showed that it can achieve lower error rates than the simple CNN. Ren and Lu [ren2018compositional] showed that CapsNet can be used for text classification and showed how to adapt the compositional coding mechanism to the CapsNet architecture. Jimenez-Sanchez, Albarqouni, and Mateus [jimenez2018capsule] tested CapsNet for Medical Imaging Data Challenges showing that it can achieve good performance even when having less trainable parameters than the tested counterpart CNNs. Mobiny and Nguyen [mobiny2018fast] tested the performance of CapsNet for lung cancer screening and showed that it could outperform CNNs mainly when the training set was small. A similar result was achieved by Kim et al. in traffic speed prediction [kim2018capsule] with CapsNet outperforming traditional CNNs approaches. Mukhometzianov and Carrillo [mukhometzianov2018capsnet] used CapsNet with multiple image datasets and found that, although achieving good results, CapsNet still requires higher training times compared to other CNNs. Canqun et al. [xiang2018ms] proposed the Multi-Scale CapsNet (MS-CapsNet). They introduced a fixed division of the CapsNet network limited to three “” (they did neither name or explore the division concept), each with a different number of convolutions. Also, recently developed, the Path Capsule Networks by Amer and Maul [path2019] (Path-Capsnet) explore the parallelism of CapsNets by splitting the network such that each path or  is responsible for computing each digitcaps or a primary capsule entirely, unlike the computation of different dimensions/features in MLCN.

Iii Multi-lane CapsNets (MLCN)

In 2019, we introduced a novel organization for the CapsNet named Multi-Lane CapsNet (MLCN) with improved explainabily, performance and parallelization without decreasing accuracy or generalization power [MLCN-SPL]. However, beyond just encoding the probability of a class, each vector also contains information to reconstruct the original image, with distinct dimensions of the vector representing different features of the image. With this in mind, we propose to split the original CapsNet architecture111source code in https://github.com/vandersonmr/lanes-capsnet (Figure 1), dividing the PCs into independent sets called . Each of these sets of PCs, a , is responsible for one of the dimensions in the final digit capsules.

Figure 1: MLCN architecture.

The number of PCs per  may vary, as well as the way they are computed. In the original CapsNet, two 2D convolutions are applied to the input image and then reshaped to produce the PCs. More convolutions may be applied, what we call the of a , or more filters can be used per convolution generating more capsules, what we call the of a . Further, distinct dimensions of a final digit capsule can be generated by  with different configurations (and thus distinct computational requirements).

There are two key advantages of this organization over the original CapsNet architecture. First, it allows parallelism of the execution, as each set of PCs is constructed independently, improving performance and allowing training and deployment on distributed environments. Second, it improves the explainability of the network by associating different features of the image to each .

Iii-a CapsNet Parallelization

A DNN can be paralyzed in different ways and normally finding the best way for a given network is a complex and hard task. The three most common are data parallelism, model parallelism and pipelining.

The first, data parallelism, splits the data which is going to be computed. Basically, it divides the input batch into smaller batches for each computer unit and synchronizes it at the end of the batch. Although being very simple and straightforward, it can only scale increasing the batch size as dividing too much a small batch can result in small computation and frequent synchronization. And varying the batch size impacts in the accuracy, what can mean a trade-off between accuracy and speedup.

Another possibility is by splitting the network operations itself. However, it is not always trivial to find a good place to split the operations. Normally, if two operations which are data-dependent are split into two computation units, it will involve lots of communication. Moreover, implementing this kind of network division and communication in current frameworks is not trivial.

Lastly, but not less important, pipelining split the network into levels which can compute different data at the same time in a pipeline approach. It is normally the approach used in high-performance scenarios.

These are not the only techniques to distribute the training and inference of DNNs and they are not mutually exclusive and can be used together [jia2018beyond]. Related to MLCN, we tested its capability of allowing easy model parallelism and compare it to the common approach that is data parallelism. Of course, for huge MLCN networks pipeline could also be used, but we focus on showing how being able to facilitate the use of model parallelism can bring many advantages mainly over when only using data parallelism. This same advantage can be easily extended by adding pipelining per lane, but it will remains for future work.

Iv Heterogeneous Distribution Problem

One of the main advantages of having data-independent  is that these  can be deployed separately in multiple accelerators. If we have multiple equal  and multiple equal accelerators, deployment is as basic as dividing the  equally over the HW resources, only being concerned with the communication cost involved. If in other cases we have  with different shapes, characteristics and computational intensity or/and we have multiple accelerators with different characteristics or computational power, deployment becomes more involved. First, because it now involves load balancing the computational intensity of the  and the computational power of the  and, second, because now there is also the chance to apply different optimizations for different pairs of HW and . This scenario can be seen in Figure 2, which shows how multiple  can be deployed for different accelerators with different compilation stacks.

Deciding where to execute each  or what optimizations to apply to each lane/hardware pair is not trivial. In this work, we present an approach to address the first problem using a deployment heuristic. We will address the second problem in future work.

Figure 2: Multiple Neural Network  can be trained in parallel using multiple HW even in heterogeneous scenarios..

Iv-a Heuristic Execution Cost for MLCN Lanes

Finding, statically, the optimal solution to deploy a  given a set of HW resources is a complex task. For example, aspects such as the version of the compiler being used or what other  (and their characteristics) are being executed concurrently on the same HW can have a significant impact on the final performance. These are only two of the many aspects that can affect performance. However, we observed in our experiments that, at least for MLCN, we do not need to have the exact final performance to make a good deployment decision. Our experiments have shown that simple predictors can provide fair results.

We experiment running and taking the average execution time of 10 executions of MLCN  with different width, size, and types, using three different NVIDIA GPUs (K80, P100, and V100). For the same number of parameters, independently of the GPU used, the performance displays a well-behaved pattern. It varies linearly when increasing the depth, quadratically when varying the width, and it was multiplied by a factor when changing the GPU.

Thus, for MLCN  with NVIDIA GPUs and compilers, predicting the performance on a given HW substrate can be approximated by Equation 1, which achieves a 0.901 Pearson correlation with our experimental data.


The in equation 1 is the speed factor of the GPU being used. It only has any significance when deploying to a heterogeneous set of GPUs and the constant for each GPU can be inferred by simply measuring the execution time of a tiny  in each GPU and normalizing it. This can be done before the execution and it has an insignificant cost in the final execution time. In the case of our experiments, we collect the by executing a 512x512 fully connected network with a small set of data. Normalized by K80, we used the following s for M40, P100 and V100: 3.1, 4.2 and 6.

Iv-B Load Balancing Algorithm

We showed that we can make good execution cost predictions for NVIDIA GPUs and MLCN . However, there is still the problem of how to deploy a set of  with different sizes and widths to a set of GPUs with different speeds. We can model this problem as a numerical set partition with bins, each bin corresponding to a target GPU. The cost of each  being deployed (inserted into the bin) is equal to the  cost (Equation 1) multiplied by previous execution speed prediction on host HW via execution of a tiny  ().

The numerical set partition problem is NP-Hard, but very good results can be achieved using heuristic/approximative algorithms and it can even be solved in pseudo-polynomial time using dynamic programming making it one of “The Easiest Hard Problem” [hayes2002computing, korf2009multi]. One of such heuristics that achieved good results and is very simple to implement is the greedy partition which always inserts the remaining  with the largest cost in the emptiest bin. Algorithm 1 shows this greedy algorithm including the pre-execution used to calculate the .

if using heterogeneous HW:
  for each HW:
    execute a tiny lane
    GPUSpeed[i] = runtime of the tiny lane
    GPUSpeed[i] = GPUSpeed[i]/smallest(GPUSpeed)
def GreedyPartition(lanes, NumGPUs,GPUSpeed):
  GPUTasks = [[] for i in range(NumGPUs)]
  for lane in reverse sorted lanes:
    sort GPUTasks by GPUTasks[i][j]*GPUSpeed[i]
  return GPUTasks
Algorithm 1 Greedy Parition Algorithm

V Experimental Setup

In our experiments, we used machines from Google Cloud. All virtual machines instantiated had 24 vCPUs with 50GB of RAM and a default network interface. We used different GPU setups, including NVIDIA Tesla M40, K80, P100 and V100 all with CUDA 10.0, Intel MKL-DNN and Tensorflow 1.13.1.

The results and experiments that we explore did not show sensitivity to the input data set (tested with MNIST, CIFAR10, and others) and we chose to use the MNIST data set. Execution time was measured by executing 10 MNIST epochs, excluding the first, and using the average time for the others. All results had a very small variation. The execution time between epochs had always a very similar value. Thus, for simplicity, we present averages.

Thus, in this work we tested four configurations for the CapsNet parallelization, as follow:

  • Original with Data Parallelism (baseline or base

    ): we simply used the original concept of CapsNet for the MNIST dataset parallelized using Keras data parallelism support.

  • MLCN with Data Parallelism (mlcn-data): we used the same approach as in the baseline (Keras data parallelism), but with the MLCN organization.

  • MLCN with Model Parallelism (mlcn-model): we parallelize the execution by executing each  on different GPUs. When using multiple machines, we used Horovod MPI framework to do handle the communication.

Vi Experimental Results

Vi-a MLCN Scalability

To understand how each approach to the parallelization of CapsNet scales, we studied their performance with 1, 2, 4 and 8 NVIDIA Tesla K80 GPUs.

The graph in Figure 3 shows the performance comparison between the base (baseline), mlcn-data and mlcn-model. MLCN is faster than the baseline even in a single GPU, as reported earlier. However, it is interesting to notice that the advantage does not increase when scaling to more GPUs with data parallelization, as the speedup difference between mlcn-data and baseline remained constant. This suggests that the reorganization proposed by MLCN does not improve scaling via data parallelism. However, the same is not true for model-parallelism. In this case Mlcn-model has a visible advantage, scaling with higher efficiency and achieving a near 7.18 speedup with 8 GPUs over the single GPU baseline. Thus, MLCN not only is faster than the original CapsNet (baseline) but, because it allows model-parallelism, it scales more efficiently.

Figure 3: speedup of the three parallelization approaches: baseline with data parallelism (base), MLCN with data parallelism (mlcn-data) and MLCN with model parallelism (mlcn-model). All speedup are relative to the baseline with one GPU.

Vi-B Impact of Batch Size

The size of the minibatch, or batch size, has a significant impact on the performance of a DNN as more computation/communication is available, enabling a more efficient use of the HW. The batch size has a significant impact on data parallelism performance as more data/computation is available to be divided among the GPUs. To study the advantage of MLCN over the data parallelism method we tested both approaches with 100, 150, 300 and 600 batch sizes. The graphs in Figures (a)a and (b)b show the speedup versus a single GPU with a 100-sized batch size. In both cases we observe similar efficiency as batch size grows. So, for different batch sizes the relative advantage of MLCN with model parallelism stays the same, as increasing the batch size equally increases the efficiency of data and model parallelism approaches.

(a) Baseline using data-parallelism for different mini batch sizes
(b) MLCN using model-parallelism for different mini batch sizes
Figure 4: MLCN and baseline scalability for 1, 2, 4 and 8 NVIDIA K80 GPUs using Google Cloud VM with 24 vCPUs and 90GB of RAM.

We also studied the impact of batch size on both baseline and MLCN accuracy, shown in Figure 5. Increasing batch sizes have a significant impact on the accuracy in both cases. The magnitude of this impact is related to the dataset as shown by the differences between MNIST and Cifar10 results. Thus, as model parallelism has better performance and scalability with smaller batch sizes (Figures (a)a and (b)b), model parallelism has the advantage of scaling without the need to trade accuracy for efficiency.

Figure 5: validation accuracy impact when increasing the training batch size for the baseline and MLCN in the Cifar10 and MNIST datasets.

Vi-C Impact of Lanes Characteristics

The previous results explored the suitability of MLCN and its model parallelization. We also explore also how the characteristics of the MLCN

 can affect performance and scalability by varying the three main hyperparameters in MLCN

: their width, depth and the quantity. The results are shown, respectively, in Figures (a)a, (b)b and (c)c.

The width and depth of  has a direct impact on the number of parameters per  and, consequently, the amount of computation per . With more computation per , the efficient use of multiple GPUs becomes advantageous. This is shown in Figures (a)a and (b)b as larger lanes increase efficiency. However, increasing the width had a much more significant increase in efficiency, at similar increase in number of parameters. This indicates that, besides the number of parameters, the type of computation affects performance. In the case of MLCN   wider  result in better performance than deeper  with the same number of parameters.

Another interesting point was the fact that increasing the number of lanes did not significantly increase performance, as shown in Figure (c)c. Even though increasing the number of lanes also increases the amount of computation available between batches, there is an overhead of having these computations separable. So, having several  in one GPU is less efficient than having a single extremely large .

(a) MLCN using model-parallelism with mini batch width of 150 and varying the width of the .
(b) MLCN using model-parallelism with mini batch width of 150 and varying the size of the .
(c) MLCN using model-parallelism with batch size of 150 and varying the number of .
Figure 6:

scalability variance with different


Vi-D Heterogeneous Lanes and GPUs

One interesting observation about MLCN is that having  with different characteristics, such as  with different sizes and depths, increases the generality of the network. A similar result was reported by Canqun et al. [xiang2018ms] with the MS-CapsNet organization. However, as discussed in Section IV, deploying  in multiple GPUs when the  have different computational footprint can be challenging. To study a proposed heuristic to deploy  with different widths and depths, we tested 4 MLCN networks with 6, 9, 12 and 24 . Each  may have pairs of depth and width values ranging from 1 to 5. As shown in Figure 7, we obtain a smaller execution time with our heuristic than when naïvely randomly distributing the  between the GPUs. The advantage increases with the number of , showing that, the larger the number of   the harder it is to randomly find a good distribution. Notice that the time accounted for the greedy heuristic includes the (almost insignificant) time to run the heuristic.

Figure 7: average execution time (executed 10 times) of heterogeneous  running on four K80 NVIDIA GPUs with a random and a greedy partition of  execution distribution. All  varying on width and depth.

Vi-E Heterogeneous Lanes with Heterogeneous GPU

More than having heterogeneous  we also tested a scenario with heterogeneous accelerators. Rather than 4 NVIDIA Tesla K80, we deployed four systems each with a different GPU: one M40, one K80, one P100, and one V100. The results are in Figure 8. For total execution time, there was a significant increase because of network communication between the systems. Moreover, the difference between random deployment and our greedy heuristic becomes larger, showing that for more complex the scenarios with many  or heterogeneous HW, it is key to deploy the computation carefully.

Figure 8: average execution time (executed 10 times) of heterogeneous  running on one K80, one P100, one V100, and one M40 NVIDIA GPU in multiple machines communicating using MPI with a random and a greedy partition of  execution distribution. All  varying on width and depth.

Vii Conclusion

The Multi-lane CapsNet (MLCN) is a novel organization for the CapsNet network which is shown to achieve better accuracy with more efficient HW utilization. Further, MLCN allows model parallelization by running the  in parallel. In this work, we analyze and measure the advantages of this new parallelization scheme of the CapsNet when compared to the usual data parallelism.

We find that MLCN is faster than the original CapsNet and it scales better with model parallelism being almost 2x more efficient, even with small batch sizes. We also explored the impact of different  configurations on performance and scalability, showing that wider  usually achieve higher HW efficiency.

Finally, we found that when parallelizing MLCN with  with different characteristics (or when deploying in machines with different accelerators), load balance is a key factor to reaching good performance. We proposed a greedy algorithm to deploy  in these scenarios and we found that it can be up to 50% more efficient than the naïve random deployment.