Log In Sign Up

Variant Parallelism: Lightweight Deep Convolutional Models for Distributed Inference on IoT Devices

Two major techniques are commonly used to meet real-time inference limitations when distributing models across resource-constrained IoT devices: (1) model parallelism (MP) and (2) class parallelism (CP). In MP, transmitting bulky intermediate data (orders of magnitude larger than input) between devices imposes huge communication overhead. Although CP solves this problem, it has limitations on the number of sub-models. In addition, both solutions are fault intolerant, an issue when deployed on edge devices. We propose variant parallelism (VP), an ensemble-based deep learning distribution method where different variants of a main model are generated and can be deployed on separate machines. We design a family of lighter models around the original model, and train them simultaneously to improve accuracy over single models. Our experimental results on six common mid-sized object recognition datasets demonstrate that our models can have 5.8-7.1x fewer parameters, 4.3-31x fewer multiply-accumulations (MACs), and 2.5-13.2x less response time on atomic inputs compared to MobileNetV2 while achieving comparable or higher accuracy. Our technique easily generates several variants of the base architecture. Each variant returns only 2k outputs 1 <= k <= (#classes/2), representing Top-k classes, instead of tons of floating point values required in MP. Since each variant provides a full-class prediction, our approach maintains higher availability compared with MP and CP in presence of failure.


Prune2Edge: A Multi-Phase Pruning Pipelines to Deep Ensemble Learning in IIoT

Most recently, with the proliferation of IoT devices, computational node...

Musical Chair: Efficient Real-Time Recognition Using Collaborative IoT Devices

The prevalence of Internet of things (IoT) devices and abundance of sens...

Master Graduation Thesis: A Lightweight and Distributed Container-based Framework

Edge/Fog computing is a novel computing paradigm that provides resource-...

Improved Fault Analysis on SIMECK Ciphers

The advances of the Internet of Things (IoT) have had a fundamental impa...

A Bi-Directional Co-Design Approach to Enable Deep Learning on IoT Devices

Developing deep learning models for resource-constrained Internet-of-Thi...

DAC: Data-free Automatic Acceleration of Convolutional Networks

Deploying a deep learning model on mobile/IoT devices is a challenging t...

Real-time Denoising and Dereverberation with Tiny Recurrent U-Net

Modern deep learning-based models have seen outstanding performance impr...

I Introduction

Convolutional neural networks (CNNs) are being used in several visual analysis tasks such as image recognition

[1, 2, 3], object detection [4, 5], and segmentation [6]. In CNNs, computation grows proportional to the input size, and it eventually leads to higher latency. Although reducing the number of parameters and precision of variables help in improving the latency, they often come at the cost of lower accuracy [2, 7]. Moreover, the complexity of deep learning models and their applications are only to grow over foreseeable future, and hence, overall they require even more resources.

Model distribution across multiple machines is one viable solution that has consequently attracted growing attention especially when considering the fact that there are typically quite a number of local compute nodes around us that can help get the jobs done.

(a) Model Parallelism
(b) Class Parallelism
(c) Variant Parallelism
Fig. 1: Comparison of different model distribution methods on atomic input data.

For atomic data that cannot be split into further pieces [8], two major solutions are often used to decompose a given model and distribute it among different compute resources; namely, model parallelism (MP) and class parallelism (CP). MP decomposes a model into multiple sequential or parallel slices, each sent over to a separate compute node for processing, then lots of intermediate data values are returned to the master to combine and complete the model. Slicing in MP can be inter- or intra-layer (Fig. (a)a). CP decomposes the model into multiple disjoint models, each aiming to predict only one or a few non-overlapping classes (Fig. (b)b); thus, contrary to MP, CP runs a monolithic model on each computing node. CP improves efficiency of each model mostly by applying class-aware pruning policies [8].

Both solutions have their own limitations. In MP

, intermediate tensors have to be communicated between different nodes. In current CNNs, and of course depending on the available bandwidth, the time needed to transmit intermediate tensors can be orders of magnitude higher than inference time itself. Hence,

MP would drastically increase end-to-end latency due to link saturation. CP does not suffer from this issue since it only returns a few numbers from each computing node; these numbers represent the prediction of classes allocated to the model executed on that computing node. Verbatim CP has a firm restriction on scalability that is limited to the number of classes in the application [8], which can debatably be extended by choosing overlapping subsets of classes instead. However more importantly, both paradigms intrinsically suffer from multiple instances of single point of failure (SPOF) issue because every compute node has to complete its job and send the result back to the requesting node before the top-level classification task can be fully finished. Otherwise, the master node cannot generate the final network prediction. Therefore, failure in a single node compromises availability of the whole system. The scenario of interest in this paper is where we use the free times of IoT devices to do a distributed inference job. Failure is a common problem in distributed systems, especially in this scenario of cycle-stealing from nearby IoT devices, where worker nodes only contribute in idle times and at any times may preempt the inference task to process their main jobs [9]. Also for instance, some nodes may seriously slow down due to various unexpected reasons [10]. Additionally, communication issues such as network congestion, lead to missed deadlines in real-time applications. These concerns motivate the need for a fault tolerant mechanism for distributed inference. Obviously, in both MP and CP, higher availability may be achieved by introducing redundant nodes, but of course this comes at the expense of large impact on communication and computation cost.

In this paper, we propose a different distribution technique and we call it Variant Parallelism (VP). Both MP and CP can be viewed as applying a “top-down” model slicing approach: they try to break a baseline model into pieces, although differently in MP vs CP, and distribute each piece across multiple nodes. In contrast, we in VP employ a “bottom-up” approach: we generate multiple lightweight variants (described below) from a chosen basic architecture and then combine their results in an aggregator component (Fig. (c)c). In fact, CP and VP are both special types of ensemble learning models, but with different intents and approaches.

In VP, each variant is an independent model that can be deployed on a different machine or accelerator. Unlike CP

, our variants have full-class prediction heads, i.e., each worker node predicts probabilities for all classes, and this is the key in achieving a fault-tolerant architecture. Consequently, in a system with

worker nodes, VP can still get a full response even if nodes fail, while having the variants helps to improve accuracy over naïve replication when more than one worker node is working.

VP’s variants can differ in input resolution and input offset. To achieve faster inference time, we apply three network optimizations: narrowing network width, reducing complexity of the compute intensive layers, and replacing with faster operations when possible.

Since each machine generates full-class prediction, the data transmission amount would be higher than CP, resulting in risking a higher response time. To avoid this, we put a compression/decompression module. The compression layer attaches to each worker’s classification head. In the compression part we select classes with the highest confidence score and send them to the master together with their corresponding indices. In VP, each machine executes a lightweight classification model, and sends its compressed result to the master. The master machine then applies a decompression step followed by a score scaling to weight each prediction based on its corresponding model capacity. It eventually combines results via an aggregation module. There is no trainable parameter in the combination modules. Otherwise, it reduces system’s flexibility and can compromise availability. For evaluation, we consider a smart home scenario and demonstrate VP’s robustness on six object recognition datasets.

Our major contributions can be summarized as follows:

  • We introduce a bottom-up model distribution called Variant Parallelism which contrary to MP and CP is fault-tolerant, flexible, and therefore, can be freely distributed across several sporadically available compute nodes such as the case of IoT devices or across multiple accelerators within a single machine.

  • Our technique generates multiple variants each having different number of parameters, multiply-accumulate operations (MACs), latency, and model size that can be deployed on different machines based on their computing capacities.

  • We propose an aggregation method that is resilient to failures, while matching or exceeding the accuracy of the baseline model.

  • We propose a fast and simple technique to reduce bandwidth usage. It can compress the output size of each node by up to while losing less than accuracy.

  • Our proposed method can have fewer parameters, fewer MACs, and lower response time on an atomic input compared to the baseline while achieving comparable or higher accuracy.

Ii Related Work

Model Parallelism (MP): In model parallelism (Fig. (a)a) a deep learning model is sliced into multiple sub-models [11, 12, 13]. Each of them can be deployed and concurrently executed on parallel machines. These sub-models have to transfer a huge amount of data from intermediate tensors. The time to transfer these tensors can be several orders of magnitude higher than the time to compute operations of the neural network itself.

Class Parallelism (CP): As its name suggests, class parallelism (Fig. (b)b) decouples a basic deep learning model into multiple independent sub-models, each can predict one or a few non-overlapping classes. Structured pruning is then applied on each sub-model to reduce their latency. The ultimate result can be prepared by combining the prediction of all of these sub-models [8, 14]. CP does not have the issue mentioned for MP, and can improve latency compared to a single model. However, it has a hard limitation on the number of sub-models. More importantly, since both MP and CP require results of all machines, they suffer from mutiple SPOFs.

Data Parallelism (DP): There is also another parallelism paradigm called data parallelism, in which all machines run the same model, but the data that they observe is different [15, 16]. For example, each may process a different part of a single image. However, for atomic live data which is impossible to be further split, DP cannot be leveraged.

Variant parallelism (VP) aims to work on atomic live data and by design can tolerate fault in up to nodes. VP gives us the flexibility to generate enough variants based on different objectives including latency constraints, model size, and compute resources. VP can be easily combined with other parallelism schemes.

Iii Variant Parallelism

Fig. 2: Overview of our proposed workflow including offline parts (variant generation and model training) and online actions (dispatcher and aggregator).

In VP, we consider a basic model, and generate multiple lightweight and fast models based on it. Each of these variants can have different storage sizes, parameters, inference times, and accuracies. We add inference-time augmentation. It leads to feeding slightly different inputs for each model. Each variant can be treated as an independent full-class predictor. Every machine can concurrently execute one or more of these variants in-parallel with other machines. Depending on the use-case, these machines may be different GPUs on a single node or different connected edge devices, which is the case of interest in this paper. If there is an idle compute node with limited storage space or computation constraints, VP can generate a lighter variant which provides the opportunity to utilize the available compute capacity. VP aims to reduce response time to enable faster decision making via a distributed deep learning architecture while maintaining availability in presence of failure. Here, we first concisely overview VP’s workflow. We then explain our design decisions, and how each module works.

Iii-a Overview

Our workflow starts with one basic architecture, but is straightforward to extend to more than one model. As shown in Fig. 2, we generate different optimized variants from the base model(s). We then add classification head, compression, decompression, and scaling modules to each variant. Variants are retrained on the desired dataset. These variants are independent and can be separately deployed on different machines for distributed in-parallel execution. Every node has its own model. Once deployed, the master node multicasts live data to all contributing workers, and waits for a predefined period (proportional to the desired deadline). The workers feed the data into their models and send the compressed result back to the master node. The master then decompresses, scales and combines them through the aggregation module to prepare the final prediction (Fig. 3).

Iii-B Reducing Model Complexity

Model Input Size #Params #MACs
MobileNetV2 2.27M 300M
370K 54M
360K 40M
340K 27M
330K 17M
320K 10M
TABLE I: Comparison of different models’ characteristics

For variant generation, we reduce complexity of the basic architecture through the following three steps: (1) reducing width factor, (2) reducing complexity of the last few layers, and (3) replacement with faster operations when possible. These steps help us to improve per machine model size, computation requirements, and latency. Firstly, we reduce width factor of depth-wise separable convolution layers (was first introduced in [2]) by . It drastically improves network runtime. We recover the lost accuracy by combining multiple parallel variants. Secondly, we reduce complexity of the last few layers of each variant based on its input size. The intuition is that in the current efficient CNNs, the last few layers take considerable portion of the computation time. However, it should be proportional to the input resolution and the achievable accuracy. As shown in Table II, each model uses convolutional layers with the number of filters () proportional to its input size (). For example, input size of () is and its last convolution layer has output filters while () with input size of has only output filters. We empirically select the number of these output filters so that they are divisible by for better performance. Lastly, we employ faster operations when possible, e.g., replacing , , and layers by , , and layers, respectively. We generate five different models based on these optimizations. Their characteristics are illustrated in Table I.

Input Operator
conv2d - 32 1
bottleneck 1 16 1
bottleneck 6 24 2
bottleneck 6 32 3
bottleneck 6 64 4
bottleneck 6 96 3
bottleneck 6 160 3
bottleneck 6 320 1
conv2d - 1
globalmaxpool - - 1
conv2d - -
compression - -

Similar to the original MobileNetV2, we apply a convolution layer on input followed by depth-wise convolutional residual bottlenecks. Each row is sequentially applied times, and is the expansion factor of intermediate layers in bottleneck blocks. is the number of classes.

TABLE II: Variants model architecture.
Fig. 3: The flow of data and the online operations on them, for the inference task.

Iii-C Compression and Decompression


, for tasks with only few classes (e.g., CIFAR-10), each machine returns only one floating point, and for tasks with more classes (e.g., CIFAR-100), 10 to 20 floating point values are returned by each. In contrast, our variants are independent models that can return full-class prediction. This feature increases the system reliability in a faulty environment, but raises a challenge in communication time. To address the mentioned issue, we leverage the

idea in which from a prediction vector

, we select elements having the highest confidence scores and pack them along with their corresponding indices. Fig. 4 depicts a toy example with 6 classes, assuming . The compression step shrinks the vector size to two vectors of size : One stores floating point predictions with the highest score ( and ), and the second keeps their integer indices ( and ). After receiving compressed vectors, the master reconstructs a tensor of shape (), where and are the number of total contributing machines and classes, respectively. It then scatters values of compressed vectors based on their indices. This procedure is fast yet robust. The intuition behind is that in each vector except a few elements with high confidence scores, others are close to zero. More importantly, our aggregation method (§ III-D) amplifies both actual information and noise at the same time. Hence, by removing values with lower confidence score, we actually mitigate noisy information to some degrees, and consequently achieve higher accuracy. The technique also gives us the flexibility to set

as a hyperparameter, trading accuracy for transmission time.

Iii-D Scaling and Aggregation

Master is responsible for balancing the prediction vectors of different variants. After decompression step, we apply a

operation to transform the reconstructed prediction vectors into a probability distribution (Fig. 

4). We then need to scale each prediction vector based on its network capacity. For example, if variant has more parameters, and thereby returns more accurate predictions than variant , then it deserves to have a higher weight in the aggregated result. We use a proxy of the network architecture itself to scale values of every prediction vector via eq. 1.


Where is the prediction vector of variant with input shape and depth multiplier , and is a parameter to equalize the impact of input shapes and depth multiplier.

In aggregation module we combine prediction results that are received from compute nodes including the master itself. We add the vectors together and apply a afterwards. Using FullyConnected (

) layers instead, although might improve accuracy, degrades flexibility. Also, with the same amount of neurons as the number of classes

, and variants contributing in the aggregation step, the computation cost of using only a single layer would be which is times higher.

Fig. 4: An example of compression, decompression and scaling steps assuming and .

Iv Experiments

Iv-a Experimental Setup

Compute Platform: We use an in-house server (Table III), to design, train and evaluate our models. We selected the commonly reported vision datasets that are trainable on our server in a reasonable time.

CPU Intel Xeon E5-2630-v3 (x86_64)
Frequency: 2.4GHz (1.2-3.2GHz)
GPU NVidia GeForce GTX 1080 Ti
Total Memory: 11GB
Memory 32GB (1600MHz)
TABLE III: Server specification

Dataset: We evaluated our proposed method on six different vision datasets (Table IV), namely CIFAR-10 [17], CIFAR-100 [17]

, Fashion MNIST (F-MNIST)

[18], MNIST [19]


[20], and Food-101 [21].

Dataset Input Size #Trains #Tests
Food-101 101 75,750 25,250
SVHN 10 73,257 26,032
CIFAR-10 10 50,000 10,000
CIFAR-100 100 50,000 10,000
MNIST 10 60,000 10,000
F-MNIST 10 60,000 10,000
TABLE IV: Datasets used for evaluation

Edge Device: For latency evaluation of different models, we use a mid-range tablet (Table V), and leverage Google TFLite native benchmark tool [22] to profile the performance.

Samsung Galaxy Tab A 8.0 with S Pen (2019, SM-P205)
CPU Octa-core big.LITTLE (2x1.8 GHz Cortex-A73 and 6x1.6 GHz Cortex-A53)
GPU ARM Mali-G71 MP2
Memory 3GB LPDDR4X
Network 2G/3G/4G(LTE) & Wi-Fi 802.11
TABLE V: Edge device specification

Baseline Model: We consider MobileNetV2 [3] as our basic architecture (Table I) and compare our results with it.

Iv-B Training

We trained all models for epochs which is reportedly fewer than the number of epochs required in CP ( epochs [8]

). We port pre-trained weights from ImageNet-1k

[23] whenever available. Our training procedure has two steps. For the first 30 epochs, all layers except the last two convolutions were frozen. We used batch size of 128 in this step, and 64 (32 in Food-101) for the rest of 105 epochs. We used Adam optimizer with and . Learning rate decays by a factor of once the accuracy stagnates. Since our tasks are classification, and each model can independently predict all of the classes, we apply a separate categorical cross entropy for each model. Contrary to CP, our variants can observe the same data during training. Therefore, we train multiple models along with the combination components all together. The number of models that can be trained simultaneously is a parameter that can be set based on the computation capacity. Having more models requires more memory, but improves GPU utilization as most parts of our data pipeline executes once. We apply random horizontal flip, random rotation, and MixUp augmentations.

Dataset Metric MobileNetV2 Ours() Ours() Ours() Ours() Ours()
CIFAR-10 Top-1 95.16 91.76 93.07 93.81 94.43 94.62
Top-5 98.57 97.24 97.87 98.20 98.26 98.29
MNIST Top-1 99.37 99.38 99.41 99.41 99.42 99.46
Top-2 99.87 99.92 99.95 99.89 99.92 99.95
CIFAR-100 Top-1 80.51 69.62 73.57 74.46 74.67 75.20
Top-5 94.9 91.17 93.36 93.47 93.41 93.67
SVHN Top-1 95.52 93.81 95.16 95.34 95.89 95.91
Top-2 98.37 97.47 98.21 98.25 98.46 98.32
Fashion-MNIST Top-1 94.41 93.38 94.16 94.60 94.67 94.55
Top-2 98.82 98.30 98.51 98.76 98.84 98.82
Food-101 Top-1 79.68 58.78 65.35 70.13 71.43 74.23
Top-5 93.5 83.50 87.70 90.19 90.78 92.43
TABLE VI: Accuracy (%) of single models on different datasets

Iv-C Accuracy of Single Models

First, we evaluate accuracy of each variant and compare it to the accuracy of our baseline. To have a fair comparison, we trained the baseline with the same training policy, and made sure it has the same or higher accuracy compared with original reports, if applicable. For each dataset, we report the accuracy of five different models. We also report and in aggregator for datasets with and , respectively. Here, by we mean the ground-truth label is among top class predictions with the highest confidence scores. As depicted in Table VI, as we go from the variant to , the accuracy improves. For the datasets with few number of classes, the accuracy of is close to the baseline. Nevertheless, our has fewer parameters and MACs.

Iv-D Impact of Compression Method

In this section, we analyze the effect of our compression algorithm on accuracy and the required bandwidth. We feed the same input data to all variants, and evaluate the impact of different values for in our compression module.

Bandwidth Saving: Output size is a linear function of .


Where is the output size in byte, is the number of bytes required for storing a floating point value and is the number of classes. Among different values, is a special case. Because the compressor selects only one element, we can further reduce communication cost, sending only an integer representing its index. It can be found that choosing is not logical as transmitting the whole vector actually costs less than compressing it.

(a) CIFAR-10 aggregator Top-1
(b) CIFAR-100 aggregator Top-1
(c) Food-101 aggregator Top-1
(d) CIFAR-10 aggregator Top-2
(e) CIFAR-100 aggregator Top-5
(f) Food-101 aggregator Top-5
Fig. 5: Impact of different values on accuracy of different benchmarks. : the number of top predictions returned by each node.

Ensemble Accuracy: In this experiment, we evaluate how different values of can impact the final accuracy. Fig. 5 illustrates a significant jump in accuracy from to . For , the accuracy does not significantly increase. Another interesting observation is on accuracy: sending the prediction with the highest confidence score per variant improves accuracy. The reason is that when aggregating multiple prediction vectors, classes with lower scores behave like noise. By omitting these elements from our prediction vectors, we indeed achieve higher accuracy. It appears to be a trade-off between higher and accuracies. Apparently, is a good choice and we use it for the rest of our experiments.

Iv-E Scaling Accuracy

We generated five variants each having unique characteristics. Here, we analyze the impact of combining different variants on the ensemble accuracy (Fig. 7). We show results on CIFAR-10, CIFAR-100, and Food-101. The other three datasets behave similarly to CIFAR-10. Although our variants are independent and can be executed in-parallel on separate machines, for simpler comparison, we show the final accuracy based on the combined MACs. We also added results from the combination of the variants through while using (in the compression module) and without compression as well. It is worth noting that even the combined number of parameters and MACs of our models are fewer than the baseline.

On CIFAR-10, we almost reach the same accuracy as the baseline by combining , , and . Addition of and improves the accuracy incrementally and enables us to achieve even better results than the baseline. On CIFAR-100, comparable accuracy can be achieved by aggregating the results of all five variants. The most MACs is required by which is fewer than the original MobileNetV2. The Food-101 dataset is rather more complex as the size of its original images are two orders of magnitude bigger than the other benchmarks. To achieve comparable results, we generated and (Table VII) with and filters in their last convolution layers, respectively. We reach similar accuracy by combining the first six variants. Inclusion of further improves the result, however, at the cost of more computation. In all benchmarks, increasing in the compression step or transmitting the predictions without any compression gives better results but the differences are not significant. Intrestingly, by increasing the number of variants to in the Food-101, this gap disappears. This can be due to the increase in the number of variants, particularly one or two more accurate ones.

Iv-F Effect of Every Variant on Accuracy

Model Input Size #Params #MACs
390k 71M
410K 112M
TABLE VII: Characteristics of additional variants for Food-101.

To find out whether each variant actually impacts the final accuracy, we evaluate the predictions returned by the combination of our variants while excluding only one of them. As shown in Fig. 7, in all datasets, omitting each variant has a negative impact on the final accuracy. This is especially interesting in less accurate variants. For example, on CIFAR-100, has accuracy itself which is more than lower than the baseline, yet when included in the aggregation, it can improve the final accuracy. Same is true in the other benchmarks.

(a) CIFAR-10 (b) CIFAR-100 (c) Food-101
Fig. 6: Impact of different variants on the final accuracy. (Please zoom in)
(a) CIFAR-10 (b) CIFAR-100 (c) Food-101
Fig. 7: How omitting a variant affects the final accuracy. (Please zoom in)

Iv-G Availability in Presence of Failure

One serious drawback of utilizing idle nodes is preemption in presence of higher priority jobs (e.g., their main tasks). Missing real-time deadlines due to unexpected communication or computation issues are other reasons that indicate the demand for a fault-tolerant parallelism approach. Nevertheless, MP and CP need to get the results of all contributing machines to provide the final outcome. By design, our architecture is fault-tolerant even if worker nodes are not able to return their predictions. Our aggregation module also supports this behavior. As shown in Fig. 7, we can still have results by contribution of a fraction of variants being executed on different machines. As the number of faulty nodes increases (i.e., fewer contributing variants), the final accuracy decreases. The achievable accuracy depends on the variants remained for the aggregation. In the worst case, all nodes fail except the one that executes as it has the lowest accuracy among all, which is still better than MP/CP that fail to provide any predicton. Our VP can be viewed as providing a graceful degradation in presence of increasing levels of faults/preemtions in the environment.

Iv-H Other Performance Metrics: Inference Time, MACs, Model Size

We report the performance of our variants on the other common metrics. We convert each variant designed for and trained on Food-101 to TFLite [24], and execute them for iterations. The results are illustrated in Table VIII. Since each variant has different characteristics, the slowest one determines the eventual response time. Therefore, depending on the available combination of nodes, one can get speedup compared to the baseline. Considering a smart home scenario with a star topology and the round-trip time

, the estimated speedup would be

Iv-I Variant Parallelism vs. Class Parallelism

We briefly compare our method with the current state-of-the-art deep learning parallelism algorithm, sensAI [8] which is a CP technique.

Fault-Tolerance: After sending requests to workers, the master node waits for a time window to receive as many responses as possible. Thus, in the event of preemtion due to processing high priority tasks or an issue in a worker or the network, the master node treats them as a fault. We demonstrated that in case of failure on even worker nodes, we can still get a response. In CP, however, since each model predicts different class(es), all contributing machines have to successfully process and send their results to the master. One can improve it by executing replications on additional machines, but it increases both complexity and cost. Another idea is to apply overlapping class-aware pruning [25]. In this type of replication, in presence of failure in one machine, the results can be aggregated from other machines with accuracy drop. Both ideas increase computation and communication costs.

Response Time, MACs, Model Size, Training Time: As depicted in Table VIII, by leveraging class parallelism, sensAI can reach up to speedup using parallel machines. We showed that variant parallelism gets speedup compared with the baseline while requiring roughly half of in-parallel compute nodes compared to sensAI. Further note that, we trained our variants for epochs which is less than class parallelism, yet we achieve comparable or higher accuracy. However, since our infrastructures and scheduling policies are different, direct comparison might be unfair.

Model Time (ms) Speedup #MACs Gain #Params Gain
Aggregate - - -
sensAI -
TABLE VIII: Comparison of our variants with the baseline and sensAI.

Output Size: In the tasks with few classes (e.g. CIFAR-10, SVHN, and MNIST) almost the same number of values must be transmitted. However, for the datasets with (e.g. CIFAR-100 and Food-101), we transmit ten bytes per variant which depending on the number of clusters in CP, it can be up to less than sensAI.

Flexibility: Since each variant is being executed independently, it gives us more flexibility than CP to achieve different goals. We can generate additional variants based on more than one basic architecture, or design a highly customized variant for contributing machines. We can have several variants with different characteristics to be deployed on heterogeneous compute machines. VP can also be combined with other distribution schemes including CP.

V Conclusion & Future Directions

In this paper, we presented Variant Parallelism, a more flexible and fault-tolerant distribution scheme based on ensemble learning. Our evaluation on six different object recognition datasets demonstrates that our method can improve performance in number of parameters, MACs, and response time compared to the baseline and the current state-of-the-art.

We note the issues to be considered in future works. Although VP has enough flexibility, it has yet to be evaluated on more complicated benchmarks. In addition, we intend to extend it on more complicated tasks such as object detection and semantic segmentation. We believe more performance gain can be achieved when tasks become more complicated.

Both VP and CP are based on the well-known ensemble learning techniques. Therefore, they can have similar weaknesses and strengths. For example, in VP combining variants with identical characteristics, may not help achieving significant accuracy boost unless we retrain each of them with different random seeds [26].Contributing machines in CP and VP observe the same data. In some scenarios, e.g., a smart city with long latency or bandwidth bottlenecks, transmitting input data can impact the end-to-end latency. Combining split computing paradigm [27, 28] with CP or VP can help mitigate the problem. Thus, in future work we decide to optimize the input communication in CP and VP.


  • [1] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning.   PMLR, 2019, pp. 6105–6114.
  • [2] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [3] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,”

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , pp. 4510–4520, Jan 2018.
  • [4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision.   Springer, 2016, pp. 21–37.
  • [5] Y. Xiong, H. Liu, S. Gupta, B. Akin, G. Bender, Y. Wang, P.-J. Kindermans, M. Tan, V. Singh, and B. Chen, “Mobiledets: Searching for object detection architectures for mobile accelerators,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 3825–3834.
  • [6] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [7] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2820–2828.
  • [8] G. Wang, Z. Liu, B. Hsieh, S. Zhuang, J. Gonzalez, T. Darrell, and I. Stoica, “sensAI: ConvNets Decomposition via Class Parallelism for Fast Inference on Live Data,” in Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica, Eds., vol. 3, 2021, pp. 664–679.
  • [9] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg, “Quincy: fair scheduling for distributed computing clusters,” in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009, pp. 261–276.
  • [10] J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.
  • [11] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” Advances in neural information processing systems, vol. 25, pp. 1223–1231, 2012.
  • [12] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in neural information processing systems, vol. 32, pp. 103–112, 2019.
  • [13] P. Yu and M. Chowdhury, “Fine-grained GPU sharing primitives for deep learning applications,” in Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, 2020.
  • [14] Y. Yang, J. Chung, G. Wang, V. Gupta, A. Karnati, K. Jiang, I. Stoica, J. Gonzalez, and K. Ramchandran, “Robust class parallelism-error resilient parallel inference with low communication cost,” in 2020 54th Asilomar Conference on Signals, Systems, and Computers.   IEEE, 2020, pp. 1064–1065.
  • [15] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep neural networks.” in Proceedings of Machine Learning and Systems, vol. 1, 2019, pp. 1–13.
  • [16] G. Wang, S. Venkataraman, A. Phanishayee, J. Thelin, N. R. Devanur, and I. Stoica, “Blink: Fast and generic collectives for distributed ML,” in Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, 2020.
  • [17] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
  • [18] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
  • [19] Y. LeCun, C. Cortes, and C. J. Burges, “MNIST handwritten digit database,” ATT Labs [Online]. Available:, vol. 2, 2010.
  • [20] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.
  • [21]

    L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in

    European conference on computer vision.   Springer, 2014, pp. 446–461.
  • [22] Google, “TFLite Model Benchmark Tool with C++ Binary,” 2021. [Online]. Available:
  • [23]

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,”

    International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [24]

    G. TFLite, “Tensorflow lite,” 2021. [Online]. Available:
  • [25] Y. Yang, J. Chung, G. Wang, V. Gupta, A. Karnati, K. Jiang, I. Stoica, J. Gonzalez, and K. Ramchandran, “Robust class parallelism-error resilient parallel inference with low communication cost,” in 2020 54th Asilomar Conference on Signals, Systems, and Computers.   IEEE, 2020, pp. 1064–1065.
  • [26] Z. Allen-Zhu and Y. Li, “Towards understanding ensemble, knowledge distillation and self-distillation in deep learning,” arXiv preprint arXiv:2012.09816, 2020.
  • [27] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
  • [28] M. Sbai, M. R. U. Saputra, N. Trigoni, and A. Markham, “Cut, distil and encode (cde): Split cloud-edge deep inference,” in 2021 18th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON).   IEEE, 2021, pp. 1–9.