In the past few years, Artificial Intelligence (AI) applications have been largely deployed on edge devices and served as the main horsepower to drive the new technology wave[nikouei2018smart, lin2018edgespeechnets]. Such edge-based AI applications have specific characteristics such as user privacy, task uniqueness, data adaptation, and etc. Therefore, more and more attention is paying to “training-on-edge”, which is expected to effectively adapt neural network models to practical utilization. As one of the most well-recognized collaborative learning techniques, federated learning leverages the scalable data parallelism to expand edge device’s limited computation capacities in terms of computing ability, memory size, and etc. [konevcny2016federated, smith2017federated, bonawitz2019towards, geyer2017differentially]. Federated learning expects to have multiple edge devices to collaboratively train identical models with local training data [zhao2018federated]. By aggregating the parameter updates from each device, a global model can be collaboratively trained efficiently and securely.
However, during federated learning on edge devices, a serious problem has been ignored: when federated learning deploys identical training models (i.e.
, Convolutional Neural Network (CNN)) to heterogeneous edge devices, the ones with extremely weak computation capacities may significantly delay the parameter aggregation. As illustrated in the left of Fig.1, heterogeneous edge devices including Nvidia Jetson Nano, AWS Deeplens, and Raspberry Pi collaboratively train AlexNet on Cifar-10 through federated learning [Jetson, Deeplens, Raspberry, VGG, cifar10]. The time cost of each training cycle for Jetson Nano, Deeplens and Raspberry are 16 mins, 80 mins, and 30 mins. Therefore, in the traditional synchronized federated learning, the aggregation cycle will be prolonged to 80 mins by Deeplens.
These weak computation capacity edge devices are often referred as the computational stragglers, which are drawing more and more attention from the research community [li2019federated, wu2019safa, smith2017federated, chen2019asynchronous]. Leveraging some model training optimization methods, such as model compression, we can accelerate the straggler’s local training. However, due to the heterogeneous edge device, the optimized models are usually adapted to particular edge device resource constraints. These models are optimized with determined diverged structures, may significantly defect the collaborative convergence. Therefore, most works addressed the straggler issue by compromising with asynchronous edge collaboration. Although asynchronized federated learning can effectively accelerate parameter aggregation cycle, it couldn’t fundamentally solve the original weak computation capacity problem. On the other hand, asynchronized edge devices with staleness weight parameters will introduce larger loss during each cycle, harming the convergence performance.
In this paper, we propose ELFISH, a resource-aware federated learning framework, targeting at solving the computation capacity heterogeneity problem among edge devices. Specifically, ELFISH designs a specialized model optimization method to make each straggler work synchronously. Fig. 1 shows the overview of our proposed framework. Once computational stragglers emerge in federated learning, ELFISH will optimize the stragglers with the following major contributions:
As Fig. 1 (a) illustrates, we first profile the model’s training computation consumption in terms of time cost, memory usage and computation workload. Guided by the profiling model, we can determine how many neurons we need to mask in each layer to ensure the model training computation consumption satisfy the specific resource constraints.
We further propose a resource-aware soft-training scheme which is shown in Fig. 1 (b). Rather than generating a deterministically optimized model with diverged structure, different sets of neurons will be dynamically masked in every training cycle and will be recovered and updated during the following parameter aggregation, ensuring comprehensive model updates overtime.
As demonstrates in Fig. 1 (b), we further propose a corresponding parameter aggregation scheme which is used to balance the contribution from soft-training and improve the collaborative convergence in terms of speed and accuracy.
Experiments show that the proposed CNN training profiling models can achieve an average 93% CNN training time estimation. With the resource-aware soft-training scheme,ELFISH can provide up to 2 training speed-up in various federated learning settings with stragglers. Furthermore, ELFISH demonstrates about 4% accuracy improvement and better collaborative convergence robustness.
2.1 Neural Network Training-on-Edge
With the booming development of intelligent edge devices, traditional centralized training pratice of neural network cannot well adapt to vast end-users, who may have unique data domains, different cognitive tasks, or specific data privacy requirements [geyer2017differentially]. Therefore, more and more attention is paying to “training-on-edge”, which is expected to effectively adapt neural network models to practical utilization. Unfortunately, the computation capacity of edge devices still cannot catch up with the heavy computation workload of model training.
In order to reduce the model training workload, many compression-oriented works have been proposed for local training optimization: Caldas et al. leveraged the random dropout technique to reduce the training model volume and therefore minimized the computation consumption [caldas2018expanding]. Du et al. proposed a training model compression method by iteratively masking model neurons with fewer gradient updates and skipping corresponding gradients’ computation for less training efforts [du2018efficient].
In addition to training workload reduction, collaborative learning is another edge training approach that unites resource-constrained devices for computation capacity expansion. Li et al. proposed the parameter server framework to achieve data-paralleled collaborative learning with synchronized parameter updates across distributed systems [li2014scaling]. McMahan et al. further improved conventional collaborative learning into Federated Learning (FL) [zhao2018federated], which updates the centralized model by communicating weight updates across edge devices. Due to flexible collaboration schemes and high-recognized data security, federated learning is considered as the most effective collaborative scheme for training-on-edge.
2.2 Federated Learning Stragglers
In practical federated learning utilization in training-on-edge, the computational heterogeneity across edge devices is inevitable. When federated learning deploys identical neural network models to edge devices with various resource constraints, the ones with weak computation capacities may fail to satisfy the model’s training computation consumption, thereby significantly delaying the parameter aggregation and causing severe computational straggler issues.
Although the aforementioned local training optimization works have demonstrated expected performance, they cannot be applied to collaborative learning scenarios. As local training optimization adapts the training model into particular device resource constraints, deterministically optimized models will be generated with diverged structures, which can significantly defect the collaborative convergence. As shown in [yuan2019distributed], when introducing models with diverged structures into learning collaboration, the overall model accuracy could drop as much as 10%.
Therefore, most solutions can only be compromised with asynchronous edge collaboration [nishio2019client, caldas2018expanding, wang2019adaptive, chen2019asynchronous]. Nishio et al. proposed an optimized federated learning protocol (i.e., FedeCS), which kicks out straggled devices with limited computation resource budgets from the learning collaboration [nishio2019client]. Wang et al. revealed that the asynchronized stragglers may introduce considerable training loss and eventually disturb the collaborative convergence. And a dedicated asynchronized collaboration scheme was proposed to solve this issue to a certain degree [wang2019adaptive].
Although these works accelerated the overall federated learning with asynchronized straggler collaboration, they cannot fundamentally eliminating computational stragglers without device-specific training optimization. And the asynchronized collaboration schemes still suffer form considerable performance loss. Fig. 2
shows our experimental analysis for two collaborative edge devices under three learning settings. We can easily find that, synchronized federated learning will achieve the best convergence in terms of accuracy and speed. While, when the asynchronized straggler parameter aggregation cycle increases from 2 epochs (setting 2) to 3 epochs (setting 3), both the converge accuracy and speed will decrease. Therefore, we propose a soft-training scheme to optimize straggler models but still can guarantee the convergence.
3 Training Consumption Profiling
To achieve the proposed resource-aware federated learning on heterogeneous edge devices, our first task is to fully profile the resource consumption for neural network training on edge devices. Specifically, we take CNNs as our primary research target, whose neurons (i.e., convolutional filters) will be treated as the smallest structural units for analysis.
3.1 Theoretical Training Cycle Formulation
Given a certain edge device, the training consumption of a CNN model can be generally evaluated as the computation time of each training cycle. While, from the edge computation perspective, the computation time is mainly determined by the computation workload and in-memory data transmission volume.
Training Computation Workload Formulation: For CNN model training, two major processes are iteratively conducted, namely the forward propagation for inference loss evaluation and the back-propagation for weight parameter modification. In the forward propagation, the primary computation workload is introduced by MAC (Multiply-Adding) operations, which are brought by the multiplication between feature maps and weight matrices. While in the back-propagation, the major workload comes from the calculation of the backward gradient. It is notable that, the CNN gradients calculation is also conducted by the multiplication of between gradient maps with weight matrices, which has the same computation load as the forward propagation. Therefore, we can double the forward propagation MAC operation amounts to approximate the overall computation workload. Moreover, since the model is usually trained with input data per mini-batch for each forward and backward propagation, the mini-batch size and the total mini-batch number should be taken into consideration. Therefore, the computation workload of for training each neuron in the layer:
where , and , represent the calculated sizes of the neuron weights and the input feature map, respectively. - means the neuron number in the - layer. Based on , an CNN model’s computation workload can be formulated as: .
Training Memory Usage Formulation: Since edge devices’ memory will be iteratively utilized per mini-batch from a training epoch, the memory usage in each mini-batch includes weights, gradients and the total feature maps generated in this mini-batch. Moreover, due to the gradient matrices have equal size as weight matrices, we double the weights to represent the sum of weights and gradients. Therefore, the training memory usage for the neuron in the layer is modeled as:
where and are data bit values which usually equal to 32 in the edge device. Based on the neuron level modeling, we can formulate model’s training memory usage as: .
Overall Training Time Consumption Formulation: Based on the analysis of computation workload and in-memory data transmission, the training time cost of a neuron in the layer can be approximated as:
where is a given edge device’s average computing bandwidth and indicates the transmitting speed between the main memory to the processor. Furthermore, by considering the edge device’s memory capacity , the training time cost for the entire model can be formulated as:
where is the memory capacity for weight parameters and feature maps on straggler. represents the size of memory that needs to transmitted from secondary memory to main memory and is the corresponding transmitting speed. indicates the extra time overhead during the training phase, such as data loading and compiling delay.
3.2 Device-Specific Computation Profiling Accurateness Evaluation
Parameters like in the proposed profiling model are specified by edge device specifications, therefore we conduct on-device computation consumption profiling for parameter retrieval.
We employ a Jetson Nano as the test platform. An Nvidia kernel analysis tool is used to monitor the real-time device resource automatically, such as memory usage and CPU usage rate [Jetson]. During the measurement, we generate 100 CNN models with random structure configurations (i.e., layer number, neuron sizes, neuron amount per layer, etc.). Further, all of these models are deployed on the Jetson Nano and their training time cost and memory usage are measured. Based on the measured data, device-specific model parameters are calculated. Table. 1 shows the retrieved parameters in the proposed computation consumption profiling models. The retrieved profiling models’ estimation accurateness will be further evaluated in Section. 6.
|32 bits||32 bits|
|870 MB/s||2.8 GFLOPS|
We evaluate the accurateness of the proposed CNN training time cost profiling model, by comparing estimated results with realistic measurements. The realistic measurements are based on the 1000 times tests for VGG-13 on Cifar-10 and we calculate the average values. Table. 2 illustrates comparison results. Since we formulate the resource consumption for each neuron, we set different percentages of neurons for each model to evaluate our formulations. According to Table. 2, we can find that the profiling model has accuracies of 88%98%, averaging 93%.
4 Resource-Aware Soft-training
Based on the CNN model training computation consumption profiling, we propose a Resource-aware Soft-training scheme to accelerate model local training on heterogeneous edge devices and eventually prevent computational stragglers from delaying the collaborative learning process.
As shown in Fig. 1 (b), the overall concept of the proposed soft-training is that: a specialized computation optimization strategy is designed and applied to each edge device depending on their computing performance in each training cycle. Specifically, each training cycle can be summarized as a Mask and Recover process: The straggler model will be optimized by Masking a particular number of resource-intensive neurons during the training process. By doing so, the optimized local model could fit the computation capacity/resource constraints of the local edge device, and facilitate the global synchronized weight parameter averaging in collaborative federated learning. Another core novelty of soft-training scheme lies in the “Soft”: the neuron masking selection strategy for one straggler is dynamically changing during cycles. In this way, the masked neurons in one training cycle will be Recovered in the next training cycle. From the perspective of the overall training process with many cycles, each local training model can still maintain a rather complete model structure. This is the major difference from conducting static pruning optimization for each edge device, which can cause the training divergence problem due to the computational heterogeneity issue. In this section, we will describe the soft-training process in detail and discuss how our proposed method helps to guarantee the global model convergence.
4.1 Resource-Aware Masked Neuron Selection
The computation optimization strategy we adopt is neuron masking, which is done by temporally skipping partial neurons from the complete model training. To do so, we first identify the number of neurons to be potentially masked which can make the model satisfy the training time cost constraint. After that, we select the specific neuron group to mask. For each straggler, two key points are considered in masked neuron selection: the training time cost and the collaborative convergence contribution of each neuron.
Neuron Number Selection w/ Training Time Cost: We formulate training time cost constraint and straggler’s computation capacities in terms of memory size and computation workload as:
where is the total number of neurons in the training model and is the number of the potential neurons chosen to be masked from training. indicates the training time cost constraint. It can be regarded as the training time cost of normal devices in the federated learning. and represent the memory capacity and computation workload capacity of the straggled edge devices.
Directly calculating lacks the determined solutions, so we leverage a simple but efficient greedy method to search : we mask each layer with neurons simultaneously, where equals and is the weight parameter by considering the fact that the neuron in the different layer has distinct training time cost. In other words, the layer with higher training time costs will keep fewer ratios of neurons during the training. Therefore, Eq. 6 is reformulated as:
By doing that, we can identify , which represents the neuron number in each layer that will join the training.
Specific Neuron Selection w/ Convergence Contribution: After identifying the kept neuron number in each layer, we can select the neurons with the highest contribution to the global convergence and mask other neurons. In the aggregation cycle , each edge device will merge its updated weight parameters to the global model. The neurons with higher weight parameter updates will provide larger impacts on the global model. Therefore, we define the weight parameters of the neuron in the layer at the end of as , and the neuron’s convergence contribution will be calculated by the summation of useful updates in each cycle as:
where a larger represents a higher convergence contribution.
Before the training process in each parameter aggregation cycle, we choose percentage neurons from the last training cycle with the highest and randomly select other neurons from the rest ones. By doing so, the high convergence contribution can be guaranteed.
4.2 Dynamic Soft-Training Scheme
After defining the masked neuron selection rule, we can conduct the soft-training for federated learning. In this section, we demonstrate the specific soft-training process and how it guarantees global convergence.
Neuron Masking and Re-updating Process: Before training in each parameter aggregation cycle, we leverage the proposed neuron masking selection to identify the potential masked neurons. In the next cycle, the neurons masked in the last cycle will be assigned with the values from the global model and recovered into the local model again. After the masking selection process, different parts of neurons are masked. Therefore, the neuron masked in the last cycle will join the training again and be re-updated according to the back-propagation process during the training. The soft-training iteratively conducts the neuron selection, masking, and re-updating steps until the global model converged.
Fig. 1 (b) demonstrates the overview of the proposed resource-aware soft-training. In the aggregation cycle, partial neurons are masked from training. The model will be trained with partial neurons and upload its weight parameters to the global model and step into cycle. In a training cycle, these masked neurons will recover by fetching the global model parameters. During the iterative dynamic masking and recovering process, the overall model training will fully update every neuron in the model structure and thus maintain a complete functional model.
Convergence Guarantee Analysis: In the traditional hard manner pruning methods, the neurons will be permanently removed from the model. In federated learning, such a hard manner will generate a fixed partial model and prevent the skipped neurons from updating their weight parameters to the global model. Therefore, corresponding neurons in the global model will gradually stop being updated and significantly impact the collaborative convergence.
However, in our soft-training scheme, we guarantee the collaborative convergence from the following two aspects: 1) According to our defined masked neuron selection setting, neurons with less significant weight parameter updates will be randomly masked. We only mask the neurons in a single cycle and will recover them in the following cycles. Such a soft scheme will enable a complete functional model and each neuron has a chance to provide enough contribution to the collaboration. 2) In each parameter aggregation cycle, we always keep neurons with the highest convergence contributions unchanged and still let them join the next cycle’s training process. Such a scheme enables the optimized model can always update the most significant information to the global model, guaranteeing the collaborative convergence.
5 Parameter Aggregation Scheme
with Soft-Trained Models
With the proposed “soft-training” technique, we will further investigate the corresponding parameter aggregation scheme to further enhance accuracy and convergence speed for training-on-edge.
5.1 Loss-based Weight Aggregation Scheme
Model Average Scheme: In each parameter aggregation cycle, the soft-training will generate models with partial neurons. Therefore, the models on the edge devices have diverged partial structures, introducing more errors to the global model compared to the original full model. Therefore, we need to consider the model structure comprehensiveness when aggregate weight parameters.
Weight Average based on Model Loss: The goal of Federated Learning is to train a global model with each edge device’s local data and computation resources constraints. Therefore, the optimization goal of the global model is . After the soft-train phase, stragglers train a partial model and update parameters with other devices synchronously. During this process, we should consider the error brought by the model diverged structures and reduce this error for the global model. When averaging the weights aggregated by different devices: , where
is hyperparameter correlates to the diverged model structures.
5.2 Convergence Improvement Discussion
We assume that the global loss functionF is L-smooth and -strongly convex, each edge device processes one epoch local training before updating the parameters to server. Furthermore, we assume that for , , and , we have , and . Taking , after T updates on the server, the global model will converge to an optimum :
The model based hyperparameter controls the trade-off between the convergence rate and the additional error brought by partial model. When , the convergence rate approaches , with the additional error :
, the error variancecaused by partial model structure is reduce to 0. Therefore, when a smaller model to aggregate, applying our method, the error variance will greatly be reduced, the accuracy and convergence speed will be improved. The proposed aggregation scheme will be evaluated in the next section.
Based on thorough CNN model training consumption profiling on edge devices, we combine the proposed “resource-aware soft-training” (Sec. 4) and the dedicated corresponding parameter aggregation scheme (Sec. 5), we proposed a comprehensive federated learning framework ELFISH, expecting to resolve the computational straggler issues and enhance training-on-edge. In the next section, we will comprehensively evaluate the performance of the proposed ELFISH framework.
6.1 Experiment Setup
Testing Platform Setting: In our experiment, we build our own edge federated learning testbed with multiple Nvidia Jetson Nano development boards. By adjusting the configuration of CPU bandwidth and memory availability, we simulate the computation-capable devices as well straggled devices with different resource constraints. The details of these straggler settings are shown in Table. 3.
CNN Models and Dataset: In the experiment, two CNN models are used as our testing targets, namely, LeNet and a modified AlexNet. LeNet is trained with the handwriting dataset MNIST. AlexNet is trained with the image dataset CIFAR-10, which contains 60K 3232 color images.
Several Comparing Scheme: In order to exhibits the effectiveness and superiority of our proposed ELFISH framework, we adopt three other federated learning schemes for comparison: (1) Synchronized Federated Learning (Syn. FL). All devices update their parameters synchronously with flexible parameter aggregation cycles. When straggled devices emerged, the capable devices have to wait for stragglers to finish their training process and then update parameters to the server all-together. (2) Asynchronized Federated Learning (Asyn. FL). All devices update their parameters immediately after local training without waiting for others. (3) Soft-Training Only (S.T. Only). The dynamic training model optimization scheme shown in Sec. 4 without any parameter aggregation optimization. (4) ELFISH The proposed comprehensive scheme that combines S.T. Only and the proposed parameter aggregation scheme shown in Sec. 5.
|Constraints||Strag. 1||Strag. 2||Strag. 3||Strag. 4|
|Memory Usage (MB)||252||150||100||110|
|Time Cost (Mins)(CIFAR-10)||20.6||23.8||27.2||34|
6.2 General ELFISH Performance Evaluation
We evaluate the general performance of ELFISH in terms of the training accuracy and converge speed with the other federated learning schemes in this part. As shown in Table. 3, there are two experimental settings involved: (1) Four devices join in the FL with two capable devices and two stragglers as Strag 1 and Strag 2. (2) Six devices join in the FL with three capable devices and three stragglers as Strag 1, Strag 2 and Strag 3 in Table. 3. These two settings are all conducted on the MNIST and CIFAR-10 datasets.
Accuracy Evaluation. According to Fig. 3, Asyn. FL always achieves the lowest accuracy, because the staleness parameters updated by straggled devices will bring more errors to the global model in the server. On the contrary, Syn. FL updates comprehensive model parameters and achieves better accuracy. However, due to straggler issues, much fewer training cycles are accomplished, resulting in slower converge speed. The accuracy of ELFISH is better than all other three schemes as there are no stragglers and no staleness parameters updated. And with the improved parameter aggregation scheme, it achieves better accuracy than S.T. Only.
Convergence speed Evaluation. It is clearly that our ELFISH has the fastest converge speed. While, Asyn. FL is hard to converge, as the additional errors caused by stragglers will be repetitively brought back to the collaboration with asynchornization. Therefore, we can see lots of accuracy fluctuations with it. For S.T. Only, without applying the improved parameter aggregation scheme, the soft-trained models also introduce a certain amount of imbalanced synchronization, causing slower converge speed than ELFISH.
6.3 Convergence Robustness Evaluation
From the previous section’s experiment results, we find that when more straggled devices emerge in FL collaboration, more accuracy fluctuations will present. Therefore we further investigate the convergence robustness of different FL schemes. Experiments are set with an increasing straggler number from 1 to 4.
Fig. 4 illustrates the performance comparison, we can clearly see that ELFISH is more robustness than Asyn. FL. Especially, when all collaborative edge devices are computational stragglers, our ELFISH still demonstration robust convergence regulation capability. Its accuracy average is 7 and 4 larger than Asyn. FL with LeNet and AlexNet respectively, while, its accuracy variance is 6 and 4 smaller compared with Asyn. FL.
6.4 Weight Aggregation Scheme Evaluation
With the straggled device number increasing, stragglers with more constrained computation resources join the Federated Learning. With resource-aware soft-training, the weaker straggler will train a more partial model and bring a lager training loss to the global model. And the proposed the parameter aggregation scheme can effectively resolve such an issue as presented in Sec. 5. The effectiveness of this proposed parameter aggregation scheme is also evaluated with increasing stragglers from 1 to 4.
By comparing ELFISH and S.T. Only as shown in Fig. 5, we can demonstrate the effectiveness of the proposed aggregation scheme. It is obviously that ELFISH can reduce the accuracy variance caused by partial model effectively. Even for the 4-straggler FL setting, ELFISH achieves an accuracy benefits of 2% and 15% compared to S.T. Only for LeNet and AlexNet respectively.
6.5 Non-IID Setting Evaluation
As we are targeting practical heterogeneity across edge devices, we also evaluate the effectiveness of our method with Non-IID setting (though our proposed technique is more oriented by computational heterogeneity rather than data heterogeneity).
The Non-IID capability is evaluated with four edge devices in a FL collaboration with a single straggled device. We divide the dataset into four parts with different data distribution and assign them to each devices. Although the overall performance is inevitably defected by the Non-IID setting. ELFISH can still achieve better accuracy and faster converge speed than other schemes, especially better than Syn. FL, with a 2 convergence speed up and 8.21% and 3.72% accuracy benefits for AlexNet and LeNet.
Compared with IID setting, the lack of common data will have a more serious impact than stragglers on the accuracy and convergence speed of the global model. While, by applying our method, the data will participate in each training cycle and make a contribution to the convergence of the global model.
In this work, we proposed ELFISH — a resource-aware federated learning framework on heterogeneous edge devices. Leveraging thorough CNN model training consumption profiling on edge devices, innovative “soft-training” optimization scheme, as well as dedicated parameter aggregation scheme improvement, ELFISH can effectively introduce local CNN model optimization into federating learning to eliminate computational stragglers, while maintaining expected collaborative convergence across all edge devices. Compared with conventional synchronized/asynchronized federated learning works, experiments demonstrated that the proposed ELFISH has superior training accuracy, speed, as well as convergence robustness and Non-IID setting resistance. By well addressing the computational heterogeneity of edge devices, the proposed ELFISH significantly enhances the applicability and performance of federated learning for training-on-edge.