Accelerating DNN Training in Wireless Federated Edge Learning System

05/23/2019
by   Jinke Ren, et al.
Zhejiang University
0

Training task in classical machine learning models, such as deep neural networks (DNN), is generally implemented at the remote computationally-adequate cloud center for centralized learning, which is typically time-consuming and resource-hungry. It also incurs serious privacy issue and long communication latency since massive data are transmitted to the centralized node. To overcome these shortcomings, we consider a newly-emerged framework, namely federated edge learning (FEEL), to aggregate the local learning updates at the edge server instead of users' raw data. Aiming at accelerating the training process while guaranteeing the learning accuracy, we first define a novel performance evaluation criterion, called learning efficiency and formulate a training acceleration optimization problem in the CPU scenario, where each user device is equipped with CPU. The closed-form expressions for joint batchsize selection and communication resource allocation are developed and some insightful results are also highlighted. Further, we extend our learning framework into the GPU scenario and propose a novel training function to characterize the learning property of general GPU modules. The optimal solution in this case is manifested to have the similar structure as that of the CPU scenario, recommending that our proposed algorithm is applicable in more general systems. Finally, extensive experiments validate our theoretical analysis and demonstrate that our proposal can reduce the training time and improve the learning accuracy simultaneously.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

03/22/2020

HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in Mobile-Edge-Cloud Computing

Nowadays, deep neural networks (DNNs) are the core enablers for many eme...
02/26/2020

HFEL: Joint Edge Association and Resource Allocation for Cost-Efficient Hierarchical Federated Edge Learning

Federated Learning (FL) has been proposed as an appealing approach to ha...
07/14/2020

Energy-Efficient Resource Management for Federated Edge Learning with CPU-GPU Heterogeneous Computing

Edge machine learning involves the deployment of learning algorithms at ...
07/15/2020

Joint Multi-User DNN Partitioning and Computational Resource Allocation for Collaborative Edge Intelligence

Mobile Edge Computing (MEC) has emerged as a promising supporting archit...
12/29/2021

Training Time Minimization for Federated Edge Learning with Optimized Gradient Quantization and Bandwidth Allocation

Training a machine learning model with federated edge learning (FEEL) is...
07/24/2021

Accelerating Federated Edge Learning via Optimized Probabilistic Device Scheduling

The popular federated edge learning (FEEL) framework allows privacy-pres...
01/20/2021

DynaComm: Accelerating Distributed CNN Training between Edges and Clouds through Dynamic Communication Scheduling

To reduce uploading bandwidth and address privacy concerns, deep learnin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With AlphaGo defeating the world’s top Go player and the troika Y. Bengio, G. Hinton, and Y. LeCun winning the 2018 ACM A.M. Turing Award, artifical intelligence (AI) becomes the most cutting-edge technique in both academia and industry communities and is envisioned as a revolutionary innovation enabling a smart earth in the future [1]. The implementation of AI in wireless networks is one of the most fundamental research directions, leading the trends of communication and computation convergence [2, 3, 4]

. The key idea of implementing AI in wireless networks is leveraging the rich data collected by massive distributed user devices to learn approximate AI models for network planning and optimization. Various conceptual and engineering AI breakthroughs have been applied for wireless network design, such as channel estimation

[5], signal detection [6], and resource allocation [7, 8].

In despite of the substantial progress in AI techniques, current learning algorithms demand enormous computation and memory resources for data processing. However, the training data in wireless networks is unevenly distributed over a large amount of resource-constrained user devices, whereas each device only owns a small fraction of data [9]. These two hostile conditions make it hard to implement AI algorithms in user devices. Conventional solution generally offloads the local training data to a remote cloud center for centralized learning. Nevertheless, this method suffers from two key disadvantages. On the one hand, the latency for data transmission is typically very large because of the limited communication resource. On the other hand, the privacy information involved in the training data may be leaked since the cloud center can be inevitably attacked by some malicious third parties. Hence, classical cloud-based learning framework is no longer suitable for the scenarios where data privacy is of paramount importance, such as intelligent healthcare system and smart bank system [10] .

To address the first issue, an innovative architecture called mobile edge computing (MEC) has been developed by implementing cloud computation capability at the network edge and migrating learning tasks from the cloud center to the edge server [11]. By this means, the communication latency can be significantly reduced [12], the mobile energy consumption can be extensively saved [13], and the core network congestion can be notably relieved [14]. To overcome the second deficiency, a novel distributed learning framework, namely federated learning (FL) has been recently proposed in [15]. The key idea of this framework is to globally aggregate local learning updates (gradients or parameters) trained on user devices in a centralized node while keeping the privacy-sensitive raw data remained at local devices. Toward this end, the benefit of the shared models trained from the affluent data can be reaped and the computation resources of both user devices and cloud servers can be collectively exploited [16]. Motivated by this, effective collaboration between MEC and FL, referred to as federated edge learning (FEEL) has the great potential to facilitate AI implementation in wireless networks [17].

This paper makes use of the FEEL framework to accelerate the training task of general deep neural networks (DNN). We are inspired by the pioneering studies on applying FL to learn AI models [18, 19, 20, 21, 22, 23, 24, 25, 26, 27]. In [18], a practical FederatedAveraging algorithm for distributed DNN training has been developed, while both independently and identically distributed (IID) and non-IID datasets are used to test its learning performance. Two approaches called structured updates and sketched updates have been proposed to reduce communication cost between the center node and the user devices [19]. Based on this, substantial studies have devoted great effort to further reducing the communication overhead by developing effective compression methods [20, 21, 22, 23]. Moreover, via creating a shared subset of data gathered from user devices, an efficient collaborative learning strategy for non-IID data distributed system has been devised to improve the learning accuracy [24]. In addition, the authors in [25] have investigated the multi-task FL system and proposed a system-aware optimization framework to balance the communication cost, stragglers, and fault tolerance. Particularly, a block-chained FL architecture has been introduced in [26], where the optimal block generation rate has been derived to minimize the end-to-end latency. Last but not least, a prominent broadband analog aggregation scheme based on the over-the-air computation technology has been proposed in [27], and two communication-and-learning tradeoffs have been presented to achieve a low-latency FEEL system.

The precedent works on FL mainly focus on accelerating the training task from the communication perspective, i.e., reducing the communication overhead between centralized node and user devices. However, the optimization for the training task itself is also crucial and has not been investigated yet. To facilitate the intelligent system implementation, we consider the joint communication and computation resource allocation in this paper. The major hyperparameter, i.e., the training batchsize is also optimized to improve the training efficiency. This work is most closely related to the pioneering work of

[28], which has developed a control algorithm to achieve the tradeoff between local update and global parameter aggregation under fixed resource budget. In contrast, we go one further step toward the training batchsize optimization from the learning perspective. Our main result is that the batchsize of each device should dynamically adapt to the wireless channel condition in order to achieve lower training latency and higher learning accuracy. The main contributions of this work are summarized as follows.

  • To quantitatively analyze the training process, we first define a significant global loss decay function with respect to the training batchsize. Based on this, a novel criterion, namely learning efficiency is also developed as the ratio between the global loss decay and the end-to-end latency, which can well evaluate the system learning performance.

  • We theoretically analyze the detailed expression of the learning efficiency in the CPU scenario and formulate a training acceleration problem under both communication and learning resource budgets. The closed-form expressions for joint batchsize selection and communication resource allocation are also derived. Specifically, the optimal batchsize is proved to scale linearly with the local training speed and increase with both the training priority ratio and the uplink data rate in the power of .

  • We extend the training acceleration problem to the GPU scenario and develop a new training function to characterize the relation between the training latency and the batchsize of general GPU modules. The corresponding solution in this scenario is manifested to have the similar structure as that in the CPU scenario, revealing that our proposed algorithm can be applied in more general systems.

  • The proposed algorithms in both CPU and GPU scenarios are implemented in software. Several classical DNN models are used to test the system performance based on a real image dataset. The experimental results demonstrate that our proposed scheme can attain a better learning performance than some benchmark schemes.

The rest of this paper is organized as follows. In Section II, we introduce the FEEL system and establish the DNN model and the channel model. In Section III, we quantitatively analyze the training process and formulate the training acceleration optimization problem in the CPU scenario. The closed-form solution for the CPU scenario is developed in Section IV. In Section V, we extend the training acceleration problem to the GPU scenario and discuss the solution in this case. Section VI presents the experimental results and the whole paper is concluded in Section VII.

Ii System Model

Ii-a Federated Edge Learning System Model

Fig. 1: Federated edge learning system.

As depicted in Fig. 1, we consider an FEEL system comprising one edge server and distributed single-antenna user devices, denoted by the set

. A shared DNN model (e.g., convolutional neural networks, CNN) needs to be collaboratively trained by these devices. By interacting with the own user, each device collects a number of labelled data samples and constitutes its local dataset

, where is the training sample and represents the corresponding ground-true label.

To accomplish the training task, two schemes have been widely employed: 1) centralized learning, i.e., each device directly uploads the raw data to the base station (BS) for global training, and the updated model is then multicast back to each device; 2) individual learning, i.e., each device trains an independent model via local dataset without any collaboration. The former faces with severe privacy issue since the edge server is inevitably attacked by some malicious third parties. In contrast, the latter is capable of avoiding privacy disclosure but suffers from isolated data island. Thus, the learning accuracy and reliability cannot be guaranteed. To deal with the above two issues, we adopt the FEEL scheme to accelerate the training task and improve the learning accuracy simultaneously. In this scheme, the edge server only aggregates the local gradients without centrally collecting the raw data.111Due to the sparsity of the gradient, the communication overhead can be significantly reduced by some gradient compression methods [23]. Therefore, we consider to aggregate the gradients instead of the parameters in this paper. For convenience, we define the following five steps as a training period, which is periodically performed before achieving a satisfactory learning accuracy. The detailed procedures can be summarized as follows.

  • Step 1 (Local Gradient Calculation): In each training period, say the

    -th period, each device first selects the training data from the local dataset, performs the forward-backward propagation algorithm, and then derives the local gradient vector

    , where is the parameter set of the DNN model.

  • Step 2 (Local Gradient Uploading): After quantizing and compressing the gradient vector, each device transmits its local gradient to the edge server via multiple channel access, such as the time division multiple access (TDMA) or the orthogonal frequency division multiple access (OFDMA).

  • Step 3 (Global Gradient Aggregation): The edge server receives the gradient vectors from all user devices and then aggregates (averages) them as the global gradient, as

    (1)
  • Step 4 (Global Gradient Downloading): After finishing the gradient aggregation, the edge server delivers the global gradient to the BS, which broadcasts it to all user devices.

  • Step 5 (Local Model Updating):

    Each device runs the stochastic gradient descent (SGD) algorithm based on the received global gradient. Mathematically, the local DNN models are updated by

    (2)

    where is the step-size in the -th training period.

Ii-B DNN Model

In this work, we take a generalized fully-connected DNN model for analysis. To comprehensively characterize the network structure, we denote as the number of layers, where the -th layer is equipped with neurons. Then in the -th layer, the number of weights between each pair of connected neurons and the number of biases added to each neuron are and , respectively. Therefore, the total number of parameters (weights and biases) is given by

(3)

We assume that the DNN model is deployed in each user device and the edge server. Moreover, the local loss function of each device that measures the training error is defined as

(4)

where is the sample-wise loss function that quantifies the prediction error between the learning output (via input and parameter ) and the ground-true label . Accordingly, the global loss function at the edge server can be expressed as

(5)

The target of the training task is to optimize the parameters towards minimizing the global loss function via the SGD algorithm. Further, the gradient vector of each device can be expressed as

(6)

where implies the gradient operator. Note that each parameter has a counterpart gradient, therefore the total number of gradients of each device exactly equals in (3).

Ii-C Communication Model

Without loss of generality, we adopt the typical TDMA method for data transmission in this paper. Let denote the transmit power of device for gradient uploading and denote the uplink channel power gain. Accordingly, denote as the transmit power of the BS for transmitting the global gradient to device and as the downlink channel power gain.

It should be emphasized that the durations of both uplink and downlink time-slots are relatively short (e.g., the frame duration of LTE protocol is 10 ms), during which the channel power gains keep fixed. However, the time duration for one training period is usually in the time-scale of second because of the high computational complexity in running SGD algorithm as well as the limited computation resource of user devices. In view of this, each training period will experience multiple time-slots. Since this work focuses on accelerating the training task from the long-term learning perspective, the channel dynamics would not affect the learning performance too much. Toward this end, we employ the average uplink and downlink data rates instead of the instantaneous ones, which can be respectively evaluated as

(7)
(8)

where represents the expectation over channel fading, denotes the system bandwidth, and

implies the variance of additive white Gaussian noise (AWGN).

Thus far, we have elaborated the detailed procedures of the FEEL scheme and established both the learning and communication models. In the next, we will analyze the training task and formulate the training acceleration optimization problem. In our work, the edge server is always equipped with powerful GPU whereas the training module of user devices can be either CPU or GPU. Therefore, we will respectively investigate the CPU and GPU scenarios in the sequel.

Iii Federated Learning in CPU Scenario: Problem Formulation

In this section, we will first investigate the CPU scenario where each device is only equipped with CPU for DNN training. The training acceleration problem is formulated to maximize the system learning efficiency. Some insightful results about network planning are also discussed.

Iii-a Training Loss Decay Analysis

To accelerate the training process and achieve a satisfactory learning accuracy, we take the mini-batch SGD algorithm in this paper. As is known to us, the major difficulty performing this algorithm is the selection of the hyperparameter, i.e., the training batchsize, which greatly affects the learning accuracy and is valuable to be optimized.

To quantitatively assess the training performance, we first define an auxiliary function, namely global loss decay, as

(9)

which represents the difference of the global loss function across the -th training period. For brevity, we rewrite as since our following analysis is based on one training period.

Now we analyze the training performance of the introduced FEEL system. Recall that in one training period, each device first selects a subset of data, namely one batch for local gradient calculation. Denote as the number of data involved in the batch of device . Then the edge server aggregates the gradient information from all devices and calculates the global gradient. Therefore, the FEEL system is capable of processing data in one training period, which is referred to as global batchsize. Note that the target of the training task is toward minimizing the global loss function. Then according to [29], the relation between the global loss decay function and the global batchsize can be approximately evaluated as

(10)

where is a coefficient determined by the specific structure of the DNN model. We shall remark that the global loss decay does not increase linearly with the global batchsize. It is because that the learning rate should adapt to the batchsize to ensure the convergence of the mini-batch SGD algorithm and guarantee the learning accuracy as well[30].

Iii-B End-to-End Latency Analysis

As mentioned earlier, we aim at accelerating the training task by collaboratively leveraging the data information and computation resources among all devices. Therefore, the end-to-end latency of the training period is also an essential term to be optimized. In the following, we will mathematically calculate the latency of each step in one training period.

  • Local Gradient Calculation Latency: In the CPU scenario, each device is equipped with CPU for DNN training. We measure the computation capacity of each device by its CPU frequency, denoted by (in CPU cycle/s). Moreover, let denote the number of CPU cycles performing the forward-backward propagation algorithm with one data. Due to the fact that CPU operates in serial mode, the local gradient calculation latency is given by

    (11)
  • Local Gradient Upload Latency: The local gradient vector of each device should be quantized and compressed before transmission. 222Note that the computational complexity of the gradient quantization and compression algorithm is very low [23]. Thus, the corresponding latency can be omitted as compared with the local gradient upload latency. Denote the average quantitative bit number for each gradient as . Let denote the compression ratio, which is defined as the ratio between the compressed gradient data size and the overall raw gradient data size. Then the total data size for each local gradient vector is . Recall that we adopt the TDMA channel access for data transmission. Let denote the length of each uplink radio frame (usually 10 ms in LTE standards) and denote the time-slot duration allocated to device . Therefore, the local gradient upload latency can be expressed as

    (12)
  • Global Gradient Download Latency: When the edge server finishes gradient aggregation, it broadcasts the global gradient vector to each device. To consist with the uploading procedure, we use bits to quantize each global gradient term and leverage the same gradient compression technique. Similarly, let denote the length of each downlink radio frame and denote as the time-slot duration allocated to device for global gradient downloading. Thus, the global gradient download latency can be expressed as

    (13)
  • Local Model Update Latency: After receiving the global gradient, each device starts to update its local model via the gradient-descent method, as presented in (2). Denote as the number of CPU cycles that are required for local model update. Then, the local model update latency is given by

    (14)

It should be noted that the edge server has powerful GPU modules so that the gradient aggregation latency can be reasonably neglected. Moreover, the gradient aggregation at the edge server cannot start until receiving the local gradient vectors of all devices. Therefore, in one training period, the end-to-end latency of each device can be expressed as

(15)

Accordingly, the end-to-end latency of the FEEL system in one training period is given by

(16)

Iii-C Problem Formulation

In this paper, we aim at accelerating the training task while guaranteeing the learning accuracy. To better reflect the training performance, we first define a novel evaluation criterion from the learning perspective, as presented below.

Definition 1

The training performance of the FEEL system can be evaluated by the learning efficiency, which is defined as

(17)
Remark 1

The learning efficiency can be interpreted as the average global loss decay rate in the duration of one training period. Therefore, improving the learning efficiency is equivalent to accelerating the training process. In view of this, the learning efficiency is an appropriate metric to evaluate the system training performance.

Based on the above analysis, the objective of training acceleration can be transformed into the learning efficiency maximization. Therefore, the optimization problem can be mathematically formulated as

(18a)
s.t. (18f)

where (18f) and (18f) represent the uplink and downlink communication resource limitations, respectively, (18f) gives the overall number of data that can be processed in one training period, and (18f) bounds the minimum and maximum batchsizes of each device, where is determined by the memory size and the CPU configuration of each device.

It can be observed that the objective function in problem is complicated and non-convex, making it hard to be solved in general. In the next section, we will decompose it into two subproblems and devise efficient algorithms to solve them individually.

Iv Federated Learning in CPU Scenario: Optimal Solution

In this section, we first analyze the mathematical characteristics of problem and then equivalently decompose it into two subproblems. The closed-form solutions for both subproblems are derived individually and some insightful results are also discussed.

Iv-a Problem Decomposition

The main challenge in solving problem is that the denominator of the objective function is non-smooth. For ease of decomposition, (18a) can be rewritten as

(19)

It can be seen that the local gradient upload latency is determined only by the uplink communication resource , while the global gradient download latency is determined only by the downlink communication resource and is independent of other variables. Meanwhile, and are generally independent of each other according to (18f) and (18f). Toward this end, one training period can be divided into two subperiods. The first subperiod aims to perform local gradient calculation and uploading, which can be formulated as

(20)

The second subperiod is to download global gradient and update local DNN model, and thus can be formulated as

(21)

It needs to be emphasized that the value of in subproblem should match that in subproblem . Therefore, the global batchsize can be regarded as a global variable which will be optimized at last. In view of this, we will analyze the two subproblems in the sequel.

Iv-B Solution to Subproblem

We can find that subproblem is generally a min-max optimization problem and is hard to be solved directly. To make it better tractable, we first define as the maximum reciprocal of each uplink learning efficiency among user devices, i.e., . Then, by parametric algorithm [31], subproblem can be transformed into

(22a)
s.t.

It can be easily verified that constraint (22) is non-convex, resulting in a non-convex optimization problem. Recall that the values of in subproblems and are identical. Therefore, we first keep fixed and then determine the joint optimal batchsize selection and uplink communication resource allocation policy. When is fixed, the problem becomes a convex one, as presented in the following lemma.

Lemma 1

Given the value of , problem can be converted into a classical convex optimization problem.

Proof:

The proof is straightforward since when the value of is fixed, the objective function (22a) is convex and all constraints get affine. The detailed derivation is omitted for brevity.

Lemma 1 is essential for solving problem by applying the fractional optimization method. Moreover, classical convex optimization algorithms can be also utilized. To better characterize the structure of the solution and gain more insightful results, we first define some significant auxiliary indicators for each device, as

  • Local training speed is defined as the speed of performing local forward-backward propagation algorithm at each device, i.e., .

  • Training priority ratio is defined as the ratio between the local computation capacity and the overall devices’ computation capacity, i.e., .

Based on the above definitions, the optimal solution to problem with fixed can be described as follows.

Theorem 1

The joint optimal batchsize selection and uplink communication resource allocation policy is given by

(23)

where and are the optimal values satisfying the active time-sharing constraint and the global batchsize limitation , respectively. Here, the operations and .

Proof:

Please refer to Appendix A.

Remark 2

(Threshold-based Batchsize Selection) Theorem 1 reveals that the optimal batchsize has a threshold-based structure and is mainly determined by three parameters, i.e., the local training speed, the training priority ratio, and the uplink data rate. More precisely, the batchsize scales linearly with the local training speed and increases with both the training priority ratio and the uplink data rate in the power of . This result is intuitive and plays an important role in hyperparameter tuning. On the one hand, it theoretically guides devices to increase the batchsize when the local training speed increases or the channel condition becomes better. On the other hand, those devices with higher training priority are suggested to choose larger batchsize since they are superior to promote the training process by speeding up the global loss convergence.

Remark 3

(Adaptive Resource Allocation) Theorem 1 indicates that the optimal uplink resource allocation depends not only on the uplink data rate and the local training speed, but also on the training batchsize. Specifically, this solution guarantees that the edge server can receive all local gradient vectors simultaneously, resulting in a synchronous manner without any waiting delay. This result guides devices to balance the learning and communication performances, which is regarded as a learning-communication tradeoff. Concisely, the amount of time-slot resource decreases with the uplink data rate since those devices with better channel quality request less communication resource. In addition, when the local training speed decreases, the device should occupy more time-slot resources to reduce its end-to-end latency.

Now we start to determine the optimal values of the above two parameters, i.e., and . Classical method is to perform the two-dimensional search algorithm. To facilitate the search process and reduce the computational complexity, we will calculate some useful bounds for these two parameters in the following. On the one hand, by investigating a special case where each device trains with the same batchsize as well as being allocated with identical time-slot resource, we can obtain the upper bound of . On the other hand, we relax the constraint (18f) and further apply the Karush-Kuhn-Tucker (KKT) conditions to achieve the lower bound of . With mathematical analysis, the range of could be expressed in the following corollary, which is proved in Appendix B.

Corollary 1

The value of shall satisfy

(24)

Based on Corollary 1, we further investigate another two extreme cases to determine the range of . The first one corresponds to the scenario where the optimal batchsize for all devices are . The second one corresponds to the scenario of . Then the range of can be expressed as the function of , which is presented in the following corollary and is also proved in Appendix B.

Corollary 2

When there exists at least one device whose optimal batchsize is in the interval , the value of should satisfy

(25)
Remark 4

From Corollary 2, we can observe that the values of and are tightly coupled. Note that even though this result is derived under the assumption that there is at least one device’s batchsize is between and , it is still applicable since the extreme cases of and rarely occur in practice. On the other hand, these two extreme cases correspond to the cases of and , whose solutions can be directly obtained via the KKT conditions.

Based on the above analysis, we now develop an effective two-dimensional search algorithm to solve subproblem , as described in Table I. The main idea is to update the values of training batchsize and uplink communication resource in each iteration until the time-sharing constraint and the global batchsize limitation are both satisfied. It is easy to prove that this algorithm has the computational complexity of and thus can be easily implemented in practical systems, where is the maximum tolerance.

1:  Initialization
  • Set the maximum tolerance .

  • Calculate and according to (24).

  • Calculate and based on Theorem 1, involving the one-dimensional search for and , respectively.

2:  while or , do
3:      Define .
4:      Calculate with one-dimensional search for .
5:      if , then
6:          , break.
7:      else
8:          if , then
9:              .
10:          else
11:              .
12:          end if
13:          Update and .
14:      end if
15:  end while
Algorithm 1 Two-dimensional search algorithm for subproblem .
TABLE I: The Solution to Subproblem

Iv-C Solution to Subproblem and Global Discussion

In this part, we will first solve subproblem and then discuss the global solution to the original problem . Similar to the subproblem , we denote as the maximum reciprocal of each downlink learning efficiency among all devices. Then, the subproblem can be reformulated as

(26a)
s.t.

Given the value of , the mathematical characteristic of problem is similar to that of problem . Therefore, it can be solved using the KKT conditions. The closed-form solution to is presented in the following theorem. Note that the proof is similar to that of Theorem 1, which is omitted here due to page limits.

Theorem 2

The optimal downlink communication resource allocation policy is given by

(27)

where is the optimal value associated with the time-sharing constraint .

Remark 5

(Consistent Resource Allocation) Theorem 2 indicates that the optimal downlink resource decreases with the downlink data rate. The reason is similar to that in Theorem 1. Note that by this means, the local model updating of each device can be accomplished simultaneously. Combined it with the results in Theorem 1, we can conclude that the overall end-to-end latency of each device should be identical, leading to a synchronous training system.

As yet, we have presented the closed-form expressions of joint optimal batchsize selection and uplink/downlink time-slot resource allocation as the function of the global batchsize . As a result, problem can be degraded into an univariate optimization problem and thus can be solved by classical gradient descent algorithm effectively.

V Extension to GPU Scenario

In this section, we consider the scenario where each device is equipped with GPU for DNN training. We will first propose a novel GPU training function and then extend the training acceleration problem in this case. Particularly, the optimal solution is proved to have a similar structure as that in the CPU scenario.

V-a GPU Training Function

Unlike the serial mode of general CPUs, GPUs perform computation in the parallel mode. In this situation, the local gradient calculation latency is no longer proportional to the training batchsize. Specifically, when the training batchsize is small, GPU can directly process all data simultaneously, leading to a constant training latency. In contrary, it grows once the training batchsize exceeds the maximum volume of data that the GPU can process at a time. With this consideration, we propose the following function to capture the relation between the local gradient calculation latency and the training batchsize, as presented in the following assumption and shown in Fig. 2(a).

Assumption 1

In the GPU scenario, the relation between the local gradient calculation latency and the training batchsize of each device is given by

(28)

where , , and are three coefficients determined by the specific DNN structure (e.g., the number of layers and the number of neurons) and the concrete GPU configuration (e.g., the video memory size and the number of cores).

(a) Theoretical result.
(b) Experimental result.
Fig. 2: Local gradient calculation latency w.r.t. training batchsize of general GPUs.
Remark 6

As discussed earlier, when the training batchsize is below a threshold , the local gradient calculation latency keeps fixed because of GPU’s parallel execution capability. In this case, the data samples are inadequate such that the computation resource is not fully exploited. Therefore, we name it as the data bound region. On the other hand, the local gradient calculation latency grows linearly when the training batchsize exceeds the threshold. It is because in this case, the computation resource (e.g., the video memory) cannot support processing all data samples simultaneously. We shall note that the local gradient calculation latency does not increase in a ladder form. The reason is rather intuitive that the operations of data reading and transferring between memory modules and processing modules need additional time, leading to a comprehensive linear trend. Consequently, we name this region as the compute bound region.

To validate the results in Assumption 1, we implement three classic DNN models, i.e., DenseNet, GoogleNet, and PNASNet on a Linux server equipped with three NVIDIA GeForce GTX 1080 Ti GPUs. The experimental results are depicted in Fig. 2(b). It can be observed that the local gradient calculation latency of each model first keeps invariant and then increases almost linearly with the training batchsize. This result shows that the curves obtained via experiments fit the theoretical model in (28) very well, which demonstrates the applicability of the proposed GPU training function.

V-B Problem Formulation and Analysis

Similar to the CPU scenario, the learning efficiency in the GPU scenario can be also defined as the ratio between the global loss decay and the end-to-end latency of each training period. Note that the data size of both local gradient and global gradient in the GPU scenario are identical to those in the CPU scenario. Thus, the local gradient upload latency and global gradient download latency can be still expressed as (12) and (13), respectively. Besides, let (in floating-point operations per second) denote the computation capability of the GPU module of device and denote as the number of floating-point operations that are required for model updating. Then, the local model update latency in this case can be expressed as

(29)

Based on the above analysis, the end-to-end latency of each training period in the GPU scenario is given by

(30)

Consequently, the training acceleration problem can be formulated as

s.t.

The above problem is not easy to solve since the local gradient calculation latency is non-differential. Traditional method for solving problem is the sub-gradient algorithm which performs iterative optimization of the three-variable set, , via the sub-gradient functions and will result in high computational complexity. To address this issue, we analyze the mathematical characteristic of the learning efficiency in (31) and derive a necessary condition of the solution to problem , as summarized in the following lemma.

Lemma 2

In the GPU scenario, the optimal batchsize for each device shall locate in the compute bound region, i.e., .

Proof:

Please refer to Appendix C.

The result in Lemma 2 coincides with the empirical result in practical systems that the computation resource of all devices should be fully exploited to achieve the largest learning efficiency. Accordingly, the data bound region can be neglected and can be reformulated as

(32a)
s.t. (32c)

With comprehensive comparison, we can observe that the structures of problems and are identical except the expressions of local gradient calculation latency. Fortunately, the two kinds of latency are affine with the batchsize, making both problems essentially the same structure. Therefore, the GPU scenario can be degraded into the CPU scenario and the algorithms to solve problem are still applicable with a minor modification on the delay expressions. The detailed derivations and results are omitted due to page limits.

Vi Experiments

Vi-a Experiment Settings

In this section, we conduct experiments to validate our theoretical analysis and evaluate the performance of the proposed algorithms. The system setup is summarized as follows unless otherwise specified. We consider a single-cell network with a radius of

m. The BS is located in the center of the network and each device is uniformly distributed in the cell. The channel gains of both the uplink and downlink channels are generated following the pass-loss model:

[km], where the small-scale fading is set as Rayleigh distributed with uniform variance. The transmit powers of the uplink and downlink channels are both dBm. The system bandwidth MHz and the channel noise variance dBm. The lengths of each uplink and downlink radio frame are both set as ms according to the LTE standards. For each device, the average quantitative bit numbers for each gradient bits and the maximum batchsize is bounded by data samples.

To test the system performance, we choose three commonly-used DNN models: DenseNet121, ResNet18, and MobileNetV2 for image classification. The well-known dataset CIFAR-10 is used for model training, which consists of 50000 training images and 10000 validation images of 32 32 pixels in 10 classes. Standard data augmentation and preprocessing methods are employed to improve the test accuracy, and effective deep gradient compression methods are also used to reduce the communication overhead [23]. Moreover, to simulate the distributed mobile data, we study two typical ways of partitioning CIFAR-10 data over devices [27]: 1) IID case, where all data samples are first shuffled and then partitioned into equal parts, and each device is assigned with one particular part; 2) non-IID case, where all data samples are first sorted by digit label and then divided into shards of size , and each device is thereafter assigned with two shards. Note that the latter is a pathological non-IID partition way since most devices only obtain two kinds of digits.

Vi-B CPU Scenario: Generalization with Different DNN Models

An important test of the proposed scheme is whether it is applicable for training different DNN models, i.e., the generalization ability. Toward this end, we implement the above three DNN models in an FEEL system with user devices. Specifically, the CPU frequencies of all devices are configured as: devices with GHz, devices with GHz, and devices with GHz.

We test both the loss convergence speed and learning accuracy improvement of each DNN model with different learning rates. To be more realistic, we implement the non-IID case and the results are presented in Fig. 3. It can be observed that our proposed scheme is capable of attaining a satisfactory learning accuracy with a fast loss convergence speed for all three models. Moreover, the ultimate learning accuracy can be well guaranteed with different learning rates. This result demonstrates the strong generalization ability of the proposed algorithm, which promotes its wide implementation in practical systems.

(a) Global training loss
(b) Test accuracy
Fig. 3: Global training loss and test accuracy with different learning rates.

Vi-C CPU Scenario: Performance Comparison among Different Schemes

In this part, we test the performance improvement of our proposed scheme as compared with three benchmark schemes in the scenarios of and devices, respectively. Without loss of generality, we take a pre-trained DenseNet121 with initial learning accuracy for the following test. Both the IID and non-IID cases are implemented. The detailed procedures of each scheme can be summarized as follows.

  • Individual Learning: Each device trains its DNN model until the local loss function converges. Then the edge server aggregates all local models, averages the parameters, and transmits the results to each device.

  • Model-Based FL

    : Each device trains its DNN model via local dataset through one epoch. Thereafter, the model parameters of each device are transmitted to the edge server for aggregation. The results are then transmitted back to each device for the next training epoch before the global loss function converges

    [18].

  • Gradient-Based FL: Each device trains its DNN model by running one-step SGD algorithm. Then the gradients of each device are transmitted to the edge server for aggregation. After that, the global gradients are transmitted back to each device. This process is periodically performed before the global loss function converges [32].

It is clearly that the main difference between our proposed scheme and the gradient-based FL scheme is that we take into account the joint batchsize selection and communication resource allocation.

We use two metrics to evaluate the training performance of each scheme, i.e., the test accuracy and the training speedup, which is defined as the ratio between the training speed of each scheme and that of the individual learning scheme.

(a)
Scheme IID case Non-IID case Test accuracy Training speedup Test accuracy Training speedup Individual learning Model-based FL Gradient-based FL Our proposed scheme

(b)
Scheme IID case Non-IID case Test accuracy Training speedup Test accuracy Training speedup Individual learning Model-based FL Gradient-based FL Our proposed scheme

TABLE II: Training Performance of Different Schemes

Table II(a) shows the training performance of each scheme in the scenario of devices. As compared with the individual learning scheme, our proposed scheme can speed up the training task by about times with about learning accuracy improvement in the IID case, and can speed up the training task by about times with about learning accuracy improvement in the non-IID case. These results demonstrate that the proposed scheme can accelerate the training process and improve the learning accuracy simultaneously. Moreover, compared with the model-based FL scheme, our scheme can achieve about times training speed in the IID case, and can achieve about times training speed in the non-IID case, both with a slight learning accuracy improvement. The reason can be explained as follows. The gradients of each device can be deeply compressed because of the sparsity, therefore the communication overhead for gradient transmission is much smaller than that for parameter transmission. Besides, the gradients contain more information than parameters and match the essential operations of the mini-batch SGD algorithm closely. Thus, the learning accuracy of the proposed scheme is better than that of the model-based scheme. On the other hand, compared with the gradient-based scheme, our scheme is capable of achieving about times and times training speed in the IID and non-IID cases, respectively, both with almost the same learning accuracy. It is rather intuitive that our scheme optimally selects the training batchsize to effectively balance the communication overhead and the training workload. In contrary, the gradient-based scheme trains each local model using the whole dataset without considering the communication and learning tradeoff.

We further test the training performance of each scheme in the scenario of devices. The results are presented in Table II(b). It can be observed that our proposed scheme can still achieve the fastest training speed as well as attain a satisfactory learning accuracy improvement among all schemes, which demonstrates the superiority and scalability of our proposal. Moreover, as the number of devices increases, both the learning accuracy improvement and the training speed improvement of the proposed scheme become more evident as compared with the individual learning scheme. It is because with more devices, more computation resources can be exploited to fully train the DNN model in our scheme, resulting in a higher learning accuracy. In particular, the accuracy gap between IID case and non-IID case in the individual learning scheme is larger than those in other schemes. As we can imagine, the individual learning scheme cannot properly grasp the characteristic of the non-IID data because it performs local training without any collaboration. Conversely, other three schemes dynamically share the local learning updates so that their accuracy gaps between IID case and non-IID case are much smaller. This result also suggests the superiority of the proposed scheme to deal with the non-IID data in distributed FL systems.

Vi-D GPU Scenario: Performance Comparison among Different Schemes

(a) Global training loss vs training time.
(b) Test accuracy vs training time.
Fig. 4: Performance comparison among different training schemes in the IID case.
(a) Global training loss vs training time.
(b) Test accuracy vs training time.
Fig. 5: Performance comparison among different training schemes in the non-IID case.

An important test is whether the proposed joint batchsize selection and communication resource allocation policy can accelerate the training task while guaranteeing the learning accuracy in the GPU scenario. Therefore, we turn to compare the training performance of the proposed scheme with three baseline schemes in the scenario of devices. Similarly, the DenseNet121 is tested and both IID and non-IID cases are experimented. The three baseline schemes are summarized as follows.

  • Online learning scheme, where the training batchsize of each device is .

  • Full batchsize scheme, where the training batchsize of each device is .

  • Random batchsize scheme, where each device randomly selects a training batchsize between and in each training period.

Fig. 4 and Fig. 5 show the loss convergence speed and the learning accuracy in the IID case and non-IID case, respectively. We can observe from all plots that the proposed scheme can achieve the fastest training speed while attaining the highest learning accuracy at the same time. The reason is that our proposed scheme can optimally allocate the time-slot resource and effectively select approximate training batchsize to balance the communication overhead and the training workload. Further, the learning accuracy of the proposed scheme in the IID and non-IID cases are almost the same with enough training time. It is because in the proposed scheme, the edge server can generally aggregate the data information with different kinds of distributions. Therefore, the learning accuracy in the non-IID case will not deteriorate as compared with that in the IID case. This also suggests the applicability of the proposed scheme in the hyperparameter adjustment for practical systems with non-IID data.

Vii Conclusion

This paper aims at accelerating the DNN training task in the framework of FEEL, which not only protects users’ privacy but also improves the system learning efficiency. The key idea is to jointly optimize the local training batchsize and wireless resource allocation to achieve a fast training performance while maintaining the learning accuracy. In the common CPU scenario, we formulate a training acceleration optimization problem after analyzing the global loss decay and the end-to-end latency of each training period. The closed-form solution for joint batchsize selection and communication resource allocation is then developed by problem decomposition and classical optimization tools. Some insightful results are also discussed to provide meaningful guidelines for hypeparameter adjustment and network planning. To gain more insights, we further extend our framework to the GPU scenario and develop a new GPU training function. The optimal solution in this case can be derived by fine-tuning the results in the CPU scenario, indicating that our proposed algorithm is applicable in more general systems. Our studies in this work provide an important step towards the implementation of AI in wireless communication systems.

Appendix A Proof of Theorem 1

According to Lemma 1, problem is a classical convex optimization problem for a given value of . Therefore, it can be solved by the Lagrange multiplier method. The partial Lagrange function can be defined as

(33)

where , , and are the Lagrange multipliers associated with constraints (22), (18f), and (18f), respectively. Denote as the optimal solution to . Note that the uplink communication resource is non-negative while the batchsize of each device is bounded in the interval ]. Therefore, applying the KKT conditions gives the following necessary and sufficient conditions, as

(34)
(35)
(36)
(37)
(38)
(39)
(40)
(41)

With simple mathematical calculation, we can derive the optimal batchsize selection policy as

(42)

Moreover, the minimum is achieved when “” in (38) is set to “”. Combining (38) and (42), we can obtain the optimal time-slot allocation in Theorem 1.

Appendix B

B-a Proof of Corollary 1

To prove Corollary 1, we first analyze the following two cases.
1) Case A (Equivalent resource allocation)

In this case, the time-slot resource allocated to each device is identical and each device selects the same training batchsize for local gradient calculation, i.e., and . As a result, the corresponding objective value in (22a) can be derived as

(43)

Since the goal of problem is to minimize , the optimal is no greater than , which can be regarded as an upper bound of .
2) Case B (Infinite memory resource)

In this case, the memory of each device is sufficient so that the batchsize limitation (18f) can be relaxed. Further, the convexity of problem can be preserved, making the KKT conditions still effective. With the similar derivations in Appendix A, we can obtain the objective value as

(44)

Due to the fact that the objective value will not increase after constraint relaxation, is no less than . Thus, can be viewed as a lower bound of . Combining (43) and (44), we can express the range of in Corollary 1, which ends the proof.

B-B Proof of Corollary 2

According to Theorem 1, we can observe that the optimal batchsize has a threshold-based structure. Similar to the proof of Corollary 1, we also consider two cases.
1) Case A (Online learning)

In this case, we assume that the optimal batchsize of each device is . According to (23), this case happens when , resulting in

(45)

2) Case B (Full batch learning)

Similar to case A, this case demands that the optimal batchsize of each device satisfies . Therefore, this case occurs when , leading to

(46)

Based on the above analysis, when there exists at least one device whose batchsize is between and , the value of will satisfy , which completes the proof.

Appendix C Proof of Lemma 2

The proof of Lemma 2 is straightforward by the reduction to absurdity. It can be observed that the global loss decay is an increasing function with each training batchsize . However, the local gradient calculation latency keeps unchanged when . Therefore, in the data bound region, the global loss decay (numerator of (31)) increases with while the end-to-end latency (denominator of (31)) remains unchanged, resulting in an increasing learning efficiency. Since the objective of problem is to maximize the learning efficiency, the optimal batchsize will not locate in the data bound region. This ends the proof.

References