To train a machine learning model, traditional machine learning adopts a centralized approach in which the training data are aggregated on a single machine. On the one hand, such a centralized training approach is privacy-intrusive, especially when the data are collected by mobile devices and contain the owners’ sensitive information (e.g., locations, user preference on websites, social media, etc.). On the other hand, transmitting all the collected data for mobile devices is impractical due to communication resource limitations. With such consideration, the concept of federated learning (FL), which enables training on a large corpus of decentralized data residing on mobile devices, is proposed in [konevcny2016federated].
As a distributed training approach, FL adopts the parameter server paradigm in which most of the computation is offloaded to the mobile devices in a parallel manner and a parameter server is used to coordinate the training process. During each iteration, after receiving the FL model parameters from the server, the workers (i.e., mobile devices) train their local FL models using their local data and transmit the parameter updates to the server, which will aggregate the information from all the workers and send the aggregated result back. Since all the communications between the workers and the server are over wireless links, the learning performance depends on the wireless environments as well as the workers’ resource constraints. There have been some works that study the communication aspects of FL [tran2019federated, zeng2019energy, vu2019cell, ren2019accelerating, chen2019joint, zhu2020one, amiri2019machine, amiri2019federated]. Nonetheless, they either do not consider the existing strategies that have shown promising improvement in communication efficiency (e.g., gradient quantization [seide20141]) or ignore the energy consumption of the workers and the impact of transmission errors. In addition, all these works assume perfect channel-state information (CSI) at both the server side and the worker side, which may not be reasonable in practice.
It is worth mentioning that in real-world FL applications over wireless networks, the communication time between the server and the workers is not negligible. Therefore, it becomes more critical to improve the learning performance with respect to the total training time instead of the number of rounds. With such consideration, the implementation of the FL algorithms given a fixed training time is considered in this work. In addition, the idea of SignSGD with majority vote [bernstein2018signsgd1] is adopted to improve the communication efficiency of the FL algorithm, in which only the signs of the parameter updates are shared between the server and the workers. The workers are assumed to transmit their parameter updates over flat-fading channels and CSI is only available at the receiver side. Channel capacity with outage is considered and each worker is supposed to determine its transmission rate and power. In such a case, the learning performance is determined by the number of communication rounds that the FL algorithm can be run and the outage probability per communication round. On the one hand, increasing the transmission power decreases the outage probability. On the other hand, a larger transmission power results in higher energy consumption. Similarly, increasing the transmission rate decreases the communication time and therefore increases the number of communication rounds given a fixed training time, while a larger transmission rate (with fixed transmission power) results in a higher outage probability per communication rounds. Such tradeoffs play essential roles in the implementations of the FL algorithm and therefore are the main focus of this work. More specifically, our main contributions are summarized as follows.
Two tradeoffs in the implementation of SignSGD over wireless networks are identified. (i) Given the time for each communication round, the energy consumption versus the outage probability per communication round. (ii) Given the energy consumption, the number of communication rounds that the FL algorithm can be run given a fixed training time versus the outage probability at each communication round.
Two optimization problems are formulated and solved. The first problem minimizes the energy consumption given the outage probability (and therefore the learning performance) requirement while the second problem optimizes the learning performance given the energy consumption requirement.
Extensive simulations are performed to demonstrate the effectiveness of the proposed method.
The remainder of this work is organized as follows. Section III introduces the system model. Section II discusses the related works. Some analysis of the performance of SignSGD over wireless networks is provided in Section IV. The optimization problems are formulated in Section V and the corresponding solutions are presented in Section VI. Section VII presents the simulation results. Conclusions are presented in Section VIII.
Ii Related Works
To improve the communication efficiency of the distributed learning algorithms, various methods have been proposed, including quantization [seide20141, alistarh2017qsgd, bernstein2018signsgd1, wu2018error, wen2017terngrad, agarwal2018cpsgd], sparsification [sattler2019sparse, sattler2019robust, wang2018atomo] and subsampling [konevcny2016federated, caldas2018expanding]. However, most of these works ignore the impact of wireless environments and the resource constraints of the mobile devices, which are of vital importance in the implementation of FL algorithms over real-world wireless networks.
Recently, there have been a number of existing works that study the implementation of FL algorithms over wireless networks. In [tran2019federated]
, the weighted sum of the training time and the energy consumption is optimized by properly selecting the local computation parameters and the communication time allocated to each user. However, the formulation of the optimization problem and the proposed method rely on the assumption that the loss function is strongly convex.[zeng2019energy] also considers the energy consumption of the communications between the mobile devices and the server. The goal is to minimize the weighted sum of the energy consumption and the number of participated mobile devices by mobile device scheduling and effective bandwidth allocation. [vu2019cell] considers a cell-free massive MIMO scenario and the training time is minimized by jointly optimizing the local computation and the communication parameters. [ren2019accelerating] empirically proposes a learning efficiency metric which is a function of the mini-batch size and the time of a communication round. Resource allocation and the mini-batch size are jointly optimized to maximize the learning efficiency. [chen2019joint] takes the effect of packet transmission errors into consideration and analyzes its impact on the performance of FL. A joint bandwidth allocation and mobile device selection problem is formulated and solved to minimize a FL loss function that captures the performance of the FL algorithm. However, in these works, effective strategies for improving communication efficiency mentioned above are not considered. [zhu2020one] adopts gradient quantization and proposes an one-bit broadband over-the-air aggregation scheme. The impact of wireless channel hostilities is analyzed. [amiri2019machine] and [amiri2019federated] propose to combine the quantization, sparsification and error compensation schemes, the energy consumption of the devices as well as the impact of transmission errors are ignored in these two works. Moreover, all these works assume CSI at both the transmitter side and the receiver side. In this work, we adopt the idea of SIGNSGD with majority vote [bernstein2018signsgd1] in the design of the communication system and consider flat-fading channels with receiver only CSI.
Iii System Model
In this work, a wireless multi-user system consisting of one parameter server and a set of workers is considered. In particular, each worker stores a local dataset
, which will be used for local training. The local dataset can be locally generated or collected through each worker’s usage of mobile devices. Considering that the training of a prediction model, especially in deep learning, usually requires a large dataset, the goal of the workers is to cooperatively learn a machine learning model while keeping the local training data at their mobile devices.
Iii-a Machine Learning Model
A typical federated optimization problem with normal workers is considered. Formally, the goal is to minimize a finite-sum objective of the form
For a machine learning problem, we have a sample space , where
is a space of feature vectors andis a label space. Given the hypothesis space , we define a loss function which measures the loss of prediction on the data point made with the hypothesis vector . In such a case, is a local function defined by the local dataset of worker and the hypothesis . More specifically,
where is the size of worker ’s local dataset . The loss function depends on the learning tasks and the machine learning models.
To accommodate the requirement of communication efficiency in FL, we adopt the popular idea of gradient quantization as in SignSGD with majority note [bernstein2018signsgd1], which is presented in Algorithm 1. At -th communication round, each worker computes the gradient based on its locally stored model weights and the local datasets . Then, instead of transmitting the gradient directly, the worker transmits to the parameter server, in which is the sign function. After receiving the shared signs of the gradients from the workers, the parameter server performs aggregation using the majority vote rule and sends the aggregated result back to the workers. Finally, the workers update their local model weights using the aggregated result.
Iii-B Local Computation Model
In this work, we consider a similar local computation model as those in [tran2019federated] and [chen2019joint]. Let and denote the number of CPU cycles required for worker to process per bit data and its CPU cycle frequency, respectively, which are assumed known to the parameter server. Then, the CPU energy consumption of worker for one local iteration of computation is given by [burd1996processor]
in which is the effective capacitance coefficient of worker ’s computing chip, is the size of worker ’s training data per iteration (in bits). In addition, the computation time per local iteration of worker is given by
Iii-C Transmission Model
In this work, it is assumed that the workers transmit their local updates (i.e., the signs of the gradients) to the parameter server via the orthogonal frequency division multiple access (OFDMA), and does not interfere with each other. Given that the parameter server has more power and bandwidth compared to the mobile devices, the downlink transmission time is ignored in this work for simplicity.111Note that for a fixed transmission rate, the downlink transmission time is a constant which can be readily integrated to the first and the second constraints of the optimization problems (14) and (15), respectively, if needed. Moreover, similar to most of the existing literature (e.g., [tran2019federated, chen2019joint]), it is assumed that the downlink transmissions are error-free.
For the uplink transmission, different from the existing works that consider perfect CSI at both the transmitter side and the receiver side, we consider flat-fading channels with receiver only CSI and the capacity with outage. We assume a discrete-time channel with stationary and ergodic time-varying gain following Rayleigh distribution, and additive white Gaussian noise (AWGN) for each worker.
Capacity with outage is defined as the maximum rate that can be transmitted over a channel with some outage probability corresponding to the probability that the transmission cannot be decoded with negligible error probability [goldsmith2005wireless]. Suppose that worker transmits with a rate of , in which is some fixed minimum received SNR, the data can be correctly received if the instantaneous received SNR is greater than or equal to , in which is the transmission power of worker ; is the noise power spectral density and is the corresponding bandwidth. The probability of outage is thus . Particularly, for Rayleigh fading channel, we have
The corresponding communication time and energy consumption are given by
in which is the size of the transmitted data.222Note that in the schemes where full precision gradients are transmitted, each worker is supposed to transmit bits for each element in the gradient vectors. However, Algorithm 1 only requires 1 bit by transmitting the signs and therefore leads to an improvement of 32 times in communication time as well as communication energy consumption. In addition, also depends on the machine learning model. For instance, in a softmax regression model for -class classification tasks, .
For simplicity, it is assumed that for worker , is transmitted as a single packet in the uplink and the whole packet is decoded incorrectly when an outage happens.
Iv Analysis of the Performance of Algorithm 1 over Wireless Networks
Before diving into the details of the system design, we first analyze how the characteristics of wireless networks affect the performance of Algorithm 1. In order to facilitate the analysis, the following commonly adopted assumptions are made.
(Lower bound). For all and some constant , we have objective value .
(Smoothness). , we require for some non-negative constant
where is the standard inner product.
Given the above assumptions, the following result can be proved.
Suppose that the model parameter at the beginning of -th iteration is , then by performing one iteration of Algorithm 1, we have
in which is the dimension of the gradients; is the -th entry of the gradient vector and is the -th entry of the vector after taking the sign operation. The expectation and the probability are over the dynamics of the wireless channels.
Please see Appendix A. ∎
Note that given fixed , the right hand side of (9) depends on the probability of the signs of the aggregation result being the same as those of the true gradients (i.e., ). For the ease of discussion, we consider the
-th entry of the gradients and define a series of random variablesgiven by
can be considered as the outcome of one Bernoulli trial with successful probability . Let , then it can be verified that
follows the Poisson binomial distribution with mean. Since is non-negative, the Markov’s inequality gives
Note that and are the expected number of workers that share wrong and correct signs, respectively. The lower bound in (13) represents the difference between the ratios of workers that share the correct signs and that share the wrong signs.
In particular, let denote the probability of (i.e., the -th entry of the gradient of worker has the same sign as that of the true gradients ), then . When , minimizing is equivalent to minimizing . To this end, two tradeoffs can be observed. Firstly, it can be observed from (6) that given fixed bandwidth and noise , the transmission rate and transmission power determine the outage probability. Increasing the transmission power and decreasing the transmission rate both decrease the outage probability. However, according to (7), a larger and a smaller result in higher communication energy consumption. In addition, given fixed time for each communication round (i.e., ), decreasing increases the communication time and therefore requires worker to increasing the CPU frequency such that the local computation time can be reduced. As a result, the CPU energy consumption of worker also increases. Therefore, there exists a tradeoff between the energy consumption of the workers and the learning performance. Secondly, given a fixed total training time, transmission power and CPU frequency , despite that decreasing the transmission rate can decrease the outage probability during each iteration, it also increases the time for each communication round and therefore decreases the number of iterations that the FL algorithm can be run. Therefore, another tradeoff between the number of iterations and the outage probability per iteration can be observed.
V Problem Formulation
V-a Energy Minimization Given Learning Performance Constraint
In order to obtain the tradeoff between the energy consumption of the workers and the learning performance, it is essential to know the minimum energy consumption of each worker to guarantee certain learning performance. Particularly, the learning performance is mainly determined by two parameters: the outage probability at each iteration and the number of total iterations that is inversely proportional to the time per communication round (denoted by ). Given and , the goal of worker is to minimize its energy consumption. The corresponding optimization problem is formulated as follows.
in which the CPU frequency for local computation , the transmission rate and the transmission power are the parameters to be optimized. The first constraint captures the delay requirement per communication round. The feasible regions of CPU frequency and transmission power of worker are imposed by the second and the third constraints, respectively. The last constraint restricts the feasible range of the outage probability.
V-B Learning Performance Optimization Given Energy Consumption Constraint
Recall that the performance of the FL algorithm mainly depends on the outage probability at each iteration and the total number of iterations. According to the discussion in Section IV, given a fixed total training time, transmission power and CPU frequency , minimizing the outage probability per iteration and maximizing the number of iterations are conflicting. Therefore, in this subsection, the goal is to find a good tradeoff such that the learning performance is optimized.
Note that signSGD converges with a rate of [bernstein2018signsgd1], in which is the total number of iterations. Therefore, in this work, the objective is to maximize , in which captures the impact of the number of iterations and captures the learning performance improvement at each iteration (c.f. (13)). In addition, according to the discussion in Section III, , in which is determined by the local dataset of worker and therefore unknown to the server. To facilitate the discussion, we assume that .333Note that when all the workers have the same dataset, . In the i.i.d data distribution setting, can be considered as a noisy version of . As long as the noise is not too large (e.g., when the local datasets are large enough), this assumption is approximately true. This is verified in our simulation results. Given fixed total training time, since the number of iterations is inversely proportional to the time consumption of each iteration, the optimization problem is formulated as follows.
in which the communication round time and the transmission rate are the parameters to be optimized. The first constraint captures the energy consumption requirement for each worker and the second constraint captures the delay requirement for each iteration.
Furthermore, we assume that the workers transmit with high SNR and therefore we have
Vi Optimization of System Parameters for Federated Learning
Vi-a Energy Minimization Given Outage Probability Constraint
We note that the optimization problem (14) is not always feasible. In particular, according to the delay requirement , it is required that . Combing it with the power constraint and plugging them into (6) yields
which may contradict the last condition in (14). Therefore, two scenarios are considered.
Vi-A1 The optimization problem (14) is infeasible
In this case, we assume that , and .
We note that and are the two most important parameters that determine the performance of the FL algorithm. When the optimization problem (14) is infeasible, the delay requirement and the outage probability requirement cannot be satisfied simultaneously for worker . Since the communication round time is supposed to be determined by the slowest worker (the straggler), we assume that each worker tries its best to reduce its outage probability while accommodating the delay requirement.
Vi-A2 The optimization problem (14) is feasible
For the ease of presentation, we define , , and .
Given any , the optimal transmission power is given by
Given any , the optimal CPU frequency for local computation is given by
Please see Appendix B. ∎
Vi-B Learning Performance Optimization Given Energy Consumption Constraint
In the optimization problem (15), given any fixed , the optimal transmission rate of worker is given by
Please see Appendix C. ∎
Let . According to Lemma 2, the workers can be divided into two groups. The optimal transmission rates of the workers in the first group (i.e., ) is limited by their energy consumption upper limit while those of the workers in the second group is limited by the communication round time which is subject to design. Further define the following two functions.
It can be verified that both and are convex functions of . Therefore, (24) is a difference of convex programming problem, which can be solved by the DCA algorithm [tao1997convex].
Vii Simulation Results
In this section, we examine the performance of the proposed methods through extensive simulations. We implement a softmax regression model on the standard MNIST dataset consisting of 60,000 training samples and 10,000 testing samples [lecun1998gradient], which is for handwritten digit recognition. Each sample is a 2828 size gray-level image. It is assumed that there are 10 workers in total. Each worker randomly sample 2000 training samples from the training dataset. For all the workers, we set ; cycles/bit; bits; GHz; GHz; bits; ; W; W/Hz; kHz.
Vii-a Energy Minimization Given Learning Performance Constraint
In this subsection, the impact of the outage probability and communication round time in (14) is examined. We set the same outage probability constraints for all the workers, i.e., . The three figures in Fig. 1 shows the training accuracy, testing accuracy and the per worker energy consumption of Algorithm 1 with different and , respectively. It can be observed that as and increase, the energy consumption decreases. This is because the feasible region of (14) corresponding to a smaller and is a subset of that of (14) corresponding to a larger and . On the other hand, both the training accuracy and the testing accuracy decrease as and increase, which indicates the existence of the tradeoff between the energy consumption and the learning performance.
Vii-B Learning Performance Optimization Given Energy Consumption Constraint
In this subsection, we examine the impact of the transmission power and the communication round time . The energy consumption upper limit is set as J. Fig. 2 shows the performance of Algorithm 1 with different and . For the solid curves, the transmission rate ’s are given by (21) while the configurations of the marked points are given by the solution of (15). It can be shown that as increases, the learning performance of Algorithm 1 first increases and then decreases. According to (21), when increases, decreases and therefore the outage probability also decreases. However, in the meantime, as increases, the number of communication rounds decreases given the fixed training time. As a result, when the outage probability has a larger impact on the learning performance, increasing results in better performance. When is larger than a certain critical value, the number of communication rounds plays a more important role and therefore increasing leads to worse performance. In addition, such a critical value decreases as the transmission power increases. This is because for the same critical , a larger corresponds to a larger and therefore a smaller . Furthermore, it can be observed from Fig. 2 that the proposed method works close to the optimal operation point for all the examined scenarios, which validates its effectiveness.
In this work, the implementation of the SignSGD algorithm over wireless networks is studied. In particular, the tradeoff between the energy consumption and the learning performance as well as the tradeoff between the number of iterations that the algorithm can be run given a fixed training time and the outage probability at each iteration are identified. In addition, two optimization problems are formulated and solved, which can serve as guidelines for local processing and communication parameters selection. Furthermore, extensive simulations are performed to demonstrate the effectiveness of the proposed method. Exploring the heterogeneity among the workers is an interesting future direction.
Appendix A Proof of Theorem 1
According to Assumption 2, we have
in which is the -th entry of the vector . Taking expectation on both sides yields
Appendix B Proof of Lemma 1
According to the constraint
it can be obtained that
Since the objective function is an increasing function of , we have
According to the constraint
In addition, the objective function is an increasing function of . Therefore,
Appendix C Proof of Lemma 2
According to the constraint
According to the constraint
In addition, it can be shown that the objective function is a decreasing function of . Therefore,