I Introduction
In recent years, with the technological advances in modern smart devices, each phone, tablet, or smart home system, generates and stores an abundance of data, which, if harvested collaboratively with other users’ data, can lead to learning models that support many intelligent applications such as smart health and image classification [1, 2]. Standard traditional machine learning approaches require centralizing the training data on one machine, cloud, or in a data center. However, the data collected on modern smart devices are often of sensitive nature that discourages users from relying on centralized solutions. Federated Learning (FL) [3, 4] has been proposed to decouple the ability to do machine learning from the need to store the data in a centralized location. The idea of Federated Learning is to enable smart devices to collaboratively learn a shared prediction model while keeping all the training data on the device.
Figure 1 shows a schematic representation of an FL architecture. In FL, collaborative learning without data sharing is accomplished by each agent receiving a current model weight from the server. Then, each participating learning separately updates the model by implementing a stochastic gradient descent (SGD) [5] using its own locally collected datasets. Then, the participating agents send their locally calculated model weights to a server/aggregator, which often combines the models through a simple averaging, as in FedAvg [4], to be sent back to the agents. The process repeats until a satisfactory model is obtained. Federated learning relies heavily on communication between learner agents (clients) and a moderating server. Engaging all the clients in the learning procedure at each time step of the algorithm results in huge communication cost. On the other hand, poor channel quality and intermittent connectivity can completely derail training. For resource management, in the original popular FL algorithms such as FedAvg in [4], at each round of the algorithm, a batch of agents are selected uniformly at random to receive the updated model weights and perform local learning. FedAvg and similar FL algorithms come with convergence guarantees [6, 7, 8, 9] under the assumption of availability of the randomly selected agents at each round. However, in practice due to factors such as energy and time constraints, agents’ availability is not ubiquitous at all times. Thus, some works have been done to solve this problem via device scheduling [10, 11, 12, 13, 14]. Nevertheless, the agents’ availability can be a function of unforeseen factors such as communication channel quality, and thus is not deterministic and known in advance.
To understand the effect of an agent’s stochastic availability on the FL, recent work such as [15] proposed to move from random batch selection to an FL model where the agents availability and participation at each round are probabilistic, see Fig. 1. In this paper, we adopt this newly proposed framework and contribute to devising an algorithm that achieve faster convergence and lower error covariance. Our focus will be on incorporating a variance reduction procedure into the local SGD procedure of participating learner agents at each round. The randomness in SGD algorithms induces variance of the gradient, which leads to decay learning rate and sublinear convergence rate. Thus, there has been an effort to reduce the variance of the stochastic gradient, which resulted in the socalled Stochastic Variance Reduced Gradient (SVRG) methods. It is shown that SVRG allows using a constant learning rate and results in linear convergence in expectation.
In this paper, we incorporate an SVRG approach in an FL algorithm where agents’ participation in the update process in each round is probabilistic and nonuniform. Through rigorous analysis, we show that the proposed algorithm has a faster convergence rate. In particular, we show that our algorithm results in a practical convergence in expectation with a rate , which is an improvement over the sublinear rate of in [15]. We demonstrate the effectiveness of our proposed algorithm through a set of numerical studies and by comparing the rate of convergence, covariance, and the circular error probable (CEP) measure. Our results show that our algorithm drastically improves the convergence guarantees, thanks to the decrease in the variance, which results in faster convergence.
Organization: Section II introduces our basic notation, and presents some of the properties of smooth functions. Section III presents the problem formulation and the structure behind it. Section IV includes the proposed algorithm and its scheme. Section V contains our convergence analysis for the proposed algorithm and provides its convergence rate. Section VI presents simulations and Section VII gathers our conclusions and ideas for future work. For the convenience of the reader, we provide some of the proofs in the Appendix.
Ii Preliminaries
In this section, we introduce our notations and terminologies used throughout the paper. We let , , , denote the set of real, positive real numbers. Consequently, when , is its absolute value. For , denotes the standard Euclidean norm. We let
denotes an inner product between two vectors for two vectors
and . A differentiable function : is Lipschitz with constant , or simply Lipschitz, over a set if and only if , for . Furthermore, if the function is differentiable, we have for all [16]. Lastly, we recall Jensen’s inequality, which states [17]:(1) 
Iii Problem statement
This section formalizes the problem of interest. Consider a set of agents (clients) that communicate with a server to learn parameters of a model that they want to fit into their collective data set. Each agent has its own local data which can be distributed either uniformly or nonuniformly. The learning objective is to obtain the learning function weights from
(2) 
where
is possibly a convex or nonconvex local learning loss function. At each agent
, depends on training data set(supervised learning). Examples include
[18]
square loss ,

log loss .
Assumption 1 (Assumption on smoothness of local cost functions).
The local loss functions have Lipschitz gradients, i.e., for any agent we have
(3) 
for any and .
This assumption is technical and common in the literature.
Problem (2) should be solved in the framework of FL in which agents maintain their local data and only interact with the server to update their local learning parameter vector based on a global feedback provided by the server. The server generates this global feedback from the local weights it receives from the agents. In our setting, at each round of the FL algorithm, each agent becomes active to perform local computations and connect to the server with a probability of . We denote the active state by ; thus,
The activation probability at each round can be different.
Iv Federated learning with variance reduction
To solve (2), we design the FedAvgSVRG Algorithm 1, which has a twolayer structure. In this algorithm, each agent has its own probability to be active or passive in each round which is denoted by for agent at iteration .
Algorithm 1 is initialized with by the server. We denote the number of the FL iterations by . At each round , the set of active agents is denoted by (line 5), which is the set of agents for which . Then, each active agent receives a copy of the learning parameter from the server. Afterward, active agents perform their local updates according to steps 7 to 18. For resource management local update in FL algorithms, e.g., [15], follow an SGD update. However, the SGD update suffers from a high variance because of the randomized search of the algorithm, so instead of using the SGD update step, which results in a decaying step size and slow convergence, we use the SVRG update step which is stated in lines 7 to 18. In the SVRG update, we calculate the full batch gradient of the agents at some points, which are referred to as snapshots. Then, between every two snapshots, each agent does its local update. A schematic of SVRG update steps is shown in Fig. 2.
We denote the number of snapshots in our SVRG method by . We let be the number of local SVRG updates in between two snapshots for each active agent before aggregation. Line 10 of Algorithm 1 corresponds to computing the full batch gradient of each agent at the snapshot points, then in line 12, each agent does its local update with substituted gradient term denoted as
. Note the gradient substituted term in the SVRG update is an unbiased estimator. After completing the SVRG update, each agent updates its snapshot, which is mentioned in line 17
[19][5]. In the end, in line 20, the model parameter gets updated. It should be noted that the weight for updating the model parameter denoted by makes the gradient to be unbiased when the model parameter wants to be updated because, by this fraction, agents with a low probability of being selected for each iteration still have an adequate impact on a model parameter when they play a part at each iteration. Unlike SGD, the stepsize for the SVRG update does not have to decay in line 14. Hence, it gives rise to a faster convergence as one can choose a large stepsize.V Convergence analysis
In this section, we study the convergence bound for the proposed algorithm which is applicable for both convex and nonconvex cost functions.
Assumption 2 (Assumption on unbiased stochastic gradients).
(4) 
for any and . As a result, our substituted gradient term denoted by + becomes unbiased where .
Also, we should point out that and are independent for , and the agent activation for each iteration is independent of random function selection. In other words, and are completely independent.
Theorem V.1 (Convergence bound for the proposed algorithm for both convex and nonconvex cost functions).
Proof of Theorem V.1 is given in the appendix.
By incorporating a SVRG approach in our FL algorithm, Theorem V.1 gaurantees that we can a fixed size stepsize and achieve a convergence rate of . The improvement is due to the fact that the SVRG update step does not need to have a decaying stepsize throughout the learning process. Thus, using a constant and larger stepsize leads the algorithm to faster convergence to the stationary point. This is an improvement over the existing algorithm [15] in which they guarantee as the convergence rate of the algorithm by using the SGD method for their local update step.
Vi Numerical Simulations
In this section, we analyze and demonstrate the performance of the Algorithm 1 by solving a regression problem (quadratic loss function). In this study, we compare the performance of our algorithm to that of the FedAvg in [15]. We generate data pieces of form
and then apply a zero mean normally distributed noise to data, i.e., Noise
. Then, to observe the effect of stochastic optimization, we distribute the data among agents. Thus, each agent owns quadratic costs. In other words, we seek to solve the following convex optimization problem:where in our problem, , and are 10 and 50, respectively. Here, , and is the learning parameter (weight) which is a column vector with 10 elements.We conduct Monte Carlo simulation in all of which we initialize our algorithm at , and we use the fixed stepsize in all rounds. We also simulate the FedAvg algorithm of [15] with the same initialization but using the decaying stepsize of as mentioned in [15]. For our algorithm we consider two cases: (1) and (2) .
The simulation results for the first case are shown in Fig. 3–Fig.5, while the results for the second case are shown in Fig. 6–Fig.8. Figures 3 and 6 show that in both cases our algorithm has a faster convergence to the optimal cost (the value is ).
Figures 4 and 7 show the variance caused by the two algorithms. The variance of our algorithm is significantly lower than that of the algorithm of [15]. In order to show the variance of our algorithm, we limit the vertical axis between and . Also, the variance of our algorithm decreases as the number of iterations increases as opposed to the algorithm of [15], which suffers from a high variance which is more than so the mean of the Monte Carlo runs cannot be seen in the plot.
Figure 5 and 8 show the circular error probable (CEP) to observe the variance in the last iteration () for our algorithm and the FedAvg algorithm in [15]. CEP is a measure used in navigation filters. It is defined as the radius of a circle, centered on the mean, whose perimeter is expected to include the landing points of 50% of the rounds; said otherwise, it is the median error radius [20]. Here, then, CEP demonstrates how far the means of the Monte Carlo runs are from 50% of the Monte Carlo iterations for both algorithms. As a result, less radius means less variance from the mean of the Monte Carlo runs. This plot shows not only our algorithm reaches a closer neighborhood to the optimal cost, but also, it has less CEP radius in comparison to that of the algorithm of [15]; this is another indication that our algorithm has less variance compared to the FedAvg algorithm in [15]. For our algorithm, the CEP radius in the first and the second cases are respectively and , while these values of the algorithm of [15] are respectively and .
To complete our simulation study, we also compare the convergence performance of our algorithm to that of the FedAvg of [4], which uses a uniform agent selection. Figure 9 demonstrates the results when we use the batch size of of the FedAvg of [4] and use the parameters corresponding to the first case for our algorithm. As we can see, our algorithm outperforms the FedAvg of [4] both in mean and variance.
Vii Conclusions
We have proposed an algorithm in the FL framework in the setting where each agent can have a nonuniform probability of becoming active (getting selected) in each FL round. The algorithm possesses a doublylayered structure as the original FL algorithms. The first layer corresponds to distributing the server parameter to the agents. At the second layer, each agent updates its copy of the server parameter through an SVRG update. Then after each agent sends back its update, the server parameter gets updated. By leveraging the SVRG technique from stochastic optimization, we constructed a local updating rule that allowed the agents to use fixed stepsize. We characterized an upper bound for the gradient of the expected value of the cost function, which showed our algorithm converges to the optimal solution with the rate of no less than . This showed an improvement over the existing results that only have a convergence rate of . We demonstrated the performance of our algorithm through several numerical examples. We used various statistical measures to show our algorithm’s faster convergence and low variance compared to some stateoftheart existing FL algorithms. Future work will investigate the extension of the result for the nonuniformly selection of snapshots inside the SVRG update for computing the full batch gradient of the agents.
References
 [1] D. C. Nguyen, Q. V. Pham, P. N. Pathirana, M. Ding, A. Seneviratne, Z. Lin, O. A. Dobre, and W. J. Hwang, “Federated learning for smart healthcare: A survey,” ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–37, 2022.
 [2] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with noniid data,” arXiv preprint arXiv:1806.00582, 2018.

[3]
Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, “Federated learning,”
Synthesis Lectures on Artificial Intelligence and Machine Learning
, vol. 13, no. 3, 2019.  [4] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communicationefficient learning of deep networks from decentralized data,” in International Conference on Artificial Intelligence and Statistics, (Lauderdale, FL), pp. 1273–1282, 2017.
 [5] R. Xin, S. Kar, and U. A. Khan, “Decentralized stochastic optimization and machine learning: A unified variancereduction framework for robust performance and fast convergence,” tspm, pp. 102–113, 2020.
 [6] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on noniid data,” 2019.
 [7] A. Mitra, R. Jaafar, G. J. Pappas, and H. Hassani, “Achieving linear convergence in federated learning under objective and systems heterogeneity,” arXiv preprint arXiv:2102.07053, 2021.
 [8] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine Learning and Systems, pp. 429–450, 2020.
 [9] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE Journal of Selected Areas in Communications, no. 6, pp. 1205–1221, 2019.
 [10] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Transactions on Communications, vol. 68, no. 1, pp. 317–333, 2019.
 [11] Y. J. Cho, J. Wang, and G. Joshi, “Client selection in federated learning: Convergence analysis and powerofchoice selection strategies,” arXiv preprint arXiv:2010.01243, 2020.
 [12] Y. Ruan, X. Zhang, S.C. Liang, and C. JoeWong, “Towards flexible device participation in federated learning,” in International Conference on Artificial Intelligence and Statistics, pp. 3403–3411, PMLR, 2021.
 [13] H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with partial worker participation in noniid federated learning,” arXiv preprint arXiv:2101.11203, 2021.
 [14] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Transactions on Wireless Communications, vol. 20, pp. 269–283, 2021.
 [15] J. Perazzone, S. Wang, M. Ji, and K. Chan, “Communicationefficient device scheduling for federated learning using stochastic optimization,” in IEEE INFOCOM 2022IEEE Conference on Computer Communications, (Piscataway, N,J), pp. 1449–1458, IEEE, 2022.
 [16] D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997.
 [17] D. S. Bernstein, Matrix Mathematics: Theory, Facts, and Formulas (Second Edition). Princeton reference, Princeton University Press, 2009.

[18]
A. Gron,
HandsOn Machine Learning with ScikitLearn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
. O’Reilly Media, Inc., 1st ed., 2017.  [19] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in Neural Information Processing Systems (C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, eds.), vol. 26, (57 Morehouse Lane Red Hook, N,Y), Curran Associates, Inc., 2013.

[20]
J. C. Spall and J. L. Maryak, “A feasible bayesian estimator of quantiles for projectile accuracy from noniid data,”
Journal of the American Statistical Association, vol. 87, no. 419, pp. 676–681, 1992.