I Introduction
The recent success of deep learning approaches for domains such as speech recognition and computer vision stems from many algorithmic improvements but also from the fact that the size of available training data has grown significantly over the years, together with the computing power. The current trend is to use a larger data set and to train deeper networks (higher number of layers) to improve the accuracy. However, the complexity and the memory requirements quickly become unmanageable within the resources of a single machine. An efficient way to deal with this colossal computing task within a reasonable training time is to adopt distributed computation, and to exploit computation and memory resources of multiple machines in parallel. In
[1], the federated learning system was introduced to allow the mobile devices perform computation of model training locally on their training data according to the model released by the model owner. Such a design enables mobile users to collaboratively learn a shared prediction model while keeping all the training data private on the device.Most of the popular distributed training algorithms include minibatch versions of stochastic gradient descent (SGD). Unfortunately, bulk synchronous implementations of stochastic optimization are often slow in practice due to the need to wait for the slowest machine in each synchronous batch, i.e., they suffer from the so called
straggler effect. For example, experiments on Amazon EC2 instances show that some workers can be five times slower than the typical performance [2]. There have been several attempts in the literature to mitigate the straggler effect by adding redundancy to the distributed computing system via coding [3, 4] or via scheduling computation tasks [5, 6]. However, these works overlooked the inherent heterogeneity in the computing capacity of different workers. It is crucial to consider the implications of such heterogeneity on optimizing the task allocation to different workers, improving learning accuracy, minimizing latency, and/or minimizing energy consumption. In that sense, [7] considers the problem of adaptive task allocation with the aim to maximize the learning accuracy, while satisfying a delay constraint resulting from data distribution/aggregation over heterogeneous channels, and local computation on heterogeneous nodes. Furthermore, the limited computing resource of a user device is shared among all running applications. The independent and rational mobile clients need an incentive to participate in federated learning. Hence, a critical question that needs to be addressed by each worker is “How much Central Processor Unit (CPU) resource of heterogeneous workers should be allocated to the training task of the model owner?” The answer to this question has repercussions for the central model owner, since in its most basic form of synchronous SGD, the model owner has to wait for the gradients from all workers in order to update its current set of model parameters.In [8], a game theoretical approach is established to consider a communication incentive in federated learning, where the aim was to construct a relay network and cooperative communications for supporting model update transfer. Unlike the previous works, in this paper we consider an incentivebased approach to motivate the workers to allocate more computation power for local training. In this setting, at each gradient update step, the model owner offers an incentive to each worker participating in the federated learning process. Based on this incentive, the workers determine the CPU power they will use to calculate their gradient from the local data. The model owner has a finite budget and distributes its budget among its workers to achieve a fast convergence to a target error rate. We model the interaction between the mobile devices and the model owner as a Stackelberg game. In Stackelberg game, the model owner is the buyer as it buys the learning service provided by the mobile devices. Then, the mobile devices that are the service providers act as the sellers. The model owner inherently acts as the single leader in the upper level of the Stackelberg game while the mobile devices are the corresponding followers. We obtained the equilibrium solution of this game by first quantifying the average time required to finish a single iteration of SGD. We also implemented our game theoretical algorithm numerically on MNIST dataset. Our analysis provides insights on the optimal number of workers to achieve a desired balance of errorlatency tradeoff.
Ii System Model
We consider a cooperative federated learning system as shown in Figure 1. Specifically, a model owner employs a set of mobile devices, i.e., workers to train a highquality centralized model. The workers fetch the current parameters from the model owner as and when instructed in the algorithm. Then, they compute gradients using one minibatch and push their gradients back to the model owner. At each iteration, the model owner aggregates the gradients computed by the workers and updates the parameter .
Let be the time elapsed for the worker to update the gradient in iteration . Here, we consider plain synchronous SGD such that the model owner waits for all the workers to push their gradients. Thus, iteration is completed in time, when all workers send their gradient updates. We assume that the time taken by a worker to compute gradient of one minibatch is random and independently distributed across minibatches and workers [9]. Specifically, we assume that
is exponentially distributed with mean
, where denotes the total number of CPU cycles required to accomplish the computation task, and denotes CPU power, i.e., the computation ability represented as CPU cycles per second of worker .The model owner negotiates with the workers about the CPU power, i.e., . In return, each worker will receive the revenue from the model owner, where is the price of one unit of worker ’s CPU power. Intuitively, the latency required to finish the learning process depends on the total usage of CPU power of all workers. Specifically, the learning latency becomes smaller as the expected value of maximum of reduces. As a result, the model owner aims to minimize the following cost function:
(1) 
where is a positive constant optimization parameter. Note that decreases with increasing value of . Let denote the maximum allowable budget of the model owner to pay for CPU power usage of cooperative workers.
Lemma 1
The expected value of the time required to finish a single iteration is obtained as:
(2) 
where the outer sum is over all nonempty subsets of and denotes the number of elements of . In addition, .
Proof: We omit the proof due to lack of space, but it follows the same lines of derivation of Proposition 3.2 in [10].
Note that as the workers obtain a revenue of from the model owner, each model device has an energy cost incurred from the computation, which is directly dependent on the value of CPU power usage, as: , where is a coefficient depending on the chip architecture [11]. Thus, the objective of each worker is to maximize the following utility function:
(3) 
where is the maximum of allowable CPU power usage of the workers.
Iii Stackelberg Game Formulation and Equilibrium Solution
Next, we model the interaction between the workers and the model owner as a Stackelberg game. In the lower level of the game, the workers determine their CPU power, , as a function of price per unit, . In the upper level, the model owner decides on the price per unit power for each worker, . As a result, the Stackelberg game can be formally defined as follows:
1) Lowerlevel Subgame
: Given the fixed vector of prices of one unit of CPU power
, the lowerlevel subgame problem is defined as:(4)  
subject to  (5) 
where is the maximum available CPU power.
2) Upperlevel Subgame: After each worker’s CPU power utilization with respect to prices, the model owner forms a upperlevel subgame problem as defined as:
(6)  
subject to  (7) 
where is the available budget to pay the workers.
Based on the game formulation, we consider a Stackelberg equilibrium to the solution for the model owner and the workers. Specifically, by following the backward induction, we firstly use the firstorder optimality condition to obtain the optimal solution to the lower level subgame. Then, we substitute the Nash equilibrium of the lowerlevel subgame into the upperlevel subgame and investigate the solution to the upperlevel subgame.
Iiia Solution to Lowerlevel Subgame
To find the optimal solution for the lowerlevel subgame at each worker, we take the first derivative of the utility function of each worker in (3) with respect to :
(8) 
By equating (8) to zero, we obtain the optimal CPU power as:
(9) 
Furthermore, it is easy to show that the utility function of each worker is strictly concave, which guarantees the existence and uniqueness of nash equilibrium.
IiiB Solution to Upperlevel Subgame
After obtaining the optimal CPU power of each worker as a function of price per unit CPU power, we investigate the solution to the upperlevel subgame for the model owner. Due to the high nonlinearity of the maximum time equation given in Lemma 1, we cannot obtain the closed form solution for the general case. Instead, we present Lemma 2 that can be used to develop an efficient update algorithm to reach the equilibrium point for heterogeneous case, where .
Lemma 2
When is sufficiently large, the optimal solution is realized at the boundary, i.e., , where is optimal budget allocation per unit of power for worker .
Proof: The proof is given in Appendix A.
The closedform solution to the homogeneous case where is given in the following theorem.
Theorem 1
When , the optimal solution to the upperlevel subgame is for all . Proof: The proof is given in Appendix B.
Iv Numerical Results
In the simulations, we use MNIST dataset for which we first convert the 28 x 28 images into single vectors of length 784. We use a single layer of neurons followed by softmax cross entropy with logits loss function. Thus, effectively the parameters consist of a weight matrix
of size 784 x 10 and a bias vector
of size 1 x 10. We use a regularizer of value 0.01, and learning rate of value 0.05. For implementation we used Tensorflow with Python3. For the runtime simulations, we generate random variables from the respective distributions in python to represent the computation times. Specifically, the computation time for each worker is generated from an exponential distribution with mean
. Furthermore, to consider heterogeneous workers in the system, we select uniformly at random in the range of . Due to the randomness of the selection of and randomness of stochastic gradient descent (SGD), we run each realization 50 times and take the average. In the simulations, we are interested in the error rate defined as ratio of the difference of the processed image and the original image with the original image. In all simulations, we defined a target error rate, and if the target error rate is realized, we stop the simulations and consider the time elapsed to reach the predefined error rate.We first investigate the effect of varying number of workers and budget on the latency in Fig. (a)a. In all budget values, the latency initially decreases with the number of workers, since the error improves with increasing , and thus, the number of workers in the training data set process leads to increase diversity and to reach the target error value, the system requires fewer iterations. However, after a certain point, the latency starts to increase. This is due to the fact that as the number of helpers increases, the positive effect of diversity of training data diminishes and the delay resulted in waiting for the update of all workers, starts to dominate. We also observe that the time required to reach the target error rate decreases, as the budget of the model owner increases. This is because an increase in the budget results in more CPU power allocated per worker reducing the time to complete each iteration. As a result, the total latency decreases. Fig. (a)a demonstrates the tradeoff between the diversity, which leads to reduction in total number of iterations, and the time elapsed to complete each iteration, both of which increases with the number of workers. Henceforth, for a given budget and target error rate, there exists an optimal number of workers that should be employed by the model owner.
Next, we investigate the optimal number of workers minimizing the total latency for varying budget and target error rates. As depicted in Fig. (b)b, an increase in the budget leads to an increase in the optimal number of workers, since as the budget increases, more CPU power can be purchased from more workers. Furthermore, as target error rate decreases, the optimal number of workers increases. This is because as the target error rate decreases, the number of iterations to complete the process increases, which in turn allows the diversity of training data provided by different workers to become more effective.
V Conclusion
In this paper, we have presented a Stackelberg game model to analyze the CPU allocation strategies of multiple workers as well as the budget allocation of the model owner in a synchronous SGD run by the model owner. Specifically, we have investigated the impact of the available budget and target error rate on CPU power utilization of workers and the convergence time of the learning process. We observe that even though higher number of workers leads to higher diversity in the learning process, there is a maximum number of workers beyond which the delay due to waiting for SGD update dominates. This result demonstrates the importance of an efficient resource allocation algorithms in a practical learning system.
One important direction of extension of this work is to consider a dynamic game formulation that arises when the dynamic channel and worker CPU conditions are taken into account. Additionally, the interactions between the model owner and workers depend on the learning approach implemented, e.g., AdaGrad, ADAM, etc. Although similar tradeoffs exist regardless of the method implemented, it would be insightful to study the optimal number of workers depending on the method used.
References

[1]
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas,
“Communicationefficient learning of deep networks from decentralized
data,” in
International Conference on Artificial Intelligence and Statistics (AISTATS)
, Fort Lauderdale, FL, USA, Apr. 2017. 
[2]
A. G. D. R. Tandon, Q. Lei and N. Karampatziakis, “Gradient coding: avoiding
stragglers in distributed learning,” in
Proc. Int. Conf. on Machine Learning
, Sydney, Australia, Feb. 2017, pp. 3368–3376.  [3] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” IEEE Transactions on Information Theory, vol. 64, no. 3, pp. 1514–1529, 2018.
 [4] C. Karakus, Y. Sun, S. Diggavi, and W. Yin, “Straggler mitigation in distributed optimization through data encoding,” in Advances in Neural Information Processing Systems 30 (NIPS), Long Beach, NY, USA, Dec. 2017, pp. 5440–5448.
 [5] A. Harlap, H. Cui, W. Dai, J. Wei, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing, “Addressing the straggler problem for iterative convergent parallel ml,” in ACM Symposium on Cloud Computing (SoCC), Santa Clara, CA, USA, Oct. 2016, pp. 98–111.
 [6] Y. Sun, J. Zhao, S. Zhou, and D. Gündüz, “Heterogeneous computation across heterogeneous workers,” CoRR, vol. abs/1904.07490, 2019. [Online]. Available: http://arxiv.org/abs/1904.07490
 [7] U. Mohammad and S. Sorour, “Adaptive task allocation for mobile edge learning,” CoRR, vol. abs/1811.03748, 2018. [Online]. Available: http://arxiv.org/abs/1811.03748
 [8] S. Feng, D. Niyato, P. Wang, D. I. Kim, and Y. Liang, “Joint service pricing and cooperative relay communication for federated learning,” CoRR, vol. abs/1811.12082, 2018. [Online]. Available: http://arxiv.org/abs/1811.12082
 [9] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and stale gradients can win the race: Errorruntime tradeoffs in distributed sgd,” CoRR, vol. abs/1803.01113, 2018. [Online]. Available: https://arxiv.org/abs/1803.01113
 [10] M. Bibinger, “Notes on the sum and maximum of independent exponentially distributed random variables with different scaleparameters,” CoRR, vol. abs/1307.3945, 2013. [Online]. Available: http://arxiv.org/abs/1307.3945
 [11] J. Zhang, X. Hu, Z. Ning, E. C. . Ngai, L. Zhou, J. Wei, J. Cheng, and B. Hu, “Energylatency tradeoff for energyaware offloading in mobile edge computing networks,” IEEE Internet of Things Journal, vol. 5, no. 4, pp. 2633–2645, Aug 2018.
Appendix A Proof of Lemma 2
We first substitute the optimal CPU power allocations for the workers, i.e., , given as in (9) into the cost minimization problem given in (6)  (7). As the constraint in (7) is linear, we adopt the Lagrangian method. The Lagrangian function for the optimization problem (6)  (7) is given as follows:
(10) 
where is Lagrangian function and denotes Lagrangian multiplier.
To obtain the optimal solution, we take the first derivative of Lagrangian function with respect to .
(11) 
In (11), we use the relation between and , i.e. . Then, we equate the first derivative given in (11) to zero to derive the value of Lagrange multiplier, , at the optimal point as:
(12) 
Similarly, from the second and third KarushKuhnTucker (KKT) conditions we have and . Thus, the first term in (12) should be positive. Intuitively, as the exponential parameter, i.e., the inverse of mean completion time, , increases, the maximum value of completion times should decrease. As a result, for a sufficiently large , we can guarantee that is positive. Thus, from complementary slackness condition of KKT, the solution exists at the boundary.
Appendix B Proof of Theorem 1
Using the result given in (11) and the fact that , we obtain the following relation:
(13) 
Comments
There are no comments yet.