When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning

Emerging technologies and applications including Internet of Things (IoT), social networking, and crowd-sourcing generate large amounts of data at the network edge. Machine learning models are often built from the collected data, to enable the detection, classification, and prediction of future events. Due to bandwidth, storage, and privacy concerns, it is often impractical to send all the data to a centralized location. In this paper, we consider the problem of learning model parameters from data distributed across multiple edge nodes, without sending raw data to a centralized place. Our focus is on a generic class of machine learning models that are trained using gradient-descent based approaches. We analyze the convergence rate of distributed gradient descent from a theoretical point of view, based on which we propose a control algorithm that determines the best trade-off between local update and global parameter aggregation to minimize the loss function under a given resource budget. The performance of the proposed algorithm is evaluated via extensive experiments with real datasets, both on a networked prototype system and in a larger-scale simulated environment. The experimentation results show that our proposed approach performs near to the optimum with various machine learning models and different data distributions.

READ FULL TEXT VIEW PDF
07/14/2021

Communication-Efficient Hierarchical Federated Learning for IoT Heterogeneous Systems with Imbalanced Data

Federated learning (FL) is a distributed learning methodology that allow...
12/16/2018

Stochastic Distributed Optimization for Machine Learning from Decentralized Features

Distributed machine learning has been widely studied in the literature t...
03/10/2020

Communication-efficient Variance-reduced Stochastic Gradient Descent

We consider the problem of communication efficient distributed optimizat...
10/13/2017

Knowledge is at the Edge! How to Search in Distributed Machine Learning Models

With the advent of the Internet of Things and Industry 4.0 an enormous a...
08/20/2019

On Analog Gradient Descent Learning over Multiple Access Fading Channels

We consider a distributed learning problem over multiple access channel ...
10/02/2020

Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing

Big data, including applications with high security requirements, are of...
03/30/2016

Towards Geo-Distributed Machine Learning

Latency to end-users and regulatory requirements push large companies to...

I Introduction

The rapid advancement of Internet of Things (IoT) and social networking applications results in an exponential growth of the data generated at the network edge. It has been predicted that the data generation rate will exceed the capacity of today’s Internet in the near future [2]

. Due to network bandwidth and data privacy concerns, it is impractical and often unnecessary to send all the data to a remote cloud. As a result, research organizations estimate that over

of the data will be stored and processed locally [3]. Local data storing and processing with global coordination is made possible by the emerging technology of mobile edge computing (MEC) [4, 5], where edge nodes, such as sensors, home gateways, micro servers, and small cells, are equipped with storage and computation capability. Multiple edge nodes work together with the remote cloud to perform large-scale distributed tasks that involve both local processing and remote coordination/execution.

To analyze large amounts of data and obtain useful information for the detection, classification, and prediction of future events, machine learning techniques are often applied. The definition of machine learning is very broad, ranging from simple data summarization with linear regression to multi-class classification with support vector machines (SVMs) and deep neural networks

[6, 7]. The latter have shown very promising performance in recent years, for complex tasks such as image classification. One key enabler of machine learning is the ability to learn (train) models using a very large amount of data. With the increasing amount of data being generated by new applications and with more applications becoming data-driven, one can foresee that machine learning tasks will become a dominant workload in distributed MEC systems in the future. However, it is challenging to perform distributed machine learning on resource-constrained MEC systems.

In this paper, we address the problem of how to efficiently utilize the limited computation and communication resources at the edge for the optimal learning performance. We consider a typical edge computing architecture where edge nodes are interconnected with the remote cloud via network elements, such as gateways and routers, as illustrated in Fig. 1. The raw data is collected and stored at multiple edge nodes, and a machine learning model is trained from the distributed data without sending the raw data from the nodes to a central place. This variant of distributed machine learning (model training) from a federation of edge nodes is known as federated learning [8, 9, 10].

Fig. 1: System architecture.

We focus on gradient-descent based federated learning algorithms, which have general applicability to a wide range of machine learning models. The learning process includes local update steps where each edge node performs gradient descent to adjust the (local) model parameter to minimize the loss function defined on its own dataset. It also includes global aggregation steps where model parameters obtained at different edge nodes are sent to an aggregator, which is a logical component that can run on the remote cloud, a network element, or an edge node. The aggregator aggregates these parameters (e.g., by taking a weighted average) and sends an updated parameter back to the edge nodes for the next round of iteration. The frequency of global aggregation is configurable; one can aggregate at an interval of one or multiple local updates. Each local update consumes computation resource of the edge node, and each global aggregation consumes communication resource of the network. The amount of consumed resources may vary over time, and there is a complex relationship among the frequency of global aggregation, the model training accuracy, and resource consumption.

We propose an algorithm to determine the frequency of global aggregation so that the available resource is most efficiently used. This is important because the training of machine learning models is usually resource-intensive, and a non-optimal operation of the learning task may waste a significant amount of resources. Our main contributions in this paper are as follows:

  1. We analyze the convergence bound of gradient-descent based federated learning from a theoretical perspective, and obtain a novel convergence bound that incorporates non-independent-and-identically-distributed (non-i.i.d.) data distributions among nodes and an arbitrary number of local updates between two global aggregations.

  2. Using the above theoretical convergence bound, we propose a control algorithm that learns the data distribution, system dynamics, and model characteristics, based on which it dynamically adapts the frequency of global aggregation in real time to minimize the learning loss under a fixed resource budget.

  3. We evaluate the performance of the proposed control algorithm via extensive experiments using real datasets both on a hardware prototype and in a simulated environment, which confirm that our proposed approach provides near-optimal performance for different data distributions, various machine learning models, and system configurations with different numbers of edge nodes.

Ii Related Work

Existing work on MEC focuses on generic applications, where solutions have been proposed for application offloading [11, 12], workload scheduling [13, 14], and service migration triggered by user mobility [15, 16]. However, they do not address the relationship among communication, computation, and training accuracy for machine learning applications, which is important for optimizing the performance of machine learning tasks.

The concept of federated learning was first proposed in [9], which showed its effectiveness through experiments on various datasets. Based on the comparison of synchronous and asynchronous methods of distributed gradient descent in [17], it is proposed in [9] that federated learning should use the synchronous approach because it is more efficient than asynchronous approaches. The approach in [9] uses a fixed global aggregation frequency. It does not provide theoretical convergence guarantee and the experiments were not conducted in a network setting. Several extensions have been made to the original federated learning proposal recently. For example, a mechanism for secure global aggregation is proposed in [18]. Methods for compressing the information exchanged within one global aggregation step is proposed in [19, 20]. Adjustments to the standard gradient descent procedure for better performance in the federated setting is studied in [21]. Participant (client) selection for federated learning is studied in [22]. An approach that shares a small amount of data with other nodes for better learning performance with non-i.i.d. data distribution is proposed in [23]. These studies do not consider the adaptation of global aggregation frequency, and thus they are orthogonal to our work in this paper. To the best of our knowledge, the adaptation of global aggregation frequency for federated learning with resource constraints has not been studied in the literature.

An area related to federated learning is distributed machine learning in datacenters through the use of worker machines and parameter servers [24]. The main difference between the datacenter environment and edge computing environment is that in datacenters, shared storage is usually used. The worker machines do not keep persistent data storage on their own, and they fetch the data from the shared storage at the beginning of the learning process. As a result, the data samples obtained by different workers are usually independent and identically distributed (i.i.d.). In federated learning, the data is collected at the edge directly and stored persistently at edge nodes, thus the data distribution at different edge nodes is usually non-i.i.d. Concurrently with our work in this paper, optimization of synchronization frequency with running time considerations is studied in [25] for the datacenter setting. It does not consider characteristics of non-i.i.d. data distributions which is essential in federated learning.

Distributed machine learning across multiple datacenters in different geographical locations is studied in [26], where a threshold-based approach to reduce the communication among different datacenters is proposed. Although the work in [26] is related to the adaptation of synchronization frequency with resource considerations, it focuses on peer-to-peer connected datacenters, which is different from the federated learning architecture that is not peer-to-peer. It also allows asynchronism among datacenter nodes, which is not the case in federated learning. In addition, the approach in [26] is designed empirically and does not consider a concrete theoretical objective, nor does it consider computation resource constraint which is important in MEC systems in addition to constrained communication resource.

From a theoretical perspective, bounds on the convergence of distributed gradient descent are obtained in [27, 28, 29], which only allow one step of local update before global aggregation. Partial global aggregation is allowed in the decentralized gradient descent approach in [30, 31], where after each local update step, parameter aggregation is performed over a non-empty subset of nodes, which does not apply in our federated learning setting where there is no aggregation at all after some of the local update steps. Multiple local updates before aggregation is possible in the bound derived in [26], but the number of local updates varies based on the thresholding procedure and cannot be specified as a given constant. Concurrently with our work, bounds with a fixed number of local updates between global aggregation steps are derived in [32, 33]. However, the bound in [32] only works with i.i.d. data distribution; the bound in [33] is independent from how different the datasets are, which is inefficient because it does not capture the fact that training on i.i.d. data is likely to converge faster than training on non-i.i.d. data. Related studies on distributed optimization that are applicable for machine learning applications also include [34, 35, 36], where a separate solver is used to solve a local problem. The main focus of [34, 35, 36] is the trade-off between communication and optimality, where the complexity of solving the local problem (such as the number of local updates needed) is not studied. In addition, many of the existing studies either explicitly or implicitly assume i.i.d. data distribution at different nodes, which is inappropriate in federated learning. To our knowledge, the convergence bound of distributed gradient descent in the federated learning setting, which captures both the characteristics of different (possibly non-i.i.d. distributed) datasets and a given number of local update steps between two global aggregation steps, has not been studied in the literature.

In contrast to the above research, our work in this paper formally addresses the problem of dynamically determining the global aggregation frequency to optimize the learning with a given resource budget for federated learning in MEC systems. This is a non-trivial problem due to the complex dependency between each learning step and its previous learning steps, which is hard to capture analytically. It is also challenging due to non-i.i.d. data distributions at different nodes, where the data distribution is unknown beforehand and the datasets may have different degrees of similarities with each other, and the real-time dynamics of the system. We propose an algorithm that is derived from theoretical analysis and adapts to real-time system dynamics.

We start with summarizing the basics of federated learning in the next section. In Section IV, we describe our problem formulation. The convergence analysis and control algorithm are presented in Sections V and VI, respectively. Experimentation results are shown in Section VII and the conclusion is presented in Section VIII.

Iii Preliminaries and Definitions

Iii-a Loss Function

Machine learning models include a set of parameters which are learned based on training data. A training data sample  usually consists of two parts. One is a vector that is regarded as the input of the machine learning model (such as the pixels of an image); the other is a scalar that is the desired output of the model (such as the label of the image). To facilitate the learning, each model has a loss function defined on its parameter vector for each data sample . The loss function captures the error of the model on the training data, and the model learning process is to minimize the loss function on a collection of training data samples. For each data sample , we define the loss function as , which we write as in short111

Note that some unsupervised models (such as K-means) only learn on

and do not require the existence of in the training data. In such cases, the loss function value only depends on ..

Model Loss function ()
Squared-SVM ( is const.)
Linear regression
K-means where
Convolutional neural network

Cross-entropy on cascaded linear and non-linear transforms, see

[7]
TABLE I: Loss functions for popular machine learning models

Examples of loss functions of popular machine learning models are summarized222While our focus is on non-probabilistic learning models, similar loss functions can be defined for probabilistic models where the goal is to minimize the negative of the log-likelihood function, for instance. in Table I [6, 7, 37]. For convenience, we assume that all vectors are column vectors in this paper and use to denote the transpose of . We use “” to denote “is defined to be equal to” and use to denote the norm.

Assume that we have edge nodes with local datasets . For each dataset at node , the loss function on the collection of data samples at this node is

(1)

We define , where denotes the size of the set, and . Assuming for , we define the global loss function on all the distributed datasets as

(2)

Note that cannot be directly computed without sharing information among multiple nodes.

Iii-B The Learning Problem

The learning problem is to minimize , i.e., to find

(3)

Due to the inherent complexity of most machine learning models, it is usually impossible to find a closed-form solution to (3). Thus, (3) is often solved using gradient-descent techniques.

Iii-C Distributed Gradient Descent

We present a canonical distributed gradient-descent algorithm to solve (3), which is widely used in state-of-the-art federated learning systems (e.g., [9]). Each node  has its local model parameter , where denotes the iteration index. At , the local parameters for all nodes are initialized to the same value. For , new values of are computed according to a gradient-descent update rule on the local loss function, based on the parameter value in the previous iteration . This gradient-descent step on the local loss function (defined on the local dataset) at each node is referred to as the local update. After one or multiple local updates, a global aggregation is performed through the aggregator to update the local parameter at each node to the weighted average of all nodes’ parameters. We define that each iteration includes a local update step which is possibly followed by a global aggregation step.

After global aggregation, the local parameter at each node  usually changes. For convenience, we use to denote the parameter at node  after possible global aggregation. If no aggregation is performed at iteration , we have . If aggregation is performed at iteration , then generally and we set , where is a weighted average of defined in (5) below. An example of these definitions is shown in Fig. 2.

Fig. 2: Illustration of the values of and at node .

The local update in each iteration is performed on the parameter after possible global aggregation in the previous iteration. For each node , the update rule is as follows:

(4)

where is the step size. For any iteration (which may or may not include a global aggregation step), we define

(5)

This global model parameter is only observable to nodes in the system if global aggregation is performed at iteration , but we define it for all to facilitate the analysis later.

We define that the system performs steps of local updates at each node between every two global aggregations. We define as the total number of local iterations at each node. For ease of presentation, we assume that is an integer multiple of in the theoretical analysis, which will be relaxed when we discuss practical aspects in Section VI-B. The logic of distributed gradient descent is presented in Algorithm 1, which ignores aspects related to the communication between the aggregator and edge nodes. Such aspects will be discussed later in Section VI-B.

Input: ,
Output: Final model parameter
1 Initialize , and to the same value for all ; for  do
2        For each node in parallel, compute local update using (4); if  is an integer multiple of  then
3               Set for all , where is defined in (5); //Global aggregation Update ;
4       else
5               Set for all ; //No global aggregation
Algorithm 1 Distributed gradient descent (logical view)

The final model parameter obtained from Algorithm 1 is the one that has produced the minimum global loss after each global aggregation throughout the entire execution of the algorithm. We use instead of , to align with the theoretical convergence bound that will be presented in Section V-B. In practice, we have seen that and are usually the same, but using provides theoretical rigor in terms of convergence guarantee so we use in this paper. Note that in Line 1 of Algorithm 1 is computed in a distributed manner according to (2); the details will be presented later.

The rationale behind Algorithm 1 is that when , i.e., when we perform global aggregation after every local update step, the distributed gradient descent (ignoring communication aspects) is equivalent to the centralized gradient descent, where the latter assumes that all data samples are available at a centralized location and the global loss function and its gradient can be observed directly. This is due to the linearity of the gradient operator. See Appendix -A as well as [38] for detailed discussions about this.

The main notations in this paper are summarized in Table II.

Global loss function
Local loss function for node
Iteration index
Local model parameter at node in iteration
Global model parameter in iteration
Final model parameter obtained at the end of learning process
True optimal model parameter that minimizes
Gradient descent step size
Number of local update steps between two global aggregations
Total number of local update steps at each node
Total number of global aggregation steps, equal to
() Total number of resource types (the -th type of resource)
Total budget of the -th type of resource
Consumption of type- resource in one local update step
Consumption of type- resource in one global aggregation step
Lipschitz parameter of () and
Smoothness parameter of () and
Gradient divergence
Function defined in (11), gap between the model parameters obtained from distributed and centralized gradient descents
Constant defined in Lemma 2, control parameter
Function defined in (18), control objective
Optimal obtained by minimizing
TABLE II: Summary of main notations

Iv Problem Formulation

When there is a large amount of data (which is usually needed for training an accurate model) distributed at a large number of nodes, the federated learning process can consume a significant amount of resources. The notion of “resources” here is generic and can include time, energy, monetary cost etc. related to both computation and communication. One often has to limit the amount of resources used for learning each model, in order not to backlog the system and to keep the operational cost low. This is particularly important in edge computing environments where the computation and communication resources are not as abundant as in datacenters.

Therefore, a natural question is how to make efficient use of a given amount of resources to minimize the loss function of model training. For the distributed gradient-descent based learning approach presented above, the question narrows down to determining the optimal values of and , so that the global loss function is minimized subject to a given resource constraint for this learning task.

We use to denote the total number of global aggregations within iterations. Because we assumed earlier that is an integer multiple of , we have . We define

(6)

It is easy to verify that this definition is equivalent to found from Algorithm 1.

To compute in (6), each node first computes and sends the result to the aggregator, then the aggregator computes according to (2). Since each node only knows the value of after the -th global aggregation, at node will be sent back to the aggregator at the -th global aggregation, and the aggregator computes afterwards. To compute the last loss value , an additional round of local and global update is performed at the end. We assume that at each node, local update consumes the same amount of resource no matter whether only the local loss is computed (in the last round) or both the local loss and gradient are computed (in all the other rounds), because the loss and gradient computations can usually be based on the same intermediate result. For example, the back propagation approach for computing gradients in neural networks requires a forward propagation procedure that essentially obtains the loss as an intermediate step [7].

We consider different types of resources. For example, one type of resource can be time, another type can be energy, a third type can be communication bandwidth, etc. For each , we define that each local update step at all nodes consumes units of type- resource, and each global aggregation step consumes units of type- resource, where and are both finite real numbers. For given and , the total amount of consumed type- resource is , where the additional “” is for computing , as discussed above.

Let denote the total budget of type- resource. We seek the solution to the following problem:

(7)
s.t.

To solve (7), we need to find out how and (and thus ) affect the loss function computed on the final model parameter . It is generally impossible to find an exact analytical expression to relate and with , because it depends on the convergence property of gradient descent (for which only upper/lower bounds are known [39]) and the impact of the global aggregation frequency on the convergence. Further, the resource consumptions and can be time-varying in practice which makes the problem even more challenging than (7) alone.

We analyze the convergence bound of distributed gradient descent (Algorithm 1) in Section V, then use this bound to approximately solve (7) and propose a control algorithm for adaptively choosing the best values of and to achieve near-optimal resource utilization in Section VI.

V Convergence Analysis

We analyze the convergence of Algorithm 1 in this section and find an upper bound of . To facilitate the analysis, we first introduce some notations.

V-a Definitions

We can divide the iterations into different intervals, as shown in Fig. 3, with only the first and last iterations in each interval containing global aggregation. We use the shorthand notations to denote the iteration interval333With slight abuse of notation, we use to denote the integers contained in the interval for simplicity. We use the same convention in other parts of the paper as long as there is no ambiguity. , for .

Fig. 3: Illustration of definitions in different intervals.

For each interval , we use to denote an auxiliary parameter vector that follows a centralized gradient descent according to

(8)

where is only defined for for a given . This update rule is based on the global loss function which is only observable when all data samples are available at a central place (thus we call it centralized gradient descent), whereas the iteration in (4) is on the local loss function .

We define that is “synchronized” with at the beginning of each interval , i.e., , where is the average of local parameters defined in (5). Note that we also have for all  because the global aggregation (or initialization when ) is performed in iteration .

The above definitions enable us to find the convergence bound of Algorithm 1 by taking a two-step approach. The first step is to find the gap between and for each , which is the difference between the distributed and centralized gradient descents after steps of local updates without global aggregation. The second step is to combine this gap with the convergence bound of within each interval to obtain the convergence bound of .

For the purpose of the analysis, we make the following assumption to the loss function.

Assumption 1.

We assume the following for all :

  1. is convex

  2. is -Lipschitz, i.e., for any

  3. is -smooth, i.e., for any

Assumption 1 is satisfied for squared-SVM and linear regression (see Table I). The experimentation results that will be presented in Section VII show that our control algorithm also works well for models (such as neural network) whose loss functions do not satisfy Assumption 1.

Lemma 1.

is convex, -Lipschitz, and -smooth.

Proof.

Straightforwardly from Assumption 1, the definition of , and triangle inequality. ∎

We also define the following metric to capture the divergence between the gradient of a local loss function and the gradient of the global loss function. This divergence is related to how the data is distributed at different nodes.

Definition 1.

(Gradient Divergence) For any and , we define as an upper bound of , i.e.,

(9)

We also define .

V-B Main Results

The below theorem gives an upper bound on the difference between and when is within the interval .

Theorem 1.

For any interval and , we have

(10)

where

(11)

for any .

Furthermore, as is -Lipschitz, we have .

Proof.

We first obtain an upper bound of for each node , based on which the final result is obtained. For details, see Appendix -B. ∎

Note that we always have and because otherwise the gradient descent procedure or the loss function becomes trivial. Therefore, we have for due to Bernoulli’s inequality. Substituting this into (11) confirms that we always have .

It is easy to see that . Therefore, when , i.e., at the beginning of the interval , the upper bound in (10) is zero. This is consistent with the definition of for any . When (i.e., the second iteration in interval ), the upper bound in (10) is also zero. This agrees with the discussion at the end of Section III-C, showing that there is no gap between distributed and centralized gradient descents when only one local update is performed after the global aggregation. If , then is either or for any interval and . Hence, the upper bound in (10) becomes exact for .

For , the value of can be larger. When is large, the exponential term with in (11) becomes dominant, and the gap between and can increase exponentially with for . We also note that is proportional to the gradient divergence (see (11)), which is intuitive because the more the local gradient is different from the global gradient (for the same parameter ), the larger the gap will be. The gap is caused by the difference in the local gradients at different nodes starting at the second local update after each global aggregation. In an extreme case when all nodes have exactly the same data samples (and thus the same local loss functions), the gradients will be always the same and , in which case and are always equal.

Theorem 1 gives an upper bound of the difference between distributed and centralized gradient descents for each iteration interval , assuming that in the centralized gradient descent is synchronized with at the beginning of each . Based on this result, we first obtain the following lemma.

Lemma 2.

When all the following conditions are satisfied:

  1. for all

for some , where we define and , then the convergence upper bound of Algorithm 1 after iterations is given by

(12)
Proof.

We first analyze the convergence of within each interval . Then, we combine this result with the gap between and from Theorem 1 to obtain the final result. For details, see Appendix -C. ∎

We then have the following theorem.

Theorem 2.

When , we have

(13)
Proof.

Condition 1 in Lemma 2 is always satisfied due to the condition in this theorem.

When , we can choose to be arbitrarily small (but greater than zero) so that conditions 2–4 in Lemma 2 are satisfied. We see that the right-hand sides of (12) and (13) are equal in this case (when ), and the result in (13) follows directly from Lemma 2 because according to the definition of in (6).

We consider in the following. Consider the right-hand side of (12) and let

(14)

Solving for , we obtain

(15)

where the negative solution is ignored because in Lemma 2. Because according to (15), the denominator of (14) is greater than zero, thus condition 2 in Lemma 2 is satisfied for any , where we note that increases with when .

Suppose that there exists satisfying conditions 3 and 4 in Lemma 2, so that all the conditions in Lemma 2 are satisfied. Applying Lemma 2 and considering (14), we have

which contradicts with condition 4 in Lemma 2. Therefore, there does not exist that satisfy both conditions 3 and 4 in Lemma 2. This means that either 1) such that or 2) . It follows that

(16)

From Theorem 1, for any . Combining with (16), we get

where we recall that . Using (6) and (15), we obtain the result in (13). ∎

We note that the bound in (13) has no restriction on how the data is distributed at different nodes. The impact of different data distribution is captured by the gradient divergence , which is included in . It is easy to see from (11) that is non-negative, non-decreasing in , and proportional to . Thus, as one would intuitively expect, for a given total number of local update steps , the optimality gap (i.e., ) becomes larger when and are larger. For given and , the optimality gap becomes smaller when is larger. When , we have , and the optimality gap converges to zero as . When , we have , and we can see from (13) that in this case, convergence is only guaranteed to a non-zero optimality gap as . This means that when we have unlimited budget for all types of resources (i.e., ), it is always optimal to set and perform global aggregation after every step of local update. However, when the resource budget is limited for some , the training will be terminated after a finite number of iterations, thus the value of is finite. In this case, it may be better to perform global aggregation less frequently so that more resources can be used for local update, as we will see later in this paper.

Vi Control Algorithm

We propose an algorithm that approximately solves (7) in this section. We first assume that the resource consumptions and () are known, and we solve for the values of and . Then, we consider practical scenarios where , , and some other parameters are unknown and may vary over time, and we propose a control algorithm that estimates the parameters and dynamically adjusts the value of in real time.

Vi-a Approximate Solution to (7)

We assume that is chosen small enough such that , and use the upper bound in (13) as an approximation of . Because for a given global loss function , its minimum value is a constant, the minimization of in (7) is equivalent to minimizing . With this approximation and rearranging the inequality constraints in (7), we can rewrite (7) as

(17)
s.t.

where .

It is easy to see that the objective function in (17) decreases with , thus it also decreases with because . Therefore, for any , the optimal value of is , i.e., the largest value of that does not violate any inequality constraint in (17), where denotes the floor function for rounding down to integer. To simplify the analysis, we approximate by ignoring the rounding operation and substituting into the objective function in (17), yielding

(18)

and we can define the (approximately) optimal as

(19)

from which we can directly obtain the (approximately) optimal as , and the (approximately) optimal as .

Proposition 1.

When , , , , we have , where .

Proof.

Because , we have . Thus, . Let . With a slight abuse of notation, we consider continuous values of . We have

where the first inequality is from a lower bound of logarithmic function [40]. We also have

where the first inequality is from a lower bound of [40], the second inequality is because and .

Thus, for any , increases with , and is non-decreasing with . We also note that increases with for any , and . It follows that increases with for any . Hence, . ∎

Combining Proposition 1 with Theorem 2, we know that using found from (19) guarantees convergence with zero optimality gap as (and thus and ), because and . For general values of (and ), we have the following result.

Proposition 2.

When , , , , there exists a finite value , which only depends on , , , , , , , (), such that . The quantity is defined as

where index (set ), , , . Here, for convenience, we allow to interchangeably return a set and an arbitrary value in that set, we also define .

We also note that , thus .

Proof.

We can show that is finite according to the definition of and , then it is easy to see that is finite. We then show for any , in which case the maximization over in (18) becomes fixing . Then, the proof separately considers the terms inside and outside the square root in (18). It shows that the first order derivatives of both parts are always larger than zero when . Because the square root is an increasing function, increases with for , and thus . See Appendix -D for details. ∎

There is no closed-form solution for because includes both polynomial and exponential terms of , where the exponential term is embedded in . Because can only be a positive integer, according to Proposition 2, we can compute within a finite range of to find that minimizes .

Vi-B Adaptive Federated Learning

In this subsection, we present the complete control algorithm for adaptive federated learning, which recomputes in every global aggregation step based on the most recent system state. We use the theoretical results above to guide the design of the algorithm.

As mentioned earlier, the local updates run on edge nodes and the global aggregation is performed through the assistance of an aggregator, where the aggregator is a logical component that may also run on one of the edge nodes. The complete procedures at the aggregator and each edge node are presented in Algorithms 2 and 3, respectively, where Lines 33 of Algorithm 3 are for local updates and the rest is considered as part of global aggregation, initialization, or final operation. We assume that the aggregator initiates the learning process, and the initial model parameter is sent by the aggregator to all edge nodes. We note that instead of transmitting the entire model parameter vector in every global aggregation step, one can also transmit compressed or quantized model parameters to further save the communication bandwidth, where the compression or quantization can be performed using techniques described in [19, 20], for instance.

Vi-B1 Estimation of Parameters in

The expression of , which includes , has parameters which need to be estimated in practice. Among these parameters, and () are related to resource consumption, , , and are related to the loss function characteristics. These parameters are estimated in real time during the learning process.

The values of and () are estimated based on measurements of resource consumptions at the edge nodes and the aggregator (Line 2 of Algorithm 2). The estimation depends on the type of resource under consideration. For example, when the type- resource is energy, the sum energy consumption (per local update) at all nodes is considered as ; when the type- resource is time, the maximum computation time (per local update) at all nodes is considered as . The aggregator also monitors the total resource consumption of each resource type based on the estimates, and compares the total resource consumption against the resource budget (Line 2 of Algorithm 2). If the consumed resource is at the budget limit for some , it stops the learning and returns the final result.

The values of , , and are estimated based on the local and global losses and gradients computed at and , see Line 2 and Lines 22 of Algorithm 2 and Lines 3, 3, and 3 of Algorithm 3. To perform the estimation, each edge node needs to have access to both its local model parameter and the global model parameter for the same iteration (see Lines 3 and 3 of Algorithm 3), which is only possible when global aggregation is performed in iteration . Because is only observable by each node after global aggregation, estimated values of , , and are only available for recomputing starting from the second global aggregation step after initialization, which uses estimates obtained in the previous global aggregation step444See the condition in Line 2 of Algorithm 2 and Lines 3 and 3 of Algorithm 3. Also note that the parameters , , , sent in Line 3 of Algorithm 3 are obtained at the previous global aggregation step (, , and are obtained in Lines 33 of Algorithm 3)..

Input: Resource budget , control parameter , search range parameter , maximum value
Output:
1 Initialize , , ;  // is a resource counter Initialize as a constant or a random vector; Initialize ; repeat
2        Send and to all edge nodes, also send STOP if it is set; ;  //Save iteration index of last transmission of ;  //Next global aggregation is after iterations Receive , from each node ; Compute according to (5); if   then
3               Receive , , , from each node ; Compute according to (2) if   then
4                      ;
5              if STOP flag is set then
6                      break;  //Break out of the loop here if STOP is set
7              Estimate ; Estimate ; Compute , estimate for each , from which we estimate ; Compute new value of according to (19) via linear search on integer values of within , where we set ;
8       for  do
9               Estimate resource consumptions , , using received from all nodes and local measurements at the aggregator; ;
10       if  such that  then
11               Decrease to the maximum possible value such that the estimated resource consumption for remaining iterations is within budget for all , set STOP flag;
12Send to all edge nodes; Receive from each node ; Compute