I Introduction
The proliferation of smartphones, unmanned aerial vehicles (UAVs), and other devices comprising the Internet of Things (IoT) is causing an exponential rise in data generation and large demands for machine learning (ML) at the edge [8373692]. For example, sensor and camera modules on selfdriving cars produce up to 1.4 terabytes of data per hour [carData1] with the objective of training ML models for intelligent navigation. The traditional paradigm in ML of centralized training at a server is often not feasible in such environments since (i) transferring these large volumes of data from the devices to the cloud imposes long transfer delays and (ii) users are sometimes unwilling to share their data due to privacy concerns [7498684].
Federated learning (FedL) is a recently proposed distributed ML technique aiming to overcome these challenges [mcmahan2017communication, konevcny2016federated].Under FedL, devices train models on their local datasets, typically by means of gradient descent, and a server periodically aggregates the parameters of local models to form a global model. This global model is then transferred back to the devices for the next round of local updates, as depicted in Fig. 1. In conventional FedL, each device processes its own collected data and operates independently within an aggregation period. This will become problematic in terms of upstream device communication and local device processing requirements, however, as its implementations scale to networks consisting of millions of heterogeneous wireless devices [niknam2020federated, hosseinalipour2020federated].
At the same time, devicetodevice (D2D) communications that are becoming part of 5G and IoT can enable local offloading of data processing from resource hungry to resource rich devices [9155510]. Additionally, we can expect that for particular applications, the datasets collected across devices will contain varying degrees of similarity, e.g., images gathered by UAVs conducting surveillance over the same area [9084352, kairouz2019advances]. Processing similar data distributions at multiple devices adds overhead to FedL and an opportunity for efficiency improvement.
Motivated by this, we develop a novel methodology for smart device sampling with data offloading
in FedL. Specifically, we formulate a joint sampling and data offloading optimization problem where devices expected to maximize contribution to model training are sampled for training participation, while devices that are not selected may transfer data to those that are. This data offloading is performed according to estimated data dissimilarities between nodes, which are updated as transfers are observed. We show that our methodology yields superior model performance to conventional FedL while significantly reducing network resource utilization. In our model motivated by paradigms such as
fog learning [hosseinalipour2020federated, hosseinalipour2020multi, tu2020network], data offloading only occurs among trusted devices; devices that have privacy concerns are exempt from data offloading.Ia Related Work
To improve the communication efficiency of FedL, recent works have focused on efficient encoding designs to reduce parameter transmission sizes [9155479, sattler2019robust], optimizing the frequency of global aggregations [8486403, wang2019adaptive], and device sampling [ji2020dynamic, nguyen2020fast]. Our work falls into the third category. In this regard, most works have assumed a static or uniform device selection strategy, e.g., [wang2019adaptive, hosseinalipour2020multi, yang2019energy, tran2019federated, tu2020network, konevcny2016federated, sahu2018convergence, reisizadeh2020fedpaq, ji2020dynamic, nguyen2020fast], where the main server chooses a subset of devices either uniformly at random or according to a predetermined sampling distribution. There is also an emerging line of work on device sampling based on wireless channel characteristics, specifically in cellular networks [8851249, shi2019device, xia2020multi, ren2020scheduling, he2020importance]. By contrast, we develop a sampling technique that adapts to the heterogeneity of device resources and overlaps across local data distributions that are key characteristics of contemporary wireless edge networks. Also, we study device sampling based on the utility of device data distributions. Specifically, when compared to the limited literature on device sampling by each device’s instantaneous contributions to global updates [9155494], we introduce a novel perspective based on device data similarities. Our methodology exploits the proliferation of D2D communications at the wireless edge [tu2020network, hosseinalipour2020multi], to diversify each selected device’s local data via D2D offloading. Our work thus considers the novel problems of sampling and D2D offloading for FedL jointly, and leads to new analytical convergence bounds and algorithms used by implementations.
It is worth mentioning two parallel lines of work in FedL that consider relationships between node data distributions. One is on fairness [williamson2019fairness], in which the objective is to train the ML model without biasing the result towards any one device’s distribution, e.g., [mohri2019agnostic, li2019fair]
. Another leverages transfer learning techniques
[pan2009survey] to build models across data parties (e.g., companies or enterprises) that possess partial overlaps [li2019fedmd, liu2018secure, gao2019privacy]. Our work is focused on a fundamentally different objective, i.e., network resource efficiency optimization.IB Motivating Toy Example
Consider Fig. 2, wherein five heterogeneous smart cars communicate with an edge server to train an object detection model. Due to limited bandwidth, the server can only exploit out of the cars to conduct FedL training, but needs to train a model representative of the entire dataset within this network. The computational capability of each car, i.e., the number of processed datapoints in one aggregation period, is shown next to itself, and the edge weights in the data similarity graph capture the similarity between the local data of the cars. Rather than using statistical distance metrics [liese2006divergences], which are hard to compute in this distributed scenario, the data similarities could be estimated by commute routes and geographical proximity [jeske2013floating]. Also, in D2Denabled environments, nodes can exchange small data samples with trusted neighbors to calculate similarities locally and report them to the server.
In Fig. 2, if the server samples the cars with the highest computational capabilities, i.e., and , the sampling is inefficient due to the high data similarity between them. Also, if it samples those with the lowest similarity, i.e., and , the local models will be based on low computational capabilities, which will often result in a low accuracy (and could be catastrophic in this vehicular scenario). Optimal sampling of the cars considering both data similarities and computational capabilities is thus critical to the operation of FedL.
We take this one step further to consider how D2D offloading can lead to augmented local distributions of sampled cars. The node sampling must consider the neighborhoods of different cars and the capability of data offloading in those neighborhoods: D2D is cheaper in terms of resource utilization among those that are close in proximity, for example. The feasible offloading topology in Fig. 2 is represented by the data offloading graph. Given ’s high processing capability and data dissimilarity with neighboring cars and , sampling in a D2Doptimized solution can yield a composite of all three cars’ distributions. The purpose of this paper is to model these relationships for a general wireless edge network and optimize the resulting sampling and offloading configurations.
IC Outline and Summary of Contributions

[leftmargin=4mm]

We formulate the joint sampling and D2D offloading optimization problem for maximizing FedL model accuracy subject to realistic network resource constraints (Sec. II).

Our theoretical analysis of the offloading subproblem for a fixed sampling strategy yields a new upper bound on the convergence of FedL under an arbitrary data sampling strategy (Sec. III). Using this bound, we derive an efficient sequential convex optimizer for the offloading strategy.

We propose a novel MLbased methodology that learns the desired combination of sampling and resulting offloading (Sec. IV). We encapsulate the network structure and offloading scheme into model features and learn a mapping to the sampling strategy that maximizes expected FedL accuracy.

We evaluate our methodology through experiments on realworld ML tasks with network parameters from our testbed of wireless IoT devices (Sec. V). Our results demonstrate model accuracies that exceed FedL trained on all devices with significant reductions in processing requirements.
Ii System and Optimization Model
In this section, we formulate the joint sampling and offloading optimization (Sec. IID). We first introduce our edge device (Sec. IIA), network (Sec. IIB), and ML (Sec. IIC) models.
Iia Edge Device Model
We consider a set of devices connected to a server, and time span for model training. Each device possesses a data processing capacity , which limits the number of datapoints it can process for training at time , and a unit data processing cost . Intuitively, , are related to the total CPU cycles, memory (RAM), and power available at device [morabito2018legiot]. These factors are heterogeneous and timevarying, e.g., as battery power fluctuates and as each device becomes occupied with other tasks. Additionally, for each , we define as the data transmit budget, and as the unit data transmission cost to device . Intuitively, , are dependent on factors such as the available bandwidth and the channel interference conditions. For example, devices that are closer in proximity would be expected to have lower .
Due to resource constraints, the server selects a set of devices to participate in FedL training. Some devices may be stragglers, i.e., possessing insufficient to participate in training, but nonetheless gather data. Different from most works, our methodology will seek to leverage the datasets captured by nodes in the unsampled set via local D2D communications with nodes in the sampling set .
We denote the dataset at device for the specific ML application by . is the initial data at , which evolves as offloading takes place. Henceforth, we use calligraphic font (e.g., ) to denote a set, and noncalligraphic (e.g., ) to denote its cardinality. Each data point is represented as , where
is a feature vector of
features, and is the target label.IiB Network Topology and Data Similarity Model
We consider a timevarying network graph , among the set of nodes to represent the available D2D topology. Here, denotes the set of edges or connections between the nodes, where if node is able/willing to transfer data in D2D mode to node at time , depending on e.g., the trust between the devices, and whether the devices are D2Denabled. For instance, smart home peripherals can likely transfer data to their owner’s smartphone, while certain smart cars in the vehicular network in Fig. 2 may be unwilling to communicate. We capture these potential D2D relationships using the adjacency matrix , where if , and otherwise.
We define as the fraction of node ’s data offloaded to node at time . To optimize this, we are interested in the similarity among local datasets. We define the similarity matrix among the nodes at time , where . Higher values of imply a higher dataset similarity between nodes and , and thus less offloading benefit. In practice, neither the server nor the devices have exact knowledge of the local data distributions. To this end, we consider a probabilistic interpretation of similarity where
is defined based on the probability that a data point
sampled i.i.d from is “similar” to at least one data point . Two datapoints and are considered similar if (i) they have the same label , and (ii) their feature vectors satisfy , where can be applicationspecific. Due to device dataset heterogeneity, in general, data similarity will not be symmetric.For , is defined from according to the offloading behavior, as we will explain in Sec. IID. Estimates of the initial , i.e., before any offloading takes place, can be obtained in several ways that avoid transferring large volumes of data. We assume that device will broadcast a random sample of to its neighbor , which can then estimate by determining the fraction in this sample that are similar to a . To capture both the node connectivity and data similarity jointly for offloading, we also define the connectivitysimilarity matrix , and , where represents the Hadamard product.
IiC Distributed Machine Learning Model
The learning objective of FedL is to train a global ML model parameterized by a vector (e.g.,
weights in a neural network) using the devices participating in training. Formally, for
, each sampled device is concerned with updating its model parameter vector on its local dataset . The local loss at device is defined as(1) 
where denotes the corresponding loss (e.g., squared error for a regression problem, crossentropy for a classification problem [wang2019adaptive]) of each datapoint . Each device minimizes its local loss sequentially via gradient descent:
(2) 
where is the step size and is the average gradient over the local dataset . With periodicity , the server performs a weighted average of , :
(3) 
where denotes the th aggregation, , , and denotes the total data located at node between and . The server then synchronizes the sampled devices: , .
Since we are concerned with the performance of the global parameter , we define as the weighted average of as in (3) for each time , though it is only computed at the server when . The global loss that we seek to optimize considers the loss over all the datapoints in the network:
(4) 
where denotes the multiset of the datasets of all the devices at time , and .
IiD Joint Sampling and Offloading Optimization
The goal of our optimization is to select (i) the subset of devices to sample from a total budget of and (ii) the offloading ratios between the devices to minimize the loss associated with . We consider a time average for the objective, as devices may rely on intermediate aggregations for realtime inferences. For the variables, we define the binary vector to represent device sampling status, i.e., if then , otherwise , and matrix to represent the offloading ratios at time . The resulting optimization problem is as follows:
(5)  
subject to  
(6)  
(7)  
(8)  
(9)  
(10)  
(11)  
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
The data at sampled devices, i.e., for , changes over time in (6) based on the total received data for device . is determined in (7) by scaling the data transmissions from to device according to the similarity. In response to the data offloading, the connectivitysimilarity matrix is updated in (8). Together, (7) and (8) capture similarityaware D2D offloading, which we explain further in the following paragraph. Next, (9)(11) ensure that our D2D offloading solution adheres to device receive capacities , data processing limits , and D2D limits . (12) ensures total offloaded data by device does not exceed its local dataset size. Through (13)(15), offloading only occurs between singlehop D2D neighbors from to . (16) maintains compliance with the desired sampling size, i.e., .
Similarityaware D2D offloading: The amount of raw data device receives from is . Ideally, device will receive data that is dissimilar to . However, neither nor have full knowledge of each others’ datasets in this distributed scenario (nor does the server). Therefore, data offloading in is conducted through an i.i.d. selection of data points from to send to . The estimated overlapping data that arrives at is , and the resulting useful data is . We therefore adjust by the effective fraction of data offloaded from to , resulting in (8). In particular, when transfers all of its data to (i.e., ), becomes 1, preventing further data offloading from to according to (7). Imposing these constraints promotes data diversity among the sampled nodes through offloading.
Solution overview: Problem
faces two major challenges: (i) it requires a concrete model of the loss function with respect to the datasets, which is, in general, intractable for deep learning models
[goodfellow2016deep], and (ii) even if the loss function is known, the coupled sampling and offloading procedures make this problem an NPhard mixed integer programming problem. To overcome this, we will first consider the offloading subproblem for a fixed sampling strategy, and develop a sequential convex programming method to solve it in Sec. III. Then, we will integrate this into a graph convolutional network (GCN)based methodology that learns the relationship between the network properties, sampling strategy (with its corresponding offloading), and the resulting FedL model accuracy in Sec. IV. An overall flowchart of our methodology is given in Fig. 3.Iii Developing the Offloading Optimizer
In this section, we study the offloading optimization subproblem of . Our theoretical analysis of (5) under common assumptions will yield an efficient approximation of the objective in terms of the offloading variables (Sec. IIIB). We will then develop our solver for the resulting optimization (Sec. IIIC).
Iiia Definitions and Assumptions
To aid our theoretical analysis of FedL, similar to [wang2019adaptive], we will consider a hypothetical ML training process that has access to the entire dataset at each time instance. The parameter vector for this centralized model is trained as follows: (i) at each global aggregation , is synchronized with , i.e., , and (ii) inbetween global aggregation periods, is trained based on gradient descent iterations to minimize the global loss .
Definition 1 (Difference between sampled and unsampled gradients).
We define the instantaneous difference between , the gradient with respect to the full dataset across the network, and , the gradient with respect to the sampled dataset, as:
(18) 
where is the scaled sum of gradients on the sampled datasets, and is the total data across the sampled devices.
Definition 2 (Difference between sampled and unsampled gradients).
We define as the upper bound between the gradient computed on for and at time :
(19) 
We also make the following standard assumptions [wang2019adaptive, tu2020network] on the loss function for the ML model being trained:
Assumption 1.
We assume is convex with respect to , LLipschitz, i.e., , and smooth, i.e., , .
Despite these assumptions, we will show in Sec. V that our results still obtain significant improvements in practice for neural networks which do not obey the above assumptions.
IiiB Upper Bound on Convergence
For convergence analysis, we assume that devices only offload the same data once, and assume that recipient nodes always keep received data. This must be done to ensure that the optimal global model parameters remain constant throughout time. The following theorem gives an upper bound on the difference between the parameters of sampled FedL and those from the centralized learning, i.e., , over time:
Theorem 1 (Upper bound on the difference between sampled FedL and centralized learning).
Assuming , the upperbound on the difference between and within the local update period before the th global aggregation, , is given by:
(20) 
where , and
(21) 
Proof.
See Appendix A. ∎
Through , Theorem 1 establishes a relationship between the difference in model parameters and the datapoints in the sampled set . Using this, we obtain an upper bound on the difference between our and the global minimizer of model loss :
Corollary 1 (Upper bound on the difference between sampled FedL and the optimal).
The difference of the loss induced by compared to the loss induced by for , is given by:
(22)  
where , , and .
Proof.
See Appendix B. ∎
As our ultimate goal is an expression of (5) in terms of the data at each node, we consider the relationship between and , which is clearly nonconvex through . Since (see Appendix B), (22) can be approximated using the first two terms of its Taylor series:
(23) 
At each time instant, the first term in the right hand side (RHS) of (23) is a constant. Thus, under this approximation, the RHS of (22) becomes proportional to , which is in turn a function of . The final step is to bound the expression for , and thus their weighted sum , in terms of the , .
Proposition 1 (Upper bound on the difference between local gradients).
The difference in gradient with respect to a sampled device dataset vs. the full dataset satisfies:
(24) 
where , , is a constant independent of , and
(25) 
Proof.
See Appendix C. ∎
The above proposition relates each to the number of instantaneous data points available at device .
IiiC Offloading as a Sequential Convex Optimization
Using the result of (23) to replace the RHS of (22) implies that the objective function in (5) is proportional to , where is defined in (21) as a sumofratios of . Considering as the objective in problem yields the sumofratios problem in fractional programming [schaible2003fractional]. The scale of existing solvers for the sumofratios fractional programming problem (e.g. [kuno2002branch]) are on the order of ten ratios, which corresponds to ten devices in our case. Contemporary largescale networks that may have hundreds of edge devices [8373692] therefore cannot be solved accurately or in a timesensitive manner. Motivated by the above fact, we approximate . Using this with (24), we obtain the following approximation for (5):
(26) 
where term is due to sampling and term
is the statistical error from the central limit theorem. Thus, for a known binary vector
(i.e., a known ) that satisfies (16), we arrive at the following optimization problem for the D2D data offloading:Since the number of datapoints at the unsampled devices is fixed for all time, can be expressed as , where is a constant. Consequently, both the coefficient of in term and the entirety of term in (26) are decreasing functions of the quantity of data at sampled devices . Furthermore, given , through (6) and (7), both terms and are convex with respect to the offloading variables in Problem . The only remaining challenge is then to obtain , which we consider next.
Sequential gradient approximation: Obtaining requires the knowledge of realtime gradients, , , which are unknown a priori. Furthermore, the gradients of the devices are only observed at the global aggregation time instances . Motivated by this, we approximate for , , using the gradients observed at the most recent global aggregation, i.e., , on which we perform a sequence of corrective approximations. Specifically, since the average loss is convex, is expected to decrease over time. We assume that this decrease occurs linearly and approximate the realtime gradient using the previously observed gradient at the server as , , , where the scaling factor is readjusted after every global aggregation . Through the readjustment procedure, the server receives the gradients and computes the scaling factor for the each aggregation period as .
Given the aforementioned characteristics of terms and in (26), our proposed iterative approximation of , and the fact that the constraints of are all affine at each time instance, we can solve this problem as a sequence of convex optimization problems over time. For this, we employ the CVXPY convex optimization software [diamond2016cvxpy].
Iv Smart Device Sampling with D2D Offloading
We now turn to the sampling decisions in problem , which must be coupled with the offloading solution to . After explaining the rationale for our GCNbased approach (Sec. IVA), we will detail our training procedure encoding the network characteristics (Sec. IVB). Finally, we will develop an iterative procedure for selecting the sample set (Sec. IVC).
Iva Rationale and Overview of GCN Sampling Approach
Sampling the optimal subset of nodes from a resourceconstrained network to maximize a utility function (in our case, minimizing the ML loss) has some similarity to 01 knapsack problem [martello2000new]
. In this combinatorial optimization problem, a set of weights and values for
items are given, where each item can be either added or left out to maximize the value of the items within the knapsack subject to a weight capacity. Analogously, our sampling problem aims to maximize FedL accuracy while adhering to a sampling budget . Strategies for the knapsack problem become unsuitable here because the value that each device provides to FedL is difficult to quantify: it depends on the ML loss function, the gradient descent procedure, and the D2D relationships from Sec. III.To address these complexities, we propose a (separate) ML technique to model the relationship between network characteristics, the sampling set, and the resulting FedL model quality. Specifically, we develop a sampling technique based on active filtering of a Graph Convolutional Network (GCN) [kipf2016semi, lee2020fast]. In a GCN, the learning procedure consists of sequentially feeding an input (the network graph) through a series of graph convolution [wu2020comprehensive] layers, which generalize the traditional convolution operation into nonEuclidean spaces and thus captures connections among nodes in different neighborhoods.
Our methodology is depicted in Fig. 4. GCNs excel at graphbased classification tasks, as they learn over the intrinsic graph structure. However, GCNs by themselves have performance issues when there are multiple good candidates for the classification problem [li2018combinatorial]. This holds for our largescale network scenario, as many high performing sets of sampling candidates can be expected. The data offloading scheme adds another important dimension: a sampled node may perform poorly when considered in isolation, but it may have high processing capacities and be connected to unsampled nodes with large quantities of local data and high transfer limits . We address these issues by (i) incorporating the solution from Sec. III into the GCN training procedure, and (ii) proposing sampling GCNbranch, a networkbased postprocessing technique that maps the GCN output to a sampling set by considering the underlying connectivitysimilarity matrix.
IvB GCN Architecture and Training Procedure
We consider a GCN function with two inputs: (i) , a matrix of node features, and (ii) , the augmented connectivitysimilarity matrix. The feature vector for each node is defined as , forming the rows of , and the augmented connectivitysimilarity matrix is defined as , where
denotes the identity matrix
[wu2020comprehensive].consists of two graph convolutional layers separated by a rectified linear unit (ReLU) activation layer
[kipf2016semi], as depicted in Fig. 4. The outputs of each layer are defined as:(27) 
where is the degree matrix of , denotes the trainable weights for the th layer, and represents ReLU activation. Note that , , and , where is the dimension of the second layer. Finally, logsoftmax activation is applied to to convert the results into a vector of probabilities, i.e., , representing the likelihood of each node belonging to the sampled set.
GCN training procedure: To train the GCN weights, we generate a set of sample network and node data realizations with the properties from Sec. IIA&IIB. For each realization, we calculate the matrices and corresponding to the inputs of the GCN. Then, for each candidate sampling allocation (with ), we solve from Sec. III to obtain the offloading scheme, and then determine the loss of FedL resulting from model training and D2D offloading. Among these, we choose the that yields the smallest objective to be the target GCN output. The collection of form the training samples for the GCN.
As the number of devices increases, the number of choices that will be considered for the sampled set increases combinatorially as . An advantage of this GCN procedure is that it is networksize independent: once trained on a set of realizations for tolerablesized values of , the graph convolutional layer weights , , can be applied to the desired network of arbitrary size . Our obtained performance results in Sec. V verify this experimentally.
IvC OffloadingAware Smart Device Sampling
Given any network graph, our procedure must solve the sampling problem at the point of FedL initialization, i.e., . With the trained GCN in hand, we obtain and for the target network and calculate , . Given this output, our sampling GCNbranch algorithm populates the set as follows. Let be the subset of nodes in the 98th percentile of initial data quantity. Starting with , the first node is added according to , where . In this way, the first node added is the device with highest GCN probability among the largest data generation nodes. To choose subsequent sampled nodes, the algorithm performs a recursive branchbased search on the initial connectivitysimilarity matrix for nodes with the highest sampling probabilities and the least aggregate data similarity to the previously sampled nodes. Formally, we choose the th node addition as , where , with denoting the neighbor nodes of within the 98th percentile of data dissimilarity to (i.e., based on ). In this way, our branch algorithm relies on the GCN to decide which branch the sampling scheme will follow given its current sampled nodes (visualization in Fig. 4), so that subsequent selections are more likely to contain nodes with (i) different data distributions while (ii) leading to new neighborhoods that can contribute to the current set. Once the sampled set is determined, the offloading is scheduled for per the solver for from Sec. III.
Summary of methodology: Fig. 3 summarizes our methodology developed in Sec. III&IV for solving . The sequential convex optimization for offloading (Sec. III) is embedded within the GCNbased sampling procedure (Sec. IV). Once the model is trained on sample network realizations, it is applied to the target network to generate the and for FedL.
V Experimental Evaluation
Va Setup and Experimental Procedure
VA1 Network characteristics via wireless testbed
We employed our IoT testbed in Fig. 5 to obtain device and communication characteristics. It consists of Jetson Nano, Coral Dev Board TPU, and UP Squared AI Edge boards configured to operate in D2D mode. We used Dstat [dstat] to collect the device resources and power consumption. We map the measured computing resources (in CPU cycles and RAM) and the corresponding power consumption (in mW) at devices to the costs and capacities in our model by calculating the Gateway Performance Efficiency Factor (GPEF) [morabito2018legiot]. Specifically, to determine the processing costs , we measured the GPEF of the devices running gradient iterations on the MNIST dataset [yann]. For the processing capacities , we pushed the devices to 100% load and measured the GPEF. We initialized the devices at 25%75% loads, and treated the available remaining capacity as the receive buffer parameter .
For the transmission costs, we measured GPEF spent on D2D offloading over WiFi. Our WiFi links, when only devoted to D2D offloading, consistently saturated at 12 Mbps. To simulate the effect of external tasks, we limit available bandwidth for D2D to 1, 6, and 9 Mbps. We then calculated the transmission resource budget for devices as transfer limits , and modelled unit transfer costs as normalized D2D latency.
VA2 Datasets and largescale network generation
For FedL training, we use MNIST and FashionMNIST (FMNIST) [fmnist] image classification datasets. We consider a CNN predictor composed of two convolutional layers with ReLU activation and dropout. The devices perform rounds of gradient descent with a learning rate . Following [tu2020network], we generate network topologies with to devices using Erdös–Rényi graphs with link formation probability (i.e., ) of . To produce local datasets across the nodes that are both overlapping and noni.i.d, the datapoints at each node are chosen uniformly at random with replacement from datapoints among three labels (i.e., image classes). Differentiating the labels between devices captures dataset heterogeneity (i.e., from different devices collecting different labels). The number of initial datapoints
at each device follows a normal distribution with mean
and variance
. We further estimate the initial similarity weights based on the procedure discussed in Sec. IIB.Sampling Method  Devices (MNIST)  Devices (FMNIST)  

600  700  800  600  700  800  
Smart  8  9  11  8  5  7 
Random  14  17  
Heuristic  17  14  17  7  
Smart w/ offload  4  4  6  6  4  5 
Random w/ offload  11  14  10  14  12  13 
Heuristic w/ offload  15  9  11  5 
For the GCNbased sampling procedure, we train the model on small network realizations of ten devices. We consider sampling budgets of to , with thousands of training samples in each case. We save the resulting graph convolutional layer weights and for each choice of and reapply them on the larger target networks.
VB Results and Discussion
In the following experiments, we compare our methodology to several baseline sampling and offloading schemes. The three sampling strategies considered are smart, random, and heuristic. Smart sampling refers to our proposed method, random sampling is done by averaging the performance over five randomly sampled combinations of devices, and heuristic sampling selects the devices with the highest processing capacities. Each of the three sampling schemes is either done without or with offloading. For smart sampling, our offloading methodology is used. For random sampling, we perform random offloading. For heuristic sampling, we perform a greedy offloading that maximizes the number of received data points for the device. A baseline of FedL with no sampling (i.e., all nodes active) and no offloading is also included.
VB1 Model accuracy
Figs. 6 and 7 show FedL accuracy for both datasets in six different combinations ( and ). Overall, we see that the final accuracy obtained by our smart with (w/) offloading scheme outperforms all of the other methods except for FedL in two cases (6a&6d). The comparison with FedL all nodes
is remarkable as that leverages all of the devices in the network, while the sampling uses at most 5% of them. Our smart data offloading methodology outperforms FedL due to two main reasons: (i) it minimizes data skew resulting from unbalanced label frequencies, and (ii) it ensures higher quality of local datasets at sampled nodes, which reduces bias caused by multiple local gradient descents. Without offloading, the improvement obtained over the heuristic and random sampling strategies consistently exceeds 20% for MNIST and 10% for FMNIST, which shows that sampling optimization still leads to considerable improvements when D2D is disabled. On the other hand, our method with offloading obtains a substantial improvement over no offloading in most cases, whereas these differences are smaller for the heuristic and random methods. This emphasizes the importance of designing the sampling and offloading schemes for FedL jointly.
VB2 Model convergence speed
We next compare the convergence speeds of our methodology to the other schemes in terms of the number of global aggregations needed to reach a certain percentage of the final accuracy of FedL with all nodes. Table I compares the convergence speeds on MNIST (to reach 85%) and FMNIST (to reach 65%), respectively, for and . Overall, we see that our joint sampling and offloading methodology obtains significantly faster training speeds than the other methods, on average 40% for MNIST and 50% for FMNIST. Enabling offloading is also seen to improve the convergence speeds of each sampling scheme; in fact, without offloading, several baseline cases fail to reach the given percentage of the FedL baseline.
VB3 Resource utilization
Finally, we compare the resource utilization for the different schemes in terms of the total data processed across nodes in the network. Fig. 8 gives the results for each dataset, comparing the datapoints processed by methods with and without offloading to reach within a percentage of an arbitrary reference accuracy (60%). We see that smart sampling (with or without offloading) outperforms the other schemes in all cases, which highlights the computational efficiency obtained by our method. As the accuracy level increases, our methodology constantly requires less datapoints compared to the other methods (on average 40% fewer), emphasizing its ability to filter out duplicate data.
These experiments demonstrate that our joint optimization method exceeds baseline performances in terms of model accuracy, convergence speed, and resource utilization.
Vi Conclusion and Future Work
In this paper, we developed a novel methodology to solve the joint sampling and D2D offloading optimization problem for FedL. Our theoretical analysis of the offloading subproblem produced new convergence bounds for FedL, and led to a sequential convex optimization solver. We then developed a GCNbased algorithm that determines the sampling strategy by learning the relationships between the network properties, the offloading topology, the sampling set, and the FedL accuracy. Our implementations using realworld datasets and IoT measurements from our testbed demonstrated that our methodology obtains significant improvements in terms of datapoints processed, training speed, and resulting model accuracy compared to several other algorithms, including FedL using all devices. Future investigations will consider the integration of realistic network characteristics on FedL.
Appendix A
Since , , we get:
(28) 
We simplify (A) through the following steps:
(29) 
where results from expanding and applying the triangle inequality repeatedly, follows from using the smoothness of the loss function and the triangle inequality, applies Lemma 3 from [wang2019adaptive], and uses the expanded form of in (21). We then rearrange (A):
(30)  
Since when resynchronization occurs at , , we express as:
(31) 
Appendix B
We first define . Since is Lipschitz, we apply the result of Theorem 1, and Lemmas 2 and 6 from [wang2019adaptive] to obtain: