I Introduction
With the advent of novel powerful computing platforms, alongside the availability of largescale datasets, machine learning (ML), and particularly, deep learning, have gained significant interest in recent years
[1]. Such developments have also contributed to the invention of more advanced ML architectures and more efficient training mechanisms [2], which have resulted in stateoftheart performance in many domains, such as computer vision
[3][4], healthcare [5], etc.More recently, however, there has been an increasing awareness in consumers of services, which are driven by ML models, regarding the privacy of their data. Depending on how sensitive the data type is or how often it is collected, each user has their own privacy concerns and preferences [6]. Such trends have been coincident with the proliferation of mobile computing solutions, which provide devices, such as smarthome devices, cell phones, laptops, and drones, with strong computation capabilities [7].
These societal and technical trends have given rise to paradigms such as federated and decentralized learning, where the generated data by each device stays onboard to protect its privacy [8, 9]. To compensate for that, (part of) the computation is also shifted to be done locally at the enduser devices. It has been shown that in many cases, distributing the learning process over different nodes incurs negligible performance loss compared to centralized training approaches [10].
However, one major bottleneck in all the aforementioned paradigms is the communication network between the learning nodes. As the data points generated by each node differ from the rest of the rest of the network, the nodes need to periodically communicate with each other so that they all converge to the same model, rather than diverging to completely different models. If the communication that needs to occur between the nodes in the network induces sizeable delays, it can significantly lengthen the convergence time across the network, as it can totally dominate the computation delay at the learning nodes.
This phenomenon has motivated a massive body of recent work on dealing with the communication delays for federated and decentralized learning. In [11], a setting with a single server and multiple worker nodes is considered, where at each iteration, a subset of worker nodes are selected, either by the server or by the worker nodes themselves, to send their gradients to the server. In [12], a simple network of multiple worker nodes is considered, over which they can all exchange their computation results with a fixed amount of delay, in conjunction with a server which aggregates all the results and sends back updated parameters to the worker nodes. In [13], gossiping algorithms and convergence guarantees are provided for decentralized optimization with compressed communication. In [14], it is shown how specific connectivity of the communication network topology among learning nodes affects the speed of convergence. In [15], convergence results are derived for a combination of quantization, sparsification and local computation in a distributed computation setting with a single master and multiple worker nodes. In [16], a deadlinebased approach for minibatch gradient computing at each computing node is proposed, such that the minibatch size is adaptive to the computation capabilities of each node, hence making the scheme robust to stragglers.
Most of the above works deal with an abstract model for the communication network among the learning nodes. One particularly interesting communication paradigm to consider is wireless communication, especially as operators around the world roll out their 5G network infrastructure. There have been some recent works that have considered wireless constraints, mostly in the context of federated learning [17, 18, 19, 20, 21].
In this paper, we consider the decentralized learning scenario over a network of learning nodes connected together through a shared wireless medium. Considering the nature of wireless networks, in which nodes in proximity can more efficiently communicate with each other, while interfering at concurrent transmissions, we attempt to characterize the communication delay for exchanging the gradients among the learning nodes over the wireless network topology. In particular, we consider a setting similar to [14], where at each time, a set of noninterfering gradient exchanges are scheduled to happen simultaneously. Using the results on the optimality of treating interference as noise in interference networks [22], we present an algorithm for gradient exchanges in wireless decentralized learning akin to the informationtheoretic link scheduling that was proposed in [23] for the case of devicetodevice networks.
We utilize tools from random geometric graph theory to characterize the asymptotic communication latency for exchanging gradients in the aforementioned decentralized setting framework. In particular, we consider a network of learning nodes located within a circle of radius , where each node exchanges gradients with its neighboring nodes, which are within a distance of itself, where is a variable that controls the density of the gradient exchange topology. This threshold distance needs to decrease with , as the entire network needs to remain connected to guarantee the convergence of the decentralized learning algorithm. We show that as , the communication latency scales as , increasing with the number of users, and decreasing with . This result provides insights on how much communication time is needed in a wireless decentralized learning scenario, where more gradient exchanges leads to longer communication latencies, but faster convergence rates.
Ii System Model
Consider a wireless network consisting of nodes dropped uniformly at random within a circular area of radius . Assume that each node has access to a set of data points
, and the goal is to minimize a global loss function
, defined over a set of optimization parameters , using the overall dataset across the network aswhere is the local loss function at node , and is the stochastic loss function for sample given model parameters
. In order to solve this problem, decentralized stochastic gradient descent (SGD) can be utilized to minimize the objective function in an iterative fashion. In decentralized SGD, the system is run over multiple iterations, where at each iteration, each node performs a local computation of the gradient of the objective function with respect to the set of optimization parameters
over (a minibatch of) its local dataset, following which the gradients are exchanged among nodes prior to the beginning of the next iteration.Due to the pathloss and fading effects in wireless communications, nodes can more easily communicate to their closer neighbors than farther ones. Therefore, we define the communication graph as the network topology which dictates how nodes exchange gradients with their neighboring nodes, and we model it as an undirected random geometric graph (RGG) , where is the set of all nodes in the network, and for every , where , if and only if , where denotes the distance between nodes and , and is the threshold distance for gradient exchange; i.e., two nodes can exchange their gradients with each other if and only if they are located within a distance of at most .
However, activating multiple gradient exchanges over the wireless channel at the same time will lead to interference, which can significantly reduce the network performance in terms of the throughput, and therefore, the communication delay. To capture the interference among concurrent wireless transmissions, we also define a conflict graph . In this graph, each vertex represents a communication link in the original communication graph, i.e., . Moreover, there is an edge between two vertices in if their activations are in conflict; i.e., if transmitting data (i.e., gradients) on those links at the same time strongly interfere on each other. Since the level of interference also depends on the distance of transmitting/receiving nodes, we introduce a conflict distance , where for two vertices , there is an edge between and , i.e., , if and only if
which implies that at least one node in is within conflict distance of . Note that for the case of , , implying that there is a conflict between and , for any two neighbors of node in the original communication graph. This means that a node cannot communicate with two nodes at the same time (i.e., halfduplex and single frequency band constraints).
Given the above definitions, our goal is to determine the asymptotic behavior of the normalized gradient exchange latency (as ), which is defined as the delay for completing the exchange of 1 bit of gradients on all the links of the communication graph. Assuming that the communication delay in the network dominates the gradient computation delay at each node, the normalized gradient exchange latency characterizes the wallclock run time per iteration for decentralized SGD on a wireless communication network of learning nodes.
Iia Wireless Communication Model
We assume each node is equipped with a single transmit/receive antenna, and all transmissions happen in a synchronous timeslotted manner on a single frequency band. We restrict the transmission strategies to an on/off pattern: At each time slot, a node either transmits a message to another node with full power or stays completely silent. We use as a transmission status indicator of node at time slot ; i.e., if and only if node is transmitting with full power at time slot . On the receiver side, we adopt the simple and practical scheme of treating interference as noise (TIN), where each node decodes its desired message, while treating the interference from all other concurrent transmissions as noise. Letting
denote the noise variance, the rate achieved on a link from node
to node at time can be written as(1) 
where denotes the channel gain on the link between nodes and . In this paper, we adopt a singleslope pathloss model for the channel gains, where the channel gain at distance can be written as
where is the reference channel gain at a distance of , and denotes the pathloss exponent. This implies that the achievable rate in (1) can be written as
where
denotes the signaltonoise ratio (SNR) at a distance of
.Iii Forming the Communication and Conflict Graphs
The communication network topology needs to be carefully designed, as decentralized SGD will not converge if the gradient exchange communication graph is disconnected [14]. We resort to the following lemma, which provides a sufficient condition for connectivity of random geometric graphs.
Lemma 1 (Corollary 3.1 in [24]).
In an RGG with nodes and a threshold distance of
, the graph is connected with probability one (as
) if , where .^{1}^{1}1In this paper, we use the shorthand notation to denote the natural logarithm operation .In light of Lemma 1, for the communication graph, we set the gradient exchange threshold distance as
(2) 
which decreases as the number of nodes increases so as to satisfy the condition in Lemma 1, hence maintaining the connectivity of the entire graph.
Now, to build the conflict graph, we use the following result, derived in [22], for approximate informationtheoretic optimality of TIN in wireless networks.
Theorem 1 (Theorem 4 in [22]).
Consider a wireless network with transmitterreceiver pairs , where denotes the signaltonoise ratio between and , and denotes the interferencetonoise ratio between and . Then, under the following condition,
TIN achieves the entire informationtheoretic capacity region of the network (as defined in [22]) to within a gap of per dimension.
Theorem 1 immediately leads to the following corollary.
Corollary 1.
In a network with transmitterreceiver pairs, if the minimum SNR and the maximum INR across the whole network (denoted by and , respectively) satisfy , then TIN is informationtheoretically optimal to within a gap of per dimension.
As mentioned in Section II, the received power at distance can be written as . Hence, given the RGG nature of the communication and conflict graphs, we can bound the SNR and INR values across the network as
(3)  
(4) 
Therefore, (3)(4) together with Corollary 1 imply that a sufficient condition for the optimality of TIN for exchanging the gradients is
Thus, to guarantee the optimality of TIN, while having the sparsest conflict graph, we set the conflict distance as
(5) 
Iv Main Result
In this section, we present our main result on the time needed for exchanging gradients over the communication graph as follows.
Theorem 2.
For a sufficiently large network of learning nodes (), the normalized gradient exchange latency satisfies
(6) 
Remark 1.
Theorem 2 implies that the normalized gradient exchange latency can be upperbounded in an orderwise fashion (for ) as
(7) 
Theorem 2 characterizes an achievable normalized gradient exchange latency over the communication graph. Figure 1 demonstrates how this latency changes with and for the case where nodes are dropped within a circular area of radius m, transmit power is assumed to be dBm, noise power spectral density is taken to be dBm/Hz, the bandwidth is MHz, the pathloss exponent is equal to , and the reference channel gain is set to . As demonstrated by (6) and its orderwise approximation in (7), as well Figure 1, the delay of exchanging gradients over all links in the conflict graph monotonically increases with , which is expected as increasing the network size, while keeping the communication graph connected, will require an increasing number of gradient exchanges among neighboring nodes.
On the other hand, the latency decreases (approximately) exponentially with . As per (2), determines the threshold distance for gradient exchange among adjacent nodes; Increasing will reduce the number of neighbors with which each node exchanges gradients, and this provides a significant saving in terms of the communication latency. Note that this comes at the expense of slower convergence rate for the global loss function, as it will take longer for each node to obtain access to the gradients from datasets available in farther nodes.
V Achievable Scheme
In this section, we prove our main result in Theorem 2 by providing an achievable scheme for gradient exchange on all links in the communication graph and characterizing an upper bound on its achievable normalized gradient exchange latency.
Given the communication and conflict graphs, the nodes can exchange gradients with their neighbors in the communication graph as long as their exchanges are nonconflicting; i.e., there is not an edge between them in the conflict graph. This leads to the notion of independent sets on the conflict graph, where each such independent set contains a set of nodes such that there is no edge between them. This is closely related to the notion of informationtheoretic independent sets as defined in [23] for devicetodevice communication networks. It is also analog to the concept of matchings on the communication topology as considered in [14], where now the interference between active communication links is also taken into account.
We first start with the following lemma, in which we characterize a lower bound on the symmetric rate within an independent set of the conflict graph, defined as the rate that can be simultaneously achieved by all the corresponding active links in the communication graph.
Lemma 2.
For any independent set in , the symmetric rate is lowerbounded by
(8) 
Proof.
For every vertex , the achievable rate on the corresponding link from node to node in can be written as
(9)  
(10) 
where (9) follows from the fact that link is present in the communication graph, hence their distance satisfies , while the link between nodes and is not present in the conflict graph, implying that . Moreover, (10) follows from the definition of in (5), and from the fact that as , the interference grows larger than noise, i.e., . As all nodes are able to achieve this communication rate, the proof is complete. ∎
Next, we present the following lemma, which provides an upper bound on the chromatic number of the conflict graph.
Lemma 3.
The chromatic number of the conflict graph can be asymptotically upperbounded by
Proof.
Considering each vertex in the conflict graph, its degree can be upperbounded as
(11) 
where is the maximum degree of a random geometric graph with nodes and threshold distance of , and is the maximum degree of , which is a random geometric graph with nodes and threshold distance of . As per equation (4) in [25], (11) can be upper bounded by
(12) 
where denotes the clique number of a random geometric graph with nodes and threshold distance , defined as the size of the largest clique in the graph, i.e., the maximal subset of vertices in which every two vertices are connected.
Now, we can leverage the bounds in the following theorem from [26] on the clique number of random geometric graphs to upper bound (12).
Theorem 3 (Theorem 1.2 in [26]).
For a dimensional random geometric graph with nodes and threshold distance , if , then its clique number, denoted by , satisfies
where is the unit ball in and is the maximum density of the distribution of nodes in . For Euclidean distance in
and uniform distribution of nodes within a circle of radius
, and .For the graph , we have . Given the fact that , we can invoke Theorem 3 to (almostsurely) continue (12) as
(13) 
Furthermore, for the graph , we have , and since , we can again use Theorem 3 to continue (13) as
(14) 
Using a greedy coloring algorithm on the conflict graph, its chromatic number can be upper bounded by , where is the maximum degree of the vertices in . Combined with (14), this completes the proof. ∎
Having Lemmas 2 and 3, we now proceed to prove Theorem 2. Suppose that we have a proper coloring on the conflict graph with colors, where the independent set corresponding to each color is denoted by . Then, assuming that all independent sets use timesharing to exchange the gradients, we can bound the normalized gradient exchange latency as
(15) 
Now, we can leverage Lemma 2 to upper bound (15) as
(16) 
where is defined as
(17) 
It can be shown that is concave in for (see Appendix A). Therefore, using Jensen’s inequality, we can upperbound (16) as
(18) 
Now, note that is equal to the total number of vertices in the conflict graph, or the edges in the communication graph; i.e.,
Proposition A.1 in [27] suggests that the average degree of a 2dimensional random geometric graph with nodes dropped uniformly at random within a circular area of radius and a threshold distance asymptotically converges to . Therefore, we have
which together with (18) leads to
(19) 
Appendix A Proof of Concavity of in (17) for
Letting , we can write the first derivative of as
which leads to the second derivative of as
(20) 
We can write the derivative in the numerator of (20) as
(21) 
Appendix B Proof of Monotonicity of the Bound in (19)
References
 [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
 [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.

[3]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, p. 3104–3112.
 [5] M. K. Leung, H. Y. Xiong, L. J. Lee, and B. J. Frey, “Deep learning of the tissueregulated splicing code,” Bioinformatics, vol. 30, no. 12, pp. i121–i129, 2014.
 [6] P. E. Naeini, S. Bhagavatula, H. Habib, M. Degeling, L. Bauer, L. F. Cranor, and N. Sadeh, “Privacy expectations and preferences in an IoT world,” in Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017). Santa Clara, CA: USENIX Association, Jul. 2017, pp. 399–412. [Online]. Available: https://www.usenix.org/conference/soups2017/technicalsessions/presentation/naeini
 [7] J. Poushter et al., “Smartphone ownership and internet usage continues to climb in emerging economies,” Pew Research Center, vol. 22, pp. 1–44, 2016.
 [8] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Communicationefficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
 [9] M. Kamp, L. Adilova, J. Sicking, F. Hüger, P. Schlicht, T. Wirtz, and S. Wrobel, “Efficient decentralized deep learning by dynamic model averaging,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2018, pp. 393–409.
 [10] Y. Zhang, J. C. Duchi, and M. J. Wainwright, “Communicationefficient algorithms for statistical optimization,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3321–3363, 2013.
 [11] T. Chen, G. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communicationefficient distributed learning,” in Advances in Neural Information Processing Systems, 2018, pp. 5050–5060.
 [12] K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee, “Optimal algorithms for nonsmooth distributed optimization in networks,” in Advances in Neural Information Processing Systems, 2018, pp. 2740–2749.
 [13] A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” arXiv preprint arXiv:1902.00340, 2019.
 [14] J. Wang, A. K. Sahu, Z. Yang, G. Joshi, and S. Kar, “MATCHA: Speeding up decentralized SGD via matching decomposition sampling,” arXiv preprint arXiv:1905.09435, 2019.
 [15] D. Basu, D. Data, C. Karakus, and S. Diggavi, “QsparselocalSGD: Distributed SGD with quantization, sparsification, and local computations,” arXiv preprint arXiv:1906.02367, 2019.
 [16] A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communicationefficient collaborative learning,” in Advances in Neural Information Processing Systems, 2019, pp. 8386–8397.
 [17] M. M. Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent overtheair,” arXiv preprint arXiv:1901.00844, 2019.
 [18] J.H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation for distributed edge learning with heterogeneous data,” in 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). IEEE, 2019, pp. 1–6.
 [19] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energyefficient radio resource allocation for federated edge learning,” arXiv preprint arXiv:1907.06040, 2019.
 [20] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,” IEEE Transactions on Communications, 2019.
 [21] M. M. Amiri, D. Gunduz, S. R. Kulkarni, and H. V. Poor, “Update aware device scheduling for federated learning at the wireless edge,” arXiv preprint arXiv:2001.10402, 2020.
 [22] C. Geng, N. Naderializadeh, A. S. Avestimehr, and S. A. Jafar, “On the optimality of treating interference as noise,” IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1753–1767, 2015.
 [23] N. Naderializadeh and A. S. Avestimehr, “ITLinQ: A new approach for spectrum sharing in devicetodevice communication systems,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 6, pp. 1139–1151, 2014.
 [24] P. Gupta and P. R. Kumar, “Critical power for asymptotic connectivity in wireless networks,” in Stochastic analysis, control, optimization and applications. Springer, 1999, pp. 547–566.
 [25] L. Decreusefond, P. Martins, and A. Vergne, “Clique number of random geometric graphs,” 2013, working paper or preprint. [Online]. Available: https://hal.archivesouvertes.fr/hal00864303
 [26] C. McDiarmid and T. Müller, “On the chromatic number of random geometric graphs,” Combinatorica, vol. 31, no. 4, pp. 423–488, 2011.
 [27] T. Müller, “Twopoint concentration in random geometric graphs,” Combinatorica, vol. 28, no. 5, p. 529, 2008.