Wireless Communications for Collaborative Federated Learning in the Internet of Things

06/03/2020 ∙ by Mingzhe Chen, et al. ∙ Princeton University 0

Internet of Things (IoT) services will use machine learning tools to efficiently analyze various types of data collected by IoT devices for inference, autonomy, and control purposes. However, due to resource constraints and privacy challenges, edge IoT devices may not be able to transmit their collected data to a central controller for training machine learning models. To overcome this challenge, federated learning (FL) has been proposed as a means for enabling edge devices to train a shared machine learning model without data exchanges thus reducing communication overhead and preserving data privacy. However, Google's seminal FL algorithm requires all devices to be directly connected with a central controller, which significantly limits its application scenarios. In this context, this paper introduces a novel FL framework, called collaborative FL (CFL), which enables edge devices to implement FL with less reliance on a central controller. The fundamentals of this framework are developed and then, a number of communication techniques are proposed so as to improve the performance of CFL. To this end, an overview of centralized learning, Google's seminal FL, and CFL is first presented. For each type of learning, the basic architecture as well as its advantages, drawbacks, and usage conditions are introduced. Then, three CFL performance metrics are presented and a suite of communication techniques ranging from network formation, device scheduling, mobility management, and coding is introduced to optimize the performance of CFL. For each technique, future research opportunities are also discussed. In a nutshell, this article will showcase how the proposed CFL framework can be effectively implemented at the edge of large-scale wireless systems such as the Internet of Things.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning (ML) is witnessing an unprecedented interest from the wireless community [3]

driven by recent breakthroughs in deep learning, the rise of smart devices, and the wide availability of data. ML use cases for wireless networks range from data analysis and prediction to wireless environmental monitoring as well as network control and optimization. However, centralized ML requires edge devices to transmit their collected data to a central controller for learning. In practical deployments of ML, such as in Internet of Things (IoT) systems, due to privacy issues and stringent resource (e.g., bandwidth and transmit power) constraints, edge IoT devices may not be able or willing to share their collected data with other devices or a central controller. For example, a wearable device can collect medical data from a given user. However, the user may not be willing to share such private data with other users. To enable edge IoT devices to train a shared ML model without data exchange, federated learning was proposed by Google in

[2].

Federated learning (FL) is a distributed implementation of ML using which IoT devices can perform on-device ML model training while only exchanging ML model parameters with a central controller to collaboratively find a shared optimal ML model. Keeping the data at IoT devices not only preserves privacy but may also reduce network traffic congestion. Due to the unique features of FL, a number of existing works, as summarized in [11, 7, 12, 8], studied the use of FL for the optimization of wireless network performance.

In practice, to implement FL over IoT networks, edge devices must repeatedly transmit their trained ML models to a central controller via wireless links for ML model update. Due to limited wireless resources such as bandwidth, in a system such as the IoT, only a subset of devices can use FL. Meanwhile, ML models that are transmitted from IoT devices to a central controller (e.g., a base station) are subject to errors and delays caused by the wireless channel which affects the learning performance. Therefore, it is necessary to consider the optimization of wireless networks to improve FL performance, as pointed out in [14, 4, 1]. This emerging “communications for FL” research area is the key focus of this work.

Recently, a number of surveys and tutorials related to FL over wireless networks appeared in [11, 7, 12, 8] and [15]. First, the works in [11, 7, 12, 8] looked at the use of FL for communications, rather than the impact of wireless networking on FL. Moreover, all prior works in [11, 7, 12, 8] and [15] focused on the original FL developed by Google in [2] (called original FL hereinafter), which requires all edge IoT devices to transmit their ML models to a central controller for ML model update. Hence, these existing surveys did not consider the implementation of FL with less or even no reliance on the central controller. Furthermore, they did not analyze how to use wireless communication techniques to optimize the FL performance.

The main contribution of this article is to introduce a novel FL framework, dubbed collaborative FL, that combines collaborative learning [5] with federated learning so as to enable edge devices to engage in FL without connecting to a central controller. To introduce this new framework, we first provide a detailed overview on centralized learning (CL), original FL (OFL), and collaborative FL (CFL), and summarize their advantages, drawbacks, and usage in Section II. Then, in Section III, we introduce three important performance metrics to quantify the CFL performance over IoT systems. Then, in Section IV, we introduce several important communication techniques ranging from network formation, device scheduling, mobility management, and coding to optimize the CFL performance metrics. For each communication technique, we introduce the motivation for optimizing the CFL performance and then present an illustrative example and future research opportunities. Conclusions are drawn in Section V.

Ii Preliminaries and Overview

Advantages Drawbacks Usage Conditions
CL Ability to find a globally optimal ML model. Private data must be shared with a centralized controller such as a BS or cloud. Each device must be willing to share its private data.
Ample computational resources and energy available for ML training.
Significant overhead for data collection. All devices can transmit data to the BS.
Imperfect wireless transmission has a minor impact on ML model training. Difficult to implement for resource and energy-limited edge devices such as IoT devices.
Better performance for ML models with non-convex functions compared to FL.
Large delays due to long-range transmission to a remote cloud or BS.
OFL Privacy-preserving framework. Imperfect wireless transmission affects the ML model training process. All devices must be able to transmit FL model parameters to a controller or aggregator (e.g., a BS).
Devices can learn a common ML task in a distributed manner.
Number of users (and their data) that can perform FL is limited.
Ability to train ML models at device level. All devices must be able to receive the FL model parameters from the BS.
All devices must have a direct and reliable wireless connection to the BS.
Devices can locally train ML models (at the edge).
CFL Privacy-preserving framework. Imperfect wireless transmission affects the ML model training process. A reliable communication link can be formed between any two devices that need to use CFL.
Ability to include more training data samples for training compared to OFL.
Lower convergence speed compared to OFL.
Amenability for implementation in large-scale systems (e.g., IoT) because CFL can accommodate more devices in the FL process compared to OFL. Each device can locally train its ML model and aggregate the local FL models received from its associated devices.
The ML model of each device at convergence may be different since each device connects to a subset of devices.
TABLE I: Summary of the Advantages, Drawbacks, and Usage Conditions of ML over Wireless Networks.

In this section, we introduce the basic architectures and differences between CL, OFL, and CFL.

Ii-1 Centralized Learning

(a) Architecture of CL
(b) Architecture of OFL
(c) Architecture of CFL
Fig. 1: Architectures of centralized learning, original FL, and collaborative FL.

As shown in Fig. 1(a), CL needs only one ML model located at a base station (BS) or IoT cloud which works as a central controller. All devices must connect and send their data to the BS for training this ML model. Then, the BS will transmit the trained ML model to all devices. Hence, CL only requires the BS to communicate with all devices once so as to collect all devices’ datasets.

Table I

summarizes the advantages, disadvantages, and usage conditions of CL. The key advantage of CL is that it enables the BS or cloud to directly find a globally optimal ML model that minimizes the learning loss function value. Since the entire training process is completed by the BS, the ML training will not be affected by wireless network performance. However, imperfect wireless transmissions may introduce errors to the data used for training. Moreover, CL requires devices to transmit their collected data to the BS which leads to information leakage. In addition, significant overhead and resources are needed at the network and device levels to execute CL.

Ii-2 Original Federated Learning

To maintain privacy, Google’s OFL framework allows each edge device to cooperatively train a shared ML model without data transmission. In OFL, both devices and the BS own an ML model with the same architecture, as shown in Fig. 1(b). OFL is trained by an iterative learning process. First, all devices use their local data to train their local ML models and transmit their trained models to the BS. Then, the BS aggregates the received ML models, generated a new aggregate ML model, and transmits it back to all devices. Hereinafter, the ML model that is trained by an edge device is called local FL model while the ML model generated by the BS is called global FL model. At convergence, the global FL model will be equal to all local FL models, which means that devices find a shared FL model and the lcoal FL model at convergence can be used to analyze all devices’ datasets.

The advantages, disadvantages, and conditions for use of OFL are summarized in Table I. The key advantage of OFL is that it preserves data privacy and can be implemented over devices with less overhead than centralized ML. However, OFL still requires all devices to transmit their local FL model parameters to a BS. Hence, imperfect and dynamic wireless transmission will significantly impact the convergence time and the performance of OFL.

Ii-3 Collaborative Federated Learning

OFL requires all devices to send their local models to a BS, however, in practical IoT systems, devices may not be able to connect to the BS due to energy limitations or to a potentially high transmission delay. To overcome this challenge and facilitate the use of OFL in real-world IoT systems, we propose the concept of CFL using which devices can engage in FL without connecting to a BS or a cloud.

In CFL, devices that cannot connect to the BS directly can associate with neighboring users. For example, as shown in Fig. 1(b), for OFL, device cannot connect to the BS and perform FL due to a potentially high transmission delay. However, in CFL, as shown in Fig. 1(c), device can connect to its closest device for performing FL. CFL is also trained iteratively. First, each device transmits its trained local FL model to its connected devices or the BS. Then, the BS generates the global FL model and transmits it to the associated devices. Finally, each device updates its local FL model based on the local FL models received from other devices or the BS. In OFL, each device must train its local FL model using gradient descent (GD) methods while the BS aggregates the local FL models. However, in CFL, each device must both aggregate the local FL models received from other devices and train its local FL model.

(a) Simulation system
(b) Simulation result
Fig. 2: Simulation system and result to show the performance of CFL and OFL. In this figure, a red digit is the distance between two adjacent devices.

To show the difference between CFL and OFL, we implemented a preliminary simulation for a network having one BS and six devices, as shown in Fig. 2(a)

. The local FL model of each device consists of a shallow feedforward neural network with 50 neurons. The MNIST dataset

[9] is used for training the local FL models at each device and each device has 500 data samples. OFL is used for comparison. The maximum time used for FL model parameter transmission is set to be 0.23 s.

Fig. 2(b) shows how the identification accuracy changes over time. Fig. 2(b) demonstrates that CFL outperforms OFL. This is because, for OFL, only four devices can participate in FL and the other two devices have a delay larger than 0.23 s. Since CFL allows devices to connect to other devices and the transmission delay between any two neighboring devices is smaller than 0.23 s, six devices can participate. In fact, CFL can also reduce the energy consumption for device since it only needs to transmit its ML model parameters to device instead of the BS.

Table I summarizes the advantages, disadvantages, and usage conditions of CFL. The key advantage of CFL is that it enables the devices to perform the FL without transmitting local FL models to the BS, as shown in Fig. 3. Given the overview of CL, OFL, and CFL, we remark the following:

  • Choosing between CL or FL depends on: a) willingness of data sharing, b) ML model data size, and c) size of the collected data of each device. For example, when devices agree to share the data and the size of the collected data is smaller than the ML model data size, CL is recommended.

  • Choosing between OFL or CFL depends on: a) whether the BS performs FL and b) the connection and transmission delay between devices and the BS. For example, if all IoT devices need to implement FL without the BS, then CFL is more suitable.

  • OFL can be considered as a special case of CFL. In a network, if each device connects to all other devices, CFL is equivalent OFL.

Iii Performance of CFL over Wireless Networks

Wireless Factors Effects on FL Suggested Solutions
Loss function value Limited wireless resources, e.g., bandwidth and computational resources. Number of devices that can perform FL at each iteration is limited. Probabilistic user scheduling.
Over the air techniques allowing devices to aggregate local FL models over wireless transmission.
Optimized network formation.
Limited transmit power. Errors in local FL models. Channel coding and decoding.
Intelligent retransmission.
Convergence time Limited wireless resources, e.g., bandwidth, energy, and transmit power. Use of more time for local FL model parameter transmission. Coding and decoding of FL model.
FL model parameter prediction.
Over the air techniques.
Optimized network formation.
Limited computational resources. Number of local FL model updates at each CFL iteration is limited. Use of more global FL model updates.
Partial local FL model training.
Energy consumption Limited wireless resources, e.g., bandwidth. Use of more energy for local FL model transmission. Channel coding.
Optimized network formation.
Wireless channel conditions. Use of more local FL model updates.
Reliability Limited transmit power. Errors in local FL models. Channel coding.
Improved device connection policy.
Use of more local FL model updates.
Optimized network formation.
TABLE II: Summary of the Wireless Factors that Affect the Performance Metrics and Suggested Solutions.

We now introduce three key metrics for assessing the performance of CFL over wireless networks: a) value of the loss function, b) convergence time, and c) reliability.

Iii-1 Loss Function Value

An FL loss function is an objective function that devices try to minimize by adjusting their ML model parameters. For different learning tasks, the loss function will be different. The loss function value is used to evaluate the performance of CFL. The CFL training purpose is to find an ML model that minimizes the loss function. The FL loss function depends on the local FL models of all the participating devices. Hence, when those models are transmitted over wireless links, they experience transmission errors and delays which can negatively impact the loss function during training. Meanwhile, due to limited energy and computing resources, only a subset of devices can engage in CFL which decreases the total number of training data samples used for training the local FL models and increases the loss function value. Table III summarizes the wireless factors that affect the FL loss function along with suggested solutions.

Iii-2 Convergence Time

For CFL, the convergence time has three components: a) FL model parameter transmission delay, b) time needed by each device to train its local FL model, and c) number of iterations that FL needs to converge (i.e., the number of global FL model updates). The FL model parameter transmission delay depends on the data size of the FL model parameters and the data rate of the wireless link. The time used to train each device’s local FL model depends on the FL model data size, the computational resources of each device, and the number of iterations (called number of local FL model updates hereinafter) that each device uses to train its local FL model (using GD) at each FL iteration. Note that as the number of local FL model updates increases, the number of global FL model updates decreases. The number of global FL model updates also depends on the limited spectrum resources that restrict the number of devices that engage in FL. Table II summarizes the wireless factors that affect the convergence time and the suggested solutions.

Iii-3 Energy Consumption

The energy consumption needed for training a CFL algorithm consists of four components: a) local FL model transmission, b) local FL model update, c) global FL model transmission, and d) global FL model aggregation. In particular, each device will spend energy for local FL model transmission and update while the BS needs to spend energy for global FL model transmission and aggregation. A tradeoff exists between the energy consumption of the local FL model update and the transmission energy. The energy consumption of CFL depends on the FL model data size, the distance between the BS and the devices, the convergence time requirement, and the target loss function value. Table II summarizes the wireless factors that affect the energy consumption along with suggested solutions.

Iii-4 Reliability

For CFL, we can define reliability as the probability that a CFL algorithm achieves a target FL loss function value. At each CFL iteration, erroneous local FL models that are caused by imperfect wireless transmission must be abandoned by the devices. Hence, the number of local FL models used to generate the global FL model will decrease thus increasing the CFL convergence time and decreasing the loss function value. Hence, a CFL algorithm may not be able to achieve a target FL loss function value due to imperfect wireless transmissions. Thus, the reliability of CFL depends on the wireless channel conditions. As the transmit power of each device increases, the number of erroneous local FL models decreases and thus increasing CFL reliability. Table

II summarizes the wireless factors that affect the reliability and the suggested solutions.

(a) Grid topology
(b) Path topology
(c) Complete topology
(d) Star topology
Fig. 3: Number of iterations needed to converge for different CFL algorithms with different topologies. In this figure, is the upper bound of the number of iterations that a CFL algorithm needs to converge, where is the number of devices that perform the FL algorithm, is the target accuracy which implies the difference between the optimal FL model and the FL model at convergence, is the upper bound of the gradient of the loss function, with being the initial local FL model of device , and is the optimal local FL model at convergence.

Iv Communication Techniques for Collaborative Federated Learning

We now overview key techniques that can be used to improve the performance of CFL over wireless networks.

Iv-a Network Formation

The first fundamental step towards deploying CFL is to analyze the process of network formation using which devices can connect to one another to engage in a CFL task. In CFL, devices can form different network topologies. For example, IoT devices can form a grid topology for CFL, as shown in Fig. 3(a). Naturally, the training complexity and the FL convergence time directly depend on the formed topology. Hence, for any given network scenario, it will be interesting to investigate the optimal CFL network topology using the metrics of Section III.

Fig. 3 shows the upper bound of the number of iterations for CFL convergence when assuming that the upper bound is derived based on the assumption that each device updates its local FL model using the Lazy Metropolis method and the GD method [10]. Fig. 3 shows that, when the number of links of each device increases, the number of iterations decreases because having more links increases the frequency of local FL model sharing.

Clearly, CFL yields interesting network formation research questions as follows:

  • Optimal CFL network formation: The optimal CFL network topology depends on the CFL performance metrics being optimized. Therefore, a fundamental CFL question is that of network formation: How can the devices interact to form an optimal network topology that maximizes the various CFL performance metrics and tradeoffs? To find the optimal CFL network topology, the first step is to define a proper utility function that jointly considers multiple dependent CFL performance metrics and network topology. Given the defined utility function, one must develop network formation algorithms to optimize the utility function. Both centralzied and distributed solutions can be developed. Centralized solutions such as searching based algorithms may be able to find the globally optimal network topology. However, the implementation of centralized solutions requires all devices’ information such as locations or wireless channel conditions, which is impractical for a large-scale and dynamic IoT system. For distributed solutions, one can adopt a game-theoretic approach, particularly using network formation games [6]

    . In network formation games, each device is seen as an individual agent whose goal is to form a graph with neighboring devices so as to optimize the CFL performance metrics. The CFL performance (e.g., utility) depends on the entire graph and decision of all agents which makes the use of game theory suitable. One unique feature of the CFL network formation game is that it could be dynamic and requires far-sighted decision making. That is an angle that has only been studied in limited prior works as discussed in

    [6].

  • Network formation with asynchronous training: Under asynchronous FL training, IoT devices will update and transmit their local FL models at different time slots. Due to limited computing and wireless resources, each device may not want to update its local FL model until it receives all local FL models of its associated devices. Using asynchronous training can increase local FL model update frequency and the data rate of each device which reduces the convergence time. In asynchronous training, the number of devices that need to transmit the local FL models is time-varying. Hence, the network topology must be adapted to the changes in the number of devices that must transmit local FL models. Here, one must determine the frequency with which the network topology must be updated according to the number of participating devices. Note that each network topology update will change the wireless resource allocation and device association schemes so as to improve CFL performance metrics such as convergence time. However, network topology updates will also introduce communication overhead such as network state information sharing.

  • Network formation with partial network information: In actual IoT, each device may not completely know the network architecture, device locations, and network composition. Due to this limited information, the number of devices that each device can connect to is limited and hence devices may not be able to form a network topology that satisfies the CFL usage conditions (see Table I). Therefore, there is a need to investigate a globally optimal network formation for IoT devices with partial information. Since most existing complexity results related to network formation (e.g., see [10]) assume that each device has complete information, they cannot be used for devices with partial network information. Meanwhile, due to partial network information, devices may form several unconnected small device groups. Hence, a multi-layer network formation must be designed. For example, in the first layer, devices will exchange their local FL model parameters in their own groups while the local FL model parameters are exchanged over multiple groups in the second layer. The designed scheme must balance the communication overheads and training complexity among multiple layers.

Iv-B Device Scheduling

Due to energy constraints and wireless resource limitations, the number of devices that can engage in CFL is limited. Hence, an IoT device may update its local FL model using the local FL models of a subset of devices thus decreasing the CFL convergence time. Therefore, it is necessary to find an optimal device scheduling policy that can determine the frequency and which iterations that each device engages in CFL so as to optimize the CFL performance metrics.

Device scheduling plays an important role in training CFL and it also faces several interesting research problems:

  • Data importance-aware device scheduling: In CFL, the contribution of each device’s dataset on the update of a local FL model can be seen as the data importance of that device’s dataset. The data importance of each device depends on the number of training data samples and the data distribution. For instance, if a device has a large number of training data samples, its local FL model will be allocated a large weight within the local FL model update. Since only a subset of devices can perform FL at each iteration, it is necessary to design data importance-aware device scheduling policies for improving convergence speed. In particular, one must first build a data importance model that jointly considers the number of training data samples, data distribution, and data uniqueness. Meanwhile, in CFL, devices cannot share data and, hence, each device may not be able to directly know the data importance of other devices. Therefore, there is a need to find a method to learn the data importance of other devices from their transmitted local FL model parameters. In addition, one must determine the frequency of local FL model update for devices with different data importance. Note that increasing the update frequency of the devices with high data importance can improve convergence speed but it also increases the loss function value.

  • Device scheduling for multiple FL tasks: In a wireless network, a device may perform multiple FL algorithms simultaneously. Therefore, it will be interesting to design a device scheduling policy that enables devices to efficiently train multiple FL models and transmit the trained FL models to other devices simultaneously. Since each FL task has its specific convergence time requirement and target loss function value, the developed device scheduling policy must determine which FL model must be trained first and which FL model must be transmitted first so as to satisfy the requirements of each FL task. Moreover, since the convergence time of each FL task is different, the designed scheduling policy must be adapted to the changes in the number of incomplete FL tasks.

  • Device scheduling and network formation for mobile devices: In an IoT system, several devices, such as cars and drones, are mobile. The connections among different devices and the wireless network performance will change depending on the mobility of the devices thus affecting the CFL performance. Meanwhile, device mobility will increase the frequency of devices changing their connections thus slowing down the CFL training process. Therefore, it is necessary to study device scheduling and network formation for mobile devices. In OFL, devices will transmit their local FL models to a static BS. However, in CFL, mobile devices must transmit their local FL models to other mobile devices. Hence, the devices’ locations and connections are correlated in space (i.e., between two connected devices) and time (i.e., between time slots). For example, for two devices moving in parallel, although the location will be changing, the distance between the two devices remains constant. As a result, the change of their locations will not increase the local FL model transmission delay. Therefore, one must first build a model to capture the effect of spatio-temporal correlation of device locations and connections on the FL performance metrics. Then, it must investigate how to use spatio-temporal correlation to optimize device scheduling and network topology policies and the frequency of changing these policies.

(a) CFL scenario used for our simulations.
(b) Simulation result
Fig. 4: Simulation scenario and result for source coding based FL.

Iv-C Coding

During the CFL training process, source coding, channel coding, and gradient coding can be used to improve the FL performance. Source coding is used to compress the high-dimensional FL model parameters so that they can be represented by a small number of bits hence reducing the FL parameter transmission delay. Channel coding is used to protect the transmitted FL model parameters against the wireless noise and interference thus improving packet error rates and CFL reliability. Gradient coding is used to encode the GD parameters of machine learning algorithms so as to improve the ML performance.

In this regard, a quantization-based source coding method was proposed by [13] for reducing the data size of local FL models that are transmitted over wireless links. The coding and decoding procedure is shown in Fig. 4(a). Here, we use the quantization-based coding method in [13] for CFL. We implement a CFL algorithm for handwritten digit identification. All simulation settings are similar to the settings in [13].

Fig. 4(b) shows how the accuracy of a handwritten digit identification learning task changes with the number of iterations. In Fig. 4(b), CFL uses

bits to represent an element of the local FL model vector. Fig.

4(b) shows that, the quantization-based CFL algorithm with can almostly achieve the same performance compared to the CFL algorithm without coding. Since the quantization-based CFL algorithm uses only bits to represent an element of the local FL model vector, the transmission delay of the quantization-based CFL algorithm will significantly decrease. From Fig. 4(b), we can also see that the quantization-based CFL algorithm with can achieve better performance compared to the quantization-based CFL algorithm with . This is because coding makes the local FL model after coding to be different from the FL model before coding. As the number of bits used to represent the local FL model decreases, the difference between the FL model after coding and the FL model before coding increases and thus affecting the identification accuracy.

Obviously, source, channel, and gradient coding can significantly improve CFL performance. However, a number of research questions still exists:

  • Heterogeneous source coding design: In an IoT system, the wireless transmission link characteristics of each device will be different (e.g., different data rates). To efficiently use wireless resources for FL model transmission, each device may encode its local FL model using different number of bits or different coding techniques. This type of coding schemes is called heterogeneous source coding. For example, some devices can use 15 bits to represent their local FL models while another can use 7 bits to represent its local FL model. Heterogeneous source coding can significantly reduce the coding energy consumption and decrease the loss function value. However, in CFL, a device must transmit its local FL model to multiple devices. Therefore, one must determine the number of local FL models that each device must encode and the number of bits used to encode the corresponding local FL models. For example, if a given device must transmit its local FL model to three devices, then this device can encode a local FL model and transmit it to three devices. Also, the device can encode two or three local FL models with different number of bits and then transmit them to these three devices.

  • Gradient coding for avoiding stragglers: Due to limited wireless resources, an IoT system has devices with extremely high transmission delay or computational delay. Such devices (called stragglers) may not be able to complete the local FL model transmission within the time duration required by the system. If a network has a large number of stragglers, the number of devices that can perform CFL will significantly decrease. Therefore, there is a need to design gradient coding schemes for addressing the problem of stragglers. However, traditional gradient coding methods require devices to share their dataset with other devices so as to remove stragglers and hence, they cannot be used for CFL since CFL does not allow devices to share their data. Hence, one must investigate a novel gradient coding scheme without data sharing.

V Conclusion

This article proposed a novel wireless FL framework, called collaborative FL and introduced the challenges and opportunities of using wireless communication techniques for optimizing CFL performance. The introduced wireless techniques provide guidance for reliably deploying CFL across edge IoT devices. The discussed research opportunities identify important open problems that must be considered when designing and deploying CFL for IoT systems. We expect that the proposed CFL framework will fundamentally change the original FL architecture allowing it to be deployed for several future applications such as mobile keyboard prediction, IoT device identification and monitoring, and extreme event detection for autonomous vehicles.

References

  • [1] M. M. Amiri and D. Gunduz (2020-to appear,) Federated learning over wireless fading channels. IEEE Transactions on Wireless Communications. Cited by: §I.
  • [2] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. M. Kiddon, J. Konecny, S. Mazzocchi, B. McMahan, T. V. Overveldt, D. Petrou, D. Ramage, and J. Roselander (2019) Towards federated learning at scale: system design. In Proc. of Systems and Machine Learning Conference, Stanford, CA, USA. Cited by: §I, §I.
  • [3] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah (2019) Artificial neural networks-based machine learning for wireless networks: A tutorial. IEEE Communications Surveys Tutorials 21 (4), pp. 3039–3071. Cited by: §I.
  • [4] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui (2019) A joint learning and communications framework for federated learning over wireless networks. arXiv preprint arXiv:1909.07972. Cited by: §I.
  • [5] A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal (2019) GADMM: fast and communication efficient framework for distributed machine learning. arXiv preprint arXiv:1909.00047. Cited by: §I.
  • [6] Z. Han, D. Niyato, W. Saad, T. Başar, and A. Hjørungnes (2012-Oct.) Game theory in wireless and communication networks: Theory, models, and applications. Cambridge University Press. Cited by: 1st item.
  • [7] J. Kang, Z. Xiong, D. Niyato, Y. Zou, Y. Zhang, and M. Guizani (2020-to appear,) Reliable federated learning for mobile networks. IEEE Wireless Communications. Cited by: §I, §I.
  • [8] L. U. Khan, N. H. Tran, S. R. Pandey, W. Saad, Z. Han, M. N. H. Nguyen, and C. S. Hong (2019) Federated learning for edge networks: Resource optimization and incentive mechanism. arXiv preprint arXiv:1911.05642. Cited by: §I, §I.
  • [9] Y. LeCun THE MNIST database of handwritten digits. Note: http://yann.lecun.com/exdb/mnist/ Cited by: §II-3.
  • [10] A. Nedic, A. Olshevsky, and M. G. Rabbat (2018-05) Network topology and communication-computation tradeoffs in decentralised optimization. Proceedings of the IEEE 106 (5), pp. 953–976. Cited by: 3rd item, §IV-A.
  • [11] S. Niknam, H. S. Dhillon, and J. H. Reed (2019) Federated learning for wireless communications: Motivation, opportunities and challenges. arXiv preprint arXiv:1908.06847. Cited by: §I, §I.
  • [12] J. Park, S. Samarakoon, M. Bennis, and M. Debbah (2019-Nov.) Wireless network intelligence at the edge. Proceedings of the IEEE 107 (11), pp. 2204–2239. Cited by: §I, §I.
  • [13] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui (2020-05) Federated learning with quantization constraints. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain. Cited by: §IV-C.
  • [14] K. Yang, T. Jiang, Y. Shi, and Z. Ding (2020-Mar.) Federated learning via over-the-air computation. IEEE Transactions on Wireless Communications 19 (3), pp. 2022–2035. Cited by: §I.
  • [15] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang (2020-Jan.) Toward an intelligent edge: Wireless communication meets machine learning. IEEE Communications Magazine 58 (1), pp. 19–25. Cited by: §I.