Log In Sign Up

From Federated Learning to Fog Learning: Towards Large-Scale Distributed Machine Learning in Heterogeneous Wireless Networks

Contemporary network architectures are pushing computing tasks from the cloud towards the network edge, leveraging the increased processing capabilities of edge devices to meet rising user demands. Of particular importance are machine learning (ML) tasks, which are becoming ubiquitous in networked applications ranging from content recommendation systems to intelligent vehicular communications. Federated learning has emerged recently as a technique for training ML models by leveraging processing capabilities across the nodes that collect the data. There are several challenges with employing federated learning at the edge, however, due to the significant heterogeneity in compute and communication capabilities that exist across devices. To address this, we advocate a new learning paradigm called fog learning which will intelligently distribute ML model training across the fog, the continuum of nodes from edge devices to cloud servers. Fog learning is inherently a multi-stage learning framework that breaks down the aggregations of heterogeneous local models across several layers and can leverage data offloading within each layer. Its hybrid learning paradigm transforms star network topologies used for parameter transfers in federated learning to more distributed topologies. We also discuss several open research directions for fog learning.


page 1

page 3

page 5


Multi-Stage Hybrid Federated Learning over Large-Scale Wireless Fog Networks

One of the popular methods for distributed machine learning (ML) is fede...

Network-Aware Optimization of Distributed Learning for Fog Computing

Fog computing promises to enable machine learning tasks to scale to larg...

Active Learning Solution on Distributed Edge Computing

Industry 4.0 becomes possible through the convergence between Operationa...

Distilling On-Device Intelligence at the Network Edge

Devices at the edge of wireless networks are the last mile data sources ...

Device Sampling for Heterogeneous Federated Learning: Theory, Algorithms, and Implementation

The conventional federated learning (FedL) architecture distributes mach...

Flexible Parallel Learning in Edge Scenarios: Communication, Computational and Energy Cost

Traditionally, distributed machine learning takes the guise of (i) diffe...

Matching-Game for User-Fog Assignment

Fog computing has emerged as a new paradigm in mobile network communicat...

I Introduction

The modern era has witnessed an explosion in the number of intelligent wireless devices capable of connecting to the Internet and forming ad-hoc networks. The improved processing capabilities of these devices coupled with rising user demands for compute-intensive tasks has motivated fog computing, an emerging architecture which aims to migrate a significant amount of data-intensive processing from centralized datacenters in the cloud towards the distributed network edge [1]. Machine learning (ML) tasks in particular have attracted a lot of recent attention in networking applications, given their potential to provide fast and autonomous decision-making.

ML techniques generally require large datasets for model training, especially in the newer category of deep learning. This data is generated at end user devices as they interact with applications, and then traditionally is transferred to a central location, typically a datacenter, which carries out the model training. Consider, for example, automated facial recognition carried out by social media platforms today: when a user uploads a photo, a prediction is made of who is in the image by applying a model that was trained over billions of samples at a datacenter. The user’s feedback on this prediction (e.g., whether it is correct or not) will then inform further model training at the datacenter.

Centralized ML model training is prohibitive in many emerging network applications, however. Movement coordination among unmanned aerial vehicles (UAVs), process optimization in smart factories, and object recognition in virtual reality are just a few of today’s popular tasks that generate large volumes of distributed data needed for model learning while having stringent latency requirements. In particular, transferring data samples from the edge to the cloud has the following drawbacks in these contemporary applications:

  1. For battery-limited devices such as smartphones, UAVs, and wireless sensors, uplink data offloading can consume prohibitive amounts of energy.

  2. The round trip time of data transfer, model training/up-
    dating, and decision making can be prohibitively long.

  3. In privacy-sensitive applications (e.g., health monitoring from wearables), end users may not be willing to share their raw data.

These limitations have motivated work on spreading ML model training and inference through networks of devices. In particular, the federated learning technique has received significant attention recently: it only requires the transfer of models between worker devices and a central agent periodically, which has advantages in terms of privacy, latency, and communication demand [2].

I-a Federated Learning

The standard operation of federated learning is depicted in Fig 1

. To train an ML model (e.g., a neural network) of interest, two steps are repeated in sequence: (1)

local learning

, in which each worker device updates the parameters of the ML model (e.g., weights on neurons) using its own locally collected dataset, and (2)

global aggregation, in which a main server determines the new global model from the local updates and synchronizes the devices with this aggregated version. The local learning at each device typically consists of rapid gradient descent iterations to update the model. The global aggregation is typically an averaging of the local parameters, which may be weighted depending on the perceived quality of each device’s update [3].

In general, multiple gradient descent iterations may be employed in-between each global aggregation. Reducing the frequency of these aggregations reduces the upstream and downstream communication demands placed on the network. A key property of federated learning is that the data itself is never transferred between the devices and the server, which further reduces communication demands, and mitigates privacy concerns associated with data sharing. Even though each device trains on only a subset of the full dataset, model qualities resulting from federated learning have been observed to be close to centralized training in practice [4, 5].

Fig. 1: Left: A schematic of the conventional federated learning network architecture. User devices perform local (typically gradient descent) updates of the current global model parameters, and send their learned parameters to the main server. The main server then aggregates these locally trained models into a new global model and sends it back to the devices. The learning network resembles a star topology. Right: An abstract model of data flow in federated learning, summarizing the two main steps.

The standard implementation of federated learning described in Fig. 1, however, would cause performance issues in realistic fog computing environments. In the rest of this section, we outline the key considerations for developing network-aware techniques for distributing ML tasks, and initial works that have attempted to address them in the networking and ML research communities.

I-B Design Considerations for Network-Aware ML

I-B1 Communication heterogeneity

Most of the devices engaged in ML at the edge – cellular phones, smart vehicles, wireless sensors, UAVs, etc. – are wireless and mobile, with significant heterogeneity in their communication abilities. Channel qualities will change over time as devices move through the network and present varying forms of interference to each others’ channels. The devices themselves have varying transmission powers with which they can communicate model updates, too. As the achievable uplink and downlink data rates of the system will vary for each node over time, they must be taken into consideration in the design of distributed ML techniques.

These heterogeneous communication characteristics have motivated a few recent studies on federated learning for wireless networks [6, 7, 8]. Additionally, they have motivated studies on communication-efficient federated learning, through the techniques of quantization (i.e., compressing model updates prior to transmission) [9]

and sparsification (i.e., transmitting only some elements of the parameter vectors


I-B2 Computation/storage heterogeneity

Wireless edge devices also exhibit heterogeneity in their computation and storage capabilities, due both to intrinsic differences in processing equipment and varying availabilities of their resources. Thus, the time required to perform a single local update will vary from one device to another, leading to variable response times and straggler issues when these delays become prohibitively long.

In the context of federated learning, this has motivated studying the effects of device compute delays and the existence of stragglers on the time required to train models [7, 11]. Methods that have been proposed to resolve these effects include coded federated learning [12], where coding techniques are used to offload part of the computations from the devices to the server, and intelligent selection of device training participation [13]. Techniques for mitigating compute limitations have also been studied more generally, e.g., through model compression [14].

I-B3 Privacy and security

Although federated learning eliminates the need to transmit raw data over the network, it is possible for sensitive information to be leaked through reverse engineering of model parameters [4]. This can be problematic in networking applications with strict privacy concerns, like medical diagnostics. This has motivated investigations into adapting well-known privacy and security-preservation techniques – such as differential privacy and homomorphic encryption – to federated learning; see [15] and references therein.

I-B4 Joint performance metrics

The performance of an ML task is typically measured through the convergence speed and the accuracy of the resulting model. In network-aware ML, the previous three design considerations suggest other performance metrics that must be considered: the communication and computation resources expended, and the privacy/security guarantees of the resulting technique. Unfortunately, these objectives tend to compete with one another: for example, a wireless network device processing more gradient updates may improve resulting model quality, but requires more energy consumption, which can degrade device performance if the battery is currently low. Thus, techniques for network-aware ML must consider a joint optimization among the objectives of (i) minimizing network resource costs, (ii) maximizing resulting model quality, and (iii) maximizing privacy/security, with different importance assigned to each objective depending on the application. A few recent works [5, 4] have investigated tradeoffs between the first two objectives.

Fig. 2: A schematic of potential model aggregation stages for a large-scale machine learning task in network-aware learning. The main server (here depicted on the U.S. east coast) aggregates parameter updates from multiple cloud servers. Before reaching these cloud servers, local models trained by edge devices goes through multiple layers of aggregations.

Ii Motivating a New Architecture
for Network-Aware Learning

Several aspects of network-aware ML outlined in Section I-B are not addressed by federated learning. In this section, we will explain these limitations, which motivate a new paradigm for distributed ML.

Ii-a Federated Learning: Limitations in Fog Environments

Consider training and managing an ML task over a large-scale fog network consisting of millions/billions of devices geographically distributed across the world. We face the following key limitations using federated learning as the solution:

Ii-A1 Multi-layer nature of large-scale learning

Under federated learning, global aggregations would be performed at the main datacenter. When smartphones, smart vehicles, or other connected edge devices perform their local updates, their cellular base stations (BSs), road side units (RSUs), or analogous access points cannot directly transfer these learned parameters to the main server, which will be in a datacenter located possibly thousands of miles away. Instead, one pragmatic approach would be to consider multiple aggregations at different scales, e.g., edge servers in localities, cities, states, and countries, before finally reaching the datacenter. Similarly, for a team of data-gathering UAVs in an area with no cellular coverage, the local learning parameters may first be aggregated by a team of miniature UAVs, then multiple heavier UAVs, and then a high altitude platform (HAP). The HAP would transmit the aggregated models to an edge server through a backhaul network. Once at the edge server, these parameters could traverse the aforementioned hierarchy to reach the main server.

This potential multi-layer network structure for model aggregation is depicted in Fig. 2. To optimize network resources, the frequency of aggregations/synchronizations would likely decrease moving up the hierarchy, to prevent communication of marginal updates. However, if one device experiences an abrupt change in its local model from a changing environment, the delay in propagating this update through the hierarchy must be considered too.

Ii-A2 Overloading heterogeneous network resources

Another challenge is that current cellular BSs and RSUs are not designed to handle model uploads from large numbers of active devices simultaneously. Training deep neural networks with federated learning can require participation from many active devices, as high complexity models require large datasets [14]. Moreover, given the heterogeneity of edge resources, each participating device may only be capable of processing a small set of samples for a high dimensional model. This calls for a learning architecture in Fig. 2 that optimizes the choice of devices participating in training based on current network conditions.

Ii-A3 User incentivization

Many ML applications rely on voluntary user participation for model training. Large resource demands on user devices will make them less willing to use ML applications based on federated learning. It is possible to develop incentives in the the form of economic considerations, e.g., offering discounted service for willingness to provide more compute resources. Nevertheless, consecutive uplink transmissions to a BS, UAV, or HAP will result in a large sum power consumption across the devices. A network-aware learning technique could incorporate intelligent device sampling to reduce the uplink transmissions required from any single end device. Providing reasonable guarantees on battery requirements will make users more willing to participate.

Ii-A4 Strict privacy assumptions

Federated learning guarantees that each device’s local dataset is never transferred over the network. While this is important in privacy-sensitive applications, in many cases users may be willing to share portions of their datasets for ML training, which can be useful when there is a combination of resource-hungry and resource-rich devices. For example, a smart car attempting to train an object classifier with a limited on-board processor is likely willing to offload its sensor data to a more powerful car to expedite the training process if the channel/interference conditions are reasonable. This calls for a learning framework which can exploit wider ranges of privacy restrictions.

Ii-B From Federated to Fog Learning

Given these limitations, we propose a new learning paradigm called fog learning for distributing ML through large-scale heterogeneous networks. As opposed to federated learning which is based on a star topology of device-server interactions, fog learning will additionally enable intelligent data and parameter offloading between devices in a distributed topology. This hybrid learning paradigm will exploit the multi-layer structure of fog networks to optimize performance in the presence of heterogeneous network resources.

Iii Fog Learning: a Multi-layer
Hybrid Learning Paradigm

In this section, we define fog learning. We discuss its multi-layer structure, hybrid learning characteristics, and how these address the design considerations from Sec. I-B.

Iii-a Multi-layer Network Architecture

Fog learning is a multi-layer learning structure over a fog network. In this structure, similar to conventional federated learning, the main server conducts global aggregations. However, the end users are not directly connected to the main server: instead, the local models learned by edge devices may traverse multiple layers of aggregations before reaching the main server. Aggregations at each layer provide dimensionality reduction, reducing the size of the data being transmitted upstream. Synchronizations at each layer also provide agile responses to any changes in local data distributions.

To see the motivation for dimensionality reduction, consider that any ML model is represented as a vector of its model parameters. For a deep neural network, this vector can have millions of entries [14], where each element requires a certain number of quantization bits for storage and transfer. Depending on the quantization method, then, this parameter vector could require anywhere from a few megabytes to gigabytes, in extreme cases approaching the size of the training data itself. For the hierarchical network structure depicted in Fig. 2, consecutive transmissions of these vectors from millions of edge devices to the main server would lead to large delays, overloaded network infrastructure, and high communication costs.

In fog learning, we apply the federated learning aggregation/synchronization to the parameter vectors at each layer of the topology. After each aggregation, due to the associative characteristic of the summation, the size of the resulting vector to be transmitted upstream is the same as any one of the input vectors. In particular, aggregating the learning parameters of nodes results in a factor of reduction in the number of bits required on the upstream. We illustrate this principle in Fig. 3. For instance, each miniature UAV in Fig. 2 can aggregate its associated devices’ parameters and send the resulting vector to the UAVs located in the upper layer. The same can be done at small/medium UAVs, HAPs, BSs/RSUs, and edge servers.

Iii-B Hybrid Learning: Vertical and Horizontal Communications

The learning architecture in Fig. 3 follows a vertical communication structure, where model parameters are passed only upstream and downstream between the network layers. Fog learning takes this one step further to allow for horizontal communications between devices in the same layer, so long as communication constraints are met.

Fig. 3: High-level illustration of the dimensionality reduction concept provided by multi-layer aggregations in a fog topology. The length of the original learning parameter vectors at each end device is . The size of data transmitted upstream from each middle node is also , reduced from the total input size to the node by a factor of the number of inputs. If each middle node also indicates the number of devices their aggregation is based on (which is a negligible overhead), the main server can compute the final average aggregation.

Peer-to-peer (P2P) networking has long been an area of research, offering on-demand establishment of connectivity and eliminating the requirement of a central module to facilitate communication between peers. Contemporary communication technologies like 5G and the Internet of Things (IoT) are enabling direct device-to-device (D2D) communications between wireless edge nodes, which is motivating peer-to-peer intelligence in fog computing [1]. For example, there is a well-developed body of literature on D2D communication protocols for ad-hoc networks, such as MANETs, VANETs, UAV networks, and wireless sensor networks. Also, the multi-peer connectivity framework offered via newer generation iPhone devices is expected to have a large impact on their communication scheme via D2D.

Fig. 4: Network representation of fog learning. The root of the tree corresponds to the main server, the leaves of the tree correspond to the end devices, and the nodes in-between correspond to different intermediate devices (e.g., BSs, UAVs, and edge servers). The nodes belonging to the same layer and the same horizontal rectangle form clusters capable of D2D communications. The patterned rectangles correspond to those clusters that choose to engage in D2D and distributedly learn their model aggregation. The parent node of such clusters can then sample just one (or a tiny fraction of) of its children nodes to obtain the aggregated model. Each yellow block represents a local learning block, the top nodes of which have a certain clock for transmitting data vertically upward for global aggregations.

Considering again the hierarchical fog structure in Fig. 2, fog learning would intelligently cluster the devices in the bottom-most layer (inside the two green rectangles) such that the devices in each cluster have the potential to form a wireless ad-hoc network for parameter sharing and/or data offloading. Similarly, the upper layers will be clustered such that the computing nodes in each layer are capable of forming an ad-hoc network for parameter sharing, in some cases via low-latency wired connections (e.g., multiple local edge servers connected via optical fibers in a metropolitan area) and in other cases over the air (e.g., UAVs).

In Fig. 4, we represent this fog learning network structure as a logical tree graph, the leaves of which are the edge devices and the root of which is the main server. Fog learning thus becomes a hybrid learning platform that utilizes horizontal communications among nodes belonging to the same layer of the network tree model. In the following, we discuss a general approach for D2D communications at various network layers. Then, we discuss a data offloading strategy that can be utilized in the bottom-most layer of the network.

Iii-B1 Distributed aggregations through horizontal communications

With D2D sharing of their models enabled, the nodes inside the engaged clusters are capable of computing the aggregation of their locally-trained parameters in a distributed manner, through message passing and consensus formation. This approach eliminates the need for the parent node to compute the aggregation, and can be implemented at all the network layers, which has energy efficiency advantages (discussed further in Sec. III-C). At the bottom-most layer, the datasets of the devices remain local, as in federated learning. In leveraging such horizontal communications, the conventional star topology used in federated learning (see Fig. 1) is transformed to a distributed fog learning topology.

Iii-B2 Dataset offloading under milder privacy concerns

In addition to sharing learning parameters, the proposed D2D communication scheme can also be used for partial dataset offloading among trusted edge devices, for applications with milder privacy concerns. In Fig. 4, this is only applicable at the bottom-most layer of the tree where the data collected. This approach is useful in the presence of heterogeneous computation resources within a cluster, where resource abundant nodes can perform more gradient updates (discussed further in Sec. III-C). Our recent work [4] studied the improvement in network resource costs that D2D offloading can provide in distributed ML training, and found up to 50% reductions are possible compared with the case of pure local data processing in federated learning.

Iii-C Performance Advantages of Fog Learning

Referring to the design considerations from Sec. I-B, the following are the key advantages of fog learning:

Iii-C1 Reducing network traffic

Fog learning employs local aggregations of ML model parameters at different levels of the topology, providing an upstream dimensionality reduction. This results in significantly reduced network traffic between different network layers (by a factor of the number of devices in each cluster). Reducing data transfer requirements over long distances decreases latency and communication costs.

Iii-C2 Network power savings

Horizontal D2D communications allow node clusters to distributedly discover their aggregated models. Thus, the parent node of the cluster can choose one device (or a few if errors due to noise is a concern) to upload the aggregated value. Decreasing the number of uplink transmissions by an order of magnitude will reduce energy consumption significantly. For instance, in a cellular network, continuous communication with the BS drains a smartphone’s battery rapidly. With D2D enabled, rather than uploading to the BS at each aggregation, the devices could engage in short-range, low power communications, and only one device will need to transmit the result to the BS. This single device could be selected by the BS intelligently (e.g., one that is nearby and requires lower transmit power). Instead of selecting one device, it would also be possible to employ a diversity technique where each device in a cluster engages in short, simultaneous uplink transmissions of only a fraction of the parameters.

Iii-C3 Efficient spectrum usage

Devices in a cluster engaged in D2D communications can operate in the out-band mode, which does not require utilizing the licensed spectrum of a cellular BS or a vehicular RSU.

Iii-C4 Adaptation to device entry/exit

In wireless mobile environments, devices may enter/exit a local cluster rapidly. When a device enters a D2D enabled cluster, it can join the learning process quickly through acquisition of the current model parameters from a neighboring node, without the requirement of communicating with nodes in the upper layer, e.g., BSs, RSUs, or UAVs. On the other hand, when a device exits, it can transfer its model and/or data to a local, resource-abundant peer so its locally updated model and local data distribution is not negated. This capability, along with the fact that devices in different clusters can perform learning in parallel, can be described as parallel successive learning: nodes can inherit partially-trained models and continue refining the parameters with newly collected data.

Iii-C5 Learning with heterogeneity

Channel conditions between an edge device and its parent node will vary over time. For example, in suburban areas and on interstate roads, communication with the parent BSs or RSUs may not even be possible for long periods of time. With D2D communication enabled, end users in each cluster can form a “micro-fog” network and keep performing their local updates and distributed aggregations until upstream communication conditions improve. Also, if a device moves to a new D2D-enabled cluster, it will begin sharing its learning parameters with the devices in the arriving cluster. The heterogeneity between data distributions across clusters will alleviate the potential of one cluster overfitting to its own data distribution.

Iii-C6 Mitigating straggler effects

Datasets of devices with lower uplink communication qualities (i.e., low data rates, long delays, significant channel fading and loss) can be transferred to neighboring devices with better channel conditions for more efficient communication of updates to the parent node. In addition, datasets of devices with lower computational capabilities can be offloaded to those with more idle resources, which will enhance the overall model learning speed.

Iii-C7 Leveraging passive device’s datasets

Certain edge devices may possess valuable data for the ML task but not be engaged in the training process (e.g., due to processing limitations). With D2D-enabled offloading, these passive datasets can be transferred to neighboring active nodes to improve learned model quality.

Iii-C8 Faster convergence in fewer global aggregations

By mitigating the effect of stragglers and enabling more distributed processing on heterogeneous datasets, the global model in Fig. 4 can be trained faster and with fewer costly global aggregations.

Iii-D Key Innovations in Fog Learning

The following five key innovations summarizes how fog learning will satisfy the design considerations for network-aware learning in Sec. I-B:

  • It will establish multi-stage hierarchical machine learning through space.

  • It will constitute a migration from star to distributed learning topologies.

  • It will employ agile network-aware management of heterogeneous nodes and channels.

  • Its task distribution will be based on multi-objective network optimization of latency, cost, and privacy metrics.

  • It will enable parallel successive learning for rapid refinement of locally trained models.

Iv Open Research Directions

Fog learning is an emerging paradigm with several open research questions for the innovations in Sec. III-D. In the following, we outline eight key directions of future research:

Iv-1 Optimizing horizontal and vertical communication tradeoffs

Performing aggregations via horizontal D2D communication in device clusters may be more resource-efficient, but can also incur more delay compared with the case of vertical aggregations. This delay is a function of data rates among the edge devices, channel qualities, rounds of D2D communication required to compute the final value, and other factors. Given the benefits of D2D communications discussed in Sec. III-C, quantifying the trade-offs in a concrete mathematical framework and deciding which clusters of devices are suitable to perform the D2D communications deserves further investigation.

Iv-2 Effect of error propagation through a multi-layer structure

Due to communication imperfections and time-varying network topologies, horizontal parameter aggregations of devices in clusters may be noisy versions of the true aggregated values. Such noise will then be propagated and potentially amplified in transmission to upper layers. Modeling these errors, their propagation among different layers, and their cumulative effect on the convergence speed and the accuracy of the training is an interesting future direction.

Iv-3 Intelligent cluster sampling

To reduce power consumption and network traffic, the main server in Fig. 4 can perform cluster sampling, in which only the end devices from certain clusters engage in model training at each round of global aggregation. Intelligent sampling strategies require considering characteristics of nodes in different learning layers and error propagation model between layers. Also, if nodes in the upper layers have mobile capabilities, this direction prompts the idea of network reconfiguration from one global aggregation to another. For instance, instead of deploying a dedicated set of UAVs for data collection from each cluster of edge devices, a limited set of UAVs can be utilized, and the optimal trajectory can be obtained to enable the desired cluster sampling.

Iv-4 Parallel learning with fewer global aggregations

The devices located in different layers of the network tree graph can form different learning blocks (see Fig. 4), which can be used to further decrease the network traffic and the required number of global aggregations. In each block, the head (top-most) node(s) have a certain frequency for vertical communication. In-between, they can conduct multiple rounds of in-block learning and parameter updates. Studying the trade-offs between the aggregation frequencies of different learning blocks, the accuracy of the model, the number of devices per block, and the convergence speed of the training is an open direction.

Iv-5 Heterogeneous fog network modeling

A comprehensive model of the interplay between fog network parameters (e.g., trust levels between users, D2D channel quality variations between devices, vertical communication quality variations, heterogeneous data quality of different devices, and heterogeneous compute capabilities) can lead to further optimization of fog learning in the presence of heterogeneous nodes and links. Modeling and quantifying each of these parameters and designing adequate offloading schemes is of particular interest.

Iv-6 Smart data sharing

End users can offload different parts of their datasets to different peers. In acting as helper nodes, edge devices with higher compute powers can send out requests for specific samples in a dataset that they lack (e.g., those associated with less common labels) to maximize the resulting data processing benefit. This will increase the quality of the active devices’ dataset distributions and thus improve the global model convergence speed.

Iv-7 Incentivizing end users

Proper incentive mechanisms are needed to persuade devices to participate in model training. The incentives should consider the parameters of the local datasets (e.g., data quality) and the device’s network-related parameters (e.g., speed of data offloading and computational capabilities). In the near future, multiple industries may be interested in large-scale ML model training; the competitive nature of the market needs to be further investigated and encapsulated into the analysis.

Iv-8 Dynamic network mobility models

D2D dataset offloading can only occur when the mobile devices happen to be in a certain vicinity, i.e., when a reasonable communication channel can be established between them. Accurate mobility models of the edge devices could reveal pertinent information regarding the anticipated duration of contact, the estimated frequency of contact within in a certain time interval, the data distributions of the contacting devices, and so forth. This information could be used to develop mobility-aware dataset offloading mechanisms.

V Conclusion

In this article, we motivated, proposed, and defined fog learning, a new paradigm for distributing machine learning model training through large-scale networks of heterogeneous devices. We demonstrated that fog learning is inherently a multi-layer hierarchical learning framework that can significantly reduce network resource costs and model training times through multiple rounds of model aggregations at different layers of the hierarchy. We also introduced the hybrid property of fog learning, which combines horizontal device-to-device communications between nodes in the same layer with vertical communications up the hierarchy. Further, we discussed the distributed topology and multi-objective optimization nature of fog learning that make it network-aware. Finally, we discussed the unique advantages fog learning provides in contemporary fog computing settings and identified key open research directions.


  • [1] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, 2016.
  • [2] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE, vol. 107, no. 11, pp. 2204–2239, 2019.
  • [3] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-Edge AI: Intelligentizing mobile edge computing, caching and communication by federated learning,” IEEE Network, vol. 33, no. 5, pp. 156–165, 2019.
  • [4] Y. Tu, Y. Ruan, S. Wang, S. Wagle, C. G. Brinton, and C. Joe-Wang, “Network-aware optimization of distributed learning for fog computing,” in Proc. INFOCOM, 2020.
  • [5] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE JSAC, vol. 37, no. 6, pp. 1205–1221, 2019.
  • [6] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” arXiv preprint arXiv:1909.07972, 2019.
  • [7] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” in Proc. INFOCOM, 2019, pp. 1387–1395.
  • [8] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Transactions on Wireless Communications, vol. 19, no. 1, pp. 491–506, 2020.
  • [9] A. Elgabli, J. Park, A. S. Bedi, M. Bennis, and V. Aggarwal, “Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning,” in Proc. IEEE ICASSP, 2020, pp. 8876–8880.
  • [10] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler, “SparCML: High-performance sparse communication for machine learning,” in International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–15.
  • [11] D. Ye, R. Yu, M. Pan, and Z. Han, “Federated learning in vehicular edge computing: A selective model aggregation approach,” IEEE Access, vol. 8, pp. 23 920–23 935, 2020.
  • [12] S. Dhakal, S. Prakash, Y. Yona, S. Talwar, and N. Himayat, “Coded federated learning,” in IEEE GLOBECOM Workshop), 2019, pp. 1–6.
  • [13] K. Bonawitz et al., “Towards federated learning at scale: System design,” SysML, 2019.
  • [14]

    W. Wang, Y. Sun, B. Eriksson, W. Wang, and V. Aggarwal, “Wide compression: Tensor ring nets,” in

    Proc. IEEE CVPR, 2018, pp. 9329–9338.
  • [15] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini, G. Smith, and B. Thorne, “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption,” arXiv preprint arXiv:1711.10677, 2017.