Hierarchical Federated Learning with Privacy

06/10/2022
by   Varun Chandrasekaran, et al.
9

Federated learning (FL), where data remains at the federated clients, and where only gradient updates are shared with a central aggregator, was assumed to be private. Recent work demonstrates that adversaries with gradient-level access can mount successful inference and reconstruction attacks. In such settings, differentially private (DP) learning is known to provide resilience. However, approaches used in the status quo (central and local DP) introduce disparate utility vs. privacy trade-offs. In this work, we take the first step towards mitigating such trade-offs through hierarchical FL (HFL). We demonstrate that by the introduction of a new intermediary level where calibrated DP noise can be added, better privacy vs. utility trade-offs can be obtained; we term this hierarchical DP (HDP). Our experiments with 3 different datasets (commonly used as benchmarks for FL) suggest that HDP produces models as accurate as those obtained using central DP, where noise is added at a central aggregator. Such an approach also provides comparable benefit against inference adversaries as in the local DP case, where noise is added at the federated clients.

READ FULL TEXT VIEW PDF
09/08/2020

Toward Robustness and Privacy in Federated Learning: Experimenting with Local and Central Differential Privacy

Federated Learning (FL) allows multiple participants to collaboratively ...
10/27/2021

Differentially Private Federated Bayesian Optimization with Distributed Exploration

Bayesian optimization (BO) has recently been extended to the federated l...
07/13/2020

Privacy Amplification via Random Check-Ins

Differentially Private Stochastic Gradient Descent (DP-SGD) forms a fund...
07/06/2021

Differentially private federated deep learning for multi-site medical image segmentation

Collaborative machine learning techniques such as federated learning (FL...
02/09/2021

Federated Learning with Local Differential Privacy: Trade-offs between Privacy, Utility, and Communication

Federated learning (FL) allows to train a massive amount of data private...
07/09/2022

The Poisson binomial mechanism for secure and private federated learning

We introduce the Poisson Binomial mechanism (PBM), a discrete differenti...
06/26/2021

Benchmarking Differential Privacy and Federated Learning for BERT Models

Natural Language Processing (NLP) techniques can be applied to help with...

I Introduction

Federated learning (FL) is a paradigm to enable distributed learning. Federated clients locally learn, and share the gradient updates associated with local learning with a central aggregator which performs a global model update. Apart from enabling scalability, FL is assumed to provide additional privacy benefits [mcmahan2017communication]

.The data used for training never leave the data owner (or the federated clients, in this case), as in traditional machine learning (ML) settings where a centralized party must have access to data from all participating entities. However, recent work 

[melis2019exploiting, boenisch2021curious, geiping2020inverting, yin2021see, zhu2019federated, pasquini2021eluding] has shown that such learning algorithms are susceptible to attacks where the private training data can be reconstructed by observing intermediary calculations of the learning algorithm.

Differential privacy [dwork2014algorithmic] () is a formal construct that provides guarantees on privacy leakage

of various mechanisms, including stochastic gradient descent 

[abadi2016deep, chaudhuri2011differentially, 10.1145/3035918.3064047] which is commonly used for empirical risk minimization (a component of learning approaches such as FL). It achieves this by adding calibrated noise at various stages of the mechanism. In conventional settings (i.e., not FL), it has been shown that DP learning algorithms alleviate the aforementioned attacks [song2021systematic, geyer2017differentially, naseri2020toward]. However, they introduce an inherent trade-off between protecting privacy of the data used for model training, and utility of the model learned, often measured through test accuracy.

In FL, until now, DP guarantees have been investigated at two levels. At the central level (i.e., central aggregator), DP noise can be added to the aggregated gradient update [konevcny2016federated] (i.e., central DP or CDP). However, this assumes that the clients trust the central aggregator to not perform (malicious) inference on their data. Alternatively, DP noise can be added at the client locally (i.e., local DP or LDP) [kasiviswanathan2011can, wei2020federated, zhao2020local]. This assumes that the clients do not trust any actor in the entire FL ecosystem. While both these approaches provide safeguards against the aforementioned privacy adversary (under different trust models), they introduce disparate privacy vs. utility (typically measured through model performance) trade-offs. In this work, we aim to understand if a compromise can be reached through an intermediary approach, by introducing hierarchies in the learning process, while still providing formal DP guarantees. We call this hierarchical DP (or HDP).

The hierarchies we envision are deviations from the status quo of FL: clients form, or belong to zones; updates from clients within a zone are aggregated by a super-node (an elected/chosen client within the zone), and the updates from zones (i.e., super-nodes) are aggregated at the central aggregator. While such a hierarchy seems fabricated, observe that such hierarchies are omnipresent in daily computing systems (such as in telecommunication networks, the internet infrastructure etc.). One example could be the existence of compute-enabled routers which can act as super-nodes within a home environment (i.e.,trust in one’s peers). Furthermore, the onset of edge computing has resulted in the development of extensive, compute-capable edge base stations [liu2016paradrop], which can also act as super-nodes for all clients within a particular geographic zone (e.g.,trust in members of the same region or country). These “naturally occurring” clusters of clients (e.g., devices in the same household or office network, or gaming consoles joining and participating in the same P2P gaming server) have embedded trust in their formalization and incentives [marti2006taxonomy].

In this paper, we take the first step in analyzing how hierarchies are beneficial with respect to privacy, and what are key technical challenges to be solved. The first challenge is in constructing such a hierarchy. We analyze different approaches that reflect the aforementioned scenarios, and discuss the trade-offs in § IV-A. The second challenge involves making modifications to the standard approaches to FL to ensure that the newly proposed HFL approach learns the same model as in the status quo. We discuss this in § IV-B. The third challenge involves formalizing an algorithm (to provide DP) to be used in an HFL ecosystem. Determining the exact mechanism has direct implications on the privacy vs. utility trade-off, and provides avenues for privacy amplification, which we discuss in § IV-C. We generalize the approach by considering scenarios of simultaneous needs for LDP, HDP and CDP in § V. Our final challenge is understanding the various threats faced by our new constructions. We first validate that the hierarchical construction, when combined with DP learning, is robust to reconstruction adversaries. We also analyze faulty behavior at an intra- and inter-zonal granularity, and describe the information leakage by adversaries in such settings, even when an adversarial client is elected as a super-node.

We highlight the contributions of our work below:

  • We are the first to propose the notion of hierarchical DP, a consequence of adding noise in the newly proposed hierarchical FL setting.

  • We identify an opportunity for privacy amplification: the natural composability of the Gaussian mechanism provides more privacy at the central aggregator level when noise is added at the super-node.

  • We analyze prior attacks in FL setting, and comment on the applicability of the same in the context of hierarchical FL. In particular, we look into adversaries in super-node level and discuss potential adversarial inferences they may perform, such as data inference and reconstruction attacks. We show that hierarchical FL (with DP) can thwart such attacks.

  • We experimentally study the privacy vs. utility trade-off on different datasets and setups. We demonstrate that the (privacy and utility) performance of hierarchical FL sits between local and central DP. Surprisingly, we observe that the utility benefits obtained through our proposal are very close to that obtained by learning a model with central DP.

Ii Background & Related Work

We introduce different concepts used throughout the paper.

Federated Learning (FL) involves learning with many federated clients in a decentralized manner [konevcny2016federated]. At each federated round, the central aggregator shares its weights with all federated clients, each of which has a private dataset it is unwilling to divulge. At any given point in time, only clients may be online; the fraction is termed the

user selection probability

. Each client performs training for one epoch (

FedAvg [mcmahan2017communication]) or multiple epochs (FedSGD [shokri2015privacy]) on its private dataset and shares the update (obtained by using stochastic gradient descent) with the central aggregator. The aggregator gathers all updates, and performs an aggregation operation on them. Then, the central aggregator updates its weights, and shares these updated weights to all federated clients to be used as the starting point for the next federated round. FL is assumed to provide privacy by design (since the data is not shared with any central aggregator). However, recent work has shown that by observing the client updates, an adversary may infer sensitive information about the private training data [melis2019exploiting, geiping2020inverting, boenisch2021curious, wang2019beyond, zhu2019federated, bhowmick2018protection, yin2021see]. Additionally, the convergence of the learning algorithm used by the central aggregator (and overall robustness) is subject to both (a) client availability [li2018federated, rajput2019detox, chen2018draco], and (b) uniformity of the data distributed among the clients [zhao2018federated]. In this work, we do not focus on these two problems, but more so on the privacy problems associated with FL, and how they can be mitigated with differential privacy.

Differential Privacy (DP) was proposed by Dwork et al. [dwork2014algorithmic]. Let be a positive real number, and be a randomized algorithm that takes a dataset as input. The algorithm is said to provide -DP if, for all datasets and that differ on a single element, and all subsets of the outcomes of running :

where the probability is over the randomness of the algorithm . is also known as the privacy budget. DP is achieved by adding noise chosen from a specific distribution to provide the aforementioned indistinguishability guarantee [geng2015staircase, balle2018improving].

Local & Central DP: The traditional model defined earlier, also known as the central DP () model, implicitly assumes the existence of a trusted entity that does not deviate from protocol specification and adds the calibrated noise to provide the DP guarantee. To alleviate these strong trust assumptions, local DP ([kasiviswanathan2011can] assumes that each data contributor adds the noise locally, i.e., in-situ. Such a mechanism has limited knowledge of the overall function being computed on all the data, and overestimates the amount of noise required to provide privacy. The relationship between LDP and CDP is dependent on the mechanism used to achieve DP. For example, the Laplacian mechanism [dwork2014algorithmic] ensures that -LDP also provides -CDP.

Private Machine Learning: In their seminal work, Chaudhuri et al. [chaudhuri2011differentially] perturb the outputs of empirical risk minimization (ERM) mechanisms. They proceed to introduce the notion of objective perturbation, which remains the cornerstone for DP learning. Abadi et al. [abadi2016deep] extend the notion of objective perturbation by proposing a DP variant of stochastic gradient descent, where noise is added to each gradient update calculated. They also provide a tight analysis of the privacy budget () through the moments accountant. We point the curious reader to the work of Jayaraman et al. for more details on how DP learning is implemented in practice [jayaraman2019evaluating, zhao2020privacy-utility]. Orthogonal to the approaches involving DP are those that utilize techniques from cryptography, notable multi-party computation (or MPC) [gilad2016cryptonets, wagh2019securenn, zheng2019helen, mishra2020delphi] to provide data privacy.

FL with DP: McMahan et al. [mcmahan2017learning] propose the first approach where DP can be combined with FL to provide formal privacy guarantees. Similar ideas are proposed in the work of Geyer et al. [geyer2017differentially]. In both these settings, however, the central aggregator is able to observe either noise-free gradients or noisy gradients from corresponding clients. To break this connection, Bonawitz et al. [bonawitz2016practical, bonawitz2017practical] propose the notion of secure aggregation, a variant of MPC which provides the central aggregator an aggregated view of all gradients (noisy/non-noisy) from the clients. Truex et al. [truex2019hybrid] combine advances from LDP and MPC (through thresholding cryptography schemes) to provide a hybrid scheme to provide better privacy. Truex et al. [truex2020ldp] also propose an approach using LDP. However, they formulate this based on an alternative DP definition.

Hierarchical FL (HFL): Abad et al. [abad2020hierarchical] propose a mechanism to ensure communication-efficient and coordinated learning in the context of HFL. Here, the notion of hierarchies stems from the presence of clients communicating with small base stations (or cellular towers) which act as intermediaries, who further communicate with macro base stations (or the central aggregator). Similarly, Yuan et al. [yuan2020hierarchical] also propose a new protocol to optimize for communication efficiency in the LAN-WAN setting. This form of HFL is significantly different from that proposed by Briggs et al. [briggs2020federated] which aims to segregate clusters of similar clients which can be independently trained on heterogeneous models. Across these prior works, the actors are consistent: there are the federated clients as in the status quo. However, they interact with intermediary entities (such as base stations in the work of Abad et al. [abad2020hierarchical]), and these intermediary entities consolidate information (on the behest of a subset of the clients) for the central aggregator(s). It is important to note that prior work focuses on improving the scalability/communication-efficiency of FL through the introduction of hierarchies, whilst our contribution is studying the implications on privacy.

Fig. 1: An architectural overview of the approach. Different scenarios can be assumed to be present: 1) Clients in a zone that trust the central aggregator to add noise (CDP). 2) Clients in a zone that trust their intermediary super-node for noise addition (HDP). 3) Clients in a zone that do not trust anybody else and therefore add noise themselves (LDP). 4) Clients in a zone that may or not trust their intermediary super-node and chose (or not) to add noise (LDP+HDP). For pure HDP, locally learnt gradients (from level 0) are perturbed with noise to provide formal DP guarantees at level 1. This ensures that the gradients viewed by the central aggregator (at level 2) are less noisy, potentially resulting in a more performant model. Adversarial nodes can be present either at level 0 (clients) or level 1 (super-node).

Iii Approach Overview

Recall from § II that FL requires online federated clients per round to collaborate with the central aggregator to jointly learn a model. Thus, in the status quo, there is a total of two actors, at two conceptual levels of a hierarchy, i.e., client(s) at level 0 and the server at level 1. In our proposal, we assume the existence of three main actors, at three different (conceptual) levels of the hierarchy (as noted in Figure 1).

Explanation of Figure 1: At level 0 (i.e., the lowest level), we have the federated clients who hold private data as in the status quo. At level 2 (i.e., the highest level), we have the central aggregator, again as in the status quo. The key difference lies in the middle: at level 1 (i.e., the intermediate level), we introduce a new entity called the super-node. The super-node is responsible for processing requests from the online clients in a particular region (or a zone). The presence of level 1 introduces a hierarchical approach to performing FL. An added feature in our proposal revolves around the selection of the super-node: they can either (a) be elected by a pool of their peers and thus are within the same trust boundary/region as their peers, or (b) be chosen as an entity in a different trust region (to both the clients and the central aggregator), but in geographic proximity to the clients chosen in the federated round. Note that there can be many intermediary layers in practice, but for simplicity, we stick to one through the remainder of the paper.

Impact on Privacy: In the status quo, formal privacy guarantees in the FL setting are provided through the usage of DP. In turn, this is obtained through two mechanisms: noise addition at the central aggregator (i.e., CDP), or noise addition locally at each federated client (i.e., LDP). Empirical evidence suggests that the CDP mechanism will result in a final model with higher accuracy. However, it makes a crucial assumption: the central aggregator (where the noise is added) is trustworthy. The LDP mechanism removes this trust assumption at the expense of greater noise addition at each client. To provide the best of both worlds, we propose hierarchical DP (or HDP). Based on the construction described above, clients aggregate their (zonal-level) updates through the super-nodes, where calibrated noise is added, to provide the desired DP guarantee. Through the remainder of the paper, we will describe modifications made to achieve HDP.

1:  Parameters
a. user selection probability
b. per-user example cap
c. noise scale
d. UserUpdate (for FedAvg or FedSGD)
e. ClipFn (FlatClip or PerLayerClip)
f. Parameter
2:  Procedure:
3:  Initialize model
4:   for all users
5:  for each round  do
6:     for each zone  do
7:         (sample users with probability )
8:        
9:        for each user in parallel do
10:           
11:        end for
12:        
13:         (bound on for ClipFn)
14:        
15:        
16:     end for
17:  end for
Algorithm 1 Modified FL mechanism with zonal privacy. Detailed information about parameters and sub-procedures is provided in [mcmahan2017learning].

Algorithm for Hierarchical FL with DP: We summarize our approach in Algorithm 1, which is heavily based on the approach taken by McMahan et al. [mcmahan2017learning], with some key modifications to incorporate hierarchical FL for zones. From line 6, observe that gradient aggregation first occurs at a zonal level. This in turn leads to re-calibration of various parameters (lines 7-14) required to provide the DP guarantee111For more details on the various forms of clipping that can be employed, refer the original work [mcmahan2017learning].. Finally, the noised zonal gradients are averaged and aggregated in line 15. Note that in scenarios where uniform weighting is applied (i.e., the contribution of each client is the same),

. Additionally, the variance of the noise to be added (

in line 14) is calculated as a function of the online clients per zone (as noted by he denominator term: or ), and not the total fraction of clients across all zones, as is re-calibrated based on the selected clients per round per zone.

The proposed approach calculates the sensitivity based on the clipping bound that is calculated across all clients across all zones; obtaining this information in practice will require additional communication between the super-nodes. Thus, this is approximated through the existence of a global clipping bound such that .

Desiderata: HDP should have the following properties:

  1. Trust assumptions: Obtaining HDP requires selecting super-nodes, and how these are selected determines the trust assumptions needed. Ideally, we should not add any unreasonable trust assumptions (refer § IV-A).

  2. Algorithmic correctness: In the absence of noise addition, the models learnt using HFL must be the same as in the status quo (refer § IV-B).

  3. Utility: HDP should not significantly degrade the privacy provided by LDP, and not degrade the utility benefits provided by CDP. In particular, HDP should provide advantageous privacy vs. utility trade-offs (refer § IV-C).

Iii-a Threat Model

We assume a minority of () online clients are adversarial. The adversaries in our setup are honest-but-curious, i.e., they can not deviate from protocol specifications. This is the commonly followed threat model in most prior works [melis2019exploiting, geiping2020inverting, zhu2020deep, wang2019beyond, yin2021see]. Their sole goal is to infer information about the training data of a target client (or group of clients). The only knowledge these adversaries are privy to are the gradient updates shared during training. In particular, the adversaries we consider are able to view:

  1. any (gradient) updates they generate.

  2. the joint update shared from the central aggregator.

  3. the update generated at the intermediary level (i.e., at the super-node), iff the adversary is located at said level.

Recent work [boenisch2021curious, pasquini2021eluding] formalize active adversaries against the FL ecosystem whose aim is to subvert different parts of the protocol. In both cases, effective noise addition can minimize attacker efficacy.

Adversarial Goals. We now describe two attacks that we consider in our work. These attacks enable the adversary to: (a) identify if a particular data-point (or a particular attribute within a data-point) was used during training222A dataset comprises of many data-points, each of which comprises of numerous attributes., or (b) try to reconstruct the data used for training by another client by observing the gradients shared. These adversaries can utilize: (a) data inference [melis2019exploiting] where an adversarial client can calculate the update added to the (central) model by calculating the difference between its model state at the previous iteration and the current iteration (received from the central aggregator), and using knowledge of its own update to identify the aggregated update of all other clients. With this knowledge, this client can infer if a particular data-point (or attribute) was used or not (by one of the other clients). This approach can further be extended into the adversary attempting to obtain the value of a particular attribute used. These adversaries can also use (b) data reconstruction [geiping2020inverting, yin2021see, zhu2020deep], where adversaries use a modified optimization problem which allows them to reconstruct an input (used to obtain a particular gradient).While theoretically sound for a batch size of 1, empirical results demonstrate unsuccessful reconstruction for larger batch sizes [geiping2020inverting, yin2021see].

Note that the reconstruction attack can capture all effects of the inference attack; once the adversary has access to the exact data used for training, it can also infer if the data-point possesses specific attributes (or not). In our evaluation in § VII, we measure privacy through the privacy expenditure () calculation, discuss under what circumstances data reconstruction may succeed, and evaluate hierarchical FL (with DP) under the stronger reconstruction attack. While such an attack is computationally more expensive than the inference attacks proposed in literature, they are more realistic; inference attacks assume that the adversary knows what it is looking for (i.e., makes strong assumptions about adversarial knowledge), but this is often not the case.

Iv Designing Hierarchies

As noted in § III, our primary objective is to design hierarchies to understand if they induce advantageous privacy vs. utility trade-offs. However, the first problem encountered in designing hierarchies is selecting the super-nodes. This is an important problem as the super-node is responsible for adding the noise on behalf of all the federated clients it obtains the updates from. Since we assume an honest-but-curious model, all participants (including the selected super-nodes) do not deviate from the protocol. However, if clients are to share sensitive information (such as the updates produced as part of local learning), then they must trust the super-nodes in order to do so. Below, we describe why such trust may exist, and what these clients can do to amplify their trust (and minimize the exposure of their sensitive information).

Iv-a Choosing Super-Nodes

1. Exploiting Inherent Hierarchies: Hierarchies exist in communication networks in the status quo [petrek2001large]; this information can be used to select super-nodes. For example, hierarchies are introduced by the offset of edge computing, where the edge-based base-stations [messous2017computation, liu2016paradrop] serve as an intermediary between the federated clients and the central aggregator. Traditionally, the manufacturer of various hardware components placed at the different conceptual levels are different [gueta2019sbft]. This allows us to assume that the federated clients, super-nodes, and central aggregator belong to different trust regions. However, protocols are run in software which are often proprietary to the party learning the model, which in this case is the actor at level 2. In such scenarios, it is unclear if the super-node and the aggregator lie in different trust regions. One could envision that the protocol is shared with a different party, and its execution is verified using techniques from cryptography [setty2012making]; this may alleviate some of the concerns associated with trust.

2. Elections: Another approach is to elect the super-node, i.e., the super-node is one of the federated clients. This too, ensures that the super-node is from a different trust region in comparison to the central aggregator. First, federated clients are grouped into zones. For example, the grouping can be based on (a) geography, (b) compute capabilities, or (c) some form of structured or unstructured peer-to-peer overlay organization [lua2005survey-p2poverlays]. Then, a distributed election protocol is run to select a super-node for the particular zone. Methods for this problem can be drawn from past literature that studied it under the context of super-peer selection in P2P networks using different assumptions, metrics, etc. [lo2005super-peer-selection1, mahdy2007super-peer-selection2, li2019super-peer-selection3]. In our setting, such a process assumes that (a) the federated clients are aware of others who are active in that particular zone for the particular round, and (b) can communicate the outcome of the election between each other. While such assumptions introduce additional overheads in terms of communication, they are not unrealistic, as we will see soon. One key requirement in our setting is that the choice/election of the super-nodes is randomized i.e., a particular client in a zone has bounded probability of being elected as the super-node, and each of the clients is equally likely to be elected. The elected super-node is “in power” for the duration it is online (and selected, across multiple learning rounds). Once the super-node is down, then a new one is elected to replace it. However, recall that a small fraction of online clients (who are potential super-nodes) are malicious, and can subvert the election protocol through collusion. To this end, different election protocols and assumptions provide different guarantees. For example, to obtain conventional fault tolerance at zone , a majority of the clients must be honest [castro2002practical] (i.e., ). Similarly, to obtain byzantine fault tolerance,  [castro2002practical].

Formally, let us assume that there are a total of zones and each zone has federated clients, such that . All online clients in that particular round elect the super-node, i.e., one of the online clients are elected as the super-node. Note, however, that the elected super-node has complete purview to the gradients from individual clients, and these gradients may not be masked through the addition of noise needed for DP. To this end, we advocate for the usage of secure aggregation [bonawitz2017practical, bonawitz2016practical] protocols (within zones) to ensure that the super-node gets an aggregate view of the gradients from individual clients. While recent work suggests that an actively malicious super-node (or central aggregator) can violate privacy guarantees provided by secure aggregation [boenisch2021curious, pasquini2021eluding], this is out of the scope of our threat model333As we will explain in § VII-B, such attacks can also be thwarted if the clients utilize LDP protocols. The work of Bonawitz et al. [bonawitz2016practical] suggests that (a) federated clients are aware of others who are online at the particular round to participate in secure aggregation protocols (similar to what is needed for distributed election protocols), and (b) the communication cost associated with secure aggregation is quadratic in the number of participants; both of these requirements can also be used to enable election protocols (as discussed earlier). In our setting, secure aggregation needs to be applied in all zones, and prior work suggests that secure aggregation protocols are practical (i.e., introduces a tolerable time delay) for only a small number of clients [bonawitz2017practical]. This suggests that care must be taken in ensuring that the number of (online) clients per zone is also small. How this is achieved can be determined by future research.

Desiderata for Super-nodes: We have described procedures to select super-nodes from the federated clients. Now, we describe desirable properties that the super-node should possess:

  1. Compute capable: The super-node should be capable of performing aggregation operations using the gradients it receives from the federated clients.

  2. Synchronization: Federated clients may be online at different times, and their communication with the super-node can be bottlenecked due to various network-related issues [awerbuch1985complexity]. The super-node should be capable of handling such distributed synchronization issues [arvind1994probabilistic, silberschatz1979communication].

  3. Persistence: Elected super-nodes are privy to sensitive client information for the entire process. While this may result in a malicious client getting access to user data, the probability of this event is bounded, and the ill-effects can be minimized if the clients utilize LDP and secure aggregation to mask the shared updates. A naive alternative would be to elect a new super-node for every round. However, this would ensure that all the elected super-nodes are privy to sensitive user information (aggregated and noised, or otherwise).

Iv-B Algorithmic Correctness

In § III, we discuss how intermediary super-nodes can be used to add noise on behalf of all online nodes in its corresponding zone. Thus far, we have discussed how these super-nodes are selected. However, despite the presence of the intermediary level, algorithmic correctness must be preserved, i.e., the aggregator has to ensure that the final value that it obtains (and propagates) is the same as that obtained in the baseline (e.g., FedAvg [mcmahan2017communication]) scenario.

To this end, we generalize the construction described earlier as follows: there are federated clients, and a total of zones (each with its own super-node). Let us assume that each zone has an equal number of clients such that . While this simplifying assumption enables easier calculation, this can be relaxed in practice. The federated clients within each zone clip their gradients (as is commonly done [mcmahan2017learning]), and set the clipping norm to a global constant, i.e., all participants (at any level of the hierarchy) are aware of the clipping threshold. Recall that nodes in the zone utilize secure aggregation to share an aggregated view of the gradients with the super-node. Thus, each of the super-nodes has received an update from the corresponding online clients in the zone. These super-nodes will forward these updates to the central aggregator. In the baseline case, this intermediary-level does not exist (and consequently, such intermediary-level forwarding does not exist); noise addition (should the approach utilize CDP) occurs at the central aggregator.

To achieve correctness (i.e., ensure that the protocol is consistent with the baseline), the super-node can (a) apply noise required for DP as is to the aggregated gradients from each zone (and averaging by the total number of online clients for that round will occur at the central aggregator), or (b) average the gradients first (by the number of online clients in the zone)444We assume that the secure aggregation protocol returns just the sum, and not the average. The protocol can easily be modified to return the average as well. before adding noise. Observe that the sensitivity of case (b) will be lower than that of case (a), because of the division required for averaging. Consequently, we advocate for averaging at the super-node following which noise is added.

Iv-C Privacy Benefits

Having established that averaging before adding noise at the super-nodes is the right strategy (§ IV-B), we now discuss the benefits of such noise addition on utility. We begin by introducing preliminaries of the Gaussian mechanism. Note that the formalism below is borrowed from the work of Roth and Dwork [dwork2014algorithmic].

1. -sensitivity: The -sensitivity of is:

where , denotes the p-norm, and denotes some universe (such as the universe of all databases).

2. Gaussian Mechanism (GM): Let the privacy budget . For some constant , the GM with parameter is DP.

In the following lemma, we will describe how there exists a connection between LDP and CDP, if the GM is used to provide DP properties. Note that this lemma is discussed in the absence of an intermediary level, for simplicity in explanation. Extending this to include the presence of super-nodes (or an intermediary level) is trivial.

Lemma: If each of (online) federated clients obtains LDP using the GM, the central aggregator obtains DP, where .

Proof: Assume the existence of a global sensitivity bound . Assume the existence of online clients, each of which utilize the GM to independently add noise to the gradients (denoted for ) being computed to achieve LDP guarantees. Thus, each client utilizes constants for

which satisfy the above condition (in the definition of the GM), and samples noise from distributions with standard deviation

for , i.e., each client returns the following value:

where .

The central aggregator computes (e.g., as required by FedAvg [konevcny2016federated]). Note that is the gradient calculated by client multiplied by the size of client ’s dataset, and then divided by the total dataset size of all online clients.

We know that the sum of two Gaussian variables is Gaussian. Extending this, we know that since

(where ). We also know that

Setting and for all , we can observe that the central aggregator is DP, if each client is LDP.

Privacy Amplification: We can observe that by utilizing the GM for LDP as described above, the central aggregator is able to obtain a more private aggregation of the individual gradients than if it was to use the same parameters (, ). Such a result is commonly known as privacy amplification in DP literature, and in this case is a natural consequence of using the GM to provide LDP guarantees. Amplification is a technique where one amplifies a weak secret into a strong one, i.e., the provided DP guarantee is stronger, with a comparable lesser amount of noise added. Amplification is highly beneficial for utility (since lesser noise is added, but strong privacy is obtained). Prior works observe such amplification effects through subsampling [balle2018privacy] (as used by the accountant to calculate tight bounds for privacy expenditure of algorithms like DP-SGD), and more recently through approaches such as random check-ins [balle2020privacy].

Implications of Amplification: The aforementioned lemma can be generalized to multiple levels of a hierarchy. If level 0 utilizes the GM to obtain LDP, then the intermediary level i.e., level 1 (which aggregates the responses from level 0 clients) obtains DP. Level 2 which aggregates responses from level 1 clients obtains DP, all the way upto level (which aggregates the responses from clients from level ), which obtains DP. Observe that the privacy (a) is dependent on the lowest level where noise is added (which in this case, is level 0), and (b) becomes better as the number of levels increases (assuming for all ). In § VII

, we study the implications of such a mechanism on the utility of the learned classifier.

Amplification from Shuffling [bittau2017prochlo, erlingsson2019amplification]: In the status quo (i.e., no hierarchies), the central aggregator is able to directly see which client shares a gradient (noisy or otherwise) with it. Shuffling is a protocol where this view is broken; the central aggregator no longer knows which client shares the gradient. Shuffling is often achieved through a shuffler, which can be instantiated by a mixnet, for example. Feldman et al. [feldman2022hiding] analyze the shuffling protocol, and quantify the privacy amplification it induces. Recall that the clients utilize a secure aggregation protocol (which utilizes some notion of shuffling) to share an aggregated gradient with the super-node (ergo, ensuring that the super-node does not have direct view of the gradients of individual clients). If the clients were to utilize a LDP protocol (for reasons we will explain in § V), then amplification by shuffling would exist (and averaging post aggregation is a post-processing step). For clients that utilize a LDP protocol before shuffling, the authors argue that this induces another factor, resulting in (a combination of both shuffling-induced amplification, and amplification induced due to the sum of GMs).

Note: Feldman et al. [feldman2022hiding] prove their amplification result for a general DP mechanism. However, the amplification result we show in the lemma is a consequence of using the GM for DP. If the participants decide to use a different mechanism, then the lemma does not hold.

Consequence on Hierarchical FL: Observe that the amplification result allows for stronger privacy (measured at the central aggregator) when noise is added at a lower level of the hierarchy (be it clients, super-nodes, or both). Note that the stage where the noise is added dictates the denominator term in the privacy budget (assuming amplification induced only due to the sum of GMs): if noise is added at each of the clients, and if noise is added at each of the super-nodes. Since , the privacy budget when the noise is added at the super-nodes is more than when added at the client-level. Intuitively, this suggests that a classifier learnt using HDP is going to be more utilitarian than one learnt with LDP (as the amount of noise added is lower). In § VII, we empirically validate this claim.

V Generalization

Aggregator Super-node Client Privacy
C1
C2
C3
C4
C5
C6
C7
TABLE I: Different configurations that are possible with our proposed hierarchical scheme. We will explain when these cases are realizable in § V.

In this section, we provide a generalization of the technique discussed thus far, with the corresponding privacy budgets. Assume a total of zones (as defined earlier) such that each zone has clients (and ). To make discussion easier, let us assume (for all ), and the total number of online clients per round is .

C1 corresponds to the setting of pure LDP when all clients add the required noise to obtain LDP. From the central aggregator’s perspective, the overall privacy is . This is visualized in Figure 1, towards the far right (in the absence of secure aggregation).

C2 corresponds to the setting of pure HDP when all super-nodes add the required noise to obtain HDP. From the central aggregator’s perspective, the overall privacy is . This is visualized in Figure 1, in both zones 2 and 3 (in the absence of secure aggregation).

C3 is realizable when a fraction of all online clients choose to add the required noise for LDP, but the remainder do not. To ensure that the remaining clients receive privacy, the super-node corresponding to these clients needs to add noise to ensure DP. To simplify the math, let us assume that clients corresponding to super-nodes choose to utilize LDP and the remaining super-nodes provide HDP. If each zone has clients, then at each of the super-nodes, we observe DP. From the central aggregator’s perspective, the total privacy is

where the second term in the sum is the contribution of the super-nodes that provide HDP.

C4 corresponds to the setting of pure CDP when only the central aggregator adds noise to obtain CDP. This is visualized in Figure 1, towards the far left.

C5 corresponds to the setting where a fraction of clients do not trust the central aggregator and consequently wish to achieve LDP. To provide privacy to the remainder, the central aggregator also adds noise. Thus, the total privacy budget from the central aggregator’s perspective is

where the first term is due to the fraction which achieves LDP.

C6 is similar to C5, in that only a fraction of super-nodes add noise to achieve HDP. To provide privacy to the remainder, the central aggregator also adds noise. Thus, the total privacy budget from the central aggregator’s perspective is

where the first term is due to the fraction which achieves HDP.

C7 corresponds to the scenario where (a) clients in zones do not trust any entity and wish to achieve LDP, (b) clients in zones trust the super-nodes, and thus the super-nodes achieves HDP, and (c) for the clients in the remaining zones, the central aggregator is responsible for providing privacy. Thus, from the central aggregator’s perspective, the total privacy is

where the first term is due to (a), and the second term is due to (b).

Note: All the calculations above are in the absence of the amplification due to secure aggregation. Should it be considered, the only change would be a removal of the square root over all terms.

Vi Implementation

We implement our proposed approach using tensorflow. In particular, we use a combination of tensorflow-federated v0.16.1 to provide the components required for FL, and tensorflow-privacy to provide the machinery required for private learning555https://github.com/tensorflow/federated/tree/v0.16.1 and https://github.com/tensorflow/privacy. In particular, such a framework allows us to calibrate for the clipping norm and the noise mulitplier [abadi2016deep]; modifications to either will result in implications to both accuracy and privacy. To ensure correct accounting, we modify the accounting libraries in tensorflow-privacy to accurately reflect sampling probability and number of iterations (in this case, rounds).

To evaluate the efficacy of our approach, we consider the datasets listed in Table II; only the EMNIST dataset is modified to exhibit the non-i.i.d property that is commonly associated with FL. Note that the objective of our evaluation is to understand the advantageous utility vs. privacy trade-offs introduced by our scheme. To this end, how non-i.i.d the data is distributed between the clients is not an important compounding factor. The evaluation we will discuss can be considered as an average-case evaluation of the proposed hierarchical DP approach.

Dataset Size # Entries # Classes
EMNIST [cohen2017emnist] 382705 10 3383
CIFAR-10 [cifar10] 60000 10 500
CIFAR-100 [cifar100] 60000 100 500
TABLE II: Dataset characteristics.

The datasets follow a standard 80:20 split (80% of the data is used for training, and the remaining is used for validation). All our experiments were executed on a server with 2 NVIDIA Titan XP with 128 GB RAM and 48 CPU cores running Ubuntu 20.04.2. Due to computational constraints, and issues in tensorflow-federated related to GPU execution666https://github.com/tensorflow/federated/issues/832, our experimental setup is conservative; we only perform a single run of each configuration and are unable to report error bars. However, we replicate the ecosystem chosen in prior work [mcmahan2017learning].

For all experiments (unless explicitly specified otherwise), we choose a setup where users are randomly sampled per federated round (this amounts to different values of the sampling/user-selection probability for different datasets). FedAvg [mcmahan2017communication] is used as the algorithm of choice, and each client performs local training for 5 epochs. FL is performed for 200 rounds. All participating clients use SGD as the learning algorithm, with a learning rate of 0.02 (this includes clients that are super-nodes). The server’s learning rate is set to 1. These parameters were chosen based on the prescribed guidelines from the authors of tensorflow-federated and from prior work [mcmahan2017communication, mcmahan2017learning].

We re-ierate that the objective of our evaluation is centered around demonstrating the benefits of a hierarchical approach for learning in terms of the impact it has on the privacy vs. utility trade-off. To this end, we choose representative architectures from prior work for the datasets we consider [mcmahan2017communication, mcmahan2017learning]. In particular, we consider (a) a shallow 1 hidden layer DNN for EMNIST, and (b) 2 convolution layers followed by 2 fully connected layers for both CIFAR-10 and CIFAR-100. We will release all code used for our experiments on request.

Vii Evaluation

Having described our evaluation setup in § VI, we now focus on the results. Through our evaluation, we wish to answer the following questions:

  1. Does the proposed hierarchical scheme provide advantageous privacy vs. utility trade-offs in comparison to the baseline scenarios of using (a) LDP and/or (b) CDP?777Recall that LDP provides more privacy, while CDP is known to provide a more utilitarian model [DBLP:journals/corr/abs-2102-05975].

  2. Does the proposed hierarchical scheme create a new attack surface for an adversary wishing to perform data (or membership) inference?

Recall from § IV-C, that such a distributed learning framework (with the GM for DP) leads to privacy amplification (i.e., privacy when viewed through higher levels is different than what is viewed through at lower levels of the hierarchy). Thus, to ensure a fair comparison, we only report the privacy budget () calculated at the central aggregator level (i.e., ) in our results. Our results suggest that:

  1. The theoretical intuition is corroborated in empirical measurements of utility. As expected, the proposed approach provides an advantageous trade-off between utility and privacy, and outperforms the LDP baseline in scenarios where the number of super-nodes () is lesser than the number of online clients ().

  2. Hierarchical FL is also more resilient to data reconstruction attacks [zhu2020deep]. In the scenario where the super-node is adversarial, then it is able to perform perfect reconstruction (under some assumptions about the batch-size of data used for client-side learning, and/or in the absence of secure aggregation). However, if the client is adversarial, we observe that reconstruction efforts are less successful than in the scenario with central DP.

Vii-a Privacy vs. Utility Trade-Off

We first train without DP to get an estimate of the

norm of the gradients. This helps us estimate the clipping norm () to be used during DP training. We also observe the training duration (i.e., exact epoch number) at which the validation accuracy saturates. Once this is obtained, we configure the noise multiplier () so as to enable DP training and log the privacy expenditure () at the end of training. Note that our training results, configurations, datasets and models are similar to those of prior work in this area [mcmahan2017learning].

(a) EMNIST
(b) CIFAR-10
(c) CIFAR-100
Fig. 2: We plot the validation accuracy of training with differential privacy for 3 scenarios: (i) central DP or CDP (in orange), (ii) local DP or LDP (in green), and (iii) our proposed hierarchical DP or HDP (in red), in comparison to training without privacy (in blue). This is done across the three datasets: (a) EMNIST, (b) CIFAR-10 and (c) CIFAR-100. Observe that HDP achieves better utility than the LDP scenario, and is often close to the CDP case across datasets.

Privacy vs. Utility: Observe that different strategies of applying DP noise (i.e., central vs. local vs. hierarchical) result in differing values of privacy expenditure. To ensure a fair comparison of privacy vs. utility, we convert the privacy expenditure in all settings to that of the central DP privacy expenditure (using the formulation presented in § IV-C)888Note that the values we report are the ones obtained without the amplification due to shuffling; secure aggregation is known to be communication inefficient [bonawitz2016practical, bonawitz2017practical] and we wish to provide insight on the privacy vs. utility trade-offs in its absence as an average case estimate.. We then plot the validation accuracy as a function of training duration for all the datasets we consider. We also consider a scenario without any form of DP training enabled; this is our baseline. Note that for the hierarchical DP setting, we consider a scenario where there are super-nodes. The results are plotted in Figure 2.

Observe that learning simple tasks such as EMNIST (refer Figure 1(a)) with DP is achievable in all three settings; prior work has demonstrated that learning with privacy for EMNIST can be as performant as learning without privacy [mcmahan2017learning] i.e., the accuracy degradation induced by LDP in comparison to CDP is minimal. However, as expected, HDP provides advantageous trade-offs in terms of privacy (i.e., the privacy budget is lower than CDP for comparable accuracy).

The results are more interesting for complex datasets such as CIFAR-10 and CIFAR-100. First, observe that for the particular configuration we choose (i.e., , ), baseline accuracy is and for the two CIFAR versions, respectively. Note that these values are comparable to those achieved by McMahan et al. (refer Fig. 4 in [mcmahan2017learning]). Training with CDP degrades this accuracy further. However, as expected, HDP is able to provide advantageous trade-offs in terms of privacy and accuracy. Note that CIFAR-100 is naively at least more complex a learning task than CIFAR-10, and yet (a) HDP achieves similar utility to CDP in this setting, and (b) much higher utility than LDP.

Finally, Table III contains corresponding values of the privacy budget () achieved at the end of training. Observe that the privacy budget calculated at the end of HDP training is in-between that of the CDP and LDP case. In fact, with nearly a 67% decrease in privacy expenditure, HDP is able to achieve nearly the same accuracy as the CDP case across all 3 datasets we consider.

Dataset LDP HDP CDP
EMNIST 0.30 0.96 3.06
CIFAR-10 2.48 7.48 24.80
CIFAR-100 2.48 7.48 24.80
TABLE III: Privacy Expenditure across datasets and DP methods. Note that the values for CIFAR-10 and CIFAR-100 are the same as the values of and are the same in both settings, and these datasets have the same number of data-points and are trained for the same duration.

The Influence of : From our analysis, we can see that increasing the number of super-nodes () makes the scheme more private (the privacy budget is inversely proportional to the value of the number of parties where the noise is being added; in the case of HDP, this is

). To better understand this hyperparameter, we consider an experimental setting where we vary the value of

to 100, 200, and 300. Across these 3 settings, we vary the value of to one of across all datasets. All other hyperaprameters were kept the same as in the earlier experiment. We then measure the validation accuracy and vary the configurations of and to obtain the privacy expenditure. We plot the relationship between the fully trained model’s validation accuracy and the privacy budget expended to achieve it in Figure 3. Across all datasets we observe a common trend: the validation accuracy calculated increases as the privacy expenditure does. For the datasets we consider, increasing the value of does not increase the validation accuracy substantially. This is not indicative of a more general trend; one would assume that increasing the number of participants would result in a more accurate model.

Take-away: The proposed hierarchical approach provides advantageous privacy vs. utility trade-offs in comparison to both CDP and LDP-based approaches. Additionally, increasing the value of provides better privacy, which in-turn leads to lower utility.

(a) EMNIST
(b) CIFAR-10
(c) CIFAR-100
Fig. 3: We plot the validation accuracy as a function of privacy () (obtained by varying the value of ) for 3 scenarios using HDP: (i) (in blue), (ii) (in orange), and (iii) (in green), across the three datasets: (a) EMNIST, (b) CIFAR-10, and (c) CIFAR-100. Observe that the utility obtained by HDP improves with increasing privacy budget .

Vii-B Attacks on Federated Learning

We wish to understand if the proposed hierarchical scheme introduces a new attack surface: through the introduction of adversarial entities at the super-node level. In this subsection, we will discuss (a) the capability of attackers in the status quo, (b) if the success of attacks in the status quo increases in the new hierarchical scheme, and (c) if any new attacks are possible due to the introduction of hierarchies.

Dataset LDP HDP CDP No DP
EMNIST 0.571 0.599 0.62 0.68
CIFAR-10 0.423 0.452 0.494 0.596
CIFAR-100 0.447 0.46 0.474 0.564
TABLE IV: Efficacy of data reconstruction: Observe that reconstruction is least effective when LDP is used for DP training; HDP provides the next best resilience.

Level 0 Adversaries: Here, we consider scenarios where the adversaries are at level 0, and aim to perform data reconstruction (of a particular local client). Recall that in FL, the central aggregator adds the aggregated (client or super-node calculated) gradients to its own weights before propagating a new updates for the next round. Based on the generalization proposed in § V, one can observe that the noise addition required to provide DP can occur at (one or all of the) three levels: (a) the clients themselves add DP noise (i.e., a pure LDP scheme), (b) the central aggregator adds noise (i.e., a pure CDP scheme), and (c) the super-nodes add noise (i.e., a HDP scheme). Regardless of which level adds the noise, we assume that noise addition is performed and a malicious client (at level 0) receives the noisy update from the central aggregator, and wishes to use this information to enable data reconstruction. To do so, the client is able to subtract its contribution from the aggregated weight update shared, and run reconstruction attacks using the remaining information (the strategy proposed by Melis et al. [melis2019exploiting]). To perform reconstruction, we implement the attack proposed by Zhu et al. [zhu2020deep], using the source code presented by the authors for the datasets and models considered in the earlier section. We measure the reconstruction capabilities using the Learned Perceptual Image Patch Similarity (LPIPS) metric [zhang2018unreasonable] (larger values are more realistic). As prior work notes, such reconstruction attacks are largely inefficient for large batch sizes (used for local learning) and large values of .

To simplify the setup, and highlight the merit of our scheme, we consider a simple setup of and , and where one of these clients is adversarial and wishes to learn the data of the other, when the batch size used is . The results are presented in Table IV. Observe that the HDP scheme provides better resilience than CDP, but is worse than the LDP scheme. This is explained by the privacy guarantees provided by HDP (which, again, lies between LDP and CDP).

Level 1 adversaries: Privacy attacks at the super-node level can be caused by the adversary having complete purview to the gradients from each client in that zone. While we advocate for the usage of secure aggregation to alleviate this issue, we discuss the outcomes if such a protocol can not be deployed. In such situations, as denoted by Figure 1, the super-node can have direct access to gradients from individual clients and can perform data reconstruction attacks (to determine membership). We describe what may happen in such scenarios:

  • In the pure LDP setting (i.e.,case 1 from Table I), federated clients add noise to the gradients shared upstream. Thus, the super-node adversary has a noisy view of each of the per-client gradient it receives. In such settings, attacks such as the one by Melis et al. [melis2019exploiting] will be rendered ineffective (as this attack requires exact gradient knowledge), and data reconstruction attacks become less effective, as these attacks rely on correct gradient information for reconstruction (refer Table IV).

  • In scenarios where federated clients do not add noise (i.e.,case 2 from Table I), and the super-node is required to perform noise addition (as in the of HDP), data reconstruction attacks are possible if the super-node is malicious as it has direct purview to the gradients from the local clients. However, if secure aggregation is utilized, the malicious super-node is provided an aggregated gradient; the efficacy of these attacks in such cases is reduced [geiping2020inverting, yin2021see] i.e., reconstruction is not of high fidelity (refer to the No DP column in Table IV).

To make level 1 adversaries less effective, (a) either client-level noise needs to be added, or (b) an aggregated gradient (via secure aggregation) needs to be shared to the super-node. Local DP (i.e., client-level noise addition) is a very computationally efficient procedure, and its impacts on utility are well understood (i.e., it induces an unfavorable utility vs. privacy trade-off). Protocols like secure aggregation, on the other hand, do not harm the utility of the model (i.e., the utility of the model in the absence of secure aggregation is the same with it). However, it greatly increases the communication cost associated with the protocol; each client needs to communicate with all other clients (for each aggregation, in the worst case), leading to communication complexity that is quadratic in the number of clients. We measured the time required to perform secure aggregation using the tff.learning.secure_aggregator module available as part of tensorflow-federated, and for the models we described earlier. Per round, we needed between 2 and 30 seconds (the time increases as a function of the model size). These numbers were consistent with the numbers presented earlier [bonawitz2016practical, bonawitz2017practical].

Level 3 adversaries: If the central aggregator is malicious, the setting is the same as the level 2 adversary. Both LDP and HDP in conjunction with secure aggregation can reduce such an adversary’s efficacy of performing data reconstruction. However, there is a strategic advantage of performing secure aggregation at the super-node level. In a simple setting, assume the existence of zones each with clients, and (where is the total number of online clients). It is clear to see how the communication cost associated with secure aggregation is lower in this setting (it is quadratic in and not ).

Take-away: In addition to secure aggregation, (a) to defeat level 0 adversaries, either LDP or HDP suffices; (b) to defeat level 1 adversaries, LDP suffices; and (c) to defeat level 2 adversaries, LDP or HDP suffices. However, HDP provides better utility than LDP, and is preferred wherever possible.

Note: We consider passive adversaries which aim to perform reconstruction, as done in prior work. More recent work [boenisch2021curious, pasquini2021eluding] considers actively malicious actors that can provide inconsistent information to different actors of the FL ecosystem to facilitate more efficient data reconstruction (without being influenced by the batch size). However, the efficacy of such attacks are reduced when secure aggregation is combined with noise addition (required to provide DP).

Viii Discussion & Open Questions

Influence of : The privacy accounting in federated learning stems from multiple factors, primary of which is the sampling probability; this determines the number of clients that are online in each federated round. Contrary to our expectation, our experimental results suggest that increasing the value of does not have a strong effect on the accuracy of the final model learnt. We conjecture that this maybe the case due to the i.i.d distribution of data associated with both the CIFAR-10 and CIFAR-100 datasets. While EMNIST is non-i.i.d, the learning task by itself is too simple to merit substantial utility difference across the three settings we consider. However, a small value for the sampling probability i.e., a small value of greatly improves the privacy of the model learnt (i.e., small values of ). We leave performing a more in-depth analysis of the influence of and for future work.

Privacy Amplification: New approaches for privacy amplification (such as randomized check-ins [balle2020privacy]) can easily be incorporated with our scheme999Depending on the level it is applied, we will obtain a different amplification factor. Oftentimes, amplification is a byproduct of composing a process that provides randomization with the actual function that needs to be made differentially private. In our scheme, such a randomization effect is observed through secure aggregation (which enables shuffling). In the scenario where the , there is no amplification provided as the central aggregator views individual client gradients. However, when , observe that the central aggregator is only provided an aggregate view (of all gradients from a particular zone). Determining if other sources of amplification may exist is subject to future work.

Information Leakage: If the number of clients per zone exceeds the number of federated rounds, then in expectation, each client will be elected a super-node only once. A more detailed understanding is needed to ascertain how much information is leaked by exposing a super-node only once to aggregate gradient information multiple times.

Runtime: In our work, we measure the privacy vs. utility trade-offs of the proposed hierarchical ecosystem, assuming that elections and secure aggregation are implemented using state-of-the-art approaches. Microbenchmarks providing run-times of the overall scheme, highlighting time taken for both elections and secure aggregation, as a function of both and , will help understand the practicality of the scheme.

Practical Topologies & Data Distribution: In our current work, we prototype the proposal using simple topologies where each super-node is responsible for the same number of clients. One can theoretically show that varying the number of clients per super-node will not influence the privacy guarantees of the approach. However, this will have an impact on the utility of the final model learnt.

Ix Conclusion

In our work, we propose an approach for hierarchical federated learning through super-node election from federated clients. We also propose extensions for how it can be retrofitted to provide differential privacy guarantees. Our experiments suggest that the proposed approach provides more advantageous privacy vs. utility trade-offs compared to approaches in the status quo, while providing resilience to inference adversaries. In future work, we hope to analyze how the randomization involved in super-node election can be composed with the amplification provided by the Gaussian mechanism to provide stronger privacy guarantees.

References