DeepAI
Log In Sign Up

Data Leakage in Tabular Federated Learning

10/04/2022
by   Mark Vero, et al.
ETH Zurich
0

While federated learning (FL) promises to preserve privacy in distributed training of deep learning models, recent work in the image and NLP domains showed that training updates leak private data of participating clients. At the same time, most high-stakes applications of FL (e.g., legal and financial) use tabular data. Compared to the NLP and image domains, reconstruction of tabular data poses several unique challenges: (i) categorical features introduce a significantly more difficult mixed discrete-continuous optimization problem, (ii) the mix of categorical and continuous features causes high variance in the final reconstructions, and (iii) structured data makes it difficult for the adversary to judge reconstruction quality. In this work, we tackle these challenges and propose the first comprehensive reconstruction attack on tabular data, called TabLeak. TabLeak is based on three key ingredients: (i) a softmax structural prior, implicitly converting the mixed discrete-continuous optimization problem into an easier fully continuous one, (ii) a way to reduce the variance of our reconstructions through a pooled ensembling scheme exploiting the structure of tabular data, and (iii) an entropy measure which can successfully assess reconstruction quality. Our experimental evaluation demonstrates the effectiveness of TabLeak, reaching a state-of-the-art on four popular tabular datasets. For instance, on the Adult dataset, we improve attack accuracy by 10 of 32 and further obtain non-trivial reconstructions for batch sizes as large as 128. Our findings are important as they show that performing FL on tabular data, which often poses high privacy risks, is highly vulnerable.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/14/2022

Do Gradient Inversion Attacks Make Federated Learning Unsafe?

Federated learning (FL) allows the collaborative training of AI models w...
06/01/2021

H-FL: A Hierarchical Communication-Efficient and Privacy-Protected Architecture for Federated Learning

The longstanding goals of federated learning (FL) require rigorous priva...
07/02/2022

FL-Defender: Combating Targeted Attacks in Federated Learning

Federated learning (FL) enables learning a global machine learning model...
07/15/2022

PASS: Parameters Audit-based Secure and Fair Federated Learning Scheme against Free Rider

Federated Learning (FL) as a secure distributed learning frame gains int...
09/13/2021

Source Inference Attacks in Federated Learning

Federated learning (FL) has emerged as a promising privacy-aware paradig...
10/28/2021

Gradient Inversion with Generative Image Prior

Federated Learning (FL) is a distributed learning framework, in which th...
02/17/2022

LAMP: Extracting Text from Gradients with Language Model Priors

Recent work shows that sensitive user data can be reconstructed from gra...

1 Introduction

Federated Learning (McMahan et al., 2017)

(FL) has emerged as the most prominent approach to training machine learning models collaboratively without requiring sensitive data of different parties to be sent to a single centralized location. While prior work has examined privacy leakage in federated learning in the context of computer vision 

(Zhu et al., 2019; Geiping et al., 2020; Yin et al., 2021)

and natural language processing 

(Dimitrov et al., 2022a; Gupta et al., 2022; Deng et al., 2021), many applications of FL rely on large tabular datasets that include highly sensitive personal data such as financial information and health status (Borisov et al., 2021; Rieke et al., 2020; Long et al., 2021). However, no prior work has studied the issue of privacy leakage in the context of tabular data, a cause of concern for public institutions which have recently launched a competition111https://petsprizechallenges.com/ with a 1.6 mil. USD prize to develop privacy-preserving FL solutions for fraud detection and infection risk prediction, both being tabular datasets.

Key challenges

Leakage attacks often rely on solving optimization problems whose solutions are the desired sensitive data points. Unlike other data types, tabular data poses unique challenges to solving these problems because: (i) the reconstruction is a solution to a mixed discrete-continuous optimization problem, in contrast to other domains where the problem is fully continuous (pixels for images and embeddings for text), (ii) there is high variance in the final reconstructions because, uniquely to tabular data, discrete changes in the categorical features significantly change the optimization trajectory, and (iii) assessing the quality of reconstructions is harder compared to images and text - e.g. determining whether a person with given reconstructed characteristics exists is difficult. Together, these challenges imply that it is difficult to make existing attacks work on tabular data.

This work

In this work, we propose the first comprehensive leakage attack on tabular data in the FL setting, addressing the previously mentioned challenges. We provide an overview of our approach in Fig. 1, showing the reconstruction of a client’s private training data point , from the corresponding training update received by the server. In Step 1, we create separate optimization problems, each assigning different initial values

to the optimization variables, representing our reconstruction of the client’s one-hot encoded data,

. To address the first challenge of tabular data leakage, we transform the mixed discrete-continuous optimization problem into a fully continuous one, by passing our current reconstructions through a per-feature softmax at every step . Using the softmaxed data , we take a gradient step to minimize the reconstruction loss, which compares the received client update with a simulated client update computed on . In Step 2, we reduce the variance of the final reconstruction by performing pooling over the different solutions , thus tackling the second challenge. In Step 3, we address the challenge of assessing the fidelity of our reconstructions. We rely on the observation that often when our proposed reconstructions agree they also match the true client data, . We measure the agreement using entropy. In the example above, we see that the features sex and age produced a low entropy distribution. Therefore we assign high confidence to these results (green arrows). In contrast, the reconstruction of the feature race receives a low confidence rating (orange arrow); rightfully so, as the reconstruction is incorrect.

We implemented our approach in an end-to-end attack called TabLeak and evaluated it on several tabular datasets. Our attack is highly effective: it can obtain non-trivial reconstructions for batch sizes as large as 128, and on many practically relevant batch sizes such as 32, it improved reconstruction accuracy by up to 10% compared to the baseline. Overall, our findings show that FL is highly vulnerable when applied to tabular data.

Main contributions

Our main contributions are:

  • Novel insights enabling efficient attacks on FL with tabular data: using softmax to make the optimization problem fully continuous, ensembling to reduce the variance, and entropy to assess the reconstructions.

  • An implementation of our approach into an end-to-end tool called TabLeak.

  • Extensive experimental evaluation, demonstrating effectiveness of TabLeak at reconstructing sensitive client data on several popular tabular datasets.

Figure 1: Overview of TabLeak. Our approach transforms the optimization problem into a fully continuous one by optimizing continuous versions of the discrete features, obtained by applying softmax (Step 1, middle boxes), resulting in candidate solutions (Step 1, bottom). Then, we pool together an ensemble of different solutions obtained from the optimization to reduce the variance of the reconstruction (Step 2). Finally, we assess the quality of the reconstruction by computing the entropy from the feature distributions in the ensemble (Step 3).

2 Background and Related Work

In this section, we provide the necessary technical background for our work, introduce the notation used throughout the paper, and present the related work in this field.

Federated Learning

Federated Learning (FL) is a training protocol developed to facilitate the distributed training of a parametric model while preserving the privacy of the data at source 

(McMahan et al., 2017). Formally, we have a parametric function , where are the (network) parameters and . Given a dataset as the union of private datasets of clients , we now wish to find a such that:

(1)

in a distributed manner, i.e. without collecting the dataset in a central database. McMahan et al. (2017) propose two training algorithms: FedSGD (a similar algorithm was also proposed by Shokri and Shmatikov (2015)) and FedAvg, that allow for the distributed training of , while keeping the data partitions

at client sources. The two protocols differ in how the clients compute their local updates in each step of training. In FedSGD, each client calculates the update gradient with respect to a randomly selected batch of their own data and shares the resulting gradient with the server. In FedAvg, the clients conduct a few epochs of local training on their own data before sharing their resulting parameters with the server. After the server has received the gradients/parameters from the clients, it aggregates them, updates the model, and broadcasts it to the clients. In each case, this process is repeated until convergence, where FedAvg usually requires fewer rounds of communication.

Data Leakage Attacks

Although FL was designed with the goal of preserving the privacy of clients’ data, recent work has uncovered substantial vulnerabilities. Melis et al. (2019) first presented how one can infer certain properties of the clients’ private data in FL. Later, Zhu et al. (2019) demonstrated that an honest but curious server can use the current state of the model and the client gradients to reconstruct the clients’ data, breaking the main privacy promise of Federated Learning. Under this threat model, there has been extensive research on designing tailored attacks for images (Geiping et al., 2020; Geng et al., 2021; Huang et al., 2021; Jin et al., 2021; Balunovic et al., 2022; Yin et al., 2021; Zhao et al., 2020; Jeon et al., 2021; Dimitrov et al., 2022b) and natural language (Deng et al., 2021; Dimitrov et al., 2022a; Gupta et al., 2022). However, no prior work has comprehensively dealt with tabular data, despite its significance in real-world high-stakes applications (Borisov et al., 2021). Some works also consider a threat scenario where the malicious server is allowed to change the model or the updates communicated to the clients (Wen et al., 2022; Fowl et al., 2022); but in this work we focus on the honest-but-curious setting.

In training with FedSGD, given the model at an iteration and the gradient of some client, we solve the following optimization problem to retrieve the client’s private data:

(2)

Where in Eq. 2 we denote the gradient matching loss as and is an optional regularizer for the reconstruction. The work of Zhu et al. (2019) has used the mean square error for , on which Geiping et al. (2020)

improved using the cosine similarity loss.

Zhao et al. (2020) first demonstrated that the private labels

can be estimated before solving

Eq. 2, reducing the complexity of Eq. 2 and improving the attack results. Their method was later extended to batches by Yin et al. (2021) and refined by Geng et al. (2021). Eq. 2 is typically solved using continuous optimization tools such as L-BFGS (Liu and Nocedal, 1989) and Adam (Kingma and Ba, 2014). Although analytical approaches exist, they do not generalize to batches with more than a single data point (Zhu and Blaschko, 2021).

Depending on the data domain, distinct tailored alterations to Eq. 2 have been proposed in the literature, e.g., using the total variation regularizer for images (Geiping et al., 2020)

and exploiting pre-trained language models in language tasks 

(Dimitrov et al., 2022a; Gupta et al., 2022). These mostly non-transferable domain-specific solutions are necessary as each domain poses unique challenges. Our work is first to identify and tackle the key challenges to data leakage in the tabular domain.

Mixed Type Tabular Data

Mixed type tabular data is a data type commonly used in health, economic and social sciences, which entail high-stakes privacy-critical applications (Borisov et al., 2021). Here, data is collected in a table of feature columns, mostly human-interpretable, e.g., age, nationality, and occupation of an individual. We formalize tabular data as follows. Let be one line of data, containing discrete or categorical features and continuous or numerical features. Let contain discrete feature columns and continuous feature columns, i.e. , where with cardinality and

. For the purpose of deep neural network training, the categorical features are often encoded in a numerical vector. We denote the encoded data batch or line as

, where we preserve the continuous features and encode the categorical features by a one-hot encoding. The one-hot encoding of the -th discrete feature is a vector of length that has a one at the position marking the encoded category, while all other entries are zeros. We obtain the represented category by taking the argmax of (projection to obtain ). Using the described encoding, one line of data translates to: , containing entries.

3 Tabular Leakage

In this section, we briefly summarize the key challenges in tabular leakage and present our solution to these challenges in the subsequent subsections and our end-to-end attack.

Key Challenges

We now list the three key challenges that we address in our work: (i) the presence of both categorical and continuous features in tabular data require the attacker to solve a significantly harder mixed discrete-continuous optimization problem (addressed in Sec. 3.1), (ii) the large distance in the encodings of the categorical features introduces high variance in the leakage problem (addressed in Sec. 3.2), and (iii) in contrast to images and text, it is hard for an adversary to assess the quality of the reconstructed data in the tabular domain, as most reconstructions may be projected to credible input data points (we address this via an uncertainty quantification scheme in Sec. 3.3).

3.1 The Softmax Structural Prior

We now discuss our solution to challenge (i) – we introduce the softmax structural prior, which turns the hard mixed discrete-continuous optimization problem into a fully continuous one. This drastically reduces its complexity, while still facilitating the recovery of correct discrete structures.

To start, notice that the recovery of one-hot encodings can be enforced by ensuring that all entries of the recovered vector are either zero or one, and exactly one of the entries equals to one. However, these constraints enforce integer properties, i.e. they are non-differentiable and can not be used in combination with the powerful continuous optimization tools used for gradient leakage attacks. Relaxing the integer constraint by allowing the reconstructed entries to take real values in , we are still left with a constrained optimization problem not well suited for popular continuous optimization tools, such as Adam (Kingma and Ba, 2014). Therefore, we are looking for a method that can implicitly enforce the constraints introduced above.

Let be our approximate intermediate solution for the true one-hot encoded data at some optimization step. Then, we are looking for a differentiable function , such that:

(3)

Notice that the two conditions in Eq. 3 can be fulfilled by applying a softmax to , i.e. define:

(4)

Note that it is easy to show that Eq. 4 fulfills both conditions in Eq. 3 and that it is differentiable. Putting this together, in each round of optimization we will have the following approximation of the true data point: . In order to preserve notational simplicity, we write

to mean the application of softmax to each group of entries representing a given categorical variable separately.

3.2 Pooled Ensembling

As mentioned earlier, the mix of categorical and continuous features introduces further variance in the difficult reconstruction problem which already has multiple local minima and high sensitivity to initialization (Zhu and Blaschko, 2021) (challenge (ii)). Concretely, as the one-hot encodings of the categorical features are orthogonal to each other, a change in the encoded category can drastically change the optimization trajectory. We alleviate this problem by adapting an established method of variance reduction in noisy processes (Hastie et al., 2009), i.e. we run independent optimization processes with different initializations and ensemble their results through feature-wise pooling.

Note that the features in tabular data are tied to a certain position in the recovered data vector, thereby we can combine independent reconstructions to obtain an improved and more robust final estimate of the true data by applying feature-wise pooling. Formally, we run independent rounds of optimization with initializations recovering potentially different reconstructions . Then, we obtain a final estimate of the true encoded data, denoted as , by pooling them:

(5)

Where the operation can be any permutation invariant mapping that maps to the same structure as its inputs. In our attack, we use median pooling for both continuous and categorical features.

Figure 2: Maximum similarity matching of a sample from the collection of reconstructions to the best-loss sample .

Notice that because a batch-gradient is invariant to permutations of the datapoints in the corresponding batch, when reconstructing from such a gradient we may retrieve the batch-points in a different order at every optimization instance. Hence, we need to reorder each batch such that their lines match to each other, and only then we can conduct the pooling. We reorder by first selecting the sample that produced the best reconstruction loss at the end of optimization , with projection . Then, we match the lines of every other sample in the collection with respect to . Concretely, we calculate the similarity (described in detail in Sec. 4) between each pair of lines of and another sample in the collection and find the maximum similarity reordering of the lines with the help of bipartite matching solved by the Hungarian algorithm (Kuhn, 1955). This process is depicted in Fig. 2. Repeating this for each sample, we reorder the entire collection with respect to the best-loss sample, effectively reversing the permutation differences in the independent reconstructions. Therefore, after this process we can directly apply feature-wise pooling for each line over the collection.

3.3 Entropy-based Uncertainty Estimation

We now address challenge (iii) above. To recap, it is significantly harder for an adversary to asses the quality of an obtained reconstruction when it comes to tabular data, as almost any reconstruction may constitute a credible datapoint when projected back to mixed discrete-continuous space. Note that this challenge does not arise as prominently in the image (or text) domain, because by looking at a picture one can easily judge if it is just noise or an actual image. To address this issue, we propose to estimate the reconstruction uncertainty by looking at the level of agreement over a certain feature for different reconstructions. Concretely, given a collection of reconstructions as in Sec. 3.2, we can observe the distribution of each feature over the reconstructions. Intuitively, if this distribution is "peaky", i.e. concentrates the mass heavily on a certain value, then we can assume that the feature has been reconstructed correctly, whereas if there is high disagreement between the reconstructed samples, we can assume that this feature’s recovered final value should not be trusted. We can quantify this by measuring the entropy of the feature distributions induced by the recovered samples.

Categorical Features

Let be the relative frequency of projected reconstructions of the -th discrete feature of value in the ensemble. Then, we can calculate the normalized entropy of the feature as . Note that the normalization allows for comparing features with supports of different size, i.e. it ensures that , as

for any discrete random variable

of finite support.

Continuous Features

In case of the continuous features, we calculate the entropy assuming that errors of the reconstructed samples follow a Gaussian distribution. As such, we first estimate the sample variance

for the -th continuous feature and then plug it in to calculate the entropy of the corresponding Gaussian: . As this approach is universal over all continuous features, it is enough to simply scale the features themselves to make their entropy comparable. For example, this can be achieved by working only with standardized features.

Note that as the categorical and the continuous features are fundamentally different from an information theoretic perspective, we have no robust means to combine them in a way that would allow for equal treatment. Therefore, when assessing the credibility of recovered features, we will always distinguish between categorical and continuous features.

3.4 Combined Attack

1:function SingleInversion(Neural Network: , Client Gradient: , Reconstructed Labels: , Initial Reconstruction: , Iterations: , N. of Discrete Features: )
2:     for  in  do
3:         for  in  do
4:               softmax()
5:         end for
6:         
7:     end for
8:     return
9:end function
10:
11:function TabLeak(Neural Network: , Client Gradient: , Reconstructed Labels: , Ensemble Size: , Iterations: , N. of Discrete Features: )
12:     
13:     for  in  do
14:          SingleInversion(, , , , , )
15:     end for
16:     
17:      MatchAndPool()
18:      CalculateEntropy()
19:      Project()
20:     return , ,
21:end function
Algorithm 1 Our combined attack against training by FedSGD

Now we provide the description of our end-to-end attack, TabLeak. Following Geiping et al. (2020), we use the cosine similarity loss as our reconstruction loss, defined as:

(6)

where are the true data, are the labels reconstructed beforehand, and we optimize for . Our algorithm is shown in Alg. 1. First, we reconstruct the labels using the label reconstruction method of Geng et al. (2021) and provide them as an input to our attack. Then, we initialize independent dummy samples for an ensemble of size (creftype 12). Starting from each initial sample we optimize independently (creftype 13-15) via the SingleInversion function. In each optimization step, we apply the softmax structural prior of Sec. 3.1, and let the optimizer differentiate through it (creftype 4). After the optimization processes have converged or have reached the maximum number of allowed iterations , we identify the sample producing the best reconstruction loss (creftype 16). Using this sample, we match and median pool to obtain the final encoded reconstruction in creftype 17 as described in Sec. 3.2. Finally, we return the projected reconstruction and the corresponding feature-entropies and , quantifying the uncertainty in the reconstruction.

4 Experimental evaluation

In this section, we first detail the evaluation metric we used to assess the obtained reconstructions. We then briefly explain our experimental setup. Next, we evaluate our attack in various settings against baseline methods, establishing a new state-of-the-art. Finally, we demonstrate the effectiveness of our entropy-based uncertainty quantification method.

Evaluation Metric

As no prior work on tabular data leakage exists, we propose our metric for measuring the accuracy of tabular reconstruction, inspired by the 0-1 loss, allowing the joint treatment of categorical and continuous features. For a reconstruction , we define the accuracy metric as:

(7)

where is the ground truth and are constants determining how close the reconstructed continuous features have to be to the original value in order to be considered successfully leaked. We provide more details on our metric in App. A and experiments with additional metrics in Sec. C.3.

Baselines

We consider two main baselines: (i) Random Baseline does not use the gradient updates and simply randomly samples reconstructions from the per-feature marginals of the input dataset. Due to the structure of tabular datasets, we can easily estimate the marginal distribution of each feature. For the categorical features this can be done by simple counting, and for the continuous features we do it by defining a binning scheme with equally spaced bins between the lower and upper bounds of the feature. Although this baseline is usually not realizable in practice (as it assumes prior knowledge of the marginals), it helps us calibrate our metric as performing below this baseline signals that there is no information being extracted from the client updates. Note that because both the selection of a batch and the random baseline represent sampling from the (approximate) data generating distribution, the random baseline monotonously approaches perfect accuracy with increasing batch size, (ii) Cosine Baseline is based on the work of Geiping et al. (2020), who established a strong attack for images. We transfer their attack to tabular data by removing the total variation prior used for images. Note that in the case of most competitive attacks on image and text, when removing the domain specific elements, they reduce to this baseline, therefore it is a reasonable choice for benchmarking a new domain.

Experimental Setup

For all attacks, we use the Adam optimizer (Kingma and Ba, 2014) with learning rate for iterations and without a learning rate schedule to perform the optimization in Alg. 1. In line with Geiping et al. (2020), we modify the update step of the optimizer by reducing the update gradient to its element-wise sign. The neural network we attack is a fully connected neural network with two hidden layers of neurons each. We conducted our experiments on four popular mixed-type tabular binary classification datasets, the Adult census dataset (Dua and Graff, 2017), the German Credit dataset (Dua and Graff, 2017), the Lawschool Admission dataset (Wightman, 2017), and the Health Heritage dataset from Kaggle222Source: https://www.kaggle.com/c/hhp. Due to the space constraints, here we report only our results on the Adult dataset, and refer the reader to App. D

for full results on all four datasets. Finally, for all reported numbers below, we attack a neural network at initialization and estimate the mean and standard deviation of each reported metric on

different batches. For experiments with varying network sizes and attacks against provable defenses, please see App. C. For further details on the experimental setup of each experiment, we refer the reader to App. B

General Results against FedSGD

In Tab. 1 we present the results of our strong attack TabLeak against FedSGD training, together with two ablation experiments, each time removing either the pooling (no pooling) or the softmax component (no softmax). We compare our results to the baselines introduced above, on batch sizes , and , once assuming knowledge of the true labels (top) and once using labels reconstructed by the method of Geng et al. (2021) (bottom). Notice that the noisy label reconstruction only influences the results for lower batch sizes, and manifests itself mostly in higher variance in the results. It is also worth to note that for batch size (and lower, see App. D) all attacks can recover almost all the data, exposing a trivial vulnerability of FL on tabular data. In case of larger batch sizes, even up to , TabLeak can recover a significant portion of the client’s private data, well above random guessing, while the baseline Cosine attack fails to do so, demonstrating the necessity of a domain tailored attack. In a later paragraph, we show how we can further improve our reconstruction on this batch size and extract subsets of features with accuracy using the entropy. Further, the results on the ablation attacks demonstrate the effectiveness of each attack component, both providing a non-trivial improvement over the baseline attack that is preserved when combined in our strongest attack. Demonstrating generalization beyond Adult, we include our results on the German Credit, Lawschool Admissions, and Health Heritage datasets in Sec. D.1, where we also outperform the Cosine baseline attack by at least on batch size on each dataset.

Label Batch TabLeak TabLeak TabLeak Cosine Random
Size (no pooling) (no softmax)
True
Rec.
Table 1: The mean inversion accuracy [%] and standard deviation of different methods over varying batch sizes with given true labels (True ) and with reconstructed labels (Rec. ) on the Adult dataset.

Categorical vs. Continuous Features

An interesting effect of having mixed type features in the data is that the reconstruction success clearly differs by feature type. As we can observe in Fig. 3, the continuous features produce an up to lower accuracy than the categorical features for the same batch size. We suggest that this is due to the nature of categorical features and how they are encoded. While trying to match the gradients by optimizing the reconstruction, having the correct categorical features will have a much greater effect on the gradient alignment, as when encoded, they take up the majority of the data vector. Also, when reconstructing a one-hot encoded categorical feature, we only have to be able to retrieve the location of the maximum in a vector of length , whereas for the successful reconstruction of a continuous feature we have to retrieve its value correctly up to a small error. Therefore, especially when the optimization process is aware of the structure of the encoding scheme (e.g., by using the softmax structural prior), categorical features are much easier to reconstruct. This poses a critical privacy risk in tabular federated learning, as sensitive features are often categorical, e.g., gender or race.

Figure 3: The inversion accuracy on the Adult dataset over varying batch size separated for discrete (D) and continuous (C) features.

Federated Averaging

In training with FedAvg (McMahan et al., 2017) participating clients conduct local training of several updates before communicating their new parameters to the server. Note that the more local updates are conducted by the clients, the harder a reconstruction attack becomes, making leakage attacks against FedAvg more challenging. Although this training method is of significantly higher practical importance than FedSGD, most prior work does not evaluate against it. Building upon the work of Dimitrov et al. (2022b) (for details please see App. B and the work of Dimitrov et al. (2022b)), we evaluate our combined attack and the cosine baseline in the setting of Federated Averaging. We present our results of retrieving a client dataset of size over varying number of local batches and epochs on the Adult dataset in Tab. 2, while assuming full knowledge of the true labels. We observe that while our combined attack significantly outperforms the random baseline of accuracy even up to local updates, the baseline attack fails to consistently do so whenever the local training is longer than one epoch. As FedAvg with tabular data is of high practical relevance, our results which highlight its vulnerability are concerning. We show further details for the experimental setup and results on other datasets in App. B and App. D, respectively.

TabLeak Cosine
#batches 1 epoch 5 epochs 10 epochs 1 epoch 5 epochs 10 epochs
1
2
4
Table 2: Mean and standard deviation of the inversion accuracy [%] on FedAvg with local dataset sizes of on the Adult dataset. The accuracy of the random baseline for datapoints is .
Batch Size
8 16 32 64 128
Acc.
Acc.
Table 3: The mean accuracy [%] and entropies with the corresponding standard deviations over batch sizes of the categorical (top) and continuous (bottom) features.

Assessing Reconstructions via Entropy

Entropy Categorical Features
Bucket Accuracy [%] Data [%]
0.0-0.2
0.2-0.4
0.4-0.6
0.6-0.8
0.8-1.0
Overall
Random
Table 4: The mean accuracy [%] and the share of data [%] in each entropy bucket for batch size 128 on the Adult dataset.

We now investigate how an adversary can use the entropy (introduced in Sec. 3.3) to assess the quality of their reconstructions. In Tab. 3 we show the mean and standard deviation of the accuracy and the entropy of both the discrete and the continuous features over increasing batch sizes after reconstructing with TabLeak (ensemble size ). We observe an increase in the mean entropy over the increasing batch sizes, corresponding to accuracy decrease in the reconstructed batches. Hence, an attacker can understand the global effectiveness of their attack by looking at the retrieved entropies, without having to compare their results to the ground truth.

We now look at a single batch of size

and put each categorical feature into a bucket based on their reconstruction entropy after attacking with TabLeak (ensemble size

). In Tab. 4 we present our results, showing that features falling into lower entropy buckets (0.0-0.2 and 0.2-0.4) inside a batch are significantly more accurately reconstructed () than the overall batch (%). Note that this bucketing can be done without the knowledge of the ground-truth, yet the adversary can concretely identify the high-fidelity features in their noisy reconstruction. This shows that even for reconstructions of large batches that seem to contain little-to-no information (close to random baseline), an adversary can still extract subsets of the data with high accuracy. Tables containing both feature types on all four datasets can be found in Sec. D.4, providing analogous conclusions.

5 Conclusion

In this work we presented TabLeak, the first data leakage attack on tabular data in the setting of federated learning (FL), obtaining state-of-the-art results against both popular FL training protocols in the tabular domain. As tabular data is ubiquitous in privacy critical high-stakes applications, our results raise important concerns regarding practical systems currently using FL. Therefore, we advocate for further research on advancing defenses necessary to mitigate such privacy leaks.

6 Ethics Statement

As tabular data is often used in high-stakes applications and may contain sensitive data of natural or legal persons, confidential treatment is critical. This work presents an attack algorithm in the tabular data domain that enables an FL server to steal the private data of its clients in industry-relevant scenarios, deeming such applications potentially unsafe.

We believe that exposing vulnerabilities of both recently proposed and widely adopted systems, where privacy is a concern, can benefit the development of adequate safety mechanisms against malicious actors. In particular, this view is shared by the governmental institutions of the United States of America and the United Kingdom that jointly supported the launch of a competition (https://petsprizechallenges.com/) aimed at advancing the privacy of FL in the tabular domain, encouraging the participation of both teams developing defenses and attacks. Also, as our experiments in Sec. C.1 show, existing techniques can help mitigate the privacy threat, hence we encourage practitioners to make use of them.

References

Appendix A Accuracy Metric

To ease the understanding, we start by repeating our accuracy metric here, where we measure the reconstruction accuracy between the retrieved sample and the ground truth as:

(8)

Note that the binary treatment of continuous features in our accuracy metric enables the combined measurement of the accuracy on both the discrete and the continuous features. From an intuitive point of view, this measure closely resembles how one would judge the correctness of numerical guesses. For example, guessing the age of a year old, one would deem the guess good if it is within to years of the true value, but the guesses and would be both qualitatively incorrect. In order to facilitate scalability of our experiments, we chose the error-tolerance-bounds based on the global standard deviation if the given continuous feature and multiplied it by a constant, concretely, we used for all our experiments. Note that

for a Gaussian random variable

with mean and variance . For our metric this means that assuming Gaussian zero-mean error in the reconstruction around the true value, we accept our reconstruction as privacy leakage as long as we fall into the

error-probability range around the correct value. In

Tab. 5 we list the tolerance bounds for the continuous features of the Adult dataset produced by this method. We would like to remark here, that we fixed our metric parameters before conducting any experiments, and did not adjust them based on any obtained results. Note also that in App. C we provide results where the continuous feature reconstruction accuracy is measured using the commonly used regression metric of root mean squared error (RMSE), where TabLeak also achieves the best results, signaling that the success of our method is independent of our chosen metric.

feature age fnlwgt education-num capital-gain capital-loss hours-per-week
tolerance 4.2 33699 0.8 2395 129 3.8
Table 5: Resulting tolerance bounds on the Adult dataset when using , as used by us for our experiments.

Appendix B Further Experimental Details

Here we give an extended description to our experimental details provided in Sec. 4. For all attacks, we use the Adam optimizer (Kingma and Ba, 2014) with learning rate for iterations and without a learning rate schedule. We chose the learning rate based on our experiments on the baseline attack where it performed best. In line with Geiping et al. (2020), we modify the update step of the optimizer by reducing the update gradient to its element-wise sign. We attack a fully connected neural network with two hidden layers of neurons each at initialization. However, we provide a network-size ablation in Fig. 8, where we evaluate our attack against the baseline method for different network architectures. For each reported metric we conduct independent runs on 50 different batches to estimate their statistics. For all FedSGD experiments we clamp the continuous features to their valid ranges before measuring the reconstruction accuracy, both for our attacks and the baseline methods. We ran each of our experiments on single cores of Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz.

Federated Averaging Experiments

For experiments on attacking the FedAvg training algorithm, we fix the clients’ local dataset size at and conduct an attack after local training with learning rate on the initialized network described above. We use the FedAvg attack-framework of Dimitrov et al. (2022b), where for each local training epoch we initialize an independent mini-dataset matching the size of the client dataset, and simulate the local training of the client. At each reconstruction update, we use the mean squared error between the different epoch data means ( and in Dimitrov et al. (2022b)) as the permutation invariant epoch prior required by the framework, ensuring the consistency of the reconstructed dataset. For the full technical details, please refer to the manuscript of Dimitrov et al. (2022b). For choosing the prior parameter , we conduct line-search on each setup and attack method pair individually on the parameters , and pick the ones providing the best results. Further, to reduce computational overhead, we reduce the ensemble size of TabLeak from to for these experiments on all datasets.

Appendix C Further Experiments

In this subsection, we present three further experiments:

  • Results of attacking neural networks defended using differentially private noisy gradients in Sec. C.1.

  • Ablation study on the impacts of the neural network’s size on the reconstruction difficulty in Sec. C.2.

  • Measuring the Root Mean Squared Error (RMSE) of the reconstruction of continuous features in Sec. C.3.

c.1 Attack against Gaussian DP

Differential privacy (DP) has recently gained popularity, as a way to prevent privacy violations in FL (Abadi et al., 2016; Zhu et al., 2019). Unlike, empirical defenses which are often broken by specifically crafted adversaries (Balunovic et al., 2022), DP provides guarantees on the amount of data leaked by a FL model, in terms of the magnitude of random noise the clients add to their gradients prior to sharing them with the server (Abadi et al., 2016; Zhu et al., 2019). Naturally, DP methods balance privacy concerns with the accuracy of the produced model, since bigger noise results in worse models that are more private. In this subsection, we evaluate TabLeak, and the Cosine baseline against DP defended gradient updates, where zero-mean Gaussian noise is added with standard deviations , , and to the client gradients. We present our results on the Adult, German Credit, Lawschool Admissions, and Health Heritage datasets in Fig. 4, Fig. 5, Fig. 6, and Fig. 7, respectively. Although both methods are affected by the defense, our method consistently produces better reconstructions than the baseline method. However, for high noise level () and larger batch size both attacks break, advocating for the use of DP defenses in tabular FL to prevent the high vulnerability exposed by this work.

Figure 4: Mean and standard deviation accuracy [%] curves over batch size at varying Gaussian noise level added to the client gradients for differential privacy on the Adult dataset.
Figure 5: Mean and standard deviation accuracy [%] curves over batch size at varying Gaussian noise level added to the client gradients for differential privacy on the German Credit dataset.
Figure 6: Mean and standard deviation accuracy [%] curves over batch size at varying Gaussian noise level added to the client gradients for differential privacy on the Lawschool Admissions dataset.
Figure 7: Mean and standard deviation accuracy [%] curves over batch size at varying Gaussian noise level added to the client gradients for differential privacy on the Health Heritage dataset.

c.2 Varying Network Size

To understand the effect the choice of the network has on the obtained reconstruction results, we defined additional fully connected networks, two smaller, and two bigger ones to evaluate TabLeak on. Concretely, we examined the following five architectures for our attack:

  1. a single hidden layer with neurons,

  2. a single hidden layer with neurons,

  3. two hidden layers with neurons each (network used in main body),

  4. three hidden layers with neurons each,

  5. and three hidden layers with neurons each.

We attack the above networks, aiming to reconstruct a batch of size . We plot the accuracy of TabLeak and the cosine baseline as a function of the number of parameters in the network in Fig. 8 for all four datasets. We can observe that with increasing number of parameters in the network, the reconstruction accuracy significantly increases on all datasets, and rather surprsingly, allowing for near perfect reconstruction of a batch as large as in some cases. Observe that on both ends of the presented parameter scale the differences between the methods degrade, i.e. they either both converge to near-perfect reconstruction (large networks) or to random guessing (small networks). Therefore, the choice of our network for conducting the experiments was instructive in examining the differences between the methods.

Adult
German Credit
Lawschool Admissions
Health Heritage
Figure 8: Mean attack accuracy curves with standard deviation for batch size over varying network size (measured in number of parameters, #Params, log scale) on all four datasets. We mark the network we used for our other experiments with a dashed vertical line.

c.3 Continuous Feature Reconstruction Measured by RMSE

In order to examine the potential influence of our choice of reconstruction metric on the obtained results, we further measured the reconstruction quality of continuous features on the widely used Root Mean Squared Error (RMSE) metric as well. Concretely, we calculate the RMSE between the continuous features of our reconstruction and the ground truth in a batch of size as:

(9)

As our results in Fig. 9 demonstrate, TabLeak achieves significantly lower RMSE than the Cosine baseline on large batch sizes, for all four datasets examined. This indicates that the strong results obtained by TabLeak in the rest of the paper are not a consequence of our evaluation metric.

Adult
German Credit
Lawschool Admissions
Health Heritage
Figure 9: The mean and standard deviation of the Root Mean Square Error (RMSE) of the reconstructions of the continuous features on all four datasets over batch sizes.

Appendix D All Main Results

In this subsection, we include all the results presented in the main part of this paper for the Adult dataset alongside with the corresponding additional results on the German Credit, Lawschool Admissions, and the Health Heritage datasets.

d.1 Full FedSGD Results on all Datasets

In Tab. 6, Tab. 7, Tab. 8, and Tab. 9 we provide the full attack results of our method compared to the Cosine and random baselines on the Adult, German Credit, Lawschool Admissions, and Health Heritage datasets, respectively. Looking at the results for all datasets, we can confirm the observations made in Sec. 4, i.e. (i) the lower batch sizes are vulnerable to any non-trivial attack, (ii) not knowing the ground truth labels does not significantly disadvantage the attacker for larger batch sizes, and (iii) TabLeak provides a strong improvement over the baselines for practically relevant batch sizes over all datasets examined.

Label Batch TabLeak TabLeak TabLeak Cosine Random
Size (no pooling) (no softmax)
True
Rec.
Table 6: The mean inversion accuracy [%] and standard deviation of different methods over varying batch sizes with given true labels (top) and with reconstructed labels (bottom) on the Adult dataset.
Label Batch TabLeak TabLeak TabLeak Cosine Random
Size (no pooling) (no softmax)
True
Rec.
Table 7: The mean inversion accuracy [%] and standard deviation of different methods over varying batch sizes with given true labels (top) and with reconstructed labels (bottom) on the German Credit dataset.
Label Batch TabLeak TabLeak TabLeak Cosine Random
Size (no pooling) (no softmax)
True
Rec.
Table 8: The mean inversion accuracy [%] and standard deviation of different methods over varying batch sizes with given true labels (top) and with reconstructed labels (bottom) on the Lawschool Admissions dataset.
Label Batch TabLeak TabLeak TabLeak Cosine Random
Size (no pooling) (no softmax)
True
Rec.
Table 9: The mean inversion accuracy [%] and standard deviation of different methods over varying batch sizes with given true labels (top) and with reconstructed labels (bottom) on the Health Heritage dataset.

d.2 Categorical vs. Continuous Features on all Datasets

In Fig. 10, we compare the reconstruction accuracy of the continuous and the discrete features on all four datasets. We confirm our observations, shown in Fig. 3 in the main text, that a strong dichotomy between continuous and discrete feature reconstruction accuracy exists on all 4 datasets.

Adult
German Credit
Lawschool Admissions
Health Heritage
Figure 10: Mean reconstruction accuracy curves with corresponding standard deviations over varying batch size, separately for the discrete and the continuous features on all four datasets.

d.3 Federated Averaging Results on all Datasets

In Tab. 10, Tab. 11, Tab. 12, and Tab. 13 we present our results on attacking the clients in FedAvg training on the Adult, German Credit, Lawschool Submissions, and Health Heritage datasets, respectively. We described the details of the experiment in App. B above. Confirming our conclusions drawn in the main part of this manuscript, we observe that TabLeak achieves non-trivial reconstruction accuracy over all settings and even for large numbers of updates, while the baseline attack often fails to outperform random guessing, when the number of local updates is increased.

TabLeak Cosine
#batches 1 epoch 5 epochs 10 epochs 1 epoch 5 epochs 10 epochs
1
2
4
Table 10: Mean and standard deviation of the inversion accuracy [%] with local dataset size of in FedAvg training on the Adult dataset. The accuracy of the random baseline for datapoints is .
TabLeak Cosine
#batches 1 epoch 5 epochs 10 epochs 1 epoch 5 epochs 10 epochs
1
2
4
Table 11: Mean and standard deviation of the inversion accuracy [%] with local dataset size of in FedAvg training on the German Credit dataset. The accuracy of the random baseline for datapoints is .
TabLeak Cosine
#batches 1 epoch 5 epochs 10 epochs 1 epoch 5 epochs 10 epochs
1
2
4
Table 12: Mean and standard deviation of the inversion accuracy [%] with local dataset size of in FedAvg training on the Lawschool Admissions dataset. The accuracy of the random baseline for datapoints is .
TabLeak Cosine
#batches 1 epoch 5 epochs 10 epochs 1 epoch 5 epochs 10 epochs
1
2
4
Table 13: Mean and standard deviation of the inversion accuracy [%] with local dataset size of in FedAvg training on the Health Heritage dataset. The accuracy of the random baseline for datapoints is .

d.4 Full Results on Entropy on all Datasets

In Tab. 14, Tab. 15, Tab. 16, and Tab. 17 we provide the mean and standard deviation of the reconstruction accuracy and the entropy of the continuous and the categorical features over increasing batch size for attacking with TabLeak on the four datasets. In support of Sec. 4, we can observe on all datasets a trend of increasing entropy over decreasing reconstruction accuracy as the batch size is increased; and as such providing a signal to the attacker about their overall reconstruction success.

To generalize our results on the local information contained in the entropy, we show the mean reconstruction accuracy of both the discrete and the continuous features with respect to bucketing them based on their entropy in a batch of size in Tab. 18, Tab. 19, Tab. 20, and Tab. 21 for all four datasets, respectively. We can see that with the help of this bucketing, we can identify subsets of the reconstructed features that have been retrieved with a (sometimes significantly e.g., up to 24%) higher accuracy than the overall batch.

Discrete Continuous
Accuracy Entropy Accuracy Entropy
1
2
4
8
16
32
64
128
Table 14: The mean accuracy [%] and entropies with the corresponding standard deviations over batch sizes of the categorical and the continuous features on the Adult dataset.
Discrete Continuous
Accuracy Entropy Accuracy Entropy
1
2
4
8
16
32
64
128
Table 15: The mean accuracy [%] and entropies with the corresponding standard deviations over batch sizes of the categorical and the continuous features on the German Credit dataset.
Discrete Continuous
Accuracy Entropy Accuracy Entropy
1
2
4
8
16
32
64
128
Table 16: The mean accuracy [%] and entropies with the corresponding standard deviations over batch sizes of the categorical and the continuous features on the Lawschool Admissions dataset.
Discrete Continuous
Accuracy Entropy Accuracy Entropy
1
2
4
8
16
32
64