1 Introduction
Federated Learning (McMahan et al., 2017)
(FL) has emerged as the most prominent approach to training machine learning models collaboratively without requiring sensitive data of different parties to be sent to a single centralized location. While prior work has examined privacy leakage in federated learning in the context of computer vision
(Zhu et al., 2019; Geiping et al., 2020; Yin et al., 2021)and natural language processing
(Dimitrov et al., 2022a; Gupta et al., 2022; Deng et al., 2021), many applications of FL rely on large tabular datasets that include highly sensitive personal data such as financial information and health status (Borisov et al., 2021; Rieke et al., 2020; Long et al., 2021). However, no prior work has studied the issue of privacy leakage in the context of tabular data, a cause of concern for public institutions which have recently launched a competition^{1}^{1}1https://petsprizechallenges.com/ with a 1.6 mil. USD prize to develop privacypreserving FL solutions for fraud detection and infection risk prediction, both being tabular datasets.Key challenges
Leakage attacks often rely on solving optimization problems whose solutions are the desired sensitive data points. Unlike other data types, tabular data poses unique challenges to solving these problems because: (i) the reconstruction is a solution to a mixed discretecontinuous optimization problem, in contrast to other domains where the problem is fully continuous (pixels for images and embeddings for text), (ii) there is high variance in the final reconstructions because, uniquely to tabular data, discrete changes in the categorical features significantly change the optimization trajectory, and (iii) assessing the quality of reconstructions is harder compared to images and text  e.g. determining whether a person with given reconstructed characteristics exists is difficult. Together, these challenges imply that it is difficult to make existing attacks work on tabular data.
This work
In this work, we propose the first comprehensive leakage attack on tabular data in the FL setting, addressing the previously mentioned challenges. We provide an overview of our approach in Fig. 1, showing the reconstruction of a client’s private training data point , from the corresponding training update received by the server. In Step 1, we create separate optimization problems, each assigning different initial values
to the optimization variables, representing our reconstruction of the client’s onehot encoded data,
. To address the first challenge of tabular data leakage, we transform the mixed discretecontinuous optimization problem into a fully continuous one, by passing our current reconstructions through a perfeature softmax at every step . Using the softmaxed data , we take a gradient step to minimize the reconstruction loss, which compares the received client update with a simulated client update computed on . In Step 2, we reduce the variance of the final reconstruction by performing pooling over the different solutions , thus tackling the second challenge. In Step 3, we address the challenge of assessing the fidelity of our reconstructions. We rely on the observation that often when our proposed reconstructions agree they also match the true client data, . We measure the agreement using entropy. In the example above, we see that the features sex and age produced a low entropy distribution. Therefore we assign high confidence to these results (green arrows). In contrast, the reconstruction of the feature race receives a low confidence rating (orange arrow); rightfully so, as the reconstruction is incorrect.We implemented our approach in an endtoend attack called TabLeak and evaluated it on several tabular datasets. Our attack is highly effective: it can obtain nontrivial reconstructions for batch sizes as large as 128, and on many practically relevant batch sizes such as 32, it improved reconstruction accuracy by up to 10% compared to the baseline. Overall, our findings show that FL is highly vulnerable when applied to tabular data.
Main contributions
Our main contributions are:

Novel insights enabling efficient attacks on FL with tabular data: using softmax to make the optimization problem fully continuous, ensembling to reduce the variance, and entropy to assess the reconstructions.

An implementation of our approach into an endtoend tool called TabLeak.

Extensive experimental evaluation, demonstrating effectiveness of TabLeak at reconstructing sensitive client data on several popular tabular datasets.
2 Background and Related Work
In this section, we provide the necessary technical background for our work, introduce the notation used throughout the paper, and present the related work in this field.
Federated Learning
Federated Learning (FL) is a training protocol developed to facilitate the distributed training of a parametric model while preserving the privacy of the data at source
(McMahan et al., 2017). Formally, we have a parametric function , where are the (network) parameters and . Given a dataset as the union of private datasets of clients , we now wish to find a such that:(1) 
in a distributed manner, i.e. without collecting the dataset in a central database. McMahan et al. (2017) propose two training algorithms: FedSGD (a similar algorithm was also proposed by Shokri and Shmatikov (2015)) and FedAvg, that allow for the distributed training of , while keeping the data partitions
at client sources. The two protocols differ in how the clients compute their local updates in each step of training. In FedSGD, each client calculates the update gradient with respect to a randomly selected batch of their own data and shares the resulting gradient with the server. In FedAvg, the clients conduct a few epochs of local training on their own data before sharing their resulting parameters with the server. After the server has received the gradients/parameters from the clients, it aggregates them, updates the model, and broadcasts it to the clients. In each case, this process is repeated until convergence, where FedAvg usually requires fewer rounds of communication.
Data Leakage Attacks
Although FL was designed with the goal of preserving the privacy of clients’ data, recent work has uncovered substantial vulnerabilities. Melis et al. (2019) first presented how one can infer certain properties of the clients’ private data in FL. Later, Zhu et al. (2019) demonstrated that an honest but curious server can use the current state of the model and the client gradients to reconstruct the clients’ data, breaking the main privacy promise of Federated Learning. Under this threat model, there has been extensive research on designing tailored attacks for images (Geiping et al., 2020; Geng et al., 2021; Huang et al., 2021; Jin et al., 2021; Balunovic et al., 2022; Yin et al., 2021; Zhao et al., 2020; Jeon et al., 2021; Dimitrov et al., 2022b) and natural language (Deng et al., 2021; Dimitrov et al., 2022a; Gupta et al., 2022). However, no prior work has comprehensively dealt with tabular data, despite its significance in realworld highstakes applications (Borisov et al., 2021). Some works also consider a threat scenario where the malicious server is allowed to change the model or the updates communicated to the clients (Wen et al., 2022; Fowl et al., 2022); but in this work we focus on the honestbutcurious setting.
In training with FedSGD, given the model at an iteration and the gradient of some client, we solve the following optimization problem to retrieve the client’s private data:
(2) 
Where in Eq. 2 we denote the gradient matching loss as and is an optional regularizer for the reconstruction. The work of Zhu et al. (2019) has used the mean square error for , on which Geiping et al. (2020)
improved using the cosine similarity loss.
Zhao et al. (2020) first demonstrated that the private labelscan be estimated before solving
Eq. 2, reducing the complexity of Eq. 2 and improving the attack results. Their method was later extended to batches by Yin et al. (2021) and refined by Geng et al. (2021). Eq. 2 is typically solved using continuous optimization tools such as LBFGS (Liu and Nocedal, 1989) and Adam (Kingma and Ba, 2014). Although analytical approaches exist, they do not generalize to batches with more than a single data point (Zhu and Blaschko, 2021).Depending on the data domain, distinct tailored alterations to Eq. 2 have been proposed in the literature, e.g., using the total variation regularizer for images (Geiping et al., 2020)
and exploiting pretrained language models in language tasks
(Dimitrov et al., 2022a; Gupta et al., 2022). These mostly nontransferable domainspecific solutions are necessary as each domain poses unique challenges. Our work is first to identify and tackle the key challenges to data leakage in the tabular domain.Mixed Type Tabular Data
Mixed type tabular data is a data type commonly used in health, economic and social sciences, which entail highstakes privacycritical applications (Borisov et al., 2021). Here, data is collected in a table of feature columns, mostly humaninterpretable, e.g., age, nationality, and occupation of an individual. We formalize tabular data as follows. Let be one line of data, containing discrete or categorical features and continuous or numerical features. Let contain discrete feature columns and continuous feature columns, i.e. , where with cardinality and
. For the purpose of deep neural network training, the categorical features are often encoded in a numerical vector. We denote the encoded data batch or line as
, where we preserve the continuous features and encode the categorical features by a onehot encoding. The onehot encoding of the th discrete feature is a vector of length that has a one at the position marking the encoded category, while all other entries are zeros. We obtain the represented category by taking the argmax of (projection to obtain ). Using the described encoding, one line of data translates to: , containing entries.3 Tabular Leakage
In this section, we briefly summarize the key challenges in tabular leakage and present our solution to these challenges in the subsequent subsections and our endtoend attack.
Key Challenges
We now list the three key challenges that we address in our work: (i) the presence of both categorical and continuous features in tabular data require the attacker to solve a significantly harder mixed discretecontinuous optimization problem (addressed in Sec. 3.1), (ii) the large distance in the encodings of the categorical features introduces high variance in the leakage problem (addressed in Sec. 3.2), and (iii) in contrast to images and text, it is hard for an adversary to assess the quality of the reconstructed data in the tabular domain, as most reconstructions may be projected to credible input data points (we address this via an uncertainty quantification scheme in Sec. 3.3).
3.1 The Softmax Structural Prior
We now discuss our solution to challenge (i) – we introduce the softmax structural prior, which turns the hard mixed discretecontinuous optimization problem into a fully continuous one. This drastically reduces its complexity, while still facilitating the recovery of correct discrete structures.
To start, notice that the recovery of onehot encodings can be enforced by ensuring that all entries of the recovered vector are either zero or one, and exactly one of the entries equals to one. However, these constraints enforce integer properties, i.e. they are nondifferentiable and can not be used in combination with the powerful continuous optimization tools used for gradient leakage attacks. Relaxing the integer constraint by allowing the reconstructed entries to take real values in , we are still left with a constrained optimization problem not well suited for popular continuous optimization tools, such as Adam (Kingma and Ba, 2014). Therefore, we are looking for a method that can implicitly enforce the constraints introduced above.
Let be our approximate intermediate solution for the true onehot encoded data at some optimization step. Then, we are looking for a differentiable function , such that:
(3) 
Notice that the two conditions in Eq. 3 can be fulfilled by applying a softmax to , i.e. define:
(4) 
Note that it is easy to show that Eq. 4 fulfills both conditions in Eq. 3 and that it is differentiable. Putting this together, in each round of optimization we will have the following approximation of the true data point: . In order to preserve notational simplicity, we write
to mean the application of softmax to each group of entries representing a given categorical variable separately.
3.2 Pooled Ensembling
As mentioned earlier, the mix of categorical and continuous features introduces further variance in the difficult reconstruction problem which already has multiple local minima and high sensitivity to initialization (Zhu and Blaschko, 2021) (challenge (ii)). Concretely, as the onehot encodings of the categorical features are orthogonal to each other, a change in the encoded category can drastically change the optimization trajectory. We alleviate this problem by adapting an established method of variance reduction in noisy processes (Hastie et al., 2009), i.e. we run independent optimization processes with different initializations and ensemble their results through featurewise pooling.
Note that the features in tabular data are tied to a certain position in the recovered data vector, thereby we can combine independent reconstructions to obtain an improved and more robust final estimate of the true data by applying featurewise pooling. Formally, we run independent rounds of optimization with initializations recovering potentially different reconstructions . Then, we obtain a final estimate of the true encoded data, denoted as , by pooling them:
(5) 
Where the operation can be any permutation invariant mapping that maps to the same structure as its inputs. In our attack, we use median pooling for both continuous and categorical features.
Notice that because a batchgradient is invariant to permutations of the datapoints in the corresponding batch, when reconstructing from such a gradient we may retrieve the batchpoints in a different order at every optimization instance. Hence, we need to reorder each batch such that their lines match to each other, and only then we can conduct the pooling. We reorder by first selecting the sample that produced the best reconstruction loss at the end of optimization , with projection . Then, we match the lines of every other sample in the collection with respect to . Concretely, we calculate the similarity (described in detail in Sec. 4) between each pair of lines of and another sample in the collection and find the maximum similarity reordering of the lines with the help of bipartite matching solved by the Hungarian algorithm (Kuhn, 1955). This process is depicted in Fig. 2. Repeating this for each sample, we reorder the entire collection with respect to the bestloss sample, effectively reversing the permutation differences in the independent reconstructions. Therefore, after this process we can directly apply featurewise pooling for each line over the collection.
3.3 Entropybased Uncertainty Estimation
We now address challenge (iii) above. To recap, it is significantly harder for an adversary to asses the quality of an obtained reconstruction when it comes to tabular data, as almost any reconstruction may constitute a credible datapoint when projected back to mixed discretecontinuous space. Note that this challenge does not arise as prominently in the image (or text) domain, because by looking at a picture one can easily judge if it is just noise or an actual image. To address this issue, we propose to estimate the reconstruction uncertainty by looking at the level of agreement over a certain feature for different reconstructions. Concretely, given a collection of reconstructions as in Sec. 3.2, we can observe the distribution of each feature over the reconstructions. Intuitively, if this distribution is "peaky", i.e. concentrates the mass heavily on a certain value, then we can assume that the feature has been reconstructed correctly, whereas if there is high disagreement between the reconstructed samples, we can assume that this feature’s recovered final value should not be trusted. We can quantify this by measuring the entropy of the feature distributions induced by the recovered samples.
Categorical Features
Let be the relative frequency of projected reconstructions of the th discrete feature of value in the ensemble. Then, we can calculate the normalized entropy of the feature as . Note that the normalization allows for comparing features with supports of different size, i.e. it ensures that , as
for any discrete random variable
of finite support.Continuous Features
In case of the continuous features, we calculate the entropy assuming that errors of the reconstructed samples follow a Gaussian distribution. As such, we first estimate the sample variance
for the th continuous feature and then plug it in to calculate the entropy of the corresponding Gaussian: . As this approach is universal over all continuous features, it is enough to simply scale the features themselves to make their entropy comparable. For example, this can be achieved by working only with standardized features.Note that as the categorical and the continuous features are fundamentally different from an information theoretic perspective, we have no robust means to combine them in a way that would allow for equal treatment. Therefore, when assessing the credibility of recovered features, we will always distinguish between categorical and continuous features.
3.4 Combined Attack
Now we provide the description of our endtoend attack, TabLeak. Following Geiping et al. (2020), we use the cosine similarity loss as our reconstruction loss, defined as:
(6) 
where are the true data, are the labels reconstructed beforehand, and we optimize for . Our algorithm is shown in Alg. 1. First, we reconstruct the labels using the label reconstruction method of Geng et al. (2021) and provide them as an input to our attack. Then, we initialize independent dummy samples for an ensemble of size (creftype 12). Starting from each initial sample we optimize independently (creftype 1315) via the SingleInversion function. In each optimization step, we apply the softmax structural prior of Sec. 3.1, and let the optimizer differentiate through it (creftype 4). After the optimization processes have converged or have reached the maximum number of allowed iterations , we identify the sample producing the best reconstruction loss (creftype 16). Using this sample, we match and median pool to obtain the final encoded reconstruction in creftype 17 as described in Sec. 3.2. Finally, we return the projected reconstruction and the corresponding featureentropies and , quantifying the uncertainty in the reconstruction.
4 Experimental evaluation
In this section, we first detail the evaluation metric we used to assess the obtained reconstructions. We then briefly explain our experimental setup. Next, we evaluate our attack in various settings against baseline methods, establishing a new stateoftheart. Finally, we demonstrate the effectiveness of our entropybased uncertainty quantification method.
Evaluation Metric
As no prior work on tabular data leakage exists, we propose our metric for measuring the accuracy of tabular reconstruction, inspired by the 01 loss, allowing the joint treatment of categorical and continuous features. For a reconstruction , we define the accuracy metric as:
(7) 
where is the ground truth and are constants determining how close the reconstructed continuous features have to be to the original value in order to be considered successfully leaked. We provide more details on our metric in App. A and experiments with additional metrics in Sec. C.3.
Baselines
We consider two main baselines: (i) Random Baseline does not use the gradient updates and simply randomly samples reconstructions from the perfeature marginals of the input dataset. Due to the structure of tabular datasets, we can easily estimate the marginal distribution of each feature. For the categorical features this can be done by simple counting, and for the continuous features we do it by defining a binning scheme with equally spaced bins between the lower and upper bounds of the feature. Although this baseline is usually not realizable in practice (as it assumes prior knowledge of the marginals), it helps us calibrate our metric as performing below this baseline signals that there is no information being extracted from the client updates. Note that because both the selection of a batch and the random baseline represent sampling from the (approximate) data generating distribution, the random baseline monotonously approaches perfect accuracy with increasing batch size, (ii) Cosine Baseline is based on the work of Geiping et al. (2020), who established a strong attack for images. We transfer their attack to tabular data by removing the total variation prior used for images. Note that in the case of most competitive attacks on image and text, when removing the domain specific elements, they reduce to this baseline, therefore it is a reasonable choice for benchmarking a new domain.
Experimental Setup
For all attacks, we use the Adam optimizer (Kingma and Ba, 2014) with learning rate for iterations and without a learning rate schedule to perform the optimization in Alg. 1. In line with Geiping et al. (2020), we modify the update step of the optimizer by reducing the update gradient to its elementwise sign. The neural network we attack is a fully connected neural network with two hidden layers of neurons each. We conducted our experiments on four popular mixedtype tabular binary classification datasets, the Adult census dataset (Dua and Graff, 2017), the German Credit dataset (Dua and Graff, 2017), the Lawschool Admission dataset (Wightman, 2017), and the Health Heritage dataset from Kaggle^{2}^{2}2Source: https://www.kaggle.com/c/hhp. Due to the space constraints, here we report only our results on the Adult dataset, and refer the reader to App. D
for full results on all four datasets. Finally, for all reported numbers below, we attack a neural network at initialization and estimate the mean and standard deviation of each reported metric on
different batches. For experiments with varying network sizes and attacks against provable defenses, please see App. C. For further details on the experimental setup of each experiment, we refer the reader to App. BGeneral Results against FedSGD
In Tab. 1 we present the results of our strong attack TabLeak against FedSGD training, together with two ablation experiments, each time removing either the pooling (no pooling) or the softmax component (no softmax). We compare our results to the baselines introduced above, on batch sizes , and , once assuming knowledge of the true labels (top) and once using labels reconstructed by the method of Geng et al. (2021) (bottom). Notice that the noisy label reconstruction only influences the results for lower batch sizes, and manifests itself mostly in higher variance in the results. It is also worth to note that for batch size (and lower, see App. D) all attacks can recover almost all the data, exposing a trivial vulnerability of FL on tabular data. In case of larger batch sizes, even up to , TabLeak can recover a significant portion of the client’s private data, well above random guessing, while the baseline Cosine attack fails to do so, demonstrating the necessity of a domain tailored attack. In a later paragraph, we show how we can further improve our reconstruction on this batch size and extract subsets of features with accuracy using the entropy. Further, the results on the ablation attacks demonstrate the effectiveness of each attack component, both providing a nontrivial improvement over the baseline attack that is preserved when combined in our strongest attack. Demonstrating generalization beyond Adult, we include our results on the German Credit, Lawschool Admissions, and Health Heritage datasets in Sec. D.1, where we also outperform the Cosine baseline attack by at least on batch size on each dataset.
Label  Batch  TabLeak  TabLeak  TabLeak  Cosine  Random 

Size  (no pooling)  (no softmax)  
True  
Rec.  
Categorical vs. Continuous Features
An interesting effect of having mixed type features in the data is that the reconstruction success clearly differs by feature type. As we can observe in Fig. 3, the continuous features produce an up to lower accuracy than the categorical features for the same batch size. We suggest that this is due to the nature of categorical features and how they are encoded. While trying to match the gradients by optimizing the reconstruction, having the correct categorical features will have a much greater effect on the gradient alignment, as when encoded, they take up the majority of the data vector. Also, when reconstructing a onehot encoded categorical feature, we only have to be able to retrieve the location of the maximum in a vector of length , whereas for the successful reconstruction of a continuous feature we have to retrieve its value correctly up to a small error. Therefore, especially when the optimization process is aware of the structure of the encoding scheme (e.g., by using the softmax structural prior), categorical features are much easier to reconstruct. This poses a critical privacy risk in tabular federated learning, as sensitive features are often categorical, e.g., gender or race.
Federated Averaging
In training with FedAvg (McMahan et al., 2017) participating clients conduct local training of several updates before communicating their new parameters to the server. Note that the more local updates are conducted by the clients, the harder a reconstruction attack becomes, making leakage attacks against FedAvg more challenging. Although this training method is of significantly higher practical importance than FedSGD, most prior work does not evaluate against it. Building upon the work of Dimitrov et al. (2022b) (for details please see App. B and the work of Dimitrov et al. (2022b)), we evaluate our combined attack and the cosine baseline in the setting of Federated Averaging. We present our results of retrieving a client dataset of size over varying number of local batches and epochs on the Adult dataset in Tab. 2, while assuming full knowledge of the true labels. We observe that while our combined attack significantly outperforms the random baseline of accuracy even up to local updates, the baseline attack fails to consistently do so whenever the local training is longer than one epoch. As FedAvg with tabular data is of high practical relevance, our results which highlight its vulnerability are concerning. We show further details for the experimental setup and results on other datasets in App. B and App. D, respectively.
TabLeak  Cosine  

#batches  1 epoch  5 epochs  10 epochs  1 epoch  5 epochs  10 epochs 
1  
2  
4 
Batch Size  

8  16  32  64  128  
Acc.  
Acc.  
Assessing Reconstructions via Entropy
Entropy  Categorical Features  

Bucket  Accuracy [%]  Data [%] 
0.00.2  
0.20.4  
0.40.6  
0.60.8  
0.81.0  
Overall  
Random 
We now investigate how an adversary can use the entropy (introduced in Sec. 3.3) to assess the quality of their reconstructions. In Tab. 3 we show the mean and standard deviation of the accuracy and the entropy of both the discrete and the continuous features over increasing batch sizes after reconstructing with TabLeak (ensemble size ). We observe an increase in the mean entropy over the increasing batch sizes, corresponding to accuracy decrease in the reconstructed batches. Hence, an attacker can understand the global effectiveness of their attack by looking at the retrieved entropies, without having to compare their results to the ground truth.
We now look at a single batch of size
and put each categorical feature into a bucket based on their reconstruction entropy after attacking with TabLeak (ensemble size
). In Tab. 4 we present our results, showing that features falling into lower entropy buckets (0.00.2 and 0.20.4) inside a batch are significantly more accurately reconstructed () than the overall batch (%). Note that this bucketing can be done without the knowledge of the groundtruth, yet the adversary can concretely identify the highfidelity features in their noisy reconstruction. This shows that even for reconstructions of large batches that seem to contain littletono information (close to random baseline), an adversary can still extract subsets of the data with high accuracy. Tables containing both feature types on all four datasets can be found in Sec. D.4, providing analogous conclusions.5 Conclusion
In this work we presented TabLeak, the first data leakage attack on tabular data in the setting of federated learning (FL), obtaining stateoftheart results against both popular FL training protocols in the tabular domain. As tabular data is ubiquitous in privacy critical highstakes applications, our results raise important concerns regarding practical systems currently using FL. Therefore, we advocate for further research on advancing defenses necessary to mitigate such privacy leaks.
6 Ethics Statement
As tabular data is often used in highstakes applications and may contain sensitive data of natural or legal persons, confidential treatment is critical. This work presents an attack algorithm in the tabular data domain that enables an FL server to steal the private data of its clients in industryrelevant scenarios, deeming such applications potentially unsafe.
We believe that exposing vulnerabilities of both recently proposed and widely adopted systems, where privacy is a concern, can benefit the development of adequate safety mechanisms against malicious actors. In particular, this view is shared by the governmental institutions of the United States of America and the United Kingdom that jointly supported the launch of a competition (https://petsprizechallenges.com/) aimed at advancing the privacy of FL in the tabular domain, encouraging the participation of both teams developing defenses and attacks. Also, as our experiments in Sec. C.1 show, existing techniques can help mitigate the privacy threat, hence we encourage practitioners to make use of them.
References
 Abadi et al. (2016) M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 308–318. [Online]. Available: https://doi.org/10.1145/2976749.2978318
 Balunovic et al. (2022) M. Balunovic, D. I. Dimitrov, R. Staab, and M. Vechev, “Bayesian framework for gradient leakage,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=f2lrIbGx3x7
 Borisov et al. (2021) V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep neural networks and tabular data: A survey,” CoRR, vol. abs/2110.01889, 2021. [Online]. Available: https://arxiv.org/abs/2110.01889
 Deng et al. (2021) J. Deng, Y. Wang, J. Li, C. Wang, C. Shang, H. Liu, S. Rajasekaran, and C. Ding, “Tag: Gradient attack on transformerbased language models,” in EMNLP (Findings), 2021, pp. 3600–3610. [Online]. Available: https://aclanthology.org/2021.findingsemnlp.305
 Dimitrov et al. (2022a) D. I. Dimitrov, M. Balunović, N. Jovanović, and M. Vechev, “Lamp: Extracting text from gradients with language model priors,” 2022.
 Dimitrov et al. (2022b) D. I. Dimitrov, M. Balunović, N. Konstantinov, and M. Vechev, “Data leakage in federated averaging,” 2022. [Online]. Available: https://arxiv.org/abs/2206.12395
 Dua and Graff (2017) D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
 Fowl et al. (2022) L. H. Fowl, J. Geiping, W. Czaja, M. Goldblum, and T. Goldstein, “Robbing the fed: Directly obtaining private data in federated learning with modified models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=fwzUgo0FM9v
 Geiping et al. (2020) J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradientshow easy is it to break privacy in federated learning?” pp. 16 937–16 947, 2020.
 Geng et al. (2021) J. Geng, Y. Mou, F. Li, Q. Li, O. Beyan, S. Decker, and C. Rong, “Towards general deep leakage in federated learning,” 2021.
 Gupta et al. (2022) S. Gupta, Y. Huang, Z. Zhong, T. Gao, K. Li, and D. Chen, “Recovering private text in federated learning of language models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.08514
 Hastie et al. (2009) T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, ser. Springer series in statistics. Springer, 2009. [Online]. Available: https://books.google.ch/books?id=eBSgoAEACAAJ
 Huang et al. (2021) Y. Huang, S. Gupta, Z. Song, K. Li, and S. Arora, “Evaluating gradient inversion attacks and defenses in federated learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 7232–7241, 2021.
 Jeon et al. (2021) J. Jeon, K. Lee, S. Oh, J. Ok et al., “Gradient inversion with generative image prior,” pp. 29 898–29 908, 2021.
 Jin et al. (2021) X. Jin, P.Y. Chen, C.Y. Hsu, C.M. Yu, and T. Chen, “Cafe: Catastrophic data leakage in vertical federated learning,” pp. 994–1006, 2021.
 Kingma and Ba (2014) D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 12 2014.
 Kuhn (1955) H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 12, pp. 83–97, 1955. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109
 Liu and Nocedal (1989) D. C. Liu and J. Nocedal, “On the limited memory bfgs method for large scale optimization,” Mathematical Programming, vol. 45, pp. 503–528, 1989.
 Long et al. (2021) G. Long, Y. Tan, J. Jiang, and C. Zhang, “Federated learning for open banking,” 2021. [Online]. Available: https://arxiv.org/abs/2108.10749
 McMahan et al. (2017) B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communicationefficient learning of deep networks from decentralized data,” pp. 1273–1282, 2017.
 Melis et al. (2019) L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting unintended feature leakage in collaborative learning,” in 2019 IEEE symposium on security and privacy (SP). IEEE, 2019, pp. 691–706.
 Rieke et al. (2020) N. Rieke, J. Hancox, W. Li, F. Milletarì, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. MaierHein, S. Ourselin, M. Sheller, R. M. Summers, A. Trask, D. Xu, M. Baust, and M. J. Cardoso, “The future of digital health with federated learning,” npj Digital Medicine, vol. 3, no. 1, sep 2020. [Online]. Available: https://doi.org/10.1038%2Fs41746020003231
 Shokri and Shmatikov (2015) R. Shokri and V. Shmatikov, “Privacypreserving deep learning,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’15. New York, NY, USA: Association for Computing Machinery, 2015, p. 1310–1321. [Online]. Available: https://doi.org/10.1145/2810103.2813687
 Wen et al. (2022) Y. Wen, J. A. Geiping, L. Fowl, M. Goldblum, and T. Goldstein, “Fishing for user data in largebatch federated learning via gradient magnification,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 17–23 Jul 2022, pp. 23 668–23 684. [Online]. Available: https://proceedings.mlr.press/v162/wen22a.html
 Wightman (2017) F. L. Wightman, “LSAC national longitudinal bar passage study,” 2017.
 Yin et al. (2021) H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov, “See through gradients: Image batch recovery via gradinversion,” pp. 16 337–16 346, 2021.
 Zhao et al. (2020) B. Zhao, K. R. Mopuri, and H. Bilen, “idlg: Improved deep leakage from gradients,” 2020. [Online]. Available: https://arxiv.org/abs/2001.02610
 Zhu and Blaschko (2021) J. Zhu and M. B. Blaschko, “R{gap}: Recursive gradient attack on privacy,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=RSU17UoKfJF
 Zhu et al. (2019) L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” 2019.
Appendix A Accuracy Metric
To ease the understanding, we start by repeating our accuracy metric here, where we measure the reconstruction accuracy between the retrieved sample and the ground truth as:
(8) 
Note that the binary treatment of continuous features in our accuracy metric enables the combined measurement of the accuracy on both the discrete and the continuous features. From an intuitive point of view, this measure closely resembles how one would judge the correctness of numerical guesses. For example, guessing the age of a year old, one would deem the guess good if it is within to years of the true value, but the guesses and would be both qualitatively incorrect. In order to facilitate scalability of our experiments, we chose the errortolerancebounds based on the global standard deviation if the given continuous feature and multiplied it by a constant, concretely, we used for all our experiments. Note that
for a Gaussian random variable
with mean and variance . For our metric this means that assuming Gaussian zeromean error in the reconstruction around the true value, we accept our reconstruction as privacy leakage as long as we fall into theerrorprobability range around the correct value. In
Tab. 5 we list the tolerance bounds for the continuous features of the Adult dataset produced by this method. We would like to remark here, that we fixed our metric parameters before conducting any experiments, and did not adjust them based on any obtained results. Note also that in App. C we provide results where the continuous feature reconstruction accuracy is measured using the commonly used regression metric of root mean squared error (RMSE), where TabLeak also achieves the best results, signaling that the success of our method is independent of our chosen metric.feature  age  fnlwgt  educationnum  capitalgain  capitalloss  hoursperweek 

tolerance  4.2  33699  0.8  2395  129  3.8 
Appendix B Further Experimental Details
Here we give an extended description to our experimental details provided in Sec. 4. For all attacks, we use the Adam optimizer (Kingma and Ba, 2014) with learning rate for iterations and without a learning rate schedule. We chose the learning rate based on our experiments on the baseline attack where it performed best. In line with Geiping et al. (2020), we modify the update step of the optimizer by reducing the update gradient to its elementwise sign. We attack a fully connected neural network with two hidden layers of neurons each at initialization. However, we provide a networksize ablation in Fig. 8, where we evaluate our attack against the baseline method for different network architectures. For each reported metric we conduct independent runs on 50 different batches to estimate their statistics. For all FedSGD experiments we clamp the continuous features to their valid ranges before measuring the reconstruction accuracy, both for our attacks and the baseline methods. We ran each of our experiments on single cores of Intel(R) Xeon(R) CPU E52690 v4 @ 2.60GHz.
Federated Averaging Experiments
For experiments on attacking the FedAvg training algorithm, we fix the clients’ local dataset size at and conduct an attack after local training with learning rate on the initialized network described above. We use the FedAvg attackframework of Dimitrov et al. (2022b), where for each local training epoch we initialize an independent minidataset matching the size of the client dataset, and simulate the local training of the client. At each reconstruction update, we use the mean squared error between the different epoch data means ( and in Dimitrov et al. (2022b)) as the permutation invariant epoch prior required by the framework, ensuring the consistency of the reconstructed dataset. For the full technical details, please refer to the manuscript of Dimitrov et al. (2022b). For choosing the prior parameter , we conduct linesearch on each setup and attack method pair individually on the parameters , and pick the ones providing the best results. Further, to reduce computational overhead, we reduce the ensemble size of TabLeak from to for these experiments on all datasets.
Appendix C Further Experiments
In this subsection, we present three further experiments:

Results of attacking neural networks defended using differentially private noisy gradients in Sec. C.1.

Ablation study on the impacts of the neural network’s size on the reconstruction difficulty in Sec. C.2.

Measuring the Root Mean Squared Error (RMSE) of the reconstruction of continuous features in Sec. C.3.
c.1 Attack against Gaussian DP
Differential privacy (DP) has recently gained popularity, as a way to prevent privacy violations in FL (Abadi et al., 2016; Zhu et al., 2019). Unlike, empirical defenses which are often broken by specifically crafted adversaries (Balunovic et al., 2022), DP provides guarantees on the amount of data leaked by a FL model, in terms of the magnitude of random noise the clients add to their gradients prior to sharing them with the server (Abadi et al., 2016; Zhu et al., 2019). Naturally, DP methods balance privacy concerns with the accuracy of the produced model, since bigger noise results in worse models that are more private. In this subsection, we evaluate TabLeak, and the Cosine baseline against DP defended gradient updates, where zeromean Gaussian noise is added with standard deviations , , and to the client gradients. We present our results on the Adult, German Credit, Lawschool Admissions, and Health Heritage datasets in Fig. 4, Fig. 5, Fig. 6, and Fig. 7, respectively. Although both methods are affected by the defense, our method consistently produces better reconstructions than the baseline method. However, for high noise level () and larger batch size both attacks break, advocating for the use of DP defenses in tabular FL to prevent the high vulnerability exposed by this work.












c.2 Varying Network Size
To understand the effect the choice of the network has on the obtained reconstruction results, we defined additional fully connected networks, two smaller, and two bigger ones to evaluate TabLeak on. Concretely, we examined the following five architectures for our attack:

a single hidden layer with neurons,

a single hidden layer with neurons,

two hidden layers with neurons each (network used in main body),

three hidden layers with neurons each,

and three hidden layers with neurons each.
We attack the above networks, aiming to reconstruct a batch of size . We plot the accuracy of TabLeak and the cosine baseline as a function of the number of parameters in the network in Fig. 8 for all four datasets. We can observe that with increasing number of parameters in the network, the reconstruction accuracy significantly increases on all datasets, and rather surprsingly, allowing for near perfect reconstruction of a batch as large as in some cases. Observe that on both ends of the presented parameter scale the differences between the methods degrade, i.e. they either both converge to nearperfect reconstruction (large networks) or to random guessing (small networks). Therefore, the choice of our network for conducting the experiments was instructive in examining the differences between the methods.




c.3 Continuous Feature Reconstruction Measured by RMSE
In order to examine the potential influence of our choice of reconstruction metric on the obtained results, we further measured the reconstruction quality of continuous features on the widely used Root Mean Squared Error (RMSE) metric as well. Concretely, we calculate the RMSE between the continuous features of our reconstruction and the ground truth in a batch of size as:
(9) 
As our results in Fig. 9 demonstrate, TabLeak achieves significantly lower RMSE than the Cosine baseline on large batch sizes, for all four datasets examined. This indicates that the strong results obtained by TabLeak in the rest of the paper are not a consequence of our evaluation metric.




Appendix D All Main Results
In this subsection, we include all the results presented in the main part of this paper for the Adult dataset alongside with the corresponding additional results on the German Credit, Lawschool Admissions, and the Health Heritage datasets.
d.1 Full FedSGD Results on all Datasets
In Tab. 6, Tab. 7, Tab. 8, and Tab. 9 we provide the full attack results of our method compared to the Cosine and random baselines on the Adult, German Credit, Lawschool Admissions, and Health Heritage datasets, respectively. Looking at the results for all datasets, we can confirm the observations made in Sec. 4, i.e. (i) the lower batch sizes are vulnerable to any nontrivial attack, (ii) not knowing the ground truth labels does not significantly disadvantage the attacker for larger batch sizes, and (iii) TabLeak provides a strong improvement over the baselines for practically relevant batch sizes over all datasets examined.
Label  Batch  TabLeak  TabLeak  TabLeak  Cosine  Random 

Size  (no pooling)  (no softmax)  
True  
Rec.  
Label  Batch  TabLeak  TabLeak  TabLeak  Cosine  Random 

Size  (no pooling)  (no softmax)  
True  
Rec.  
Label  Batch  TabLeak  TabLeak  TabLeak  Cosine  Random 

Size  (no pooling)  (no softmax)  
True  
Rec.  
Label  Batch  TabLeak  TabLeak  TabLeak  Cosine  Random 

Size  (no pooling)  (no softmax)  
True  
Rec.  
d.2 Categorical vs. Continuous Features on all Datasets
In Fig. 10, we compare the reconstruction accuracy of the continuous and the discrete features on all four datasets. We confirm our observations, shown in Fig. 3 in the main text, that a strong dichotomy between continuous and discrete feature reconstruction accuracy exists on all 4 datasets.




d.3 Federated Averaging Results on all Datasets
In Tab. 10, Tab. 11, Tab. 12, and Tab. 13 we present our results on attacking the clients in FedAvg training on the Adult, German Credit, Lawschool Submissions, and Health Heritage datasets, respectively. We described the details of the experiment in App. B above. Confirming our conclusions drawn in the main part of this manuscript, we observe that TabLeak achieves nontrivial reconstruction accuracy over all settings and even for large numbers of updates, while the baseline attack often fails to outperform random guessing, when the number of local updates is increased.
TabLeak  Cosine  

#batches  1 epoch  5 epochs  10 epochs  1 epoch  5 epochs  10 epochs 
1  
2  
4 
TabLeak  Cosine  

#batches  1 epoch  5 epochs  10 epochs  1 epoch  5 epochs  10 epochs 
1  
2  
4 
TabLeak  Cosine  

#batches  1 epoch  5 epochs  10 epochs  1 epoch  5 epochs  10 epochs 
1  
2  
4 
TabLeak  Cosine  

#batches  1 epoch  5 epochs  10 epochs  1 epoch  5 epochs  10 epochs 
1  
2  
4 
d.4 Full Results on Entropy on all Datasets
In Tab. 14, Tab. 15, Tab. 16, and Tab. 17 we provide the mean and standard deviation of the reconstruction accuracy and the entropy of the continuous and the categorical features over increasing batch size for attacking with TabLeak on the four datasets. In support of Sec. 4, we can observe on all datasets a trend of increasing entropy over decreasing reconstruction accuracy as the batch size is increased; and as such providing a signal to the attacker about their overall reconstruction success.
To generalize our results on the local information contained in the entropy, we show the mean reconstruction accuracy of both the discrete and the continuous features with respect to bucketing them based on their entropy in a batch of size in Tab. 18, Tab. 19, Tab. 20, and Tab. 21 for all four datasets, respectively. We can see that with the help of this bucketing, we can identify subsets of the reconstructed features that have been retrieved with a (sometimes significantly e.g., up to 24%) higher accuracy than the overall batch.
Discrete  Continuous  

Accuracy  Entropy  Accuracy  Entropy  
1  
2  
4  
8  
16  
32  
64  
128 
Discrete  Continuous  

Accuracy  Entropy  Accuracy  Entropy  
1  
2  
4  
8  
16  
32  
64  
128 
Discrete  Continuous  

Accuracy  Entropy  Accuracy  Entropy  
1  
2  
4  
8  
16  
32  
64  
128 
Discrete  Continuous  

Accuracy  Entropy  Accuracy  Entropy  
1  
2  
4  
8  
16  
32  
64 