## 1 Introduction

Representation learning, by training deep neural networks as feature extractors to generate compact embedding vectors from images, is a fundamental component in computer vision. Metric learning, a kind of representation learning using supervised data, has been widely applied to image recognition, clustering, and retrieval

(weinberger2009distance; schroff2015facenet; weyand2020google). Machine learning models have the capacity to memorize training data

(carlini2019secret; carlini2021extracting), leading to privacy risks when the models are deployed. Privacy risk can also be audited by membership inference attacks (shokri2017membership; carlini2022membership), i.e. detecting whether certain data was used to train a model and potentially exposing users’ usage behaviors. Defending against such risks is a critical responsibility when training on privacy-sensitive data.Differential Privacy (DP) (dwork2006calibrating) is an extensively used quantifiable measurement of privacy risk, now generally accepted as a standard notion of privacy in both industry and government (apple; ding2017collecting; bureau2021disclosure; dpftrl_blogpost).
Applied to machine learning, DP requires a training procedure with explicit randomness, and guarantees that the distribution over output models is quantifiably similar given a certain scope of change to the training dataset.
A DP guarantee with respect to the addition and/or removal of a single arbitrary training example is known as *example-level DP*, which provides plausible deniability (in the binary hypothesis testing sense of (kairouz2015composition)) that any single example (e.g., image) occurred in the training dataset. If we instead consider how the distribution of output models changes if the data (including even the number of examples) from any single user changes arbitrarily, we have *user-level DP* (dwork2010differential). This ensures model training is quantifiably insensitive to all of the data from any one user, and hence it is impossible to tell if a user has participated in training with high confidence. This guarantee can be exponentially stronger than example-level DP if one user may contribute many examples to training.

Recently, DP-SGD (abadi2016deep) (essentially, SGD with the additional steps of clipping each individual gradient to have a maximum norm, and adding correspondingly calibrated noise) has been used to achieve example-level DP for relatively large models in language modeling and image classification tasks (li2021large; yu2021differentially; anil2021large; de2022unlocking; kurakin2022toward), often utilizing techniques like large batch training and pretraining on public data. DP-SGD can be modified to guarantee user-level DP, which is often combined with federated learning algorithms and called DP-FedAvg (mcmahan18learning). User-level DP has only been studied for small on-device models that have less than 10 million parameters (mcmahan18learning; kairouz21practical; ramaswamy2020training).

We consider user-level DP for relatively large models in representation learning with supervised data. In our setting, similar to federated learning (FL), the data are user-partitioned; but in contrast to decentralized FL, we are primarily motivated by centralized data that benefit from access to richer computation resources and the ability to form virtual clients at random.
Throughout this work, we use *user* as the basic unit of data partitioning and the granularity for privacy; a user owns their (image) data, and *class*, *identity* and *label* are used interchangeably for the supervised information.

Though typically each user only contributes images for a small number of classes, the combined class space from the union of all users can be very large, which proves challenging for existing DP algorithms. When the model size is fixed independent of the number of users, and at a relatively small scale with a few million parameters, previous methods can only achieve strong user-level DP when millions of users are available (mcmahan18learning; kairouz21practical; ramaswamy2020training). In contrast, when considering learning embedding models for applications like facial images, the number of classes (and hence, the total model size) can grow linearly with the number of users, and so simply scaling up to larger datasets with more users no longer ensures that good privacy-utility trade-offs can be achieved. For example, in a standard multi-class training paradigm with -dimensional embedding vectors, with one million users we expect the final dense layer for prediction alone to have over

million trainable parameters. Further, the fact that most users will only have examples from a small number of classes implies that gradients are approximately sparse, whereas DP-SGD requires the addition of dense noise to the full gradient, leading to a poor signal-to-noise ratio in the updates. Hence, existing methods can easily fail on the problems we consider.

We propose DP-FedEmb to train embedding models with user-level differential privacy; fig. 1 provides a high-level overview. DP-FedEmb combines public pretraining, virtual clients, local fine-tuning, and partial aggregation to achieve strong privacy-utility trade-offs. The key to the approach is partitioning the model into a backbone network that generates embeddings, and a classification softmax head specific to the classes in the training data. In each training round, users are grouped into virtual clients and initialized from the global backbone. A local randomly-initialized softmax head layer is added for the limited number of classes on the virtual client, and the complete local model is fine-tuned in order to produce an update to the backbone. The local head parameters are not included in the private aggregation and hence require no noise addition. This is in contrast to existing methods like DP-FedAvg/DP-SGD, which would add noise to all parameters including the softmax head. The backbone updates are clipped to a maximum L2 norm, aggregated across virtual clients, and combined with appropriate DP noise. At this point, the noised update is the output of a DP mechanism and satisfies the corresponding DP guarantee. This update is then applied to the global backbone, which inherits the DP guarantee, and passed to the next round of training. DP-FedEmb significantly improves the scalability of DP training for embedding models, as only the parameters of the backbone network are privatized and released, and the size of this portion of the model does not grow with the number of users. Pretraining the backbone network on public data to learn general visual representations before applying DP-FedEmb for more privacy sensitive tasks further improves performance.

We demonstrate the superior performance of DP-FedEmb for embedding models by experiments on datasets with moderate size of users and classes (DigiFace of or usersidentities, Google Landmarks Dataset of 1262 users and 2028 classes, and iNaturalist of 9275 users and 1203 classes). We also show relatively strong privacy guarantees of single digit can be achieved while maintaining strong utility if millions users can participate in training. To our knowledge, this is the first report of training a commonly used large vision model, ResNet-50, with non-negligible noise for user-level DP.

## 2 Related work

#### Differential Privacy (DP),

introduced by (dwork2006calibrating), is a formal mathematical notion of privacy protection. Formally, two datasets and are said to be neighboring if they differ at most by one entry. A randomized mechanism is said to be -differentially private if , for all neighboring and . We refer to this original definition as sample-level DP, and several variations have been proposed, including Renyi DP (RDP) (mironov2017renyi), Privacy Loss Distribution (PLD) (koskela2020tight; doroshenko2022connect), and concentrated DP (zCDP) (bun2016concentrated).

A formal definition of user-level DP is introduced in (dwork2010differential)

, where the unit of privacy protection is extended from a single entry (in the original sample-level DP) to every entry that belongs to the same user. The dependence of the utility on the number of users and the number of samples per user has been studied for various tasks: empirical risk minimization and mean estimation

(levy2021learning), estimating discrete distributions (liu2020learning), and PAC learning (ghazi2021user). Extensions to heterogeneous users in sample size have been studied in (amin2019bounding; epasto2020smoothly). User-level DP is particularly useful in federated learning, where the natural unit of privacy is a user (i.e., a client) (geyer2017differentially; mcmahan2018general) and standard private training algorithms respect user-level DP (mcmahan18learning). The privacy-utility trade-off for user-level DP is investigated in (jain2021differentially) for personalization of regression problems.#### Representation learning

and metric learning are active research directions in computer vision. Recently, a lot of progress has been made towards representation learning with large-scale unsupervised data (chen2020simple; he2022masked; zbontar2021barlow; grill2020bootstrap)

. However, we consider representation learning with supervised data as it is widely used for downstream tasks like face recognition and clustering

(schroff2015facenet; taigman2014deepface), person re-identification (wu2016personnet), and landmark recognition (weyand2020google), which can significantly benefit from privacy protection. Two technical frameworks are often used in the supervised representation learning tasks due to the large output space: triplet and its variants with hard negative mining (schroff2015facenet), and multi-class training with proxy weights (taigman2014deepface; deng2019arcface; wen2021sphereface2). We propose DP-FedEmb based on the multi-class approach for the following reasons: the two approaches can achieve similar performance when trained with large-scale data (musgrave2020metric); multi-class training is simple and flexible, and can be more efficient when less data are touched every iteration; and negative sampling can trigger non-trivial computational and private cost. To the best of our knowledge, differentially private models have not been trained for large-scale representation learning.#### Federated learning

is an active research topic primarily designed for learning from decentralized data (kairouz2019advances; wang2021fieldguide). We propose DP-FedEmb based on federated learning algorithms as they are suitable for user-level DP. User-level DP can be achieved in federated learning by variants of DP-FedAvg (geyer2017differentially; mcmahan18learning; kairouz21practical; ramaswamy2020training; andrew2019differentially); the previous works train relatively small models for image classification and language modeling tasks. Closest to DP-FedEmb is singhal2021federated, which proposed federated reconstruction that performs partially local training for personalization; dong2022spherefed fixed the softmax and train the feature extractor before calibrating for image classification tasks; waghmare2022efficient modified sampled softmax for large output space in federated learning. These works are designed for learning with decentralized data, and do not consider differential privacy. meng2022improving use differential privacy on proxy vectors to mitigate the privacy concerns when clients exchange weight vectors of identities for federated training, which is different from our motivation of training a differentially private model that will not memorize a specific user’s data.

## 3 Learning Embedding Models

### 3.1 Problem formulation and centralized training

We learn a *backbone* network parameterized by , that outputs an embedding vector for an input image . The backbone network, , is trained on paired examples of image and class . The training dataset is naturally partitioned by users, i.e., .

We adopt the popular multi-class training framework for embedding models, where a proxy weight vector is learnt for each class . We use to denote the union of proxy weight vectors , which is called the *head* of the network. Given a training image-class pair

, logits are computed by taking the inner product between the embedding vector

and the proxy weight vectors in the network head . This is effectively passing through a dense network layer parameterized by without the bias terms. With a supervised training loss , such as the cross entropy loss, the following objective is optimized(1) |

where we overload the inner product notation, i.e., , to denote a set of inner products with each element in . Typically, variants of the gradient descent method are used to solve the optimization in (1). In each iteration, the average gradient is computed on a sampled minibatch of data , and is then used to update the model parameters and . Furthermore, when the output space, i.e., the number of classes, is very large, sampled softmax is often applied (jean2014using), where only a subset of proxy weights sampled from are used in each training iteration.

### 3.2 User-level DP and DP-FedAvg

To achieve user-level DP, we control the sensitivity of each user and add corresponding noise for anonymization. To effectively control the sensitivity, it is important to understand and account for the contributions of each user in the model updates; hence it is convenient to consider the data at a granularity of users instead of individual samples. Grouping together each user’s data, the objective (1) can be rewritten as

(2) |

The above objective of two level sum is often found in federated learning (wang2021fieldguide), which can be optimized by the (generalized) FedAvg algorithm (mcmahan2017fedavg; reddi2021adaptive). In generalized FedAvg, each round starts with the server broadcasting to a subset of clients. Each client will then update the local model parameters by ClientOpt with private data , and send back the updates for model parameters . The model deltas from sampled clients are then aggregated and used by ServerOpt to get for the next round.

The generalized FedAvg algorithm can be extended for user-level DP by clipping the model deltas and adding noise proportional to the sensitivity (mcmahan18learning; geyer2017differentially). We can use either independent Gaussian noise (mcmahan18learning), or correlated noise that can achieve comparable privacy-utility trade-off without relying on the assumption of sampling (kairouz21practical). The two variants are effectively applying DP-SGD (abadi2016deep) or DP-FTRL (kairouz21practical) as ServerOpt in the generalized FedAvg framework. Unlike the cross-device FL setting where sampling is extremely hard, it is possible to control user sampling in the datacenter and use DP-SGD. But DP-FTRL provides the possibility of handling the online setting where the user data are streamed instead of collected, and can be accounted for zCDP (bun2016concentrated) reported by US census bureau (bureau2021disclosure). A complete description of DP-FedAvg for training a backbone network to generate image embeddings is in algorithm 1.

In addition to the flexibility of generalized FedAvg for user-level DP, there are a few side effects of FedAvg that make it particularly effective for differentially private training. The model deltas are computed based on data for each user before clipping in DP, which can potentially reduce the bias introduced by clipping. As de2022unlocking

suggested that averaging gradients from augmented data before clipping can improve training for example-level DP, model deltas from user data for user-level DP can be considered a natural extension to improve the bias-variance trade-off. The communication efficiency of FedAvg that leads to infrequent aggregation and model release is also desirable for DP training. The local model updates by private data on clients introduce no additional privacy cost, and only communication rounds between clients and server have to be accounted for DP. Though the theoretical advantages of FedAvg are only proved under certain assumptions

(woodworth2020local; wang2022unreasonable), FedAvg with local updates can achieve communication efficiency and fast convergence in various practical applications (wang2021fieldguide).### 3.3 Proposed DP-FedEmb method

While generalized DP-FedAvg can be applied to train a backbone network to generate embedding from image , there are challenges that significantly affect the efficiency and feasibility of the method. We propose DP-FedEmb with a few key features: virtual clients, partial aggregation, local fine-tuning, public pretraining, and parameter freezing. Details of DP-FedEmb are provided in algorithm 2.

Virtual clients.
Data heterogeneity is one of the key problems in federated optimization (wang2021fieldguide). When we train embedding models in the multi-class framework, the class space can be very large and each user may only observe a limited number of classes. In the extreme case, when training embedding models on facial images (taigman2014deepface; schroff2015facenet), each user may only have images for their own identity. This significantly limits the advantage of local updates due to client drift (karimireddy2020scaffold), and even with specialized regularization like (yu2020federated), FedSGD (mcmahan2017fedavg) with frequent aggregation and model release has to be used instead of FedAvg. It is challenging to use some specialized techniques for handling data heterogeneity (wang2021fieldguide) in DP training. Instead, we propose a simple yet effective approach: randomly groups the data of sampled users into *virtual clients*.

Unlike the cross-device FL setting where the on-device data of users cannot be exchanged, virtual clients are feasible for user data in the datacenter. It is important to guarantee that a user will not be included in two virtual clients in a single round for user-level DP, analogous to microbatches for DP-SGD and example-level DP (abadi2016deep; mcmahan2018general)

. The granularity of the DP definition slightly changes under the virtual clients setting: the adjacent dataset for DP is based on virtual clients (a group of users) instead of a single user, which has conceptually stronger privacy guarantees. Virtual clients also control the interpolation between federated training and centralized training: when all users are grouped into a single virtual client, federated training is equivalent to centralized training, which removes heterogeneity but is challenging for DP mechanism. Virtual clients are used for both baseline DP-FedAvg and the proposed DP-FedEmb method.

Partial aggregation and local fine-tuning. Another challenge is the number of parameters in DP training. A common backbone of ResNet-50 for dimensional embedding vectors has million parameters. However, the parameter size of head can linearly grow with the number of classes. Taking FaceNet (taigman2014deepface; schroff2015facenet) as an example again, can easily grow to million for million identities in real-world applications. Sampled softmax (jean2014using; waghmare2022efficient) can be applied to improve training efficiency. However, as both backbone and head are shared among users and need to be privatized by adding noise during training, the combined parameter size of will significantly affect the privacy utility trade-off, which cannot be mitigated by sampled softmax.

In DP-FedEmb, inspired by federated reconstruction (singhal2021federated) and DP personalization (jain2021differentially), we only aggregate and privatize the backbone network , which is used in inference and has fixed parameter size that does not grow with classes. A local head is randomly initialized and updated on each virtual client . A fine-tuning approach is adopted for local updates, where different learning rates are used for the backbone and head

, respectively. When combined with virtual clients, the partial aggregation and local fine-tuning approach can be interpreted in various ways: each virtual client is performing transfer learning given a shared backbone network for representation learning; the data of each class are their own positive samples as well as negative samples for other classes on the same virtual client; the size of local head

is also significantly smaller than for all classes, which is effectively a user-based sampling for softmax.Public pretraining. The parameter size of the backbone to be privatized can still be large after applying partial aggregation and local fine-tuning with virtual clients, e.g., million for ResNet-50. Inspired by recent research on applying DP-SGD for example-level DP on large language modeling (li2021large; yu2021differentially) and image classification (de2022unlocking; kurakin2022toward), we use a model pretrained on public images to initialize the DP training of the backbone network. There is a relatively clear distinction between the public and private domains for our task: we use public images collected from open webpages for pretraining, and then privately train on users’ data collected in a datacenter.

Parameter freezing. Neural networks are known to be overparameterized, and not all weights are equally important (zhang2019layers; frankle2021training). Freezing some of the parameters to be non-trainable has been shown to be effective when the privacy budget is small (sidahmed2021efficient), especially when combined with public pretraining for large models (yu2021differentially; de2022unlocking)

. For backbone convolutional neural networks with normalization layers, we experiment with training parameters with all normalization layers, and some of the convolutional kernels. However, freezing is found to be less efficient in our setting that performs representation learning, instead of classification, for a moderate size model in the high-utility-moderate-noise regime.

DP mechanism and hyperparameters. Similar to generalized DP-FedAvg, we perform clipping for model deltas and add noise for aggregated updates. The clip norm is estimated by adaptive clipping (andrew2019differentially) in the parameter tuning stage. For DP-FedEmb, we perform extensive ablation studies on several configurations in section 4

. Differentially private hyperparameter tuning

(papernot2021hyperparameter) is an active research topic out of the scope of this paper, and automating hyperparameter tuning is an important future work. Either independent Gaussian noise like DP-SGD (mcmahan18learning) or tree-based correlated noise like DP-FTRL (kairouz21practical) can be added. Under the same noise multiplier , DP-FedEmb will achieve the same privacy bound as DP-FedAvg with virtual clients, while the utility can be significantly improved for training a backbone network with a large head.## 4 Experiments

Dataset | Train | Validation | Test | ||||||

Users | Classes | Images | Users | Classes | Images | Users | Classes | Images | |

DigiFace | |||||||||

DigiFace10K | - | - | |||||||

EMNIST | - | ||||||||

GLD | - | - | |||||||

iNat | - | - |

The statistics of simulation datasets. The Google Landmarks Dataset (GLD) and iNaturalist (iNat) dataset are preprocessed by Tensorflow Federated

(gldv2; inat). For the training of EMNIST, we use images of class in the union of the train and test clients in Tensorflow Federated dataset (emnist); and use images of class in the test clients for validation. The shape of an image is for DigiFace/DigiFace10K, for GLD and iNat, and for EMNIST.We conduct experiments to train image-to-embedding backbone networks with user-level DP. We use the DigiFace dataset (bae2023digiface1m) of synthetic faces based on ethical and responsible development considerations, and verified that the conclusions on DigiFace are very similar to conclusions generated from experiments on natural facial images. We randomly split the DigiFace dataset of identities and into subsets of identities with images for training, identities with images for validation, and identities with images for testing. We extensively use a smaller training set, DigiFace10K, which contains the training identities of 72 images sampled from the DigiFace training set. We also run experiments on public datasets of natural images: EMNIST, Google Landmarks Dataset (GLD) and iNaturalist (iNat) dataset. The statistics of these datasets are summarized in table 1.

In our setting, each user holds only the images of their own identities, i.e., user-level DP is also identity-level DP. We use ResNet-50 (he2016deep) and MobileNetV2 (sandler2018mobilenetv2; hsu2020federated; wang2021fieldguide)

as backbone networks, replace batch normalization

(ioffe2015batch) with group normalization (wu2018group), and use a multi-class framework with a large softmax head to train the backbone. The dimension of the embeddings are fixed to be for all experiments.We evaluate the performance of the backbone network based on predicting identity matches from the distance between two image embeddings. By varying a threshold on the pairwise similarity, a recall versus false accept rate (FAR) curve on the test data can be generated for a trained model. A scalar value of recall@FAR= is often reported. The privacy guarantees are computed by either using Renyi differential privacy (RDP) (mironov2017renyi) and converting to -DP by (canonne2020discrete); or privacy loss distribution (PLD) (koskela2020tight; doroshenko2022connect) implemented in (pldlib); or DP-FTRL accounting without restart (kairouz21practical).

We compare the proposed DP-FedEmb with non-private oracle performance of centralized training, and baseline methods DP-FedAvg. We tune the learning rate with learning rate scheduling for standard centralized training. The centralized baseline is provided as an oracle for non-private training performance. We exclude tricks like data augmentation for either centralized or federated training as the goal is not achieving state-of-the-art performance. Virtual clients are used to improve DP-FedAvg performance, and the same tuning strategy is applied for DP-FedEmb and DP-FedAvg. In most of the experiments, unless otherwise specified, we fix the hyperparameters in the federated setting and only tune the learning rates; the backbone networks are pretrained on classifying the

classes of ImageNet

(russakovsky2015imagenet); both ClientOpt and ServerOpt are SGD optimizers with momentum ; and more details are provided in section 4.3. Code is being released^{1}

^{1}1https://github.com/google-research/federated/tree/master/dp_visual_embeddings.

Remark on privacy accounting. The accounting and differential privacy definition of DP-FedEmb depends on the DP mechanism applied in noise addition. If independent Gaussian noise similar to DP-SGD (abadi2016deep; mcmahan18learning) is used in algorithm 2, we adopt the add or remove notation for DP definition (dwork2014algorithmic; vadhan2017complexity). If tree-based noise similar to DP-FTRL (kairouz21practical) is used in algorithm 2, we adopt the add or remove with special element notation in (kairouz21practical) for DP definition. We use implementation in (pldlib)

for RDP and PLD accounting for DP-FedEmb, and use the open-sourced implementation by

(kairouz21practical) for DP-FTRL-FedEmb. Privacy amplification for sampling is leveraged for DP-FedEmb accounting. Note that the theory of amplification by sampling gives tighter guarantees when using Poisson sampling (abadi2016deep), while shuffling of data is used in most if not all DP training for deep learning. We use uniform sampling in simulation, which is close to, but not exactly the same as Poisson sampling. Fixing this convention between theory and practice is beyond the scope of this paper. We plan to release the accounting code.

### 4.1 Privacy-utility-computation trade-off

We study the privacy-utility-computation trade-offs of training ResNet-50 on DigiFace10K in fig. 2 and fig. 3. Figure 1(a) shows the privacy-utility trade-off. Without adding noise, the federated algorithms (DP-)FedAvg and (DP-)FedEmb can achieve even better results than the non-private centralized baseline, which is consistent with recent empirical and theoretical justifications that FedAvg is more accurate when learning representations (collins2022maml; collins2022fedavg). When the same noise multiplier is used, i.e., under same privacy budget, DP-FedEmb outperforms DP-FedAvg; and the margin increases when increasing the noise. We can observe the advantage of DP-FedEmb over DP-FedAvg even if we only have a small number of identities in DigiFace: the size of the head is , only 4.6% of the backbone ResNet-50 with parameters. For large scale data with 10 million identities, the size of the head can grow to , which is much larger than the backbone networks, and DP-FedAvg can easily fail in such settings.

It can be difficult to achieve a strong formal differential privacy bound without significantly hurting utility for DigiFace10K, which only has a small number of total users. We consider the practical setting of more available users, and study the privacy-computation trade-off in figs. 2(c), 2(b), 1(d) and 1(c) based on extrapolation. A key hypothesis following (mcmahan18learning; kairouz21practical) is used: utility (Recall@FAR) is non-decreasing when simultaneously increasing the number of clients per round and noise multiplier. The hypothesis is based on the fact that the signal-to-noise ratio is non-decreasing when linearly increasing the number of clients per round and noise multiplier, and has been verified in practice (ramaswamy2020training; dpftrl_blogpost).

We first choose the noise multiplier that is within a drop of recall@FAR= compared to centralized non-private training in figs. 2(a) and 1(a), based on simulation that samples virtual clients that each has users per round: recall@FAR= is for running DP-FedAvg with noise multiplier for rounds, for running DP-FedAvg with noise multiplier for rounds, for running DP-FedEmb with noise multiplier for rounds, for running DP-FedEmb with noise multiplier for rounds, and for running DP-FTRL-FedEmb with noise multiplier for rounds. Then we linearly increase the number of users sampled and use the increased noise multiplier in RDP accounting to compute privacy bound given to generate figs. 2(b), 1(d) and 1(c), and compute zCDP for fig. 2(c). Comparing curves of r400 and r800, training longer with larger noise is more effective than training shorter with smaller noise. Figure 1(c) suggests users per round is enough for DP-FedEmb to achieve single digit epsilon if users are available, while users per round are needed if users are available in fig. 1(d). users is the number of users per round in our current simulation, which can be achieved by training with computing resources for longer. In fig. 2(b), there is a crossover point when using DP-FTRL versus DP-SGD for DP-FedEmb, and DP-FTRL is more effective for relatively large privacy . Figure 2(c) shows that DP-FTRL-FedEmb can achieve zCDP smaller than , as used by US Census Bureau (bureau2021disclosure), when users per round and total users are available.

Algorithm | Hyperparameters | Privacy (10M users) | Recall@FAR= | ||||

Noise | SerLR | RDP- | PLD- | zCDP | Validation | Test | |

Centralized | |||||||

DP-FedAvg | - | ||||||

DP-FedEmb | - | ||||||

DP-FTRL-FedEmb | - |

### 4.2 Model evaluation

In table 2, we summarize the quantitative results from the privacy-utility-computation trade-off analysis in section 4.1

. Each experiment runs three times to compute the mean and standard deviation. For similar recall@FAR=

on the DigiFace validation set, DP-FedEmb achieves stronger privacy guarantee than baseline DP-FedAvg, and the advantage of DP-FedEmb is expected to be more pronounced if a head for larger identities is used for training. When users are available and users per round in training, privacy of single digit and zCDP smaller than can be achieved when recall@FAR= is within a drop compared with non-private centralized training. In addition to validation performance, the private models also perform well on the left-out test dataset. PLD accounting is tighter than RDP accounting, and future improvement on privacy accounting can help further improve the privacy guarantees. Figure 4 presents the training curves and ROC curves for comparing DP-FedEmb and DP-FedAvg under the same privacy budget. DP-FedEmb outperforms DP-FedAvg in all training rounds, and trains a stronger private model with better recall at different false accept rates.### 4.3 Ablation study

We primarily use MobileNetV2 for ablation studies on DigiFace10K for two reasons: MobileNetV2 is smaller and faster for training in experiments; to test the generalization of DP-FedEmb and avoid overfitting on ResNet-50.

Parameter freezing. In figs. 2(a) and 1(a), we notice that the utility measured by recall@FAR= can decrease faster when increasing the noise multiplier than observed for models in previous work (kairouz21practical). After using DP-FedEmb to reduce the size of parameters to be noised, the ResNet-50 backbone still has parameters, which is the language model in (kairouz21practical) that has parameters. We explore freezing parameters and the alternative model architecture MobileNetV2 of parameters. We train parameters of all normalization layers, and gradually freeze the convolutional kernels from lower level to higher level (w or w/o the input convolutional layers) to generate fig. 4(a). For image-to-embedding models, recall@FAR= linearly increases with the size of parameters, which is different from the observation that models are redundant for image classification (frankle2021training; sidahmed2021efficient). Additionally training the input convolutional layers (cattan2022fine) is more efficient than only training the higher levels of the network. In fig. 4(b), we freeze the convolutional kernels of intermediate two groups of residual blocks (out of the total four groups) in ResNet-50, which leads to a backbone network of parameters. The partially frozen model is inferior to the full model for small-medium noise, and only effective in the low-utility regime of large noise .

For similar parameter size, MobileNetV2 outperforms ResNet-50 with frozen parameters, and fig. 4(c) shows the privacy-utility trade-off. Recall@FAR= of DP-FedEmb-r800 on MobileNetV2 only drops from to when noise is added, while ResNet-50 drops from to . However, ResNet-50 still outperforms MobileNetV2 by a large margin in the high utility regime. DP-FedAvg is worse than DP-FedEmb when noises are added.

Public pretraining. Even though the input image size of DigiFace is

, different from the ImageNet pretraning image size of

, the pretrained scale-invariant backbone can consistently improve the performance by under the same noise level, as shown in fig. 4(c). Comparing curves of round 400 and round 800, the gain of public pretraining is larger when trained with a smaller number of rounds. We also pretrain a few different MobileNetV2 models on ImageNet by varying the total training epochs, and summarize the results in

fig. 4(d). Though the private fine-tuning utility is not linearly increasing with pretraining accuracy, there seems to be a general positive correlation: better pretrained models can lead to better private models except for one outlier where a inferior pretrained model causes difficulty in training. Without private training, the recall@FAR=

of these pretrained models on DigiFace (with ImageNet validation accuracy) are smaller than . Due to the domain difference, the utility of the pretrained model on DigiFace can be low, and it may not be consistent with the accuracy on ImageNet. For example, a pretrained MobileNetV2 can achieve accuracy on ImageNet while only recall@FAR= on DigiFace, but it can boost the recall@FAR= of training with DP-FedEmb and noise, from for round and for round to and , respectively. Finally, pretraining may not always help. For example, when pretraining from the (preprocessed) Google Landmark (GLD) dataset, the final recall@FAR= can be worse than without pretraining.Federated settings. In the above experiments, we fix important hyperparameters for the federated setting: users per virtual client is , virtual clients per round is , examples per client is capped at , the head learning rate (LR) scale is , the buffer size for data shuffling on clients is , and the batch size for local SGD is . In fig. 6, instead of tuning these hyperparameters in advance, we conduct a study on these hyperparameters to understand DP-FedEmb. Among these hyperparameters, figs. 5(d) and 5(a) suggest virtual clients and head LR scale are particularly important for DP-FedEmb to be on par with non-private centralized training, which is an important contribution of this work. Figures 5(c), 5(b) and 5(a) suggest users per virtual client, clients per round, and examples per client only need to be large enough under privacy consideration and computation resources for the best practice. The head LR scale has a large tuning range between 50 and 500 in fig. 5(d). Recall@FAR= is not sensitive to shuffle buffer size in fig. 5(e). The model utility can be potentially improved if we further tune the client batch size as suggested by fig. 5(f). We fixed the learning rate and other hyperparameters while varying one of the hyperparameters in the ablation study. How to automate tuning, especially tuning with differential privacy guarantees, is an important future work.

Learning rate (LR). For experiments in sections 4.2 and 4.1

, server and client learning rates are first tuned for non-private federated training with adaptive clipping of quantile

(andrew2019differentially). Then server learning rate is tuned when adding noise for private tuning, while estimated clip norm and client learning rate are fixed. The tuning range of learning rates are . Figures 6(b) and 6(a) suggest the optimal learning rates are similar for training 400 rounds or 800 rounds in non-private training. We then fix the client learning rate to be and use the estimated clip norm for MobileNetV2 in fig. 6(c), and observe the fixed clip results are very similar to adaptive clipping results. The best server learning rate for private training with noise can be smaller than non-private training, where the difference is even more notable for larger model ResNet-50 in fig. 6(d).Variants of DP-FedEmb. We use local fine-tuning with different learning rates to update backbone and head parameters. An alternative is to reconstruct the head first before fine-tuning the backbone (singhal2021federated; kumar2022fine). We empirically find that head reconstruction can only achieve similar performance as the proposed fine-tuning when there are same or more number of updates on the backbone network, and hence use local fine-tuning for efficiency. It is also possible to use binary or triplet loss within a virtual clients. In our preliminary results, they achieve inferior results compared to DP-FedEmb that uses multi-class cross-entropy loss. We leave other improvement like arcface loss (deng2019arcface) as future work.

### 4.4 Additional results

Dataset | Algorithm | Hyperparameters | Recall@FAR= / | ||||

Noise | SerLR | CliLR | Clip | Approx | AllPair | ||

DigiFace | DP-FedEmb | - | |||||

DP-FedAvg | - | ||||||

EMNIST | DP-FedEmb | ||||||

DP-FedAvg | |||||||

GLD | DP-FedEmb | 1 | |||||

DP-FedAvg | |||||||

iNat | DP-FedEmb | ||||||

DP-FedAvg |

We run additional experiments of ResNet-50 on the larger DigiFace of users. Because a lot of users in DigiFace have only images, we set each virtual client to contain users, and use samples per virtual client. We sample virtual clients per round, and a relatively large noise in PLD accounting can achieve without extrapolation. The utility measured by recall@FAR= is shown in table 3. Though the utilities of both methods are significantly degraded by the large noise, DP-FedEmb is much better than the DP-FedAvg baseline because the noise is only added to backbone of parameters instead of backbone plus head of parameters. For EMNIST, we train the embedding model on images of class and test on images of class . Using a relatively large noise multiplier for rounds, and sampling users per virtual client and virtual clients per round, can be achieved given users. A small nework with two convolutional layers similar to LeNet (lecun1998gradient) is used as the backbone network, and no pretrained model is used for initialization. We provide results on EMNIST primarily for reproducibility as the scale of EMNIST is smaller than the other datasets used in this draft.

In table 3, we also conduct experiments with MobileNetV2 on Google Landmark Dataset (GLD) (weyand2020google; hsu2020federated; gldv2) and iNaturalist (iNat) dataset (liu2015faceattributes; hsu2020federated; inat) to demonstrate the generalization of DP-FedEmb. We use a public model pretrained on ImageNet, and report extra approximate recall@FAR by computing pairwise similarity for minibatches, which is easy to reproduce and consistent with the all pair recall@FAR. We fix the hyperparameters for the federated settings, and compare the performance of DP-FedEmb and DP-FedAvg under the same privacy budget (noise multiplier). Since each user already has multiple classes in GLD, we use a smaller number of users, , in each virtual client. We also use a smaller number of virtual clients per round, , for fast experiments and strong sampling effect. Recall@FAR= of DP-FedEmb and DP-FedAvg with small noise multiplier on GLD outperforms centralized training. For iNat, we use an even smaller four users per virtual client and train for only rounds, and use a relatively large noise multiplier to get a single-digit DP guarantee for users. We report recall@FAR= instead of recall@FAR= for the challenging iNat task. In all experiments, DP-FedEmb consistently outperforms DP-FedAvg.

## 5 Conclusion

This paper presented DP-FedEmb for training embedding models with user-level differential privacy. We show how practical utility with strong privacy guarantees can be achieved in the data center, thanks to key algorithm design choices around the construction of virtual clients and in the selection of what information is shared among users. Our experiments validate this approach improves the privacy utility trade-off upon vanilla DP-FedAvg for supervised representation learning. Though strong formal DP bounds at practical levels of utility could only be achieved when millions of users participate in training, DP-FedEmb is designed to be exceptionally scalable when model size and class space increases with number of users. DP-FedEmb can also be applied to decentralized FL when each real client contains multiple classes that can possibly reduce the necessity of virtual clients. Finally, DP is a worst-case guarantee that can be improved by both algorithmic design and advanced accounting methods; the non-negligible noise we added for the small scale datasets in experiments are ready to be empirically audited for privacy.

### Acknowledgement

The authors would like to thank Zachary Garrett, Keith Rush, and the TFF team for simulation support; Viral Carpenter and Janel Thamkul for support through the internal review process; Jun Xie and Lior Shapira for early discussion; and Peter Kairouz for early feedback.