1 Introduction
Endtoend automatic speech recognition (ASR), which directly transcribes speech to text without predefined alignments, has increasingly been gaining popularity over conventional pipeline frameworks. Stateoftheart ASR models have achieved human parity in conversational speech recognition (Xiong et al., 2017). Training such models often requires a large amount of userspoken utterances. In the speech domain, training data includes audio and transcripts of utterances, which can directly expose sensitive information, or make it possible to leak attributes such as gender, dialect, or identity of the speaker.
In distributed frameworks such as Federated Learning (FL) (McMahan & Ramage, 2017), model training is performed via mobile devices with transmission of gradients over the network, allowing training over large populations (Bonawitz et al., 2019) while ensuring such data remains ondevice. Many works have shown the competitive performance of FLtrained models on sequential modeling tasks like keyboard prediction (Hard et al., 2018) and keyword spotting (Leroy et al., 2019; Hard et al., 2020), as well as in speech recognition (Dimitriadis et al., 2020; Guliani et al., 2020).
A recent line of work (Zhu et al., 2019; Geiping et al., 2020; Wei et al., 2020) has focused on demonstrating leakages of information about training data, from the gradients used in model training. At a high level, these works aim to reconstruct training samples by designing optimization methods for constructing objects that have a gradient matching to the observed gradient. For instance, a number of existing methods have been shown to successfully reconstruct images used for training image classification models. As we discuss later (in Section 3.3), there are fundamental challenges, like variablesized inputs/outputs, that render such methods inapplicable in the speech domain.
In this work, we study information leakage from gradients in ASR model training. In particular, we design a method to reveal the speaker identity of a training utterance from a model gradient computed using the utterance. Given that ASR models can have training utterances and transcripts of arbitrary lengths, for computational efficiency and to avoid potential false positives, we assume that the transcript, and the length of the training utterance are known. We start by designing HessianFree Gradients Matching (HFGM), a technique to reconstruct speech features used in computing gradients for training ASR models. Our HFGM technique eliminates the need of secondorder derivatives (i.e., Hessian) of the loss function, which were required in prior works (Zhu et al., 2019; Geiping et al., 2020; Wei et al., 2020), and can be expensive to compute. Next, our method uses the reconstructed features and a speaker identification model to uniquely identify the speaker from a list of speakers by comparing speaker embeddings.
To our knowledge, this is the first method in the speech domain that can be used for revealing information about training samples from gradients. We demonstrate the efficacy of our method by conducting experiments using the LibriSpeech data set (Panayotov et al., 2015) on the DeepSpeech (Hannun et al., 2014) model architecture. We find that our method is successful in revealing the speaker identity with 34% top1 accuracy (51% top5 accuracy) among 2.5k speakers.
We also study the effect of two standard training techniques, namely, Differentially Private Stochastic Gradient Descent (DPSGD) (Bassily et al., 2014; Abadi et al., 2016), and Dropout (Srivastava et al., 2014)
, on the success of our method. The technique of DPSGD is the stateoftheart in training deep neural networks with Differential Privacy (DP)
(Dwork et al., 2006b, a) guarantees. Intuitively, DP prevents an adversary from confidently making any conclusions about whether any particular sample was used to train a model, even while having access to the model and arbitrary external side information. While we demonstrate that using DPSGD can mitigate the success of our method, we find (in line with prior works (Abadi et al., 2016; McMahan et al., 2018; Thakkar et al., 2020)) that training large models using DPSGD can significantly affect model utility.The wellknown technique of Dropout (Hinton et al., 2012; Srivastava et al., 2014), which randomly drops hidden units of a neural network during training, is commonly employed to avoid overfitting in large models. We show that using dropout can reduce the speaker identity accuracy of our method to 0% top1 (0.5% top5), without compromising utility of the trained model.
We make the following contributions:

We design the first method to reveal speaker identity of an utterance in ASR model training, with access to only a gradient computed using the utterance. We achieve this via HessianFree Gradients Matching, an input reconstruction technique that operates without needing second derivatives of the loss function.

We empirically demonstrate the effectiveness of our method, using the DeepSpeech model architecture, in revealing speaker identity with 34% top1 accuracy (51% top5 accuracy) on the LibriSpeech data set. To spur further research, we provide an opensource implementation
^{1}^{1}1https://github.com/googleinterns/deepspeechreconstruction of our experimental framework. 
We study the effect of two standard training techniques – DPSGD and Dropout – on the success of our method. We empirically demonstrate that using dropout can reduce the success of our method to 0% top1 (0.5% top5) accuracy.
We conclude by exploring the effectiveness of our method in two complex regimes, where instead of access to individual gradients, the method can access 1) only the average gradient from a minibatch of samples, and 2) the update comprising multiple gradient descent steps using a training sample. We demonstrate that in both of the above settings, our method reveals speaker identity with nontrivial accuracy, whereas training with dropout is effective in reducing its success.
2 Background
In this section, we start by revisiting the idea of Gradients Matching (GM) (Zhu et al., 2019), which has been successfully applied to reconstruct images from gradients of an image recognition model. We then introduce endtoend ASR models and a popular architecture using the Connectionist Temporal Classification (CTC) loss (Graves et al., 2006). We also provide some background on zerothorder optimization, specifically the direct search algorithm, and speaker identification models with triplet loss, which form critical components of our proposed method.
Gradients Matching & Deep Leakage from Gradients (DLG) Algorithm
DLG was introduced by Zhu et al. (2019) as a method to reconstruct an input and output given a model gradient , where denotes the loss function and denotes the model parameters (when it is clear from the context that the gradient is w.r.t. model parameters , we just denote it by ). The algorithm attempts to find an inputoutput pair that matches with . The general idea is also referred to as Gradients Matching (GM). A dummy input and a dummy label are fed into the model to get dummy gradients . Reconstructed objects are obtained by minimizing the Euclidean distance between the dummy gradients and the client update.
(1) 
Geiping et al. (2020)
provide an extension of DLG that works with larger images and trained models. They adopt the cosine similarity (also shown in (
4)) to optimize the gradients distance, which only matches the direction of gradients. A regularization term is also added. They assume that the training labels are known and formulate the reconstruction as finding to minimizeBoth firstorder and secondorder optimization methods can be used to optimize (1) and (2). While first order techniques, such as Stochastic Gradient Descent (SGD) and Adam optimizer are usually faster to compute, secondorder methods like LBFGS can escape slow convergence paths better (Battiti, 1992). An observation about this function is the presence of firstorder derivatives , which requires secondorder differentiation with regards to
to compute the gradients. These secondorder gradients can usually be derived with autodifferentiation commonly implemented in deep learning frameworks.
Endtoend ASR with CTC loss
Recently, endtoend models have achieved superior performance while being simpler than traditional pipeline models (Graves et al., 2006; Graves, 2012; Chorowski et al., 2014, 2015; Chan et al., 2016; Xiong et al., 2017). Two major lines of endtoend ASR architectures are based on connectionist temporal classification (CTC) (Graves et al., 2006), and the attentionbased encoderdecoder mechanism (Chorowski et al., 2014).
We focus on models with CTC loss in this work. CTC models are able to align network inputs of length with label sequences of length without predefined alignments. The set of labels is extended with a blank label, denoted as “”, to form . The label sequence can be mapped with a onetomany mapping to CTC paths (Graves et al., 2006), for example, “aab” can both be mapped to “aab” or “aaabb”
A neural network , usually including one or several bidirectional RNNs to model frame dependencies, is used to map a dimensional speech features to
. The probability of a CTC path
is defined as . The likelihood of a label sequence follows as the sum over probabilities of all CTC paths:(3) 
The model is optimized by maximizing the likelihood of , i.e. minimizing . During inference, a greedy search or beam search is conducted to find a sequence that maximizes
Zerothorder Optimization
Zerothorder optimization is the process of minimizing an objective, given access to the objective values at chosen inputs. The standard approach to zerothorder optimization is to estimate the gradients
(Flaxman et al., 2005). However, gradient estimation suffers from high variance due to nonrobust local minima or highly nonsmooth objectives (Golovin et al., 2019). A direct search algorithm (also known as pattern search, derivativefree search, or blackbox search), which samples a vector
and moves to, performs well in several settings, such as reinforcement learning
(Mania et al., 2018). Additionally, a binary search on the step size, referred to as Gradientless Descent (Golovin et al., 2019) is proven to be fast for highdimensional zerothorder optimization.Speaker Identification with Triplet Loss
Recent approaches (Snyder et al., 2018; Chung et al., 2018) formulate speaker identification as learning speaker discriminative embeddings for an utterance. The embeddings are extracted from variablelength acoustic segments via a deep neural network. The multiclass cross entropy loss (Snyder et al., 2017, 2018) and the triplet loss (Zhang & Koishida, 2017; Li et al., 2017; Chung et al., 2018) are two common approaches to train the embeddings. In this work, we adopt the triplet loss, which operates on pairs of embeddings, trying to minimize the distance of embeddings from the same speaker, and maximize the distance with other negative samples.
3 A Method to Reveal Speaker Identity
In this section, we describe our method to reveal the speaker identity of a training sample given its model gradient . Here, denotes the input speech features created from the training utterance, is the length of input, is the dimension of the input speech features, is the output label sequence, and denotes the training loss function. We split our method into two phases: (1) Using HessianFree Gradients Matching to reconstruct the input speech features (reconstruction phase), and (2) Identify the speaker from the reconstructed speech features (inference phase). Figure 1 provides an illustration of our method.
3.1 Reconstruction Phase: HessianFree Gradients Matching (HFGM)
Given access to a gradient , generally one would like to find a pair of speech features and transcript such that . However, ASR models are typically sequencetosequence models that can map arbitrary length speech features () to arbitrary length transcripts (). With no additional information, the possible values can take is exponential in the label set size, searching through which can incur a prohibitive computational cost. Moreover, there can be many false positives, i.e., pairs such that . To circumvent this issue and make our problem simpler, we assume , the transcript of the training utterance, is given.^{2}^{2}2Note that the transcript could be a common phrase, e.g., “play music”. Our objective is to identify the speaker of an utterance regardless of the contents of its transcript. Even though is given, there could exist multiple length speech features , where , such that . Thus, we also assume is given. Note that even if the transcript and the length of the input speech features are known, revealing the identity of the speaker can still result in a significant breach of privacy. Designing efficient reconstruction methods that operate without these assumptions is an interesting direction, which we leave for future work.^{3}^{3}3Using the experimental setup in Section 4, we provide some preliminary results in Appendix B.1 on reconstruction i) without the knowledge of length of input speech features, and ii) with the knowledge of only the transcript length. We find that while reconstruction can succeed even with a good estimate of the input length, it fails with no knowledge of the transcript.
Now, we define the reconstruction task as constructing an such that is close to the observed gradient . Following (Geiping et al., 2020), we choose cosine distance as our measure of closeness, and formulate our optimization problem to find s.t.:
(4) 
To solve this nonconvex optimization problem using gradientbased methods as in prior work Zhu et al. (2019); Geiping et al. (2020); Wei et al. (2020), we need to compute the secondorder derivatives of the loss function
. The loss function that is of interest here is the CTC loss which is commonly used in endtoend ASR systems. However, computing the second derivative of CTC loss involves backpropagating twice through a dynamic programming algorithm which we found to be intractable.
^{4}^{4}4Additionally, the second derivatives of CTC loss are not implemented in common deep learning frameworks like TensorFlow
(Abadi et al., 2015)and PyTorch
(Paszke et al., 2019). To tackle this challenge, and also address a broader family of loss functions, we adopt a zerothorder optimization algorithm.We use a direct search approach (Section 2) called HFGM (Algorithm 1). We initialize with uniformly random values. At each iteration, we sample random unit vectors and apply them to . The value is evaluated at each of these points. We choose only the vectors that lower , sum these up and apply the sum with a learning rate . We repeat the process until we reach the convergence criteria.
3.2 Inference Phase: Revealing Speaker Identity
In the second part of our method, we use the reconstructed speech features () to identify the speaker of the utterance from a list of possible speakers.
We train a speaker identification model that uses the same speech features as our ASR model on some corpus. We assume that we have access to some public utterances for each possible speaker to identify them. We use the speaker identification model to create embeddings for each speaker from the public utterances. We take the reconstructed speech features (), create an embedding using the speaker identification model, and compare it with embeddings for each speaker. If the method is successful, the embedding created from is closer to the embedding for the speaker of the utterance than the other speakers.
3.3 Comparison with Related Prior Works
Our work differs from the related prior works (Zhu et al., 2019; Geiping et al., 2020; Wei et al., 2020) in a few ways.
Input features: The input to ASR models is typically not the raw audio but speech features which are computed from the raw audio using a series of lossy transformations. In image recognition models, the input to the model is typically the raw pixel values. While prior works on image recognition models demonstrate breach of privacy by directly reconstructing the input to the model, we incorporate an additional inference phase where we use the reconstructed input to reveal the identity of the speaker.
Variablesized inputs and outputs: We focus on ASR models, which have variablesized inputs and outputs; image recognition models have fixed size inputs and outputs.
CTC Loss: The models we focus on use CTC loss instead of crossentropy loss. CTC loss is significantly more complex, requiring a dynamic programming algorithm to compute the value of the loss function and derivatives.
4 Experiments
In this section, we provide empirical results on the effectiveness of our method. We start by describing our setup.
Model Architectures
Following prior work (Carlini & Wagner, 2018), we choose the DeepSpeech (Hannun et al., 2014)
model architecture for our experiments. The model consists of three feedforward layers, followed by a single bidirectional LSTM layer, and two feedforward layers to produce softmax probabilities for the CTC loss. DeepSpeech uses characterbased CTC loss: the output is a sequence of characters. The input is a 26dimensional melfrequency cepstrum coefficients (MFCCs) feature. MFCCs are a popular speech feature, derived by mapping the power of the result of a Fourier transform to the mel scale, then performing a discrete cosine transformation. We use Mozilla’s implementation of DeepSpeech
^{5}^{5}5https://github.com/mozilla/DeepSpeechWe conduct our experiments using randomly initialized weights for the model. For the inference phase, we follow Li et al. (2017) to train a textindependent speaker identification model on 26dim normalized MFCCs, similar to the speech feature used for inputs to DeepSpeech.
Dataset
We choose the LibriSpeech ASR corpus (Panayotov et al., 2015), a largescale benchmark speech dataset for our experiments. The dataset contains pairs of audio and transcript, along with speaker attributes such as gender and identity. For training the speaker identification model, we first combine all the dev{clean/other}, test{clean/other}, train{clean100/clean360/other500} sets to obtain 300k utterances from 2,484 speakers, and use the first 5 utterances of each speaker for training.
For the reconstruction phase, we trim the leading and ending silences, based on the intensity of every 10ms chunk, from each utterance in the remaining combined test set. Next, we randomly sample a total of 600 utterances, 100 for each interval of audio length in . The average audio length in our sampled set is 2.5 seconds, and average transcript length is 40.6 characters. The male:female ratio in the sampled utterances is 1.1:1.
Implementation Details
In this section, all the experiments consider the scenario of revealing speaker identity from a single gradient computed using a single utterance. For computational efficiency, we match gradients only for the last layer (60k parameters). Note that matching lower layers may increase the reconstruction quality.
Each dummy input in our reconstruction is initialized with uniformly random values in . When performing direct search, we sample 128 unit vectors per iteration, each of which only updates a single frame. We set the step size to 1, and reduce by half after every 2.5k iterations s.t. the loss does not decrease by more than 5%. We stop the reconstruction when the step size reaches 0.125. We run each reconstruction on a single Tesla V100 GPU. The reconstruction time depends on the length of inputs, ranging from 3 to 6 hours.
Evaluation Metrics
To evaluate our reconstruction, we use the Mean Absolute Error (MAE) to measure the distance of normalized MFCCs to those of the original utterance. During inference, the similarity scores of a reconstructed object’s embedding with each of 5 available utterances’ embeddings in the training data are averaged and ranked to identify the speaker. We use Top1 Accuracy, Top5 Accuracy and Mean Reciprocal Rank (MRR) to evaluate the speaker identity leakage. In experiments where we try to use alternate training methods, the Word Error Rate (WER) is used to evaluate the quality of trained ASR models^{6}^{6}6
An Ngram language model is trained separately on a large text corpus and used during inference.
.4.1 Empirical Results
Now, we present the results of using our method from Section 3 to reveal speaker identity from 600 individual gradients, each gradient computed using a unique utterance from our sampled set. In Figure 2, we plot the results by audio length in intervals of 0.5s from 14s. First, we show the average MAE of the reconstructed MFCCs, where we observe that the average MAE monotonously increases with the audio length. Note that the dimensionality of the optimization problem increases linearly with the audio length. Next, we plot the top1 and top5 accuracy of the speaker identity from the reconstructed MFCCs. Notice that even for the longest 3.54s utterances under consideration, the top1 accuracy is 24%, and the top5 accuracy is 37%. For comparison, we also plot the performance of our speaker identification model on the original utterances. For shorter utterances, the top1 accuracy from the reconstructed MFCCs is almost identical to that from the original utterances.
Table 1 shows the overall values of the average MAE, Top1 accuracy, Top5 accuracy, and MRR of speaker identification results from the original and reconstructed speech features. We see that while speaker identification from original utterances results in 42% top1 (57% top5) accuracy, the same from the reconstructed features is 34% top1 (51% top5), providing 81% (89.5%) relative performance.
MAE  Top1  Top5  MRR  

Original  0.00  42.0  57.0  0.554 
Reconstructed  0.25  34.0  51.0  0.419 
4.2 Training with DPSGD
Now, we study the effect of training with the popular technique of Differentially Private Stochastic Gradient Descent (DPSGD) on the success of our method. At a highlevel, each gradient gets clipped to a fixed norm bound
, and zeromean Gaussian noise of standard deviation
is added to provide a (local) DP guarantee for each sample. Due to space constraints, we defer the formal definition of Differential Privacy (DP) (Dwork et al., 2006b, a), and a pseudocode of DPSGD, to Appendix A.3.Using only clipping has been shown in prior works (Carlini et al., 2019; Thakkar et al., 2020) to be effective in mitigating unintended memorization
in language models. However, the optimization in our method (Equation
4) uses cosine distance as the loss, thus rendering only clipping ineffective. Since using DPSGD for training large models has been shown (Abadi et al., 2016; McMahan et al., 2018; Thakkar et al., 2020) to affect model utility, our first objective is to find the least s.t. the top1 accuracy of speaker identification is %. For our experiments, we set ^{7}^{7}7We observed gradients had norm at least 100, and thus chose . Due to cosine loss used in our optimization, as long as a gradient gets clipped, the value of will not have any effect on the success of the method, or on the DP guarantee via DPSGD., and provide the evaluation metrics for
in Figure 2. We observe that is effective in reducing the top1 accuracy of speaker identification to %.MAE  Top1  Top5  MRR  WER (clean)  WER (other)  

Baseline  0.25  34.0  51.0  0.419  10.5  28.4 
0.34  21.5  34.8  0.284  14.9  37.6  
0.54  2.3  5.8  0.049  15.4  39.4  
0.63  0.5  1.7  0.021  19.6  45.3 
We also provide the overall evaluation metrics with using DPSGD in Table 2. We see that for , the top1 accuracy of speaker identification is 0.5%. Further, we also provide the WER of models trained using DPSGD (Google, 2019) with a batch size of 16. We see that the WER of models trained via DPSGD, even for the smallest noise level, is significantly increased compared to our baseline training. Note that for the levels of noise presented here, the bounds for (local/central) DP will be nearvacuous. However, improving the privacyutility tradeoffs for DPSGD is beyond the scope of this work.
4.3 Training with Dropout
Dropout (Srivastava et al., 2014) has been adopted in training deep neural networks as an efficient way to prevent overfitting to the training data. The key idea of unit dropout is to randomly drop model units during training. While prior work (Wei et al., 2020) has mentioned dropout in the context of information leakage from gradients, it does not provide any empirical evidence of the effect of training with dropout on such leakages.
The dropout mask is deducible from gradients if dropping a unit completely disables a part of the network (e.g. a feedforward neural network), or dropout is applied directly on weights
(Wan et al., 2013). When parameters are shared in the network, for e.g., a fullyconnected layer operating framewise on a sequence of speech features, each part of the output typically uses an i.i.d. random dropout mask, making it difficult to infer dropout masks from a gradient.MAE  Top1  Top5  MRR  WER (clean)  WER (other)  

0.25  34.0  51.0  0.419  10.5  28.4  
0.59  0.8  2.0  0.019  11.9  28.2  
0.72  0.0  0.5  0.006  9.2  25.6  
0.81  0.1  0.3  0.005  9.5  27.1 
Table 3 shows reconstruction quality and training error rates for different dropout rates. Even for the lowest dropout rate of 0.1, we see that the top1 accuracy of speaker identification is 0%. At the same time, we observe that for models trained with dropout, the WER is comparable (or sometimes even lower) than the baseline training. We defer the plots of the results grouped by audio length to Appendix B.
Visualizing Reconstructed Features
In Figure 3, we provide two examples of spectrograms from the reconstruction of a short and a long utterance. For the long utterance, even though MAE for the reconstruction is high and the speaker identification system fails to identify the speaker, the reconstructed audio pattern is visibly similar to the original audio pattern. For comparison, we also provide spectrograms from reconstructions of the same utterances from DPSGD training (), and a dropout rate of 0.1.
5 Additional Experiments
The experiments in Section 4 focused on revealing speaker identity using our method on a single gradient from a single utterance. In distributed settings like FL, model training is performed under more complex settings. In this section, we conduct experiments to evaluate the success of our method on two natural extensions of the setting in Section 4: 1) gradients from a batch of utterances are averaged before being shared, and 2) multiple update steps are performed using a single utterance, and the final model update is shared. We demonstrate that in both of the settings above, our method can reveal speaker identity with nontrivial accuracy. Further, we show that using dropout for training reduces the limited success of the method in both the settings. All the experiments in this section are conducted using the 200 utterances of audio length 12s (from the 600 sampled utterances for experiments in Section 4).
5.1 Averaged Gradients from Batches
In this section, we study the performance of our method for revealing speaker identities from an averaged gradient computed using a batch of utterances. In the reconstruction phrase, our objective function (4) does not change; however, we instead try to reconstruct , where for . Here, is the number of samples in the batch, and is the length of input . For computational efficiency, we only update a single sample per iteration of our optimization. We provide a pseudocode for the variant of Algorithm 1 adapted to this setting, in Appendix C.1.
We conduct our experiments for batch sizes in . For each batch size, the 200 utterances are sorted by audio length, and grouped into batches. We provide the results in Table 4, comparing them with the results (batch size 1) on same 200 utterances in Section 4.1. We see that while speaker identification accuracy decreases with increasing batch sizes, the top1 accuracy is still as high as 19% for batch size 4. An experiment on the effect of training with a dropout rate of 0.1 shows that reconstruction of batch size 2 from droppedout gradients reduces the accuracy to 1% top1 (4% top5), compared to 2% top1 (4% top5) on the same set of utterances in Section 4.3.
MAE  Top1  Top5  MRR  
Original  0.00  42.0  57.0  0.490 
Batch size 1  0.14  40.0  55.0  0.470 
Batch size 2  0.21  37.0  54.0  0.451 
Batch size 4  0.37  19.0  31.0  0.249 
Batch size 8  0.48  5.0  11.0  0.084 
5.2 MultiStep Updates from a Sample
Now, we study the success of our method in revealing speaker identities from an update comprising of multiple update steps using a single utterance. We conduct our experiments for 2step and 8step updates with the learning rate set to . For computational efficiency, we reduce the number of unit vectors sampled to 8 (as opposed to 128, in the experiments in Sections 4 and 5.1) in each iteration of our zerothorder optimization.
Table 5 shows the results of our experiment, comparing them with the same (1step) from Section 4.1. Since the optimization for multistep reconstruction is different, the results are not directly comparable with those of singlestep setting. We see that even though the time/computation taken for reconstruction may increase with increasing number of steps, the success of our method in revealing speaker identity is still as high as 24% top1 accuracy for 8step updates. Using dropout in training is still effective: a dropout rate of 0.1 reduces the accuracy to 2% top1 (3.5% top5).
MAE  Top1  Top5  MRR  
Original  0.00  42.0  57.0  0.490 
1step  0.14  40.0  55.0  0.470 
2step  0.33  26.5  39.5  0.333 
8step  0.33  24.5  39.0  0.321 

6 Related Work
While we provide a background (in Section 2) for the DLG method (Zhu et al., 2019) and a comparison with our method (in Section 3.3), there have been followup works (Geiping et al., 2020; Wei et al., 2020; Zhao et al., 2020) showing highfidelity image and label reconstruction from gradients under different settings. Revealing information about training data from gradients has also been shown via membership and property leakage (Shokri et al., 2017; Song & Shmatikov, 2019; Melis et al., 2019). There is a growing line of works on revealing information from trained models. For instance, (Fredrikson et al., 2015) demonstrate vulnerabilities to model inversion attacks. Other works (Carlini et al., 2019; Thakkar et al., 2020) show the amount of unintended memorization in trained models, along with studying the effect of DPSGD in mitigating such memorization.
For using standard training techniques to reduce information leakages from model training, while gradient compression and sparsification have been claimed (Zhu et al., 2019) to provide protection, it has been shown in (Wei et al., 2020) that reconstruction attacks can succeed with nontrivial accuracy in spite of using gradient compression. There also exist works on designing strategies that require changes to the model inputs or architecture for protection, e.g., TextHide (Huang et al., 2020a), and InstaHide (Huang et al., 2020b). For realworld deployments of distributed training, there also exist protocols like Secure Aggregation (Bonawitz et al., 2017) which make it difficult for any adversary to access raw individual gradients.
Acknowledgements
The authors would like to thank Nicholas Carlini, Andrew Hard, Ronny Huang, Khe Chai Sim, and our colleagues in Google Research for their helpful support of this work, and comments towards improving the paper.
References

Abadi et al. (2015)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,
G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp,
A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,
Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C.,
Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.,
Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P.,
Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
URL http://tensorflow.org/. Software available from tensorflow.org.  Abadi et al. (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318, 2016.
 Bassily et al. (2014) Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proc. of the 2014 IEEE 55th Annual Symp. on Foundations of Computer Science (FOCS), pp. 464–473, 2014.
 Battiti (1992) Battiti, R. Firstand secondorder methods for learning: between steepest descent and newton’s method. Neural computation, 4(2):141–166, 1992.
 Bonawitz et al. (2017) Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacypreserving machine learning. In Proc. of the 2017 ACM Conf. on Computer and Communications Security (CCS), pp. 1175–1191, 2017.
 Bonawitz et al. (2019) Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., Kiddon, C., Konečnỳ, J., Mazzocchi, S., McMahan, H. B., et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
 Carlini & Wagner (2018) Carlini, N. and Wagner, D. Audio adversarial examples: Targeted attacks on speechtotext. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. IEEE, 2018.
 Carlini et al. (2019) Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., and Song, D. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284. USENIX Association, August 2019. ISBN 9781939133069.
 Chan et al. (2016) Chan, W., Jaitly, N., Le, Q., and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, 2016. doi: 10.1109/ICASSP.2016.7472621.
 Chorowski et al. (2014) Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. Endtoend continuous speech recognition using attentionbased recurrent nn: First results. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
 Chorowski et al. (2015) Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. Attentionbased models for speech recognition. In Advances in Neural Information Processing Systems, volume 28, pp. 577–585, 2015.
 Chung et al. (2018) Chung, J. S., Nagrani, A., and Zisserman, A. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
 Dimitriadis et al. (2020) Dimitriadis, D., Kumatani, K., Gmyr, R., Gaur, Y., and Eskimez, S. E. A federated approach in training acoustic models. In Proc. Interspeech, 2020.
 Dwork et al. (2006a) Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology—EUROCRYPT, pp. 486–503, 2006a.
 Dwork et al. (2006b) Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Proc. of the Third Conf. on Theory of Cryptography (TCC), pp. 265–284, 2006b.
 Flaxman et al. (2005) Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’05, pp. 385–394, USA, 2005. Society for Industrial and Applied Mathematics. ISBN 0898715857.
 Fredrikson et al. (2015) Fredrikson, M., Jha, S., and Ristenpart, T. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333, 2015.
 Geiping et al. (2020) Geiping, J., Bauermeister, H., Dröge, H., and Moeller, M. Inverting gradients–how easy is it to break privacy in federated learning? arXiv preprint arXiv:2003.14053, 2020.
 Golovin et al. (2019) Golovin, D., Karro, J., Kochanski, G., Lee, C., Song, X., et al. Gradientless descent: Highdimensional zerothorder optimization. arXiv preprint arXiv:1911.06317, 2019.
 Google (2019) Google. Tensorflowprivacy. https://github.com/tensorflow/privacy, 2019.
 Graves (2012) Graves, A. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.

Graves et al. (2006)
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
In Proceedings of the 23rd international conference on Machine learning, pp. 369–376, 2006.  Guliani et al. (2020) Guliani, D., Beaufays, F., and Motta, G. Training speech recognition models with federated learning: A quality/cost framework. arXiv preprint arXiv:2010.15965, 2020.
 Hannun et al. (2014) Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al. Deep speech: Scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567, 2014.
 Hard et al. (2018) Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage, D. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
 Hard et al. (2020) Hard, A., Partridge, K., Nguyen, C., Subrahmanya, N., Shah, A., Zhu, P., Moreno, I. L., and Mathews, R. Training keyword spotting models on noniid data with federated learning. arXiv preprint arXiv:2005.10406, 2020.
 Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 Huang et al. (2020a) Huang, Y., Song, Z., Chen, D., Li, K., and Arora, S. Texthide: Tackling data privacy in language understanding tasks. arXiv preprint arXiv:2010.06053, 2020a.
 Huang et al. (2020b) Huang, Y., Song, Z., Li, K., and Arora, S. Instahide: Instancehiding schemes for private distributed learning. In International Conference on Machine Learning, pp. 4507–4518. PMLR, 2020b.
 Leroy et al. (2019) Leroy, D., Coucke, A., Lavril, T., Gisselbrecht, T., and Dureau, J. Federated learning for keyword spotting. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6341–6345. IEEE, 2019.
 Li et al. (2017) Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., and Zhu, Z. Deep speaker: an endtoend neural speaker embedding system. arXiv preprint arXiv:1705.02304, 650, 2017.
 Mania et al. (2018) Mania, H., Guy, A., and Recht, B. Simple random search of static linear policies is competitive for reinforcement learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31, pp. 1800–1809. Curran Associates, Inc., 2018.
 McMahan & Ramage (2017) McMahan, B. and Ramage, D. Federated learning: Collaborative machine learning without centralized training data. Google Research Blog, 3, 2017.
 McMahan et al. (2018) McMahan, B., Ramage, D., Talwar, K., and Zhang, L. Learning differentially private recurrent language models. In International Conference on Learning Representations (ICLR), 2018.
 Melis et al. (2019) Melis, L., Song, C., De Cristofaro, E., and Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 691–706. IEEE, 2019.
 Panayotov et al. (2015) Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE, 2015. URL http://www.openslr.org/12.
 Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, highperformance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., dAlchéBuc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
 Shokri et al. (2017) Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. IEEE, 2017.
 Snyder et al. (2017) Snyder, D., GarciaRomero, D., Povey, D., and Khudanpur, S. Deep neural network embeddings for textindependent speaker verification. In Interspeech, pp. 999–1003, 2017.
 Snyder et al. (2018) Snyder, D., GarciaRomero, D., Sell, G., Povey, D., and Khudanpur, S. Xvectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE, 2018.
 Song & Shmatikov (2019) Song, C. and Shmatikov, V. Auditing data provenance in textgeneration models. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 196–206, 2019.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 Thakkar et al. (2020) Thakkar, O., Ramaswamy, S., Mathews, R., and Beaufays, F. Understanding unintended memorization in federated learning. arXiv preprint arXiv:2006.07490, 2020.
 Wan et al. (2013) Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. PMLR, 2013.
 Wei et al. (2020) Wei, W., Liu, L., Loper, M., Chow, K.H., Gursoy, M. E., Truex, S., and Wu, Y. A framework for evaluating gradient leakage attacks in federated learning. arXiv preprint arXiv:2004.10397, 2020.
 Xiong et al. (2017) Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., Yu, D., and Zweig, G. Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2410–2423, 2017. doi: 10.1109/TASLP.2017.2756440.
 Zhang & Koishida (2017) Zhang, C. and Koishida, K. Endtoend textindependent speaker verification with triplet loss on short utterances. In Interspeech, pp. 1487–1491, 2017.
 Zhao et al. (2020) Zhao, B., Mopuri, K. R., and Bilen, H. idlg: Improved deep leakage from gradients. arXiv preprint arXiv:2001.02610, 2020.
 Zhu et al. (2019) Zhu, L., Liu, Z., and Han, S. Deep leakage from gradients. In Advances in Neural Information Processing Systems, pp. 14774–14784, 2019.
Appendix A Background
a.1 DeepSpeech
We hereby present details about the DeepSpeech (Hannun et al., 2014) model. The model consists of three feedforward layers, followed by a single bidirectional LSTM layer, and two feedforward layers to produce softmax probabilities for the CTC loss. The list of layers and number of parameters at each layer are shown in Table 6. Note that we only use the last layer to match gradients, which has only parameters.
Layer  Type  No. parameters 

1  feedforward  1.0m 
2  feedforward  4.2m 
3  feedforward  4.2m 
4  bidirectional lstm  33.6m 
5  feedforward  4.2m 
6  feedforward  0.1m 
total  47.3m 
a.2 Deep Speaker
The Deep Speaker (Li et al., 2017)
model adopts a deep residual CNN (ResCNN) architecture to extract the acoustic features from utterances. These perframe features are averaged to produce utterancelevel speaker embeddings. The ResCNN consists of four stacked residual blocks (ResBlocks) with a stride 2. The numbers of CNN filters are 64, 128, 256, 512, respectively. The total number of parameters is 24M.
Deep Speaker is trained with Triplet Loss, which takes three samples as input, an anchor , a positive sample (from the same speaker), and a negative sample (from another speaker). The loss function of samplings is defined as
where is the cosine similarity between the anchor and the negative sample , is the cosine similarity between the anchor and the positive sample , from the th sampling. is the minimum margin between these cosine similarities, which is set to 0.1.
a.3 DpSgd
For completeness, we start by providing a definition of the notion of Differential Privacy (Dwork et al., 2006b, a). We will refer to a pair of datasets as neighbors if can be obtained by the addition or removal of one sample from .
Definition A.1 (Differential privacy (Dwork et al., 2006b, a))
A randomized algorithm is differentially private if, for any pair of neighboring datasets and , and for all events in the output range of , we have
where the probability is taken over the random coins of .
Now, we provide a pseudocode for DPSGD (Abadi et al., 2016).
Appendix B Additional Experiments, and Omitted Details
b.1 Reconstruction without Assumptions
We set up two experiments to explore the necessity of the two assumptions of known input length and transcript for our reconstruction method.
Reconstruction without Knowledge of Input Length
The first assumption for our reconstruction method (Section 3.1) that the length of input speech features is known. This is required to set up the search space for the optimization problem (4). Without the exact input length, we show that reconstruction is still possible. In the experiments below, 20 random utterances are chosen from the 12s bucket whose speaker are correctly identified in top5 in section 4.1. The average length of these utterances is 74.35 frames ( 1.5s). Table 7 shows reconstruction results when estimated lengths differ by , and compared to original lengths and are double / half of the original lengths. It can be seen that the speaker identity can still be revealed even with a good estimate of the input length. For the same amount of absolute deviation in the estimation (e.g., and ), we see that the higher estimation provides better results.
Length  Loss ()  Top1  Top5  MRR 
Original  0.04  90  100  0.748 
0.06  60  90  0.706  
0.05  55  95  0.714  
0.20  50  80  0.632  
0.31  45  70  0.580  
0.44  35  55  0.442  
1.43  20  40  0.301  
2.53  0  10  0.048  
145.15  0  0  0.003 
Reconstruction without Contents of the Transcript
Next, we conduct experiments with our method having knowledge of only the length of the transcript, not its contents. For each utterance in the set of 20 utterances from section B.1, we generate 4 random transcripts and use them to reconstruct speech features. It can be seen from Table 8 that reconstruction is constantly of a poor quality (high loss) with a random transcript, suggesting that the knowledge about the transcript is important. The bad quality of reconstructed features from an incorrect transcript also suggests that if the attacker has a list of candidates for the transcript (e.g., common phrases, song names, etc.) including the original one, a bruteforce approach to pick the one with the lowest loss can reveal the actual transcript with high confidence.
Transcript  Loss ()  MAE  Top1  Top5  MRR 

Original  0.04  0.12  90  100  
Random 1  79.5  0.78  0  0  0.010 
Random 2  135.5  0.74  0  0  0.013 
Random 3  108.7  0.77  0  0  0.006 
Random 4  101.5  0.78  0  0  0.015 
b.2 Reconstruction from DroppedOut Gradients
Figure 4 show results grouped by audio length of the experiment in Section 4.3. Even a dropout rate of 0.1 efficiently eliminates the risk of speaker identity leakage.
We also try varying the dropout rate and performing reconstruction on a small population of 20 utterances (first 20 utterances when sorted by lengths). The results are presented in Figure 5. The speaker identification accuracy drops sharply when increasing the dropout rate.
Appendix C Algorithms
We present an adapted version of HFGM for reconstructing from averaged gradients and multistep updates.
c.1 HFGM on Averaged Gradients from Batches
In Algorithm 1, a dummy input is randomly initialized at the beginning and given to the model to compute the loss and gradients at every iteration of the optimization process. When reconstructing a batch from averaged gradients, a dummy batch needs to be optimized. To save computation time, we only update a single sample at each iteration, reusing the loss and gradients of other samples in the batch to obtain the overall loss and gradients. A variant of Algorithm 1 adapted for this setting is presented as Algorithm 3
c.2 HFGM on MultiStep Updates from a Sample
A challenge when applying Algorithm 1 to this setting is the change in model parameters after each local step. Therefore, model updates of sampled unit vectors cannot be computed in batch, but need to be computed separately. The model also needs to be reset to its original parameters before each computation. Algorithm 4 provides a modified version of Algorithm 1 to reconstruct an input from multistep updates. For efficiency, if vectors are sampled at each iteration, separate versions of the model are stored in the computation graph and processed in parallel.
Appendix D Additional Visualizations
Figure 6 shows the spectrogram of some utterances reconstructed in Section 4.1, along with results when reconstructing from a gradient with DPSGD and Dropout.
We also plot perframe MAEs at different stages in the optimization process in Figure 7. In long utterances, reconstructions usually have bad quality with frames in the middle being poorly reconstructed. This suggests that the error from earlier frames may have affected reconstruction in the middle part, due to sequential dependencies modeled in the LSTM.