A Method to Reveal Speaker Identity in Distributed ASR Training, and How to Counter It

04/15/2021 ∙ by Trung Dang, et al. ∙ 0

End-to-end Automatic Speech Recognition (ASR) models are commonly trained over spoken utterances using optimization methods like Stochastic Gradient Descent (SGD). In distributed settings like Federated Learning, model training requires transmission of gradients over a network. In this work, we design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient. We propose Hessian-Free Gradients Matching, an input reconstruction technique that operates without second derivatives of the loss function (required in prior works), which can be expensive to compute. We show the effectiveness of our method using the DeepSpeech model architecture, demonstrating that it is possible to reveal the speaker's identity with 34 top-1 accuracy (51 study the effect of two well-known techniques, Differentially Private SGD and Dropout, on the success of our method. We show that a dropout rate of 0.2 can reduce the speaker identity accuracy to 0



There are no comments yet.


page 8

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end automatic speech recognition (ASR), which directly transcribes speech to text without predefined alignments, has increasingly been gaining popularity over conventional pipeline frameworks. State-of-the-art ASR models have achieved human parity in conversational speech recognition (Xiong et al., 2017). Training such models often requires a large amount of user-spoken utterances. In the speech domain, training data includes audio and transcripts of utterances, which can directly expose sensitive information, or make it possible to leak attributes such as gender, dialect, or identity of the speaker.

In distributed frameworks such as Federated Learning (FL) (McMahan & Ramage, 2017), model training is performed via mobile devices with transmission of gradients over the network, allowing training over large populations (Bonawitz et al., 2019) while ensuring such data remains on-device. Many works have shown the competitive performance of FL-trained models on sequential modeling tasks like keyboard prediction (Hard et al., 2018) and keyword spotting (Leroy et al., 2019; Hard et al., 2020), as well as in speech recognition (Dimitriadis et al., 2020; Guliani et al., 2020).

A recent line of work (Zhu et al., 2019; Geiping et al., 2020; Wei et al., 2020) has focused on demonstrating leakages of information about training data, from the gradients used in model training. At a high level, these works aim to reconstruct training samples by designing optimization methods for constructing objects that have a gradient matching to the observed gradient. For instance, a number of existing methods have been shown to successfully reconstruct images used for training image classification models. As we discuss later (in Section 3.3), there are fundamental challenges, like variable-sized inputs/outputs, that render such methods inapplicable in the speech domain.

In this work, we study information leakage from gradients in ASR model training. In particular, we design a method to reveal the speaker identity of a training utterance from a model gradient computed using the utterance. Given that ASR models can have training utterances and transcripts of arbitrary lengths, for computational efficiency and to avoid potential false positives, we assume that the transcript, and the length of the training utterance are known. We start by designing Hessian-Free Gradients Matching (HFGM), a technique to reconstruct speech features used in computing gradients for training ASR models. Our HFGM technique eliminates the need of second-order derivatives (i.e., Hessian) of the loss function, which were required in prior works (Zhu et al., 2019; Geiping et al., 2020; Wei et al., 2020), and can be expensive to compute. Next, our method uses the reconstructed features and a speaker identification model to uniquely identify the speaker from a list of speakers by comparing speaker embeddings.

To our knowledge, this is the first method in the speech domain that can be used for revealing information about training samples from gradients. We demonstrate the efficacy of our method by conducting experiments using the LibriSpeech data set (Panayotov et al., 2015) on the DeepSpeech (Hannun et al., 2014) model architecture. We find that our method is successful in revealing the speaker identity with 34% top-1 accuracy (51% top-5 accuracy) among 2.5k speakers.

We also study the effect of two standard training techniques, namely, Differentially Private Stochastic Gradient Descent (DP-SGD) (Bassily et al., 2014; Abadi et al., 2016), and Dropout (Srivastava et al., 2014)

, on the success of our method. The technique of DP-SGD is the state-of-the-art in training deep neural networks with Differential Privacy (DP) 

(Dwork et al., 2006b, a) guarantees. Intuitively, DP prevents an adversary from confidently making any conclusions about whether any particular sample was used to train a model, even while having access to the model and arbitrary external side information. While we demonstrate that using DP-SGD can mitigate the success of our method, we find (in line with prior works (Abadi et al., 2016; McMahan et al., 2018; Thakkar et al., 2020)) that training large models using DP-SGD can significantly affect model utility.

The well-known technique of Dropout (Hinton et al., 2012; Srivastava et al., 2014), which randomly drops hidden units of a neural network during training, is commonly employed to avoid overfitting in large models. We show that using dropout can reduce the speaker identity accuracy of our method to 0% top-1 (0.5% top-5), without compromising utility of the trained model.

We make the following contributions:

  1. We design the first method to reveal speaker identity of an utterance in ASR model training, with access to only a gradient computed using the utterance. We achieve this via Hessian-Free Gradients Matching, an input reconstruction technique that operates without needing second derivatives of the loss function.

  2. We empirically demonstrate the effectiveness of our method, using the DeepSpeech model architecture, in revealing speaker identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech data set. To spur further research, we provide an open-source implementation

    111https://github.com/googleinterns/deepspeech-reconstruction of our experimental framework.

  3. We study the effect of two standard training techniques – DP-SGD and Dropout – on the success of our method. We empirically demonstrate that using dropout can reduce the success of our method to 0% top-1 (0.5% top-5) accuracy.

We conclude by exploring the effectiveness of our method in two complex regimes, where instead of access to individual gradients, the method can access 1) only the average gradient from a mini-batch of samples, and 2) the update comprising multiple gradient descent steps using a training sample. We demonstrate that in both of the above settings, our method reveals speaker identity with non-trivial accuracy, whereas training with dropout is effective in reducing its success.

2 Background

In this section, we start by revisiting the idea of Gradients Matching (GM) (Zhu et al., 2019), which has been successfully applied to reconstruct images from gradients of an image recognition model. We then introduce end-to-end ASR models and a popular architecture using the Connectionist Temporal Classification (CTC) loss (Graves et al., 2006). We also provide some background on zeroth-order optimization, specifically the direct search algorithm, and speaker identification models with triplet loss, which form critical components of our proposed method.

Gradients Matching & Deep Leakage from Gradients (DLG) Algorithm

DLG was introduced by Zhu et al. (2019) as a method to reconstruct an input and output given a model gradient , where denotes the loss function and denotes the model parameters (when it is clear from the context that the gradient is w.r.t. model parameters , we just denote it by ). The algorithm attempts to find an input-output pair that matches with . The general idea is also referred to as Gradients Matching (GM). A dummy input and a dummy label are fed into the model to get dummy gradients . Reconstructed objects are obtained by minimizing the Euclidean distance between the dummy gradients and the client update.


Geiping et al. (2020)

provide an extension of DLG that works with larger images and trained models. They adopt the cosine similarity (also shown in (

4)) to optimize the gradients distance, which only matches the direction of gradients. A regularization term is also added. They assume that the training labels are known and formulate the reconstruction as finding to minimize


where and

are a regularization term (total variance) and its weight in the loss, respectively.

Both first-order and second-order optimization methods can be used to optimize (1) and (2). While first order techniques, such as Stochastic Gradient Descent (SGD) and Adam optimizer are usually faster to compute, second-order methods like L-BFGS can escape slow convergence paths better (Battiti, 1992). An observation about this function is the presence of first-order derivatives , which requires second-order differentiation with regards to

to compute the gradients. These second-order gradients can usually be derived with auto-differentiation commonly implemented in deep learning frameworks.

End-to-end ASR with CTC loss

Recently, end-to-end models have achieved superior performance while being simpler than traditional pipeline models (Graves et al., 2006; Graves, 2012; Chorowski et al., 2014, 2015; Chan et al., 2016; Xiong et al., 2017). Two major lines of end-to-end ASR architectures are based on connectionist temporal classification (CTC) (Graves et al., 2006), and the attention-based encoder-decoder mechanism (Chorowski et al., 2014).

We focus on models with CTC loss in this work. CTC models are able to align network inputs of length with label sequences of length without predefined alignments. The set of labels is extended with a blank label, denoted as “”, to form . The label sequence can be mapped with a one-to-many mapping to CTC paths (Graves et al., 2006), for example, “aab” can both be mapped to “aab” or “aaabb

A neural network , usually including one or several bi-directional RNNs to model frame dependencies, is used to map a -dimensional speech features to

. The probability of a CTC path

is defined as . The likelihood of a label sequence follows as the sum over probabilities of all CTC paths:


The model is optimized by maximizing the likelihood of , i.e. minimizing . During inference, a greedy search or beam search is conducted to find a sequence that maximizes

Note that the number of terms in (3) grows exponentially with the length of inputs. To optimize the loss, its value and first-order derivatives are computed analytically with a dynamic programming algorithm, namely the CTC forward-backward algorithm (Graves et al., 2006).

Zeroth-order Optimization

Zeroth-order optimization is the process of minimizing an objective, given access to the objective values at chosen inputs. The standard approach to zeroth-order optimization is to estimate the gradients

(Flaxman et al., 2005). However, gradient estimation suffers from high variance due to non-robust local minima or highly non-smooth objectives (Golovin et al., 2019)

. A direct search algorithm (also known as pattern search, derivative-free search, or black-box search), which samples a vector

and moves to

, performs well in several settings, such as reinforcement learning

(Mania et al., 2018). Additionally, a binary search on the step size, referred to as Gradientless Descent (Golovin et al., 2019) is proven to be fast for high-dimensional zeroth-order optimization.

Speaker Identification with Triplet Loss

Recent approaches (Snyder et al., 2018; Chung et al., 2018) formulate speaker identification as learning speaker discriminative embeddings for an utterance. The embeddings are extracted from variable-length acoustic segments via a deep neural network. The multi-class cross entropy loss (Snyder et al., 2017, 2018) and the triplet loss (Zhang & Koishida, 2017; Li et al., 2017; Chung et al., 2018) are two common approaches to train the embeddings. In this work, we adopt the triplet loss, which operates on pairs of embeddings, trying to minimize the distance of embeddings from the same speaker, and maximize the distance with other negative samples.

3 A Method to Reveal Speaker Identity

In this section, we describe our method to reveal the speaker identity of a training sample given its model gradient . Here, denotes the input speech features created from the training utterance, is the length of input, is the dimension of the input speech features, is the output label sequence, and denotes the training loss function. We split our method into two phases: (1) Using Hessian-Free Gradients Matching to reconstruct the input speech features (reconstruction phase), and (2) Identify the speaker from the reconstructed speech features (inference phase). Figure 1 provides an illustration of our method.

Figure 1: An illustration of our method. (1) A gradient is accessible to an attacker. (2) The attacker computes dummy gradients from a dummy input . (3) The attacker compares the gradient received with dummy gradients and repeats to optimize . (4) The attacker reveals the identity of the speaker. Notations in red are known to the attacker.

3.1 Reconstruction Phase: Hessian-Free Gradients Matching (HFGM)

Given access to a gradient , generally one would like to find a pair of speech features and transcript such that . However, ASR models are typically sequence-to-sequence models that can map arbitrary length speech features () to arbitrary length transcripts (). With no additional information, the possible values can take is exponential in the label set size, searching through which can incur a prohibitive computational cost. Moreover, there can be many false positives, i.e., pairs such that . To circumvent this issue and make our problem simpler, we assume , the transcript of the training utterance, is given.222Note that the transcript could be a common phrase, e.g., “play music”. Our objective is to identify the speaker of an utterance regardless of the contents of its transcript. Even though is given, there could exist multiple length speech features , where , such that . Thus, we also assume is given. Note that even if the transcript and the length of the input speech features are known, revealing the identity of the speaker can still result in a significant breach of privacy. Designing efficient reconstruction methods that operate without these assumptions is an interesting direction, which we leave for future work.333Using the experimental setup in Section 4, we provide some preliminary results in Appendix B.1 on reconstruction i) without the knowledge of length of input speech features, and ii) with the knowledge of only the transcript length. We find that while reconstruction can succeed even with a good estimate of the input length, it fails with no knowledge of the transcript.

Now, we define the reconstruction task as constructing an such that is close to the observed gradient . Following (Geiping et al., 2020), we choose cosine distance as our measure of closeness, and formulate our optimization problem to find s.t.:


To solve this non-convex optimization problem using gradient-based methods as in prior work Zhu et al. (2019); Geiping et al. (2020); Wei et al. (2020), we need to compute the second-order derivatives of the loss function

. The loss function that is of interest here is the CTC loss which is commonly used in end-to-end ASR systems. However, computing the second derivative of CTC loss involves backpropagating twice through a dynamic programming algorithm which we found to be intractable.


Additionally, the second derivatives of CTC loss are not implemented in common deep learning frameworks like TensorFlow 

(Abadi et al., 2015)

and PyTorch 

(Paszke et al., 2019).
To tackle this challenge, and also address a broader family of loss functions, we adopt a zeroth-order optimization algorithm.

  Input: Gradients to match , gradients distance function , learning rate , transcript and length of speech features . Parameters: number of samplings , number of iterations
  Initialize .
  for  to  do
     Sample unit vectors
     for  to  do
        if  then
           Add to
        end if
     end for
  end for
Algorithm 1 Hessian-Free Gradients Matching

We use a direct search approach (Section 2) called HFGM (Algorithm 1). We initialize with uniformly random values. At each iteration, we sample random unit vectors and apply them to . The value is evaluated at each of these points. We choose only the vectors that lower , sum these up and apply the sum with a learning rate . We repeat the process until we reach the convergence criteria.

3.2 Inference Phase: Revealing Speaker Identity

In the second part of our method, we use the reconstructed speech features () to identify the speaker of the utterance from a list of possible speakers.

We train a speaker identification model that uses the same speech features as our ASR model on some corpus. We assume that we have access to some public utterances for each possible speaker to identify them. We use the speaker identification model to create embeddings for each speaker from the public utterances. We take the reconstructed speech features (), create an embedding using the speaker identification model, and compare it with embeddings for each speaker. If the method is successful, the embedding created from is closer to the embedding for the speaker of the utterance than the other speakers.

3.3 Comparison with Related Prior Works

Our work differs from the related prior works  (Zhu et al., 2019; Geiping et al., 2020; Wei et al., 2020) in a few ways.

Input features: The input to ASR models is typically not the raw audio but speech features which are computed from the raw audio using a series of lossy transformations. In image recognition models, the input to the model is typically the raw pixel values. While prior works on image recognition models demonstrate breach of privacy by directly reconstructing the input to the model, we incorporate an additional inference phase where we use the reconstructed input to reveal the identity of the speaker.

Variable-sized inputs and outputs: We focus on ASR models, which have variable-sized inputs and outputs; image recognition models have fixed size inputs and outputs.

CTC Loss: The models we focus on use CTC loss instead of cross-entropy loss. CTC loss is significantly more complex, requiring a dynamic programming algorithm to compute the value of the loss function and derivatives.

4 Experiments

In this section, we provide empirical results on the effectiveness of our method. We start by describing our setup.

Model Architectures

Following prior work (Carlini & Wagner, 2018), we choose the DeepSpeech (Hannun et al., 2014)

model architecture for our experiments. The model consists of three feed-forward layers, followed by a single bi-directional LSTM layer, and two feed-forward layers to produce softmax probabilities for the CTC loss. DeepSpeech uses character-based CTC loss: the output is a sequence of characters. The input is a 26-dimensional mel-frequency cepstrum coefficients (MFCCs) feature. MFCCs are a popular speech feature, derived by mapping the power of the result of a Fourier transform to the mel scale, then performing a discrete cosine transformation. We use Mozilla’s implementation of DeepSpeech


We conduct our experiments using randomly initialized weights for the model. For the inference phase, we follow Li et al. (2017) to train a text-independent speaker identification model on 26-dim normalized MFCCs, similar to the speech feature used for inputs to DeepSpeech.


We choose the LibriSpeech ASR corpus (Panayotov et al., 2015), a large-scale benchmark speech dataset for our experiments. The dataset contains pairs of audio and transcript, along with speaker attributes such as gender and identity. For training the speaker identification model, we first combine all the dev-{clean/other}, test-{clean/other}, train-{clean-100/clean-360/other-500} sets to obtain 300k utterances from 2,484 speakers, and use the first 5 utterances of each speaker for training.

For the reconstruction phase, we trim the leading and ending silences, based on the intensity of every 10ms chunk, from each utterance in the remaining combined test set. Next, we randomly sample a total of 600 utterances, 100 for each interval of audio length in . The average audio length in our sampled set is 2.5 seconds, and average transcript length is 40.6 characters. The male:female ratio in the sampled utterances is 1.1:1.

Implementation Details

In this section, all the experiments consider the scenario of revealing speaker identity from a single gradient computed using a single utterance. For computational efficiency, we match gradients only for the last layer (60k parameters). Note that matching lower layers may increase the reconstruction quality.

Each dummy input in our reconstruction is initialized with uniformly random values in . When performing direct search, we sample 128 unit vectors per iteration, each of which only updates a single frame. We set the step size to 1, and reduce by half after every 2.5k iterations s.t. the loss does not decrease by more than 5%. We stop the reconstruction when the step size reaches 0.125. We run each reconstruction on a single Tesla V100 GPU. The reconstruction time depends on the length of inputs, ranging from 3 to 6 hours.

Evaluation Metrics

To evaluate our reconstruction, we use the Mean Absolute Error (MAE) to measure the distance of normalized MFCCs to those of the original utterance. During inference, the similarity scores of a reconstructed object’s embedding with each of 5 available utterances’ embeddings in the training data are averaged and ranked to identify the speaker. We use Top-1 Accuracy, Top-5 Accuracy and Mean Reciprocal Rank (MRR) to evaluate the speaker identity leakage. In experiments where we try to use alternate training methods, the Word Error Rate (WER) is used to evaluate the quality of trained ASR models666

An N-gram language model is trained separately on a large text corpus and used during inference.


Figure 2: MAE and speaker identification accuracy on 600 utterances reconstructed using the actual gradient, and using DP-SGD at different noise levels for training. We also provide the respective accuracy for the original utterances. For short utterances, speaker identification on our reconstructions from the actual gradients almost matches that from the original utterances. For , the top-1 accuracy reduces to 0%.

4.1 Empirical Results

Now, we present the results of using our method from Section 3 to reveal speaker identity from 600 individual gradients, each gradient computed using a unique utterance from our sampled set. In Figure 2, we plot the results by audio length in intervals of 0.5s from 1-4s. First, we show the average MAE of the reconstructed MFCCs, where we observe that the average MAE monotonously increases with the audio length. Note that the dimensionality of the optimization problem increases linearly with the audio length. Next, we plot the top-1 and top-5 accuracy of the speaker identity from the reconstructed MFCCs. Notice that even for the longest 3.5-4s utterances under consideration, the top-1 accuracy is 24%, and the top-5 accuracy is 37%. For comparison, we also plot the performance of our speaker identification model on the original utterances. For shorter utterances, the top-1 accuracy from the reconstructed MFCCs is almost identical to that from the original utterances.

Table 1 shows the overall values of the average MAE, Top-1 accuracy, Top-5 accuracy, and MRR of speaker identification results from the original and reconstructed speech features. We see that while speaker identification from original utterances results in 42% top-1 (57% top-5) accuracy, the same from the reconstructed features is 34% top-1 (51% top-5), providing 81% (89.5%) relative performance.

MAE Top-1 Top-5 MRR
Original 0.00 42.0 57.0 0.554
Reconstructed 0.25 34.0 51.0 0.419
Table 1: MAE, Top-1 Accuracy, Top-5 Accuracy, and MRR of speaker identification on 600 utterances. The top-1 (top-5) accuracy of speaker identification on reconstructed features is 81% (89.5%) relative to that on original utterances.

4.2 Training with DP-SGD

Now, we study the effect of training with the popular technique of Differentially Private Stochastic Gradient Descent (DP-SGD) on the success of our method. At a high-level, each gradient gets clipped to a fixed -norm bound

, and zero-mean Gaussian noise of standard deviation

is added to provide a (local) DP guarantee for each sample. Due to space constraints, we defer the formal definition of Differential Privacy (DP) (Dwork et al., 2006b, a), and a pseudo-code of DP-SGD, to Appendix A.3.

Using only -clipping has been shown in prior works (Carlini et al., 2019; Thakkar et al., 2020) to be effective in mitigating unintended memorization

in language models. However, the optimization in our method (Equation 

4) uses cosine distance as the loss, thus rendering only -clipping ineffective. Since using DP-SGD for training large models has been shown (Abadi et al., 2016; McMahan et al., 2018; Thakkar et al., 2020) to affect model utility, our first objective is to find the least s.t. the top-1 accuracy of speaker identification is %. For our experiments, we set 777We observed gradients had norm at least 100, and thus chose . Due to cosine loss used in our optimization, as long as a gradient gets clipped, the value of will not have any effect on the success of the method, or on the DP guarantee via DP-SGD.

, and provide the evaluation metrics for

in Figure 2. We observe that is effective in reducing the top-1 accuracy of speaker identification to %.

MAE Top-1 Top-5 MRR WER (clean) WER (other)
Baseline 0.25 34.0 51.0 0.419 10.5 28.4
0.34 21.5 34.8 0.284 14.9 37.6
0.54 2.3 5.8 0.049 15.4 39.4
0.63 0.5 1.7 0.021 19.6 45.3
Table 2: MAE and Speaker identification from reconstruction when using DP-SGD at different noise levels. The noise needed () to get 0% top-1 accuracy almost doubles the final model’s WER.

We also provide the overall evaluation metrics with using DP-SGD in Table 2. We see that for , the top-1 accuracy of speaker identification is 0.5%. Further, we also provide the WER of models trained using DP-SGD (Google, 2019) with a batch size of 16. We see that the WER of models trained via DP-SGD, even for the smallest noise level, is significantly increased compared to our baseline training. Note that for the levels of noise presented here, the bounds for (local/central) DP will be near-vacuous. However, improving the privacy-utility trade-offs for DP-SGD is beyond the scope of this work.

Figure 3: Spectrograms obtained from the original and reconstructed MFCCs, as well as training with DP-SGD , and dropout rate 0.1. The utterance in the first row is “where is my husband” (length: 1.4s, MAE of Reconstructed: 0.09, speaker identified correctly). The utterance in the second row is “i’ll give it to you this time but the next time you want anything you can go below for it” (length: 4.0s, MAE of Reconstructed: 0.31, speaker identified incorrectly). Even though the latter has a bad reconstruction quality, its spectrogram for the reconstructed features is visibly similar to that of the original. For reconstructions on training using DP-SGD, and Dropout, we see that reconstruction quality deteriorates.

4.3 Training with Dropout

Dropout (Srivastava et al., 2014) has been adopted in training deep neural networks as an efficient way to prevent overfitting to the training data. The key idea of unit dropout is to randomly drop model units during training. While prior work (Wei et al., 2020) has mentioned dropout in the context of information leakage from gradients, it does not provide any empirical evidence of the effect of training with dropout on such leakages.

The dropout mask is deducible from gradients if dropping a unit completely disables a part of the network (e.g. a feed-forward neural network), or dropout is applied directly on weights

(Wan et al., 2013). When parameters are shared in the network, for e.g., a fully-connected layer operating frame-wise on a sequence of speech features, each part of the output typically uses an i.i.d. random dropout mask, making it difficult to infer dropout masks from a gradient.

MAE Top-1 Top-5 MRR WER (clean) WER (other)
0.25 34.0 51.0 0.419 10.5 28.4
0.59 0.8 2.0 0.019 11.9 28.2
0.72 0.0 0.5 0.006 9.2 25.6
0.81 0.1 0.3 0.005 9.5 27.1
Table 3: MAE, Top-1, Top-5, and MRR of speaker identification when reconstructed from dropped-out gradients. Even a dropout rate of 0.1 efficiently prevents the leakage.

Table 3 shows reconstruction quality and training error rates for different dropout rates. Even for the lowest dropout rate of 0.1, we see that the top-1 accuracy of speaker identification is 0%. At the same time, we observe that for models trained with dropout, the WER is comparable (or sometimes even lower) than the baseline training. We defer the plots of the results grouped by audio length to Appendix B.

Visualizing Reconstructed Features

In Figure 3, we provide two examples of spectrograms from the reconstruction of a short and a long utterance. For the long utterance, even though MAE for the reconstruction is high and the speaker identification system fails to identify the speaker, the reconstructed audio pattern is visibly similar to the original audio pattern. For comparison, we also provide spectrograms from reconstructions of the same utterances from DP-SGD training (), and a dropout rate of 0.1.

5 Additional Experiments

The experiments in Section 4 focused on revealing speaker identity using our method on a single gradient from a single utterance. In distributed settings like FL, model training is performed under more complex settings. In this section, we conduct experiments to evaluate the success of our method on two natural extensions of the setting in Section 4: 1) gradients from a batch of utterances are averaged before being shared, and 2) multiple update steps are performed using a single utterance, and the final model update is shared. We demonstrate that in both of the settings above, our method can reveal speaker identity with non-trivial accuracy. Further, we show that using dropout for training reduces the limited success of the method in both the settings. All the experiments in this section are conducted using the 200 utterances of audio length 1-2s (from the 600 sampled utterances for experiments in Section 4).

5.1 Averaged Gradients from Batches

In this section, we study the performance of our method for revealing speaker identities from an averaged gradient computed using a batch of utterances. In the reconstruction phrase, our objective function (4) does not change; however, we instead try to reconstruct , where for . Here, is the number of samples in the batch, and is the length of input . For computational efficiency, we only update a single sample per iteration of our optimization. We provide a pseudo-code for the variant of Algorithm 1 adapted to this setting, in Appendix C.1.

We conduct our experiments for batch sizes in . For each batch size, the 200 utterances are sorted by audio length, and grouped into batches. We provide the results in Table 4, comparing them with the results (batch size 1) on same 200 utterances in Section 4.1. We see that while speaker identification accuracy decreases with increasing batch sizes, the top-1 accuracy is still as high as 19% for batch size 4. An experiment on the effect of training with a dropout rate of 0.1 shows that reconstruction of batch size 2 from dropped-out gradients reduces the accuracy to 1% top-1 (4% top-5), compared to 2% top-1 (4% top-5) on the same set of utterances in Section 4.3.

MAE Top-1 Top-5 MRR
Original 0.00 42.0 57.0 0.490
Batch size 1 0.14 40.0 55.0 0.470
Batch size 2 0.21 37.0 54.0 0.451
Batch size 4 0.37 19.0 31.0 0.249
Batch size 8 0.48 5.0 11.0 0.084
Table 4: Reconstruction MAE, Top-1 and Top-5 Speaker identification accuracy from averaged gradients of a batch. Even with batch size 4, our method is successful with a top-1 accuracy of 19%.

5.2 Multi-Step Updates from a Sample

Now, we study the success of our method in revealing speaker identities from an update comprising of multiple update steps using a single utterance. We conduct our experiments for 2-step and 8-step updates with the learning rate set to . For computational efficiency, we reduce the number of unit vectors sampled to 8 (as opposed to 128, in the experiments in Sections 4 and 5.1) in each iteration of our zeroth-order optimization.

Table 5 shows the results of our experiment, comparing them with the same (1-step) from Section 4.1. Since the optimization for multi-step reconstruction is different, the results are not directly comparable with those of single-step setting. We see that even though the time/computation taken for reconstruction may increase with increasing number of steps, the success of our method in revealing speaker identity is still as high as 24% top-1 accuracy for 8-step updates. Using dropout in training is still effective: a dropout rate of 0.1 reduces the accuracy to 2% top-1 (3.5% top-5).

MAE Top-1 Top-5 MRR
Original 0.00 42.0 57.0 0.490
1-step 0.14 40.0 55.0 0.470
2-step 0.33 26.5 39.5 0.333
8-step 0.33 24.5 39.0 0.321

Table 5: Reconstruction MAE, Top-1 and Top-5 Speaker identification accuracy from multi-step updates from a single sample. We see that increasing the number of steps from 2 to 8 does not significantly affect the quality of the reconstruction

6 Related Work

While we provide a background (in Section 2) for the DLG method (Zhu et al., 2019) and a comparison with our method (in Section 3.3), there have been follow-up works (Geiping et al., 2020; Wei et al., 2020; Zhao et al., 2020) showing high-fidelity image and label reconstruction from gradients under different settings. Revealing information about training data from gradients has also been shown via membership and property leakage (Shokri et al., 2017; Song & Shmatikov, 2019; Melis et al., 2019). There is a growing line of works on revealing information from trained models. For instance, (Fredrikson et al., 2015) demonstrate vulnerabilities to model inversion attacks. Other works (Carlini et al., 2019; Thakkar et al., 2020) show the amount of unintended memorization in trained models, along with studying the effect of DP-SGD in mitigating such memorization.

For using standard training techniques to reduce information leakages from model training, while gradient compression and sparsification have been claimed (Zhu et al., 2019) to provide protection, it has been shown in (Wei et al., 2020) that reconstruction attacks can succeed with non-trivial accuracy in spite of using gradient compression. There also exist works on designing strategies that require changes to the model inputs or architecture for protection, e.g., TextHide (Huang et al., 2020a), and InstaHide (Huang et al., 2020b). For real-world deployments of distributed training, there also exist protocols like Secure Aggregation (Bonawitz et al., 2017) which make it difficult for any adversary to access raw individual gradients.


The authors would like to thank Nicholas Carlini, Andrew Hard, Ronny Huang, Khe Chai Sim, and our colleagues in Google Research for their helpful support of this work, and comments towards improving the paper.


Appendix A Background

a.1 DeepSpeech

We hereby present details about the DeepSpeech (Hannun et al., 2014) model. The model consists of three feed-forward layers, followed by a single bi-directional LSTM layer, and two feed-forward layers to produce softmax probabilities for the CTC loss. The list of layers and number of parameters at each layer are shown in Table 6. Note that we only use the last layer to match gradients, which has only parameters.

Layer Type No. parameters
1 feed-forward 1.0m
2 feed-forward 4.2m
3 feed-forward 4.2m
4 bi-directional lstm 33.6m
5 feed-forward 4.2m
6 feed-forward 0.1m
total 47.3m
Table 6: Number of parameters at each layer of DeepSpeech. Note that we only match the last layer (Layer 6) during the reconstruction.

a.2 Deep Speaker

The Deep Speaker (Li et al., 2017)

model adopts a deep residual CNN (ResCNN) architecture to extract the acoustic features from utterances. These per-frame features are averaged to produce utterance-level speaker embeddings. The ResCNN consists of four stacked residual blocks (ResBlocks) with a stride 2. The numbers of CNN filters are 64, 128, 256, 512, respectively. The total number of parameters is 24M.

Deep Speaker is trained with Triplet Loss, which takes three samples as input, an anchor , a positive sample (from the same speaker), and a negative sample (from another speaker). The loss function of samplings is defined as

where is the cosine similarity between the anchor and the negative sample , is the cosine similarity between the anchor and the positive sample , from the -th sampling. is the minimum margin between these cosine similarities, which is set to 0.1.

a.3 Dp-Sgd

For completeness, we start by providing a definition of the notion of Differential Privacy (Dwork et al., 2006b, a). We will refer to a pair of datasets as neighbors if can be obtained by the addition or removal of one sample from .

Definition A.1 (Differential privacy (Dwork et al., 2006b, a))

A randomized algorithm is -differentially private if, for any pair of neighboring datasets and , and for all events in the output range of , we have

where the probability is taken over the random coins of .

Now, we provide a pseudo-code for DP-SGD (Abadi et al., 2016).

0:  Dataset of size , Loss function , Parameters: Mini-batch size , Learning rate , Clip norm bound , Per-sample noise scale , Total number of iterations
1:  Initialize model with  randomly 
2:  for  do
3:     Sample a random minibatch , by independently including each element of with probability
4:     for  do
5:        Compute gradient
6:        Clip each gradient in norm to , i.e.,
7:        Add noise
8:     end for
9:     Compute average noised gradient
10:     Update model
11:  end for

  Compute privacy cost using Moments Accountant.

Algorithm 2 Differentially Private SGD

Appendix B Additional Experiments, and Omitted Details

b.1 Reconstruction without Assumptions

We set up two experiments to explore the necessity of the two assumptions of known input length and transcript for our reconstruction method.

Reconstruction without Knowledge of Input Length

The first assumption for our reconstruction method (Section 3.1) that the length of input speech features is known. This is required to set up the search space for the optimization problem (4). Without the exact input length, we show that reconstruction is still possible. In the experiments below, 20 random utterances are chosen from the 1-2s bucket whose speaker are correctly identified in top-5 in section 4.1. The average length of these utterances is 74.35 frames ( 1.5s). Table 7 shows reconstruction results when estimated lengths differ by , and compared to original lengths and are double / half of the original lengths. It can be seen that the speaker identity can still be revealed even with a good estimate of the input length. For the same amount of absolute deviation in the estimation (e.g., and ), we see that the higher estimation provides better results.

Length Loss () Top-1 Top-5 MRR
Original 0.04 90 100 0.748
0.06 60 90 0.706
0.05 55 95 0.714
0.20 50 80 0.632
0.31 45 70 0.580
0.44 35 55 0.442
1.43 20 40 0.301
2.53 0 10 0.048
145.15 0 0 0.003
Table 7: Loss (gradients’ distance) and speaker identification results with different input lengths on 20 random short utterances correctly identified top-5 speaker in section 4.1. We see that our method succeeds even with good estimates of the input length.

Reconstruction without Contents of the Transcript

Next, we conduct experiments with our method having knowledge of only the length of the transcript, not its contents. For each utterance in the set of 20 utterances from section B.1, we generate 4 random transcripts and use them to reconstruct speech features. It can be seen from Table 8 that reconstruction is constantly of a poor quality (high loss) with a random transcript, suggesting that the knowledge about the transcript is important. The bad quality of reconstructed features from an incorrect transcript also suggests that if the attacker has a list of candidates for the transcript (e.g., common phrases, song names, etc.) including the original one, a brute-force approach to pick the one with the lowest loss can reveal the actual transcript with high confidence.

Transcript Loss () MAE Top-1 Top-5 MRR
Original 0.04 0.12 90 100
Random 1 79.5 0.78 0 0 0.010
Random 2 135.5 0.74 0 0 0.013
Random 3 108.7 0.77 0 0 0.006
Random 4 101.5 0.78 0 0 0.015
Table 8: Loss and speaker identification results when reconstructing from random transcripts of the same length as the original. Reconstruction is constantly of a poor quality with a random transcript.

b.2 Reconstruction from Dropped-Out Gradients

Figure 4: MAE and speaker identification accuracy when reconstructed from dropped-out gradients. Even a dropout rate of 0.1 efficiently reduces the speaker identification accuracy to 0.

Figure 4 show results grouped by audio length of the experiment in Section 4.3. Even a dropout rate of 0.1 efficiently eliminates the risk of speaker identity leakage.

We also try varying the dropout rate and performing reconstruction on a small population of 20 utterances (first 20 utterances when sorted by lengths). The results are presented in Figure 5. The speaker identification accuracy drops sharply when increasing the dropout rate.

Figure 5: MAE and speaker identification accuracy when dropout rate changes from 0.00 to 0.10. Each point is averaged from 20 short utterances. Speaker identification accuracy drops sharply when increasing the dropout rate.

Appendix C Algorithms

We present an adapted version of HFGM for reconstructing from averaged gradients and multi-step updates.

c.1 HFGM on Averaged Gradients from Batches

In Algorithm 1, a dummy input is randomly initialized at the beginning and given to the model to compute the loss and gradients at every iteration of the optimization process. When reconstructing a batch from averaged gradients, a dummy batch needs to be optimized. To save computation time, we only update a single sample at each iteration, reusing the loss and gradients of other samples in the batch to obtain the overall loss and gradients. A variant of Algorithm 1 adapted for this setting is presented as Algorithm 3

  Input: Gradients to match , gradients distance function , learning rate , transcript , length of speech features . Parameters: number of samplings , number of iterations , batch size
  Initialize , .
  for  to  do
     Sample column unit vectors
     for  to  do
        if  then
           Add to
        end if
     end for
  end for
Algorithm 3 HFGM on averaged gradients from batches

c.2 HFGM on Multi-Step Updates from a Sample

A challenge when applying Algorithm 1 to this setting is the change in model parameters after each local step. Therefore, model updates of sampled unit vectors cannot be computed in batch, but need to be computed separately. The model also needs to be reset to its original parameters before each computation. Algorithm 4 provides a modified version of Algorithm 1 to reconstruct an input from multi-step updates. For efficiency, if vectors are sampled at each iteration, separate versions of the model are stored in the computation graph and processed in parallel.

  Input: Parameter changes to match , gradients distance function , learning rate , transcript and length of speech features . Parameters: number of samplings , number of iterations , number of steps , local learning rate .
  Initialize .
  for  to  do
     Sample unit vectors
     for  to  do
        for  to  do
           Compute and
        end for
        if  then
           Add to
        end if
     end for
  end for
Algorithm 4 HFGM on multi-step updates from a sample

Appendix D Additional Visualizations

Figure 6 shows the spectrogram of some utterances reconstructed in Section 4.1, along with results when reconstructing from a gradient with DP-SGD and Dropout.

We also plot per-frame MAEs at different stages in the optimization process in Figure 7. In long utterances, reconstructions usually have bad quality with frames in the middle being poorly reconstructed. This suggests that the error from earlier frames may have affected reconstruction in the middle part, due to sequential dependencies modeled in the LSTM.

Figure 6: More spectrograms from reconstructions in Section 4.1. From left to right: Original, Reconstructed, DP-SGD with , Dropout 0.1

Figure 7: Per-frame MAEs of reconstructed MFCCs at different stages of the optimization. x-axis is the frame number and y-axis is the per-frame MAE. Darker lines mean more iterations have been run. The most blurred red line is the initial state and the blue line is the final result. For long utterances, frames at two ends are reconstructed better than those in the middle