Privacy attacks for automatic speech recognition acoustic models in a federated learning framework

by   Natalia Tomashenko, et al.

This paper investigates methods to effectively retrieve speaker information from the personalized speaker adapted neural network acoustic models (AMs) in automatic speech recognition (ASR). This problem is especially important in the context of federated learning of ASR acoustic models where a global model is learnt on the server based on the updates received from multiple clients. We propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset. Using this method, we develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data. Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2


Retrieving Speaker Information from Personalized Acoustic Models for Speech Recognition

The widespread of powerful personal devices capable of collecting voice ...

Federated Acoustic Modeling For Automatic Speech Recognition

Data privacy and protection is a crucial issue for any automatic speech ...

Multi-user VoiceFilter-Lite via Attentive Speaker Embedding

In this paper, we propose a solution to allow speaker conditioned speech...

End-to-End Speech Recognition from Federated Acoustic Models

Training Automatic Speech Recognition (ASR) models under federated learn...

Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech

The rapid population aging has stimulated the development of assistive d...

Federated Marginal Personalization for ASR Rescoring

We introduce federated marginal personalization (FMP), a novel method fo...

Private Language Model Adaptation for Speech Recognition

Speech model adaptation is crucial to handle the discrepancy between ser...

1 Introduction

Federated learning (FL) for automatic speech recognition (ASR) has recently become an active area of research [4, 5, 19, 10, 6, 31]. To preserve the privacy of the users’ data in the FL framework, the model is updated in a distributed fashion instead of communicating the data directly from clients to a server.

Privacy is one of the major challenges in FL [16, 20]. Sharing model updates, i.e. gradient information, instead of raw user data aims to protect user personal data that are processed locally on devices. However, these updates may still reveal some sensitive information to a server or to a third party [8, 2]. According to recent research, FL has various privacy risks and may be vulnerable to different types of attacks, i.e. membership inference attacks [28]

or generative adversarial network (GAN) inference attacks 

[29]. Techniques to enhance the privacy in a FL framework are mainly based on two categories [20]: secure multiparty computation [1] and differential privacy [7]. Encryption methods [21, 25] such as fully homomorphic encryption [25] and secure multiparty computation perform computation in the encrypted domain. These methods increase computational complexity. In a FL framework, this increase is not so significant compared to standard centralized training, since only the transmitted parameters are need to be encrypted instead of large amounts of data, however with an increased number of participants, computational complexity becomes a critical issue. Differential privacy methods preserve privacy by adding noise to users’ parameters [7, 30], however such solutions may degrade learning performance due to the uncertainty they introduce into the parameters. Alternative methods to privacy protection for speech include deletion methods [3] that are meant for ambient sound analysis, and anonymization [27] that aims to suppress personally identifiable information in the speech signal keeping unchanged all other attributes. These privacy preservation methods can be combined and integrated in a hybrid fashion into a FL framework.

Despite the recent interest in FL for ASR and other speech-related tasks such as keyword spotting [15, 11], emotion recognition [14], and speaker verification [9], there is a lack of research on vulnerability of ASR acoustic models (AMs) to privacy attacks in a FL framework. In this work, we make a step towards this direction by analyzing speaker information that can be retrieved from the personalized AM locally updated on the user’s data. To achieve this goal, we developed two privacy attack models that operate directly on the updated model parameters without access to the actual user’s data. Parameters of neural network (NN) personalized AMs contain a wealth amount of information about the speakers [18]. In this paper, we propose novel methods to efficiently and easily retrieve speaker information from the adapted AMs. The main idea of the proposed methods is to use an external Indicator dataset to analyze the footprint of AMs on this data. Another important contribution of this work is understanding how the speaker information is distributed in the adapted NN AMs.

This paper is structured as follows. Section 2 briefly introduces a considered FL framework for AM training. Section 3 describes the privacy preservation scenario and proposes two attack models. Experimental evaluation is presented in Section 4. We conclude in Section 5.

2 Federated learning for ASR acoustic models

We consider a classical FL scenario where a global NN AM is trained on a server from the data stored locally on multiple remote devices [16]. The training of the global model is performed under the constraint that the training speech data are stored and processed locally on the user devices (clients), while only model updates are transmitted to the server from each client. The global model is learnt on the server based on the updates received from multiple clients. The FL in a distributed network of clients is illustrated in Figure 1. First, an initial global speech recognition AM is distributed to the group of devices of users (speakers). Then, the initial global model is run on every user () device and updated locally on the private user data. The updated models are then transmitted to the server where they are aggregated to obtain a new global model . Typically, the personalized updated models are aggregated using federated averaging and its variations [17, 1]. Then, the updated global model is shared with the clients. The process restarts and loops until convergence or after a fixed number of rounds. The utility and training efficiency of the FL AMs have been successfully studied in recent works [4, 5, 19, 10, 6, 31], and these topics are beyond the scope of the current paper. Alternatively, we focus on the privacy aspect of this framework.

Figure 1: Federated learning in a distributed network of clients: 1) Download of the global model by clients. 2) Speaker adaptation of on the local devices using user private data. 3) Collection and aggregation of multiple personalized models ,…, on the sever. 4) Sharing the resulted model with the clients.

3 Attack models

In this section, we describe the privacy preservation scenario and present two attack models.

3.1 Privacy preservation scenario

Privacy preservation is formulated as a game between users who share some data and attackers who access this data or data derived from it and aim to infer information about the users [27]. To preserve the user data, in FL, there is no speech data exchange between a server and clients, only model updates are transmitted between the clients and server (or between some clients). Attackers aim to attack users using information owned by the server. They can get access to some updated personalized models.

In this work, we assume that an attacker has access to the following data:

  • An initial global NN AM ;

  • A personalized model of the target speaker who is enrolled in the FL system. The corresponding personalized model was obtained from the global model by fine-tuning using speaker data. We consider this model as enrollment data for an attacker.

  • Other personalized models of non-target and target speakers: ,…,. We will refer to these models as test trial data.

The attacker’s objective is to conduct an automatic speaker verification (ASV) task by using the enrollment data model in the form of and test trial data in the form of models ,…,.

3.2 Attack models

The motivation of the proposed approaches is based on the hypothesis that we can capture information about the identity of speaker from the corresponding speaker-adapted model and the global model by comparing the outputs of these two neural AMs taken from hidden layers on some speech data. We will refer to this speech data as Indicator data. Note, that the Indicator data is not related to any test or AM training data and can be chosen arbitrarily from any speakers.

3.2.1 Attack model A1

The ASV task with the proposed attack model is performed in several steps as illustrated in Figure 2.

Let denote a set of utterances in the Indicator dataset as

; a sequence of vectors in utterance

as , …; a set of personalized models as ; and an identifier of a hidden layer in the global or personalized AM as .

  1. , we compute activation values from the layer for model pairs: and , and per-frame differences between corresponding outputs:


    where , .

  2. For each personalized model, we compute mean and standard deviation vectors for

    over all speech frames in the Indicator dataset :

  3. For a pair of personalized models and , we compute a similarity score at hidden level on the Indicator dataset based on the -normalised Euclidean distance between the corresponding vector pairs for means and standard deviations:


    where , are fixed parameters in all experiments.

  4. Given similarity scores for all matrix pairs, we can complete a speaker verification task based on these scores.

Figure 2: Statistic computation for the attack model A1.

3.2.2 Attack model A2

For the second attack model, we train a NN model as shown in Figure 3. This NN model uses personalized and global models and the speech Indicator dataset for training. It is trained to predict a speaker identity provided the corresponding personalized model. When the model is trained, we use it in evaluation time to extract speaker embeddings similarly to x-vectors and apply probabilistic linear discriminant analysis (PLDA) [26, 13].

As shown in Figure 3, the model consists of two parts (frozen and trained). The outputs of the frozen part are sequences of vectors computed per utterance of the Indicator data as defined in Formula (1). For every personalized model , we compute for all the utterances of the Indicator corpus; then is used as input to the second (trained) part of the NN which comprises several time delay neural network (TDNN) layers [22] and one statistical pooling layer.

Figure 3: Training a speaker embedding extractor for the attack model A2.

4 Experiments

4.1 Data

The experiments were conducted on the speaker adaptation partition of the TED-LIUM 3 corpus [12]. This publicly available data set contains TED talks that amount to 452 hours speech data in English from about 2K speakers, 16kHz. Similarly to [19], we selected from the TED-LIUM 3 training dataset three datasets: Train-G, Part-1, Part-2 with disjoint speaker subsets as shown in Table 1. The Indicator dataset was used to train an attack model. It is comprised of 320 utterances selected from all 32 speakers of test and development datasets of the TED-LIUM 3 corpus. The speakers in the Indicator dataset are disjoint from speakers in Train-G, Part-1, and Part-2. For each speaker in the Indicator dataset we select 10 utterances. The size of the Indicator dataset is 32 minutes. The Train-G dataset was used to train an initial global AM . Part-1 and Part-2 were used to obtain two sets of personalized models.111Data partitions and scripts will be available online:

Train-G Part-1 Part-2 Indicator
Duration, hours 200 86 73 0.5
Number of speakers 880 736 634 32
Number of personalized models 1300 1079
Table 1: Data sets statistics

4.2 ASR acoustic models

The ASR AMs have a TDNN model architecture [22] and were trained using the Kaldi speech recognition toolkit [23]

. 40-dimensional Mel-frequency cepstral coefficients (MFCCs) without cepstral truncation appended with 100-dimensional i-vectors were used as the input into the NNs. Each model has thirteen 512-dimensional hidden layers followed by a softmax layer where 3664 triphone states were used as targets

222Following the notation from [22], the model configuration can be described as follows: {-1,0,1} 6 layers; {-3,0,3} 7 layers.. The initial global model was trained using the lattice-free maximum mutual information (LF-MMI) criterion with a 3-fold reduced frame rate as in [24]. The two types of speech data augmentation strategies were applied for the training and adaptation data: speed perturbation (with factors 0.9, 1.0, 1.1) and volume perturbation, as in [22]. Each model has about 13.8M parameters. The initial global model was trained on the Train-G. Personalized models were obtained by fine-tuning all the parameters of on the speakers’ data from Part-1 and Part-2 as described in [19]. For all personalized speaker models, we use approximately the same amount of speech data to perform fine-tuning (speaker adaptation) – about 4 minutes per model. For most of the speakers (564 in Part-1, 463 in Part-2) we obtained two different personalized models (per speaker) on disjoint adaptation subsets, for the rest speakers we have adaptation data only for one model.

4.3 Attack models

We investigate two approaches for attack models: A1 – a simple approach based on the comparative statistical analysis of the NN outputs and the associated similarity score between personalized models, and A2 – a NN based approach. For the test target trials, we use comparisons between different personalized models of the same speakers (564 in Part-2 and 1027 in the Part-1+Part-2), and for the non-target trials we randomly selected 10K pairs of models from different speakers in a corresponding dataset.

4.3.1 Attack model A1

The first attack model was applied as described in Section 3.2.1. The parameters , in Formula (4) equal to and respectively. This model was evaluated on two datasets of personalized models corresponding to Part-2 and combined Part-1+Part-2 datasets. The Indicator dataset is the same in all experiments.

4.3.2 Attack model A2

For training the attack model A2, we use 1300 personalized speaker models corresponding to 736 unique speakers from Part-1. When we applied the frozen part of the architecture shown in Figure 3 to the 32-minute Indicator dataset for each speaker model in Part-1, we obtained the training data with the amount corresponding to about 693h (321300). The trained part of the NN model, illustrated in Figure 3, has a similar topology to a conventional x-vector extractor [26]. However, unlike the standard NN x-vector extractor, that is trained to predicts speaker id-s by the input speech segment, our proposed model learns to predict a speaker identity from the part of a speaker personalized model. We trained 2 attack models corresponding to the two values of parameter  – a hidden layer in the ASR neural AMs at which we compute the activations. Values were choosing based on the results for the attack model A1. The output dimension of the frozen part is 512. The frozen part is followed by the trained part that consists of seven hidden TDNN layers and one statistical pooling layer introduced after the fifth TDNN layer. The output is a softmax layer with the targets corresponding to speakers in the pool of speaker personalized models (number of unique speakers in Part-1).

4.4 Results

The attack models were evaluated in terms of equal error rate (EER). Denoting by and the false alarm and miss rates at threshold , the EER corresponds to the threshold at which the two detection error rates are equal, i.e., .

Results for the attack model A1 are shown in Figure 4 for Part-2 and combined Part-1 and Part-2 datasets. Speaker information can be captured for all values with various success: EER ranges from 0.86% (for the first hidden layer) up to 20.51% (for the top hidden layer) on Part-2. For the Part-1+Part-2 we observe similar results.

Figure 4: EER, % for the attack model A1 depending on the hidden layer (in and ) which was used to compute outputs, evaluated on Part-2 and on the combined Part-1+Part-2 dataset.

Figure 5: EER, % for the attack model A1 depending on the hidden layer , evaluated on Part-2 dataset. – both means and standard deviations were used to compute similarity score ; – only means; and – only standard deviations were used.

To analyze the impact of each component in Formula (4) on the ASV performance, we separately compute similarity score either using only means () or only standard deviations (). Results on the Part-2 dataset are shown in Figure 5. Black bars correspond to when only means were used to compute similarity scores between personalized models. Blue bars represent results for when only standard deviations were used to compute . Orange bars correspond to the combined usage of means and standard deviations as in Figure 4 (, ). The impact of each component in the sum changes for different hidden layers. When we use only standard deviations, we observe the lowest EER on the first layer. In case of using only means, the first layer is, on the contrary, one of the least informative for speaker verification. For all other layers, combination of means and standard deviations provided superior results over the cases when only one of these components were used. This surprising results for the first hidden layer could possibly be explained by the fact that personalized models incorporated i-vectors in their inputs and speaker information can be easily learnt at this level of the NN, we plan to investigate this phenomena in detail in our future research.

We choose two values which demonstrate promising results for the model A1, and use the corresponding outputs to train two attack models with the configuration A2. The comparative results for the two attack models are presented in Table 2. For , the second attack model provides significant improvement in performance over the first one and reduces EER from 7% down to 2%. For , we could not obtain any improvement by training a NN based attack model: the results for A1 in this case are worse than for the simple approach A2

. One explanation for this phenomenon could be the following. The first layers of the AMs provide highly informative features for speaker classification, however, training the proposed NN model on these features results in overfitting because training criterion of the NN is speaker accuracy, but not the target EER metric, and the number of targets is relatively small, hence, the NN overfits to classify the seen speakers in the training dataset.

Attack model h=1 h=5
A1 0.86 7.11
A2 12.31 1.94
Table 2: EER, % evaluated on Part-2, - indicator of a hidden layer

5 Conclusions

In this work, we focused on the privacy protection problem for ASR AMs trained in a FL framework. We explored to what extent ASR AMs are vulnerable to privacy attacks. We developed two attack models that aim to infer speaker identity from the locally updated personalized models without access to any speech data of the target speakers. One attack model is based on the proposed similarity score between personalized AMs computed on some external Indicator dataset, and another one is a NN model. We demonstrated on the TED-LIUM 3 corpus that both attack models are very effective and can provide EER of about 1% for the simple attack model A1 and 2% for the NN attack model A2. Another important contribution of this work is the finding that the first layer of personalized AMs contains a large amount of speaker information that is mainly contained in the standard deviation values computed on Indicator data. This interesting property of NN adapted AMs opens new perspectives also for ASV, and in future work, we plan to use it for developing an efficient ASV system.


  • [1] K. Bonawitz, V. Ivanov, B. Kreuter, et al. (2016) Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482. Cited by: §1, §2.
  • [2] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song (2019) The secret sharer: evaluating and testing unintended memorization in neural networks. In 28th Security Symposium, pp. 267–284. Cited by: §1.
  • [3] A. Cohen-Hadria, M. Cartwright, B. McFee, and J. P. Bello (2019) Voice anonymization in urban sound recordings. In

    IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP)

    pp. 1–6. Cited by: §1.
  • [4] X. Cui, S. Lu, and B. Kingsbury (2021) Federated acoustic modeling for automatic speech recognition. In ICASSP, pp. 6748–6752. Cited by: §1, §2.
  • [5] D. Dimitriadis, K. Kumatani, R. Gmyr, et al. (2020) A federated approach in training acoustic models.. In Interspeech, pp. 981–985. Cited by: §1, §2.
  • [6] D. Dimitriadis, K. Kumatani, R. Gmyr, et al. (2020)

    Federated transfer learning with dynamic gradient aggregation

    arXiv preprint arXiv:2008.02452. Cited by: §1, §2.
  • [7] C. Dwork (2006) Differential privacy. In International Colloquium on Automata, Languages, and Programming, Cited by: §1.
  • [8] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller (2020) Inverting gradients–how easy is it to break privacy in federated learning?. arXiv preprint arXiv:2003.14053. Cited by: §1.
  • [9] F. Granqvist, M. Seigel, R. van Dalen, Á. Cahill, S. Shum, and M. Paulik (2020) Improving on-device speaker verification using federated learning with privacy. arXiv preprint arXiv:2008.02651. Cited by: §1.
  • [10] D. Guliani, F. Beaufays, and G. Motta (2021) Training speech recognition models with federated learning: a quality/cost framework. In ICASSP, pp. 3080–3084. Cited by: §1, §2.
  • [11] A. Hard, K. Partridge, C. Nguyen, N. Subrahmanya, A. Shah, P. Zhu, I. L. Moreno, and R. Mathews (2020) Training keyword spotting models on non-iid data with federated learning. arXiv preprint arXiv:2005.10406. Cited by: §1.
  • [12] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Estève (2018) TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer, pp. 198–208. External Links: Document Cited by: §4.1.
  • [13] S. Ioffe (2006) Probabilistic linear discriminant analysis. In

    European Conference on Computer Vision

    pp. 531–542. Cited by: §3.2.2.
  • [14] S. Latif, S. Khalifa, R. Rana, and R. Jurdak (2020) Federated learning for speech emotion recognition applications. In 2020 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 341–342. Cited by: §1.
  • [15] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau (2019) Federated learning for keyword spotting. In ICASSP, pp. 6341–6345. Cited by: §1.
  • [16] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine 37 (3), pp. 50–60. Cited by: §1, §2.
  • [17] B. McMahan, E. Moore, D. Ramage, et al. (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. Cited by: §2.
  • [18] S. Mdhaffar et al. Retrieving speaker information from personalized acoustic models for speech recognition. In Submitted to ICASSP, Cited by: §1.
  • [19] S. Mdhaffar, M. Tommasi, and Y. Estève (2021) Study on acoustic model personalization in a context of collaborative learning constrained by privacy preservation. In Speech and Computer, pp. 426–436. External Links: ISBN 978-3-030-87802-3 Cited by: §1, §2, §4.1, §4.2.
  • [20] V. Mothukuri, R. M. Parizi, S. Pouriyeh, Y. Huang, A. Dehghantanha, and G. Srivastava (2021) A survey on security and privacy of federated learning. Future Generation Computer Systems 115, pp. 619–640. Cited by: §1.
  • [21] M. A. Pathak, B. Raj, S. D. Rane, and P. Smaragdis (2013) Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise. IEEE signal processing magazine 30 (2), pp. 62–74. Cited by: §1.
  • [22] V. Peddinti, D. Povey, and S. Khudanpur (2015) A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth annual conference of the international speech communication association, Cited by: §3.2.2, §4.2, footnote 2.
  • [23] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, et al. (2011) The Kaldi speech recognition toolkit. In ASRU, Cited by: §4.2.
  • [24] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, et al. (2016) Purely sequence-trained neural networks for ASR based on lattice-free MMI. In Interspeech, pp. 2751–2755. Cited by: §4.2.
  • [25] P. Smaragdis and M. Shashanka (2007) A framework for secure speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 15 (4), pp. 1404–1413. Cited by: §1.
  • [26] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur (2018) X-vectors: robust DNN embeddings for speaker recognition. In ICASSP, pp. 5329–5333. Cited by: §3.2.2, §4.3.2.
  • [27] N. Tomashenko, B. M. L. Srivastava, X. Wang, E. Vincent, A. Nautsch, J. Yamagishi, N. Evans, et al. (2020) Introducing the VoicePrivacy initiative. In Interspeech, pp. 1693–1697. External Links: Document Cited by: §1, §3.1.
  • [28] S. Truex, L. Liu, M. E. Gursoy, L. Yu, and W. Wei (2019) Demystifying membership inference attacks in machine learning as a service. IEEE Transactions on Services Computing. Cited by: §1.
  • [29] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi (2019) Beyond inferring class representatives: user-level privacy leakage from federated learning. In IEEE INFOCOM, pp. 2512–2520. Cited by: §1.
  • [30] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou (2018) Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739. Cited by: §1.
  • [31] W. Yu, J. Freiwald, S. Tewes, F. Huennemeyer, and D. Kolossa (2021) Federated learning in ASR: not as easy as you think. arXiv preprint arXiv:2109.15108. Cited by: §1, §2.