Introducing the VoicePrivacy Initiative

by   Natalia Tomashenko, et al.

The VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology by gathering a new community to define the tasks of interest and the evaluation methodology, and benchmarking solutions through a series of challenges. In this paper, we formulate the voice anonymization task selected for the VoicePrivacy 2020 Challenge and describe the datasets used for system development and evaluation. We also present the attack models and the associated objective and subjective evaluation metrics. We introduce two anonymization baselines and report objective evaluation results.


page 1

page 2

page 3

page 4


The VoicePrivacy 2020 Challenge Evaluation Plan

The VoicePrivacy Challenge aims to promote the development of privacy pr...

Benchmarking and challenges in security and privacy for voice biometrics

For many decades, research in speech technologies has focused upon impro...

The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment

Mounting privacy legislation calls for the preservation of privacy in sp...

The VoicePrivacy 2020 Challenge: Results and findings

This paper presents the results and analyses stemming from the first Voi...

The VoicePrivacy 2022 Challenge Evaluation Plan

For new participants - Executive summary: (1) The task is to develop a v...

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

ASVspoof 2021 is the forth edition in the series of bi-annual challenges...

Toward Expressive Singing Voice Correction: On Perceptual Validity of Evaluation Metrics for Vocal Melody Extraction

Singing voice correction (SVC) is an appealing application for amateur s...

1 Introduction

Recent years have seen mounting calls for the preservation of privacy when treating or storing personal data. This is not least the result of the European general data protection regulation (GDPR). While there is no legal definition of privacy [nautsch2019gdpr], speech data encapsulates a wealth of personal information that can be revealed by listening or by automated systems [Nautsch-PreservingPrivacySpeech-CSL-2019]. This includes, e.g., age, gender, ethnic origin, geographical background, health or emotional state, political orientations, and religious beliefs, among others [COMPRISE_D5.1, p. 62]. In addition, speaker recognition systems can reveal the speaker’s identity. It is thus of no surprise that efforts to develop privacy preservation solutions for speech technology are starting to emerge. The VoicePrivacy initiative aims to gather a new community to define the tasks of interest and the evaluation methodology, and to benchmark these solutions through a series of challenges.

Current methods fall into four categories: deletion, encryption, distributed learning, and anonymization. Deletion methods [cohen2019voice, gontier2020privacy] are meant for ambient sound analysis. They delete or obfuscate any overlapping speech to the point where no information about it can be recovered. Encryption methods [pathak2013privacy, smaragdis2007framework] such as fully homomorphic encryption [zhang2019encrypted] and secure multiparty computation [brasser2018voiceguard], support computation upon data in the encrypted domain. They incur significant increases in computational complexity, which require special hardware. Decentralized or federated learning methods aim to learn models from distributed data without accessing it directly [leroy2019federated]. The derived data used for learning (e.g., model gradients) may still leak information about the original data, however [geiping2020inverting].

Anonymization refers to the goal of suppressing personally identifiable attributes of the speech signal, leaving all other attributes intact111In the legal community, the term “anonymization” means that this goal has been achieved. Here, it refers to the task to be addressed, even when the method being evaluated has failed. We expect the VoicePrivacy initiative to lead to the definition of new, unambiguous terms.. Past and recent attempts have focused on noise addition [hashimoto2016privacy], speech transformation [qian2017voicemask], voice conversion [jin2009speaker, pobar2014online, bahmaninezhad2018convolutional], speech synthesis [fang2019speaker, han2020voice], or adversarial learning [srivastava2019privacy]. In contrast to the above categories of methods, anonymization appears to be more flexible since it can selectively suppress or retain certain attributes and it can easily be integrated within existing systems. Despite the appeal of anonymization and the urgency to address privacy concerns, a formal definition of anonymization and attacks against it is missing. Furthermore, the level of anonymization offered by existing solutions is unclear and not meaningful because there are no common datasets, protocols and metrics.

For these reasons, the VoicePrivacy 2020 Challenge focuses on the task of speech anonymization. This paper is intended as a general reference about the Challenge for researchers, engineers and privacy professionals. Details for participants are provided in the evaluation plan [tomashenkovoiceprivacy] and on the challenge website222

The paper is structured as follows. The anonymization task and the attack models, the datasets, and the metrics are described in Sections 2, 3, and 4, respectively. The two baseline systems and the corresponding objective evaluation results are presented in Section 5. We conclude in Section 6.

2 Anonymization task and attack models

Privacy preservation is formulated as a game between users who publish some data and attackers who access this data or data derived from it and wish to infer information about the users [qian2018towards, srivastava2019evaluating]. To protect their privacy, the users publish data that contain as little personal information as possible while allowing one or more downstream goals to be achieved. To infer personal information, the attackers may use additional prior knowledge.

Focusing on speech data, a given privacy preservation scenario is specified by: (i) the nature of the data: waveform, features, etc., (ii) the information seen as personal: speaker identity, traits, spoken contents, etc., (iii) the downstream goal(s): human communication, automated processing, model training, etc., (iv)  the data accessed by the attackers: one or more utterances, derived data or model, etc., (v) the attackers’ prior knowledge: previously published data, privacy preservation method applied, etc. Different specifications lead to different privacy preservation methods from the users’ point of view and different attacks from the attackers’ point of view.

2.1 Privacy preservation scenario

VoicePrivacy 2020 considers the following scenario, where the terms “user” and “speaker” are used interchangeably. Speakers want to hide their identity while still allowing all other downstream goals to be achieved. Attackers have access to one or more utterances and want to identify the speakers.

2.2 Anonymization task

To hide his/her identity, each speaker passes his/her utterances through an anonymization system. The resulting anonymized utterances are referred to as trial data. They sound as if they had been uttered by another speaker called pseudo-speaker, which may be an artificial voice not corresponding to any real speaker.

The task of challenge participants is to design this anonymization system. In order to allow all downstream goals to be achieved, this system should: (a) output a speech waveform, (b) hide speaker identity as much as possible, (c) distort other speech characteristics as little as possible, (d) ensure that all trial utterances from a given speaker appear to be uttered by the same pseudo-speaker, while trial utterances from different speakers appear to be uttered by different pseudo-speakers333This is akin to “pseudonymization”, which replaces each user’s identifiers by a unique key. We do not use this term here, since it often refers to the distinct case when the identifiers are tabular data and the data controller stores the correspondence table linking users and keys..

Requirement (c) is assessed via utility metrics: automatic speech recognition (ASR) decoding error rate using a model trained on original, i.e., unprocessed data and subjective speech intelligibility and naturalness (see Section 4). Requirement (d) and additional downstream goals including ASR training will be assessed in a post-evaluation phase (see Section 6).

2.3 Attack models

The attackers have access to: (a) one or more anonymized trial utterances, (b) possibly, original or anonymized enrollment utterances for each speaker. They do not have access to the anonymization system applied by the user. The protection of personal information is assessed via privacy metrics, including objective speaker verifiability and subjective speaker verifiability and linkability. These metrics assume different attack models.

The objective speaker verifiability metrics assume that the attackers have access to a single anonymized trial utterance and several enrollment utterances. Two sets of metrics are used for original vs. anonymized enrollment data (see Section 4.1). In the latter case, we assume that the trial and enrollment utterances of a given speaker have been anonymized using the same system, but the corresponding pseudo-speakers are different.

The subjective speaker verifiability metric (Section 4.2) assumes that the attackers have access to a single anonymized trial utterance and a single original enrollment utterance. Finally, the subjective speaker linkability metric (Section 4.2) assumes that the attackers have access to several anonymized trial utterances.

3 Datasets

Several publicly available corpora are used for the training, development and evaluation of speaker anonymization systems.

3.1 Training set

The training set comprises the 2,800 h VoxCeleb-1,2 speaker verification corpus [nagrani2017voxceleb, chung2018voxceleb2] and 600 h subsets of the LibriSpeech [panayotov2015librispeech] and LibriTTS [zen2019libritts] corpora, which were initially designed for ASR and speech synthesis, respectively. The selected subsets are detailed in Table 1 (top).

Subset Female Male Total #Utter.
Training VoxCeleb-1,2 2,912 4,451 7,363 1,281,762
LibriSpeech train-clean-100 125 126 251 28,539
LibriSpeech train-other-500 564 602 1,166 148,688
LibriTTS train-clean-100 123 124 247 33,236
LibriTTS train-other-500 560 600 1,160 205,044
Development LibriSpeech Enrollment 15 14 29 343
dev-clean Trial 20 20 40 1,978
Enrollment 600
VCTK-dev Trial (common) 15 15 30 695
Trial (different) 10,677
Evaluation LibriSpeech Enrollment 16 13 29 438
test-clean Trial 20 20 40 1,496
Enrollment 600
VCTK-test Trial (common) 15 15 30 70
Trial (different) 10,748
Table 1: Number of speakers and utterances in the VoicePrivacy 2020 training, development, and evaluation sets.

3.2 Development set

The development set comprises LibriSpeech dev-clean and a subset of the VCTK corpus [yamagishi2019cstr] denoted as VCTK-dev (see Table 1, middle). With the above attack models in mind, we split them into trial and enrollment subsets. For LibriSpeech dev-clean, the speakers in the enrollment set are a subset of those in the trial set. For VCTK-dev, we use the same speakers for enrollment and trial and we consider two trial subsets, denoted as common and different. The common trial subset is composed of utterances in the VCTK corpus that are identical for all speakers. This is meant for subjective evaluation of speaker verifiability/linkability in a text-dependent manner. The enrollment and different trial subsets are composed of distinct utterances for all speakers.

3.3 Evaluation set

Similarly, the evaluation set comprises LibriSpeech test-clean and a subset of VCTK called VCTK-test (see Table 1, bottom).

4 Utility and privacy metrics

Following the attack models in Section 2.3, we consider objective and subjective privacy metrics to assess anonymization performance in terms of speaker verifiability and linkability. We also propose objective and subjective utility metrics to assess whether the requirements in Section 2.2 are fulfilled.

4.1 Objective metrics

For objective evaluation, we train two systems to assess speaker verifiability and ASR decoding error. The first system denoted is an automatic speaker verification (ASV) system, which produces log-likelihood ratio (LLR) scores. The second system denoted is an ASR system which outputs a word error rate (WER). Both are trained on LibriSpeech train-clean-360 using Kaldi [povey2011kaldi].

4.1.1 Objective speaker verifiability


system for speaker verifiability evaluation relies on x-vector speaker embeddings and probabilistic linear discriminant analysis (PLDA)

[snyder2018x]. Three metrics are computed: the equal error rate (EER) and the LLR-based costs and . Denoting by and the false alarm and miss rates at threshold , the EER corresponds to the threshold at which the two detection error rates are equal, i.e., . is computed from PLDA scores as defined in [brummer2006application, ramos2008cross]. It can be decomposed into a discrimination loss () and a calibration loss ().

is estimated by optimal calibration using monotonic transformation of the scores to their empirical LLR values.

As shown in Fig. 1, these metrics are computed and compared for: (1) original trial and enrollment data, (2) anomymized trial data and original enrollment data, (3) anomymized trial and enrollment data. The number of target and impostor trials is given in Table 2.

Figure 1: ASV evaluation.
Subset Trials Female Male Total
Development LibriSpeech Target 704 644 1,348
dev-clean Impostor 14,566 12,796 27,362
VCTK-dev Target (common) 344 351 695
Target (different) 1,781 2,015 3,796
Impostor (common) 4,810 4,911 9,721
Impostor (different) 13,219 12,985 26,204
Evaluation LibriSpeech Target 548 449 997
test-clean Impostor 11,196 9,457 20,653
VCTK-test Target (common) 346 354 700
Target (different) 1,944 1,742 3,686
Impostor (common) 4,838 4,952 9,790
Impostor (different) 13,056 13,258 26,314
Table 2: Number of speaker verification trials.

4.1.2 ASR decoding error

is based on the state-of-the-art Kaldi recipe for LibriSpeech involving a factorized time delay neural network (TDNN-F) acoustic model (AM)

[povey2018semi, peddinti2015time] and a trigram language model. As shown in Fig. 2, the (1) original and (2) anonymized trial data is decoded using the provided pretrained model and the corresponding WERs are calculated.

Figure 2: ASR decoding evaluation.

4.2 Subjective metrics

Subjective metrics include speaker verifiability, speaker linkability, speech intelligibility, and speech naturalness. They will be evaluated using listening tests carried out by the organizers.

4.2.1 Subjective speaker verifiability

To evaluate subjective speaker verifiability, listeners are given pairs of one anonymized trial utterance and one distinct original enrollment utterance of the same speaker. Following [lorenzo2018voice], they are instructed to imagine a scenario in which the anonymized sample is from an incoming telephone call, and to rate the similarity between the voice and the original voice using a scale of 1 to 10, where 1 denotes ‘different speakers’ and 10 denotes ‘the same speaker’ with highest confidence. The performance of each anonymization system will be visualized through detection error tradeoff (DET) curves.

4.2.2 Subjective speaker linkability

The second subjective metric assesses speaker linkability, i.e., the ability to cluster several utterances into speakers. Listeners are asked to place a set of anonymized trial utterances from different speakers in a 1- or 2-dimensional space according to speaker similarity. This relies on a graphical interface, where each utterance is represented as a point in space and the distance between two points expresses subjective speaker dissimilarity.

4.2.3 Subjective speech intelligibility

Listeners are also asked to rate the intelligibility of individual samples (anonymized trial utterances or original enrollment utterances) on a scale from 1 (totally unintelligible) to 10 (totally intelligible). The results can be visualized through DET curves.

4.2.4 Subjective speech naturalness

Finally, the naturalness of the anonymized speech will be evaluated on a scale from 1 (totally unnatural) to 10 (totally natural).

Dataset Gender Anonymization Development Test
Enroll Trial EER (%) EER (%)
LibriSpeech Female original original 8.67 0.304 42.86 7.66 0.183 26.79
anonymized 50.28 0.997 146.01 48.54 0.996 151.37
anonymized 35.09 0.876 15.19 29.74 0.797 14.00
Male original original 1.24 0.034 14.25 1.11 0.041 15.30
anonymized 58.39 0.998 168.50 53.23 0.999 167.14
anonymized 29.66 0.806 20.08 32.52 0.835 26.54
VCTK Female original original 2.86 0.100 1.13 4.89 0.169 1.50
(different) anonymized 50.03 0.988 162.91 48.87 0.999 142.40
anonymized 29.48 0.814 10.24 34.21 0.884 12.33
Male original original 1.44 0.052 1.16 2.07 0.072 1.82
anonymized 55.33 1.000 166.50 53.73 1.000 165.62
anonymized 26.10 0.756 18.81 25.83 0.743 16.31
Table 3: Speaker verifiability achieved by the pretrained model. The primary baseline is used for anonymization.
Dataset Anonymization Dev. WER (%) Test WER (%)
LibriSpeech original 3.83 4.14
anonymized 6.50 6.77
VCTK original 10.79 12.81
(comm.+diff.) anonymized 15.50 15.53
Table 4: ASR decoding error achieved by the pretrained model. The primary baseline is used.

5 Baseline software and results

Two anonymization baselines are provided.444 We briefly introduce them and report the corresponding objective results below.

5.1 Anonymization baselines

Figure 3: Primary baseline anonymization system.

The primary baseline shown in Fig. 3 is inspired from [fang2019speaker] and comprises three steps: (1) extraction of x-vector [snyder2018x], pitch (F0) and bottleneck (BN) features; (2) x-vector anonymization; (3) speech synthesis (SS) from the anonymized x-vector and the original F0+BN features. In Step 1, 256-dimensional BN features encoding spoken content are extracted using a TDNN-F ASR AM trained on LibriSpeech train-clean-100 and train-other-500 using Kaldi. A 512-dimensional x-vector encoding the speaker is extracted using a TDNN trained on VoxCeleb-1,2 with Kaldi. In Step 2, for every source x-vector, an anonymized x-vector is computed by finding the farthest x-vectors in an external pool (LibriTTS train-other-500) according to the PLDA distance, and by averaging randomly selected vectors among them555In the baseline, we use and .. In Step 3, an SS AM generates Mel-filterbank features given the anonymized x-vector and the F0+BN features, and a neural source-filter (NSF) waveform model [wang2019neural] outputs a speech signal given the anonymized x-vector, the F0, and the generated Mel-filterbank features. The SS AM and NSF models are both trained on LibriTTS train-clean-100. See [tomashenkovoiceprivacy, srivastava2020baseline] for further details.

The secondary baseline is a simpler, formant-shifting approach provided as additional inspiration [EURECOM+6190].

5.2 Objective evaluation results

Table 3 reports the values of objective speaker verifiability metrics obtained before/after anonymization with the primary baseline.666Results on VCTK (common) are omitted due to space constraints. The EER and metrics behave similarly, while interpretation of is more challenging due to non-calibration777In particular, is not a problem, since we care more about discrimination metrics than score calibration metrics in the first edition.. We hence focus on the EER below. On all datasets, anonymization of the trial data greatly increases the EER. This shows that the anonymization baseline effectively increases the users’ privacy. The EER estimated with original enrollment data (49 to 58%), which is comparable to or above the chance value (50%), suggests that full anonymization has been achieved. However, anonymized enrollment data result in a much lower EER (26 to 35%), which suggests that F0+BN features retain some information about the original speaker. If the attackers have access to such enrollment data, they will be able to re-identify users almost half of the time. Note also that the EER is larger for females than males on average. This further demonstrates that failing to define the attack model or assuming a naive attack model leads to a greatly overestimated sense of privacy [srivastava2019evaluating].

Table 4 reports the WER achieved before/after anonymization with the primary baseline. While the absolute WER stays below 7% on LibriSpeech and 16% on VCTK, anonymization incurs a large WER increase of 21 to 70% relative.

The results achieved by the secondary baseline are inferior and detailed in [tomashenkovoiceprivacy]. Overall, there is substantial potential for challenge participants to improve over the two baselines.

6 Conclusions

The VoicePrivacy initiative aims to promote the development of private-by-design speech technology. Our initial event, the VoicePrivacy 2020 Challenge, provides a complete evaluation protocol for voice anonymization systems. We formulated the voice anonymization task as a game between users and attackers, and highlighted three possible attack models. We also designed suitable datasets and evaluation metrics, and we released two open-source baseline voice anonymization systems. Future work includes evaluating and comparing the participants’ systems using objective and subjective metrics, computing alternative objective metrics relating to, e.g., requirement (d) in Section

2.2, and drawing initial conclusions regarding the best anonymization strategies for a given attack model. A revised, stronger evaluation protocol is also expected as an outcome.

In this regard, it is essential to realize that the users’ downstream goals and the attack models listed above are not exhaustive. For instance, beyond ASR decoding, anonymization is extremely useful in the context of anonymized data collection for ASR training [srivastava2019privacy]. It is also known that the EER becomes lower when the attackers generate anonymized training data and retrains on this data [srivastava2019evaluating]. In order to assess these aspects, we will ask volunteer participants to share additional data with us and run additional experiments in a post-evaluation phase.

7 Acknowledgment

VoicePrivacy was born at the crossroads of projects VoicePersonae, COMPRISE (, and DEEP-PRIVACY. Project HARPOCRATES was designed specifically to support it. The authors acknowledge support by ANR, JST, and the European Union’s Horizon 2020 Research and Innovation Program, and they would like to thank Md Sahidullah and Fuming Fang. Experiments presented in this paper were partially carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see