1. Introduction
Voice User Interfaces (VUIs) for conversational agents are now common in many services such as banking, call centers, and medical services, in addition to voice assistants (VA) like Amazon Alexa and Google Assistant. Often these services rely on verifying the user through speaker recognition and then using speech recognition techniques to understand a spoken language. Increased reliance on VUI services exposes users to an increasing number of threats to their privacy and security. Speech recordings are a rich source of personal information
(Schuller and Batliner, 1988), and the degree of privacy-sensitive information (e.g., emotion, sex, accent, and ethnicity) captured in these recordings extends beyond what is said (i.e., linguistic) and who says it (i.e., paralinguistic). Security and privacy concerns arise from the potential interception and misuse of this sensitive information sharing through automatic speech processing technology.In the security domain, we can consider an adversary which aims to fool the target model (Papernot et al., 2016). Evasion attacks, also known as adversarial examples, add imperceptible perturbation to the input sample to result in the incorrect prediction of the target models (e.g.,
automatic speech recognition (ASR) and automatic speaker verification (ASV))
(y. Huang et al., 2021; Carlini and Wagner, 2018; Abdullah et al., 2019a, b; Cisse et al., 2017; Yuan et al., 2018; Taori et al., 2019; Qin et al., 2019; Schönherr et al., 2018; Abdoli et al., 2019; Alzantot et al., 2018). A spoofing attack (i.e., replay, synthesis, and voice conversion attacks) is a technique where the imposter speaker’s speech is converted to desired speaker’s speech using signal processing approaches that cause false acceptances to authentication systems (Wu and Li, 2014). In response to these potential attacks, the countermeasures for adversarial/spoofing attacks have been proposed to secure target models against these attacks (Wang et al., 2020; Das et al., 2020; Hemavathi and Kumaraswamy, 2021).In the privacy domain, the adversary aims to obtain private information about the training data or to obtain the model itself (Nasr et al., 2019). Attacks targeting data privacy include, for example, an attacker aiming to determine if the voice of a certain individual was used for training a speaker identification system. In response to these potential attacks, privacy-preserving defenses have been designed to prevent privacy leakage of the raw data. These defenses fall between anonymization and cryptography (Tomashenko et al., 2020). For example, anonymization aims to make the speech input unlinkable, i.e., ensure that no utterance can be linked to its original speaker by altering a raw signal and mapping the identifiable personal characteristics of a given speaker to another identity (Lal Srivastava et al., 2020). Various studies have proposed anonymization methods based on noise addition (Tomashenko et al., 2020), voice conversion (Lal Srivastava et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020), speech synthesis (Qian et al., 2018), and adversarial learning (Srivastava et al., 2019a), considering the speaker identity (Tomashenko et al., 2020) or emotion (Aloufi et al., 2019a) as a sensitive attributes.
Voice conversion (VC) is a technique used to convert paralinguistic information such as gender, speaker identity, and emotions while keeping the linguistic information of a source speech. These technologies were made much more powerful by incorporating deep learning mechanisms. Recently, VC technology has become a key technology in designing privacy-preserving voice analytics solutions to produce convincing mimicry of specific target speaker voices. For example, Srivastava
et al. (Srivastava et al., 2020) designed an anonymization scheme that converts any input speech into that of a random pseudo-speaker. Ho et al. (Ho and Akagi, 2021) propose a speaker identity-controllable framework based on VC technology to mimic voice while continuously controlling speaker individuality. On the contrary, such techniques can enable fooling (spoofing) unprotected speaker authentication systems and therefore might prompt various potential security implications. With the generated spoofed recordings, for instance, an adversary might attack the voice assistant, making it fraudulently respond to identity-based service requests; an insider might attack the VoicePrint-based security system to gain illegitimate access and gain sensitive information; an imposter might call a bank’s contact center by making himself recognized as the victim. Thus, it raises alarm about the feasibility of achieving privacy by applying voice transformation (i.e., anonymization) regarding its security threat in real-time practical applications such as smart-assistance systems.With the increasing use of automatic speaker verification (ASV) in security-sensitive domains (e.g., forensics identification and smart-home), ASV is becoming a new target for attackers. It has been shown that ASV systems can be vulnerable to fooling/spoofing, also referred to as presentation attacks (Wang et al., 2020), since these systems generally are not yet efficient in recognizing voice modifications/variations (i.e., adversarial examples, noisy voice samples, mismatch conditions between enrolling and trails recordings) (Dehak et al., 2010)
. VC technology could be also misused for attacking these systems, and thus spoofing countermeasures (CM) have been proposed and adopted to protect ASV systems. CMs are designed to learn the distinguishing artifacts present in spoofed audio produced by VC from human speech. Spoofing refers to falsifying a speech signal as system input for feature extraction and verification, the objective of which is to improve the reliability of biometric systems by preventing fraudulent access. While the ASV system should reject a zero-effort impostor (
i.e., false attempt), the CMs should detect a valid trial (i.e., genuine speech).While security, privacy, and data protection are often studied independently, there is little understanding on their fundamental interconnections, and the complexity of their relation has not been fully explored. Specifically, in the speech domain, prior work has intensively studied the two domains separately (Carlini and
Wagner, 2018; Abdullah et al., 2019a; Aloufi
et al., 2019a). Thus, it remains unclear what the vulnerability to one domain implies for the other. Revealing such implications is important for developing effective defenses where security and privacy can be co-engineered. It is unclear how the two vectors interact with each other and how their interactions may influence attack dynamics against VUIs systems. Understanding such interactions is critical for building effective defenses. For example, in voice assistance systems, the users need to be verified first using voice-based authentication to gain access to further services (e.g., understanding the user command and responding based on it), assuming that anonymization mechanism is detected to protect user privacy (i.e., hiding sensitive speaker-related information), resulting in modified/synthesized voices that can affect the authentication functionality or be blocked by CMs that may detect it as a spoofed signal. Further, the adversary may exploit such an anonymization tool to mislead the authentication operation. Finally, studying potential attack vectors within a unified framework is essential for assessing and mitigating the broad vulnerabilities of VUIs deployed in practice, in which multiple attacks may be launched simultaneously.
In this paper, we seek to answer the following research questions.
RQ1 – What are the fundamental connections between voice spoofing and voice anonymization?
RQ2 – What are the implications of such mechanisms (e.g., speech synthesis, voice cloning, and voice conversion) for an adversary to optimize attack strategies against VUI-enabled services, and for benign users to protect their privacy?
RQ3 – What are the potential countermeasures to maintain secure and private VUI-based systems?
Our Contribution. In this work we present a step towards answering the key questions above. Answering these key questions is crucial for assessing and mitigating the broad vulnerabilities of VUIs deployed in realistic settings.
RA1 – We use a tandem framework that jointly investigates ASV and CM models performance against two vectors of attacks generated by voice transformation and anonymization mechanisms. With this framework, we show that there exists an intricate duality between the two mechanisms. Specifically, they offer different ways to achieve the same objective.
RA2 – Through empirical studies on benchmark datasets and using both spoofing countermeasures and anonymization techniques, we reveal that the anonymized voices are detected as spoofed attacks, intuitively, leading to confusingly questioning its effectiveness in obtaining privacy-transformed utterances to meet the anonymization purposes. We also provide analytical justification for such effects under a different setting. 111Code and research artefacts. https://github.com/RanyaJumah/EDGY/tree/master/Balancing_Privacy&Security_for_VUI
RA3 – Finally, we demonstrate that to effectively defend against attacks, it is necessary to consider attacks from multiple complementary perspectives (i.e., security and privacy) and carefully account for the effects in applying the mitigating solutions.
To our best knowledge, this work represents the first systematic study of voice spoofing (i.e., for security deceiving) and anonymization (i.e., for privacy protection) within a unified framework. We believe our findings deepen understanding of the vulnerabilities of VUIs in practical settings and shed light on how to develop more effective, secure and private solutions.

2. Voice Transformation and Authentication
2.1. Voice Conversion
Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. A typical voice conversion pipeline includes speech analysis, mapping, and reconstruction modules. Deep learning techniques also transform the way we implement the analysis-mapping-reconstruction pipeline. The concept of embedding in deep learning provides a new way of deriving the intermediate representation, for example, latent code for linguistic content, and speaker embedding for speaker identity. It also makes the disentanglement of speaker from speech content much easier.
2.1.1. Speech Analysis.
From the perspective of speech perception, speaker individuality is characterized at three different levels: segmental, supra-segmental, and linguistic information. The segmental information relates to the short-term feature representations, such as spectrum and instantaneous fundamental frequency (F0). The supra-segmental information describes prosodic features such as duration, tone, stress, rhythm over longer stretches of speech than phonetic units. It is more related to the signal but spanning a longer time than the segmental information. The linguistic information is encoded and expressed through lexical content.
Voice conversion technology is to deal with the segmental and suprasegmental information while keeping the language content unchanged (Vaidya and Sherr, 2019). The speech analyzer decomposes the speech signals of a source speaker into features that represent supra-segmental and segmental information.
2.1.2. Mapping.
The mapping module has taken centre stage in many studies. These techniques can be categorized in different ways, for example, based on the use of training data-parallel vs non-parallel, the type of statistical modeling technique (parametric vs non-parametric), the scope of optimization (frame-level vs utterance level) and the workflow of conversion (direct mapping vs inter-lingual).
The simplest form of voice conversion (i.e.,
mapping) requires parallel data for training and it is capable of one-to-one speaker conversion. Parallel data include the same transcription utterances spoken by the source and target speakers and they are highly expensive to collect. Thus, several studies attempted to use non-parallel data to train voice conversion models. In the case of multiple-speaker voice conversion, one-to-one speaker conversion algorithms may be applied to obtain separately trained models for all possible combinations of speaker pairs. However, this approach becomes impractical as the number of speakers increases. Traditional VC research includes modeling spectral mapping with statistical methods such as Gaussian mixture model (GMM), partial least squares regression, and sparse representation. Recent deep learning approaches such as deep neural network (DNN), recurrent neural network (RNN) and generative adversarial network (GAN) have advanced the state-of-the-art. The mapping module changes them towards the target speaker.
2.1.3. Vocoding.
Speech reconstruction can be seen as an inverse function of speech analysis that operates on the modified parameters and generates an audible speech signal. It works with speech analysis in tandem. A vocoder learns to reconstruct audio waveforms from acoustic features (Oord et al., 2016). Traditionally, the waveform can be vocoded from these acoustic or linguistic features using handcrafted models such as WORLD (Morise et al., 2016), Straight (Kawahara, 2006), and Griffin-Lim (Griffin and Lim, 1984)
. However, the quality of those traditional vocoders was limited by the difficulty in accurately estimating the acoustic features from the speech signal. Neural vocoders such as Wavenet
(Oord et al., 2016) have rapidly become the most commonly used vocoding method for speech synthesis. Although it improved the quality of generated speech, it has significant cost in computation power and data sources, and suffers from poor generalization (Lorenzo-Trueba et al., 2019). To solve this problem, many architectures such as Wave Recurrent Neural Networks (WaveRNN) (Kalchbrenner et al., 2018) have been proposed. WaveRNN combines linear prediction with recurrent neural networks to synthesize neural audio much faster than other neural synthesizers.A vocoder is used to express a speech frame with a set of controllable parameters that can be converted back into a speech waveform. Voice conversion systems only modify the speaker-dependent characteristics of speech, such as fundamental frequency (F0), intonation, intensity, and duration, while carrying over the speaker-independent speech content. The reconstruction module re-synthesizes time-domain speech signals.
2.2. Speaker Verification Techniques
Speaker verification is integral to many security applications. This is to verify the identity of a person from the characteristics of the voice. Contemporary ASV systems involve two processes: offline training (i.e., registration or enrollment) and runtime verification. During the offline training, the ASV system uses speech samples provided by the target speaker to extract certain spectral, prosodic, or other high-level features to create a speaker model. Then, in the runtime verification phase, the receiving voice is verified against the trained speaker model (Wang et al., 2020) and the verification score is compared with a pre-defined threshold. If the score is higher than the threshold, the test is accepted, or rejected otherwise. It is a binary decision task and a verification score is estimated based on the claimed speaker’s model.
2.2.1. Speech Analysis.
Typically, an encoder network extracts frame-level representations from acoustic features (e.g., Mel Frequency Cepstrum Coefficients (MFCCs), filter-banks, or spectrogram). This is followed by a global temporal pooling layer that aggregates the frame-level representation into a single vector per utterance. Finally, a feed-forward classification network processes this single vector to calculate speaker class posteriors (Nagrani
et al., 2017). Typically, in the evaluation phase, the speaker embedding is extracted from the first affine transform after the pooling layer. Different x-vector systems are characterized by different encoder architectures, pooling methods, and training objectives (e.g., softmax, angular softmax, contrastive, and triplet losses) (Villalba et al., 2020a).
Traditional Methods.
Speaker identification was dominated by Gaussian Mixture Models (GMMs) trained on low dimensional feature vectors (Reynolds
et al., 2000). The state-of-the-art involves both the use of joint factor analysis (JFA) based methods which model speaker and channel subspaces separately and i-vectors that attempt to model both subspaces into a single compact, low-dimensional space (Dehak et al., 2010). These systems rely, however, on a low dimensional representation of the audio input, e.g., MFCCs, and thus rapidly degrade in verification performance with real-world noise, and may be lacking in speaker-discriminating features (e.g., pitch information) (Nagrani
et al., 2017).
Deep Learning Methods.
DNN based acoustic models were used instead of the GMM in the i-vector framework (Dehak et al., 2010)
. Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. An alternative approach is to use DNN to extract bottleneck features
(Fu et al., 2014; Liu et al., 2015; Tian et al., 2015) or speaker representations directly (Chen et al., 2015). For example, speaker representations such as d-vector (Variani et al., 2014; Heigold et al., 2016) and RNN/LSTM based sequence-vector (s-vector) (Bhattacharya et al., 2016) have been applied as robust speaker embeddings.2.2.2. Speaker Modeling.
There are two kinds of speaker verification (SV) systems: Text-independent (TI)-SV and Text-dependent (TD)-SV systems. TD-SV assumes cooperative speakers and requires the speaker to speak fixed or spontaneously prompted utterances, whereas TI-SV allows the speaker to speak freely during both enrolment and verification. Both TI-SV and TD-SV systems share the feature extraction techniques while being different in the speaker modeling. However, the text-prompted speaker recognition systems have been the preferred alternative in many practical applications.
TI-SV Modeling.
In the text-independent (TI) mode, there are no constraints on the text. Thus, the enrollment and test utterances may have completely different texts. For such cases, it is more convenient for the users to operate. Text-independent ASV systems are more flexible and are able to accept arbitrary utterances, e.g., different languages, from speakers.
TD-SV Modeling.
In the text-dependent (TD) mode, the user is expected to speak a pre-determined text for both training and test. Due to the prior knowledge (lexical content) of the spoken phrase, TD systems are generally more robust and can achieve good performance. Text-dependent ASV is more widely selected for authentication applications, since it provides higher recognition accuracy with fewer required utterances for verification.
3. Voice Disguise vs. Speaker Authentication Systems
This section presents some insights into different types of spoofing attacks, followed by two distinct measurement estimators: Disguise and Anonymization. We then use a tandem framework combining these two estimators to manage the application of privacy-preserving solutions.
3.1. Generic Attack Model
Security of automatic speaker verification (ASV) systems can be compromised by various spoofing attacks (e.g., speech synthesis and voice conversion). We consider a scenario in which a user seeks to compromise a system or service protected by ASV. It is assumed in these scenarios that the microphone is not controlled by the authentication system and is instead chosen by the user (i.e., post-sensor scenario). An example is that voice spoofing attacks can be used to impersonate a person’s voice for voice assistants like Amazon Alexa or Google Assistant to shop online, send messages, control smart home appliances, and grant undesirable access to personal users’ data such as financial information. Such attacks are not necessarily fraudulent, whereby a user of these services may want to conceal their identity to spoof or trick third parties for privacy preservation purposes. Attacks then take the form of synthetic speech or converted voice, which is presented to the ASV system without acoustic propagation or microphone effects (i.e., logical-access voice spoofing techniques). From the attacker’s perspective, spoofing attacks (i.e., in our case, assuming the anonymization system operates as a spoofing system) can be categorized into non-proactive and adversarial attacks causing potential threats on ASV, spoofing countermeasures, or both (Das et al., 2020), as shown in Figure 2.

3.1.1. Spoofing with Non-proactive Attacks
The attacker lacks a direct optimization target related to the attacked ASV system, as shown in Figure 2 (A). Basically, crafting non-proactive attacks represent ideas or technology originally designed for completely different aims and purposes instead of fooling ASV systems (Das et al., 2020). One example is VC and TTS attacks, which aim at modifying source speaker identity to that of a target speaker, and to produce text in a given target speaker’s voice, respectively. VC and TTS technology takes place as a key concept behind a privacy protection mechanism that anonymizes the speaker identity (Tomashenko et al., 2020). Thus, TTS and VC attacks can compromise the security of ASV systems as a side-purpose rather than its original objective in helping, for example, to give those with conditions like autism the ability to speak naturally (Das et al., 2020). In this paper we focus on the non-proactive type, and consider anonymization objective in fooling ASV as a side effect of voice transformation technology.
3.1.2. Spoofing with Adversarial Attacks
The attacker leverages the information of the attacked ASV system to generate spoofed samples and can use the knowledge of either the attacked ASV or another similar ASV to generate adversarial samples (Das et al., 2020; Villalba et al., 2020b), see Figure 2 (B). Adversarial attacks can be broadly divided into black, grey, and white-box attacks (Goodfellow et al., 2014). In the black-box setting, the adversary’s observation is limited to the system output (e.g., speaker similarity score) and the model parameters and the intermediate steps of the computation are not accessible to the attacker (Nasr et al., 2019). In the grey-box, the attacker has some information such as features of the speakers and their implementation, but not their statistical models (Das et al., 2020). The white-box attacks pose the greatest threat as the attackers have full knowledge of the model under attack including its parameters which are needed for prediction (Nasr et al., 2019; Nakamura et al., 2019). We assume that anonymization tools are designed without considering specific knowledge about ASV used for authenticating either target or non-target users.
3.2. Verification-to-Disguise (V2D) Estimation
V2D is a spoofing detector to discriminate genuine and synthetic speech utterances.
Spoofing countermeasures (CM) are introduced to the ASV systems to protect them from various attacks (Wu et al., 2020b; Liu et al., 2019). High-performance anti-spoofing is used to protect ASV by identifying and filtering spoofing audio that is deliberately generated by text-to-speech, voice conversion, audio replay, etc. Claimed identities are thus only accepted if a test signal attains a countermeasure score lower than its threshold. Therefore, the existing spoofing countermeasure involves the extraction of various parameters of prediction error, aiming to capture the features that will help to differentiate genuine from spoof speech signals. Spoofing countermeasures use particular features that capture the unique aspects of human speech production, under the hypothesis that machines cannot emulate many of the fine-level intricacies of the human speech production mechanism (Das et al., 2020). This could be because of the complexity of the human speech production mechanism, human speech has a greater degree of inconsistency than machine-generated speech, as shown in Figure 3. Typically, for deep-learning based voice spoofing detection models, the speech features (e.g., LFCC or CQCC) are fed into a neural network to calculate an embedding vector for the input utterance. The objective of training this model is to learn an embedding space in which the genuine voices and spoofing voices can be well discriminated. The embedding would be further used for scoring the confidence of whether the utterance belongs to genuine speech or not.
By increasing the performance of spoofing countermeasures in detecting disguised voices, the privacy-based transformation methods may not be sufficient to further protect the users’ privacy. In our evaluation framework, we evaluate the artifact in the transformed voices by these systems using a spoofing countermeasure to indicate the level of artifacts left in the converted speech. In our case, CM tries to detect whether the privacy-transformed utterance will be detected as spoofed or not. Thus, we may have two initial scenarios regarding the privacy protection level offered by the anonymization solution: strong security, indicating that the transformed input detected as spoofed and will be prevented from accessing the ASV system, and weak security, indicating that the transformed input bypassed the spoofing countermeasures and thus might present security issues to the authentication system. This V2D output will be used in a security estimation over the inputs of VUIs systems.

3.3. Verification-to-Anonymization (V2A) Estimation
V2A is a speaker detector to verify whether the given utterance is from the target speaker or not.
VUIs use spoken commands to carry out various actions. Sometimes it is necessary to collect some speech to improve and adapt the assistant’s models to the user’s speech. In this case, an attacker could have access to sensitive user data (e.g., they may observe or infer personal information such as identity, age, and gender that can be easily obtained from these utterances). Thus, the objective is privacy preservation, suppressing critical speaker information from speech. To protect their privacy, users may implement a privacy-preservation tool over their data to minimize the personal information, while allowing one or more downstream goals to be achieved. Recent attempts have focused on speech transformation, voice conversion, and speech synthesis as technologies underpinning these solutions.
Privacy by anonymization has achieved remarkable success in concealing identity to preserve users’ privacy (Qian et al., 2018; Ahmed et al., 2020; Han et al., 2020; Srivastava et al., 2020; Vaidya and Sherr, 2019). Although the primary purpose is to protect privacy, this has successfully misled the verification systems (Srivastava et al., 2020). Speakers want to hide their identity while allowing any desired goal to be potentially achieved. In order to hide his/her identity, benign users pass their utterances through an anonymization system before sharing/publication. The resulting anonymized utterances are called trial utterances. They sound as if they were uttered by another speaker, which we call a pseudo-speaker that may be an artificial voice not corresponding to any real speaker. In our case, ASV tries to detect whether the privacy-transformed utterance is spoken by the target speaker or not. Thus, we may have two initial scenarios without considering if a piece of voice is disguised or not: better privacy, indicating that the transformed input is not linkable to the target-speaker, worst-case privacy, indicating that we can still distinguish the target-speaker of the utterance. This V2A output will be used in privacy estimation over the inputs of VUIs systems.
3.4. Tandem Framework
Despite their apparent variations, spoofed inputs and anonymization tools share the same objective of forcing target authentication systems to misclassify pre-defined inputs (target or not). We will focus on the assessment of tandem systems whereby a V2D (i.e., CM) serves as a ‘gate’ to determine whether a given speech input originates from a genuine user, before passing it to V2A (e.g., ASV system). Assuming that verifying the signal integrity comes before achieving privacy, we envision a cascaded (tandem) system where a spoofing countermeasure system is placed before the authentication system (i.e., regarding the anonymization scenario in this paper), to prevent spoofing attacks from reaching this system, as shown in Figure 4.
To assess the joint performance of V2D and V2A, we adopt a new metric called (minimum) tandem detection cost function (t-DCF) (Fu et al., 2014). A t-DCF has been proposed by ASVspoof 2019 as its primary performance metric with a focus on the spoofing attack prior. The t-DCF is based on statistical detection theory and involves detailed specification of an envisioned application. It is a parameterized cost that makes the modeling assumptions of an envisioned operating environment (application) explicit. A key feature of t-DCF is the assessment of a tandem system while keeping the two subsystems (CM and authentication) isolated from each other and they can be developed independently of each other. Since the nature of spoofing attacks is never known in advance, t-DCF metric, therefore, reflects the cost of decisions in a Bayes/minimum risk sense by combining a fixed cost model with trial priors. Thus, beyond its practice for spoofing countermeasures, the specification of costs and priors tailors the t-DCF metric towards the development of secure and private applications for a range of different configurations.
The desired security-privacy trade-off might specify through detection costs assigned to erroneous system decisions and prior probabilities assigned to the commonality of targets, non-targets, and spoofing attacks. For example, a high-security user authentication application (
e.g., access control) where target users and spoofing attacks are almost equally likely to occur, while non-target users are rare. False acceptances (i.e., whether of non-targets or VC attacks) incur a ten-fold cost relative to false rejections. The higher the t-DCF value, the more detrimental the spoofing attack. The maximum value of 1.0 indicates an attack that renders the tandem system useless. Thus, this trade-off can have three possible results which are: (1) CM bonafide and ASV accept means ‘high privacy’, (2) CM bonafide and ASV reject, which means ‘low privacy’, and (3) CM spoof and ASV accept/reject. Therefore, we want to confirm whether this objective metric can capture such score variations and predict scores for evaluating VUIs robustness and privacy.
4. experiments
In this section, we describe the datasets, neural network architectures, and corresponding attacks & countermeasure settings that we use in our experiments.
4.1. Study Setting
System | VC model | Vocoder |
---|---|---|
P1 | VoicePrivacy Challenge | Neural Source-filter |
P2 | VQVAE | World |
P3 | VQVAEGAN | Parallel WaveGAN |
P4 | CycleVQVAE | ParallelWaveGAN |
P5 | CycleVQVAEGAN | Parallel WaveGAN |
Datasets. To factor out the influence of specific datasets, we primarily use 4 benchmark datasets:
VoxCeleb.
VoxCeleb dataset (Nagrani
et al., 2017) contains over 100,000 utterances for 7325 celebrities, extracted from videos uploaded to YouTube. The speakers span a wide range of different ethnicities, accents, professions and ages. It was curated to facilitate the development of automatic speaker recognition systems. We use it to train and evaluate the authentication system.
VCTK.
VCTK dataset (Yamagishi
et al., 2019) includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences. It was recorded for the purpose of building HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation technologies. We use it to train and evaluate the anonymization systems.
VCC2020.
VCC2020 dataset (Yi et al., 2020) is based on the Effective Multilingual Interaction in Mobile Environments (EMIME) dataset (Wester, 2010), which is a bilingual database of Finnish/English, German/English, and Mandarin/English data. There are seven male and seven female speakers for each language, English, Finnish, German, and Mandarin, ending up in 56 speakers in total. It uses to train and evaluate the anonymization systems.
ASVspoof2019.
ASVspoof2019 database (Wang et al., 2020) for logical access is based upon a standard multi-speaker speech synthesis database called VCTK (Yamagishi
et al., 2019). Genuine speech is collected from 107 speakers (46 male, 61 female) and with no significant channel or background noise effects. Spoofed speech is generated from the genuine data using a number of different spoofing algorithms. It uses to train and evaluate the spoofing countermeasure systems.
Anonymization.
As VC is a basic technique behind most current state-of-the-art anonymization solutions (i.e., offering identity privacy preservation) (Tomashenko et al., 2020; Qian
et al., 2018; Ahmed
et al., 2020; Han et al., 2020; Champion
et al., 2021; Srivastava et al., 2020; Yoo et al., 2020; Aloufi
et al., 2019b), we implement the following five systems (i.e.,
P1-P5 respectively): ‘VoicePrivacy Baseline’, ‘VQVAE’, ‘VQVAEGAN’, ‘CycleVQVAE’, and ‘CycleVQVAEGAN’. For ‘P1’, we use the baseline implementation for the VoicePrivacy challange. Then, we use an open-source nonparallel VC software named crank
(Kobayashi et al., 2021) to implement various VC systems with different configurations including hierarchical architectures ‘P2’, generative adversarial networks ‘P3’, cyclic architectures ‘P4’, speaker adversarial training ‘P5’, and neural vocoders, as shown in Table 1. Following a typical VC systems pipeline in these systems, several steps such as preparing the dataset, feature extraction, training, and conversion are implemented in order to reconstruct the speech utterance while transforming the speaker identity.Authentication System. We use an x-vector (Snyder et al., 2018) embedding extractor network that was a pre-trained recipe of the Kaldi toolkit (Povey et al., 2011). Training was performed using the speech data collected from 7325 speakers contained in the entire VoxCeleb2 corpus (Nagrani et al., 2017). We extract 512-dimensional x-vectors which are fed to a probabilistic linear discriminant analysis (PLDA). PLDA scoring is used to make a rejection/acceptance decision about the speaker identity.
Tag | Category | Posterior odds ratio (flat prior) |
---|---|---|
0 | 50:50 decision making of the adversary | |
A | adversary better decisions than 50:50 | |
B | one wrong decision in 10 to 100 | |
C | one wrong decision in 100 to 1000 | |
D | one wrong decision in 1000 to 100.000 | |
E | one wrong decision in 100.000 to 1.000.000 | |
F | one wrong decision in at least 1.000.000 |
Spoofing Countermeasures. Following (Wang et al., 2020), we use several countermeasure models based on the light convolutional neural network (LCNN) (Wu et al., 2020a)
. These models trained on the ASVspoof 2019 logical access scenario considering current strategies that deal with input trials of varied length. We consider three network structures: ‘LCNN-trim-pad’, ‘LCNN-attention’, and ‘LCNN-lstm-sum’. The loss function can be either AM-softmax, OC-softmax, sigmoid, or MSE for P2SGrad. For more details, refer to
(Wang et al., 2020). We compare their performance on the ASVspoof2019 logical access (LA) dataset (i.e., as a known attack (AK) e.g., waveform filtering, griffinlim, and spectral filtering) and the anonymized recordings (i.e., as an unknown attack implementing different conversion and vocoder models from those used in the training) across the front ends based on linear frequency cepstral coefficients (LFCCs), linear filter bank coefficients (LFBs), and spectrograms. The LFCC is 60-dimensional extracted from a frame length of 20 ms, a frame shift of 10 ms, a 512-point FFT, a linearly spaced triangle filter bank of 20 channels, and delta plus delta-delta coefficients (i.e., the first dimension replaced by log spectral energy). The LFB has a similar configuration but contains only static coefficients from 60 linear filter-bank channels. The spectrogram configures similarly and has 257 dimensions.Measures. State-of-the-art CM and ASV methods are subsequently utilized to objectively evaluate the impact of voice disguise (i.e., spoofing efficacy) and anonymization level (i.e., privacy protection) by equal error rates (EER) and tandem detection cost function (t-DCF) (Kinnunen et al., 2020).
Spoofing Efficacy.
We measure the attack efficacy by the decision score confidence, which is the probability that the spoofed input belongs to the genuine class as predicted by CMs. We consider the attack successful if the decision score confidence exceeds a threshold.
Privacy Protection. We measure the level of the privacy protection offered by the anonymization solutions by the decision score confidence, which is the probability that the speech input belongs to the target speaker as predicted by ASV. We estimate the protection by anonymization in terms of the average protection afforded to a population and a worst-case to an individual.

4.2. Performance Analysis
AM-softmax | OC-softmax | Sigmoid | P2SGrad | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Feature | NN | KA | P1 | P2 | P3-P5 | KA | P1 | P2 | P3-P5 | KA | P1 | P2 | P3-P5 | KA | P1 | P2 | P3-P5 |
LFB | L-T-P | 5.580 | 0.410 | 0.320 | 0.330 | 5.980 | 2.630 | 0.350 | 0.360 | 7.000 | 2.020 | 0.160 | 0.160 | 6.810 | 2.000 | 0.220 | 0.230 |
L-A | 4.250 | 2.110 | 0.210 | 0.230 | 4.010 | 1.210 | 0.170 | 0.190 | 3.340 | 1.190 | 0.270 | 0.190 | 3.980 | 2.340 | 0.130 | 0.130 | |
L-L-S | 4.230 | 3.170 | 0.220 | 0.230 | 5.810 | 3.640 | 0.230 | 0.230 | 7.040 | 3.210 | 0.260 | 0.260 | 5.060 | 1.060 | 0.290 | 0.290 | |
SPEC | L-T-P | 4.840 | 53.00 | 32.44 | 37.49 | 4.410 | 33.01 | 24.48 | 24.78 | 3.090 | 21.00 | 12.50 | 12.50 | 2.940 | 1.150 | 0.420 | 0.420 |
L-A | 4.020 | 6.230 | 4.600 | 5.730 | 4.050 | 12.08 | 4.780 | 5.080 | 3.920 | 6.310 | 6.250 | 6.250 | 4.720 | 6.040 | 5.270 | 6.020 | |
L-L-S | 3.960 | 14.49 | 6.200 | 11.07 | 2.810 | 1.000 | 0.420 | 0.450 | 3.290 | 3.000 | 1.040 | 0.910 | 2.370 | 1.490 | 0.780 | 1.460 | |
LFCC | L-T-P | 3.040 | 27.06 | 18.74 | 18.76 | 2.930 | 9.320 | 5.530 | 5.920 | 2.500 | 7.120 | 6.250 | 6.250 | 2.310 | 6.170 | 6.020 | 5.660 |
L-A | 2.990 | 8.210 | 6.250 | 7.110 | 2.910 | 7.070 | 6.250 | 6.250 | 3.180 | 6.030 | 5.790 | 5.960 | 2.720 | 6.320 | 6.250 | 5.890 | |
L-L-S | 2.460 | 6.010 | 5.240 | 5.370 | 2.230 | 6.110 | 5.370 | 5.610 | 2.670 | 6.470 | 7.420 | 6.250 | 1.920 | 6.340 | 6.250 | 6.250 |
AM-softmax | OC-softmax | Sigmoid | P2SGrad | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Feature | NN | KA | P1 | P2 | P3-P5 | KA | P1 | P2 | P3-P5 | KA | P1 | P2 | P3-P5 | KA | P1 | P2 | P3-P5 |
LFB | L-T-P | 0.120 | 0.007 | 0.006 | 0.007 | 0.150 | 0.007 | 0.007 | 0.007 | 0.160 | 0.013 | 0.006 | 0.013 | 0.170 | 0.015 | 0.006 | 0.015 |
L-A | 0.110 | 0.006 | 0.006 | 0.006 | 0.110 | 0.020 | 0.023 | 0.013 | 0.060 | 0.014 | 0.012 | 0.014 | 0.080 | 0.014 | 0.013 | 0.013 | |
L-L-S | 0.090 | 0.015 | 0.015 | 0.015 | 0.160 | 0.018 | 0.013 | 0.015 | 0.180 | 0.015 | 0.005 | 0.005 | 0.140 | 0.006 | 0.006 | 0.006 | |
SPEC | L-T-P | 0.120 | 0.492 | 0.440 | 0.510 | 0.123 | 0.680 | 0.580 | 0.670 | 0.085 | 0.184 | 0.160 | 0.210 | 0.085 | 0.016 | 0.008 | 0.008 |
L-A | 0.110 | 0.054 | 0.031 | 0.052 | 0.105 | 0.058 | 0.030 | 0.039 | 0.109 | 0.240 | 0.230 | 0.190 | 0.135 | 0.058 | 0.040 | 0.058 | |
L-L-S | 0.101 | 0.120 | 0.069 | 0.113 | 0.077 | 0.018 | 0.008 | 0.009 | 0.087 | 0.037 | 0.020 | 0.018 | 0.060 | 0.031 | 0.015 | 0.029 | |
LFCC | L-T-P | 0.068 | 0.290 | 0.260 | 0.260 | 0.068 | 0.059 | 0.048 | 0.056 | 0.069 | 0.218 | 0.191 | 0.188 | 0.056 | 0.080 | 0.063 | 0.050 |
L-A | 0.074 | 0.274 | 0.220 | 0.191 | 0.066 | 0.122 | 0.087 | 0.081 | 0.068 | 0.080 | 0.053 | 0.057 | 0.079 | 0.116 | 0.073 | 0.055 | |
L-L-S | 0.057 | 0.059 | 0.042 | 0.044 | 0.064 | 0.051 | 0.044 | 0.049 | 0.064 | 0.180 | 0.127 | 0.110 | 0.052 | 0.169 | 0.144 | 0.150 |
4.2.1. V2D Performance
To evaluate the performance of the spoofing countermeasure system, we use the countermeasure decision score which indicates the similarity of the given utterance with genuine speech. equal error rates (EER) is calculated by setting a threshold on the countermeasure decision score, such that the false alarm rate is equal to the miss rate. A high EER indicates the converted speech to be more human-like speech, whereas a lower EER is the better spoofing countermeasure system at detecting spoofing attacks (i.e., EER is constrained between 0 and 0.5, and values larger than 0.5 indicate decisions worse than random guessing). Then, the t-DCF metric is utilized to assess the influence of CM systems on the reliability of an ASV system. The lower the t-DCF is, the better reliability of ASV is achieved.
In Tables 3 and 4, we summarize and compare the approaches to deal with varied-length input and several loss functions reported in the recent speech anti-spoofing literature. We list the EERs and t-DCFs on the evaluation set of ASVspoof19 (LA) and the output of the five conversion systems. Results on loss functions demonstrate the loss function based on sigmoid and P2SGrad have a competitive performance over unknown attacks which achieved a lower equal error rate of 0.16% and 0.13% compared to the other loss functions.
4.2.2. V2A Performance
A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Three metrics are estimated: EER and log-likelihood ratio (LLR) scores, (Leeuwen and Brümmer, 2007) (i.e., relates to empirical cross-entropy) and . Denoting by () and () the false alarm and miss rates at threshold , the EER corresponds to the threshold EER at which the two detection error rates are equal, i.e., EER = () = (). Then, we use the output scores within both: tandem (i.e., EER) and ‘ZEBRA’ frameworks (Nautsch et al., 2020) (i.e., all scores). The tandem framework serves as a guide for the optimization of a countermeasure for a given authentication system. ‘ZEBRA’ framework measures the average level of privacy protection afforded by a given privacy-preserving solution for a population and the worst-case privacy disclosure for an individual.
4.2.3. Tandem Framework Performance
By combining spoofing detection scores with ASV scores, adoption of the t-DCF, we evaluate the impact of spoofing and the performance of spoofing countermeasures upon the reliability of ASV. Both the CM and ASV subsystems will make classification errors. The aim is to assess the performance of the tandem system as a whole taking into account not only the detection errors of both subsystems, but assumed prior frequencies. For ‘spoofing performance’ assessment, we consider the ASV and CM results jointly. Minimum normalized t-DCF, defined as: = where is the optimal threshold determined from the giving evaluation dataset using the ground truth. The t-DCF metric has 6 parameters: (i) false alarm and miss costs for both systems, and (ii) prior probabilities of target and spoof trials (with an implied third, nontarget prior). Specifically, VC audio files may be rejected by a CM system if the audio contains detectable artifacts. Even if VC audio files are passed on to the CM system, they may still be rejected by the ASV if their speaker similarity is not close enough to the target speakers, as shown in Fig. 6.
4.2.4. Signal Performance.
Most of the important factors that may clearly affect the performance of speech processing systems, including speaker recognition or even spoofing countermeasure performance are: Issues on Background Noise and Issues on Variability. Background noise is an issue being highlighted by (Zheng and Li, 2017)
as problematic because during training, the speaker often speaks in a clean environment. In contrast, during testing, the speaker speaks in a noisy condition. It disturbs the evaluation test and degrades the performance of the speaker recognition system. Voice variability, also known as session variability, is another factor that may affect the performance of these systems; which can be further classified as intra-variation and inter-variation. Intra-variation occurs due to various factors, such as emotions, rate of utterances, mode of speech, disease, the speaker’s mood, and the emphasis given to the word. Inter-variation exists due to anatomical differences in the speech signals due to different transmission channels, such as the different types of microphones and headphones used during the recording of speech utterances, where speech data is sampled at either 8 kHz, 16 kHz, or 22 kHz. Thus, to avoid the models having a mismatched condition, in our experiments we adopt 16 kHz for the used recordings cross all the training and evaluation systems.
4.3. Implications
4.3.1. Effect I: Security Consequences
To analyze the impact of the privacy-preserving solutions, which rely on anonymity using voice transformation tool, we investigate the following question: to what extent can anonymization solutions result in high spoofing risk for ASV and CM?
Approch. Authentication systems are prone to be intentionally fooled using spoofing attacks (i.e., replay, text-to-speech (TTS), and voice conversion (VC)). In our experiments, we involve both spoofed voices synthesized by modern TTS (i.e., using ASVspoof2019 (LA) evaluation set) and VC models. We used converted voices produced by each of the VC systems (i.e., representive of identity anonymization tools).
Observations. Interestingly, it is noted that despite the difference in the results compared to the known attacks, the countermeasure systems are still able to recognize the converted output (i.e., anonymized) as spoofed inputs. Although identity may be pretended by VC, it might be exploited in inappropriate ways (e.g., deepfake problems such as synthesized fake voice). The reason for this may be that all the conversion mechanisms share the need for vocoders to reconstruct waveforms, and even with the development of such techniques, they still can be distinguished compared to the raw utterances.
Limitations & Future Work. However, the remaining question is about the robustness of such systems under adversarial attacks, are these countermeasures for ASV robust enough to defend against adversarial examples? For the robustness, we want to point out that two important factors may affect the performance of these systems: (1) adversarial inputs and (2) real-world perturbations (i.e., noisy background). Liu (Liu et al., 2019) start to highlight the vulnerability of some of spoofing countermeasures under both white-box and black-box adversarial attacks with the fast gradient sign (FGSM) and the projected gradient descent (PGD) methods. While Chettri et al. in (Chettri et al., 2020) spot the effect of the real-time perturbations on CMs, which could be due to (1) variations within the spoof class e.g., speech synthesizers not present in the training set, (2) within the bonafide class e.g., due to content and speaker, or (3) additional nuisance factors e.g., background noise. Thus, the ideal CM should generalize across environments, speakers, languages, channels, and attacks to allow maximum utility across different applications. It is an important direction to be explored in the future and we seek to include it in our evaluation toward designing secure and private VUIs systems. We seek to use the tandem framework to evaluate a system beyond the ASV (i.e., non-biometric) to add a level of security prior privacy-preserving applications.
4.3.2. Effect II: Privacy Concerns.
To analyze the effectiveness of the voice transformation in designing privacy-preserving solutions, we investigate the following question: does hiding identity offer an ideal solution for privacy protection in VUI-based services?
Approach. Due to privacy concerns, oftentimes such data must be de-identified or anonymized before it is used or shared. Therefore, we measured the level of privacy provided by these solutions. Specifically, according to the privacy goal adopted by these solutions, the extent to which the adversary can obtain the identity of the speaker.
Observations. We use the authentication confidence scores cross x-vectors (i.e., V2A) to calculate empirical cross-entropy (ECE). Then, to quantify levels of privacy preservation, the ZEBRA framework uses ECE value to show: (1) the performance on the population level, as the expected ECE is quantified by integrating out all possible prior beliefs (i.e., 0 bit) for full privacy, and (2) the performance on the individual level, the worst-case strength of evidence is the maximum(absolute(LLR)). As shown in Figure 5, ECEs are presented to simulate all possible prior beliefs (i.e., average/expected performance of the presented privacy-preserving solutions).
Limitations & Future Work. Recently, The VoicePrivacy initiative (Tomashenko et al., 2020) has promoted the development of anonymization methods that aim to suppress personally identifiable information in speech (i.e., speaker identity), while leaving other attributes such as linguistic content intact. Despite the appeal of anonymization techniques and the urgency to address privacy concerns in the speech domain, the current solutions may be useful from singular perspectives and for achieving a specific goals, like hiding the identity of the speaker, but might fail to sufficiently address other scenarios. Additionally, it would be concerning that if a benign user can achieve privacy by a transformation model, that also entails that a malicious user can break security by bypassing the spoofing countermeasure mechanism. Besides the spoofing perspective, Srivastava et al. in (Lal Srivastava et al., 2020) found that when the attacker has complete knowledge of the VC scheme and target speaker mapping, none of the existing VC methods will be able to protect the speaker identity, and they classified the attacks based on the attacker’s knowledge about the anonymization method (i.e., ignorant, informed, and semi-informed).

5. Potential Countermeasures
In this section, we highlight potential defenses considering both security and privacy, and then discuss whether the conflicting between them is necessarily or not.
5.1. Defense
VUI technologies that allow users to speak to interact with their devices are based on accurate speaker and speech recognition to ensure appropriate responsiveness. While VUIs are offering new levels of convenience and changing the user experience, these technologies raise new and important security and privacy concerns (e.g., spoofing attacks (Wu et al., 2015), false activations (Dubois et al., 2020), attributes attack (Aloufi et al., 2020)). Thus, in the following, we discuss potential defenses that can reduce the risk of spoofing attacks while maintaining privacy.
5.1.1. Privacy by Personalization
In many real-world usage scenarios, our voice may often also contains indicators of our identity, mood, emotions, physical and mental wellbeing that may be used to manipulate us and/or shared with third parties. This raises privacy concerns owing to the capture and processing of voice recordings that may involve two or more people, and without their explicit consent. This violates GDPR provisions, for instance. Furthermore, the deep acoustic models used to analyze these recordings may encode more information than needed for the task of interest (i.e., ASR), such as profiles of users’ demographic categories, personal preferences, emotional states, etc., and may therefore significantly compromise their privacy. Current works focus on protecting/anonymizing speaker identity using VC-based mechanisms (Qian et al., 2018; Srivastava et al., 2020). Based on our results in Table 3 and 4, we show the limitation of these techniques in achieving secure VUIs services while maintaining user privacy.
Protecting users’ privacy where speech analysis is concerned continues to be a particularly challenging task. Specifically, privacy-preserving solutions should clearly specify the privacy-utility trade-off in a transparent way: what to protect and what task to achieve. These solutions must be compatible without compromising the security of these systems. This opens a new possibility for on-device personalization of speech processing models, where personalized models are trained on users’ devices. For example, a combination of federated learning and differential privacy has been proposed to develop on-device speaker verification (Paulik et al., 2021; Sim et al., 2019). Adversarial training can be one solution in learning the representation related to the task of interest. Srivastava et al. in (Srivastava et al., 2019b) proposed an on-device encoder to protect the speaker identity using adversarial training to learn representations that perform well in ASR while hiding speaker identity. Likewise, recent applications have suggested the implementation of disentanglement (Aloufi et al., 2020) in learning speech representations can enhance the robustness of speech representations and overcome common speaker recognition issues like spoofing attacks (Peri et al., 2020). We hypothesize that learning of speech representation (i.e., task-specific) on devices yields a desirable model that meeds the needs of individual users, and thus can be achieved in a personalised, privacy-preserving way by fine-tuning a global model using standard optimization methods on data stored locally on a single device overcoming the current need of using VC-based mechanism.
5.1.2. Configurable Privacy
Protecting privacy requires more than hiding speaker information or running on-device ASR. Privacy is subjective, with varying attitudes between users, and which may even depend on the services (and/or service providers) with which these systems communicate. Thus, current solutions may be useful from singular perspectives and for achieving a specific goal, like the identity of the speaker, but might fail to sufficiently address configurable privacy. Recently, Aloufi et al. in (Aloufi et al., 2021) advocate the principle of configurable privacy, emphasizing the importance of enabling different privacy settings for optimizing the privacy-utility trade-off and promoting transparent privacy management practices.
Based on our experiments, the output of the tandem framework (i.e., combining verification and authentication) can be helpful in deciding/controlling where to deploy a privacy-preserving solution. For example, assuming such a service does not require authentication (e.g., sharing on social media platforms), then we still need to verify the input speech as a genuine utterance, but a decision threshold of the ASV can be configured to enable access to a non-target speaker. Or, if authentication is required (e.g., smart assistance) the decision threshold can only accept the target speaker (i.e., the results in Figure 6 assume the latter case). In addition, this tandem framework could be also useful if the data owner and the service provider have an agreement on what privacy-preserving/anonymization mechanism to implement it on the shared data, then such a tandem system should enable the input generated by this mechanism while restricting other inputs.
5.1.3. Online Watermarking
The concept of speech watermarking (i.e., voice signature) has risen to be an efficient and promising solution to safeguard voice signals. It encrypted a user’s personal information into the voice as an inaudible watermark (Hu et al., 2021). For example, fingerprinting the audio sample using the acoustics features and then such fingerprint can be used to securely verify the user of interest. Thus, well-designed voice watermarking can help tandem systems in managing identity security in the voice inputs.
Considering honesty and inviolability as the first step approaching privacy preservation, voice signature is a worthy direction towards delivering voice integrity. However, current speech watermarking is designed for fixed-length offline audio files, e.g.,, meeting recordings, and does not consider the impact of environmental conditions, e.g.,, bitrate variability and background noise (Zhang et al., 2021). Such environmental factors make watermark embedding and retrieval very challenging. Further, they are not designed for real-time speaker recognition systems where input speech is unknown a priori and can be of variable length (Zhang et al., 2021). Therefore, speech watermarking must be extended to address the above challenges (e.g., how and which features to encode) for efficient practical applications.
5.2. Trust vs Trustworthy
Speech is a biometric characteristic of human beings, which can produce distinguishing and repeatable biometric features. Controversy has thus arisen over the risks of privacy and security around it.
5.2.1. Is Conflict a Fundamental Principle?
Privacy as Trust. Should we suppress the speaker-related information including his/her identity for privacy preservation? Speaker-related information typically involves timbre, pitch, speaking rate, and speaking style. With the growth of advanced speech synthesis techniques, it is also easy to build speech synthesis systems (i.e., anonymization) from acquired data and then generate new speech samples which reflect the voice of a pseudo speaker. The genuine user can use the generated utterances for privacy protection against an automatic speaker verification (ASV) system. The hiding of speaker identity is also referred to as speaker anonymization or de-identification. These solutions propose to prevent access to the identity in order not to prevent improper use of it.
Security as Trustworthy. Should we maintain the speaker-related information including his/her identity for security integrity? Voice-based authentication has been implemented in security-sensitive applications (e.g., smart home systems) to enable legitimate access. With the growth of advanced speech synthesis techniques, it is also easy to build speech synthesis systems (i.e., spoofing) from acquired data and then generate new speech samples which reflect the voice of a pseudo speaker. The adversary can use the generated utterances to attack an automatic speaker verification (ASV) system. The hiding of speaker identity is also referred to as speaker spoofing or presentation. These systems are used to gain illegitimate access with claimed identity to services protected by ASV.
We leave the question of deciding whether the trust-trustworthy conflict is fundamental (i.e., how to design the next generation of voice-based applications) as an open question for the research community.
5.2.2. Beyond Voice Analytics.
Our experiments so far focused on the speech processing domain. The development of synthesized techniques (e.g., generative models) are in every domain such as images, videos, etc., these tools have become a tough challenge. Recently, synthesized techniques are proposed in limiting the privacy risks by sharing synthetic data instead of real data in a manner that protects the privacy and preserves data utility (Oprisanu et al., [n.d.]; Yoon et al., 2020). However, such techniques also can advance the development of deepfake techniques, depend on generating synthesized samples to attack the target systems. The problem might be expanding to become a broader question belong to privacy and identity management. Thus, there is an urgent need to develop countermeasures techniques against deepfake consequences.
The need for trustworthy systems that offer end-to-end privacy guarantees is urgent (Rogers et al., 2019). The importance of understanding and accommodating context (i.e., control over deployment/application) is a critical key behind designing privacy-preserving solutions to offer any degree of authenticity and linkability. To be considered privacy-enhancing, such a solution needs to allow the user to choose his required and acceptable degree of anonymization while maintaining the conventional capabilities for identification and authentication.
6. Related Work
In this section, we overview the voice conversion technology in terms of its usage for privacy protection and security concerns against it.
6.1. Voice Conversion
Voice conversion is part of the general field of speech synthesis, where we convert text to speech or changes the properties of speech; for example, voice identity and emotion (Sisman et al., 2021; Aloufi et al., 2019b). VC tools modify speaker-dependent characteristics of the speech signal, such as spectral and prosodic aspects, while maintaining the speaker-independent information (i.e., linguistic). VC enables a wide range of applications including personalized speech synthesis (Veaux et al., 2013; Huang et al., 2020), speaker de-identification (Qian et al., 2018; Tomashenko et al., 2020; Ahmed et al., 2020; Yoo et al., 2020), and voice disguise (Wang et al., 2020).
Preserving Voice Privacy. VC mechanisms show their effectiveness in filtering out the speaker-related voice biometrics present in speech data without altering the linguistic content, thus preserving the usefulness of the shared data while protecting the users’ privacy. Most of the proposed works focus on protecting/anonymizing the speaker identity using these mechanisms (Qian et al., 2018; Srivastava et al., 2019b, 2020). For example, VoiceMask builds upon voice conversion to perturb the speech and then sends the sanitized speech audio to the voice input apps (Qian et al., 2018). Similarly, Srivastava et al. in (Srivastava et al., 2020) propose an x-vector-based anonymization scheme to convert any input voice into a random pseudo-speaker based on the selected gender and region of x-vector space of the target pseudo-speaker. In (Yoo et al., 2020)
it is proposed an algorithm that produces anonymized speeches by adopting many-to-many voice conversion techniques based on variational autoencoders (VAEs) and modifying the speaker identity vectors of the VAE input to anonymize the speech data. Although these VC methods may provide some identity protection against less knowledgeable attackers (
i.e., linkage attacks), they are unable to defend against an attacker that has extensive knowledge of the type of conversion and how it has been applied (Lal Srivastava et al., 2020).Besides the speaker identity, various works have proposed to use VC-based mechanisms to protect a speaker’s gender (Jaiswal and Provost, 2019) or emotions (Aloufi et al., 2019a). Champion et al. in (Champion et al., 2021) propose to alter other paralinguistic features (i.e., F0) and analyze the impact of this modification across gender. They found that the proposed F0 modification always improves pseudonymization, and both sources and target speaker genders affect the performance gain when modifying the F0. In (Aloufi et al., 2019a), an edge-based system is proposed to filter patterns from a user’s voice before sharing it with cloud services for further analysis. Likewise, Vaidya et al. in (Vaidya and Sherr, 2019) introduce an audio sanitizer, a software audio processor that filters and modifies the voice characteristics of the speaker from audio commands before they leave the client device by altering speech features (i.e., the short-term spectral features, spectro-temporal features, and high level features) in these commands.
Voice Spoofing. VC poses a significant security threat wherever the voice is used as an authenticator (Vaidya and Sherr, 2019). VC has recently become one of the most easily accessible techniques to carry out spoofing attacks, presenting a threat to speaker verification systems (ASV). There are at least four major classes of spoofing attacks: impersonation, replay, speech synthesis, and voice conversion (Marcel et al., 2019). The execution of speech synthesis and voice conversion attacks usually requires sophisticated speech technology. Speech synthesis systems can be used to generate entirely artificial speech signals, whereas voice conversion systems operate on natural speech (Wu and Li, 2014). With sufficient training data, both speech synthesis and voice conversion technologies can produce high-quality speech signals that mimic the speech of a specific target speaker and are also highly effective in manipulating ASV systems. Such synthetic speech can be used to spoof the voice authentication systems and gain access to the user’s private resources (e.g., fraud attacks).
The awareness of this threat spawned research on anti-spoofing, including techniques to distinguish between bona fide and spoofed biometric data. Solutions are referred to as spoofing countermeasures or presentation attack detection systems (Wang et al., 2020). For example, in (Zheng et al., 2021), an introduced approach to estimate the restoration function is proposed by minimizing a function of ASV scores to improve the defense against the automatic voice disguise (AVD) conducted by VC-based methods. Therefore, the improved conversion technologies also led to concerns about security and authentication. It is thus desirable to be able to prevent one’s voice from being improperly used with such voice conversion technologies.
6.2. Privacy Exposure
Deep learning has been a driving force in research and practice across speech application domains raising the need to study what causes privacy leaks and under which conditions a model is sensitive to different types of privacy-related attacks. In privacy-related attacks, the goal of an adversary is to gain knowledge that was not intended to be shared. Such knowledge can be about the training data or information about the model, or even extracting information about attributes of the data, such as unintentionally encoded information (Nasr et al., 2019).
Membership Inference Attacks.
In membership inference attacks (MIAs), the attacker aims to identify if a data record was used to train a machine learning model
(Nasr et al., 2019). The attack is driven by the different behaviors of the target model when making predictions on samples within or out of its training set (Chen et al., 2020; Murakonda and Shokri, 2020). Song and Shmatikov (Song and Shmatikov, 2019) discuss the application of user-level membership inference on text generative models, exploiting several top ranked outputs of the model. In the speech domain, Miao et al. in (Miao et al., 2020) examine user-level membership inference (i.e., if this user has any data within target model’s training set) in the problem space of voice services, by designing an audio auditor to verify whether a specific user had unwillingly contributed audio used to train an automatic speech recognition (ASR) model under strict black-box access. Song et al. in (Song et al., 2019) combine the privacy and security domains by utilizing the success accuracy of membership inference attacks in reflecting the information leakage of training algorithms about individual members of the training set.Reconstructing Attacks In a reconstruction attack, the attacker aims to infer attributes of the records in the training set (Nasr et al., 2019; Dwork et al., 2017) by leveraging publicly accessible data that are not explicitly encoded as features or are not correlated to the learning task. ‘Overlearning’ may cause revealing privacy- and bias-sensitive attributes that are not part of the target objective (Song and Shmatikov, 2020). In the speech domain, it is possible to accurately infer a user’s sensitive and private attributes (e.g., their emotion, sex, or health status) from deep acoustic models (e.g., DeepSpeech2). An attacker (e.g., a ‘curious’ service provider) may use an acoustic model trained for speech recognition or speaker verification to learn further sensitive attributes from user input even if not present in its training data (Aloufi et al., 2020). Linkage attacks can be designed depending on the attackers’ knowledge about the anonymization scheme to infer the speaker’s identity (Srivastava et al., 2019a). These types of attributes can lead to a secondary use that may include targeting content, or data brokers might profit from selling this information to other parties such as advertisers and insurance companies, or surveillance agencies may use these attributes to recognize users and track their activities and behaviors.
7. Conclusion
This work represents a step towards understanding the security risks of anonymization tools using a tandem evaluation framework. We show both empirically and analytically that (i) there exist intriguing effects between the two vector domains, (ii) an adversary can exploit these effects to optimize attacks with respect to multiple metrics, and (iii) it requires carefully accounting for such effects in designing effective countermeasures against the potential security and privacy attacks on VUI-based systems. We believe our findings shed light on the inherent vulnerabilities of VUIs deployed under realistic settings. This work also opens a few avenues for further investigation. Devising a unified evaluation framework accounting for both security and privacy may serve as a promising starting point for developing effective countermeasures. The detailed analysis in our paper highlights the importance of thinking about their combination.
References
- (1)
- Abdoli et al. (2019) Sajjad Abdoli, Luiz G Hafemann, Jerome Rony, Ismail Ben Ayed, Patrick Cardinal, and Alessandro L Koerich. 2019. Universal adversarial audio perturbations. arXiv preprint arXiv:1908.03173 (2019).
- Abdullah et al. (2019a) Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. 2019a. Practical hidden voice attacks against speech and speaker recognition systems. arXiv preprint arXiv:1904.05734 (2019).
- Abdullah et al. (2019b) Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Logan Blue, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. 2019b. Hear” no evil”, see” kenansville”: Efficient and transferable black-box attacks on speech recognition and voice identification systems. arXiv preprint arXiv:1910.05262 (2019).
- Ahmed et al. (2020) Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz, and Parmesh Ramanathan. 2020. Preech: A System for Privacy-Preserving Speech Transcription. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 2703–2720. https://www.usenix.org/conference/usenixsecurity20/presentation/ahmed-shimaa
- Aloufi et al. (2019a) Ranya Aloufi, Hamed Haddadi, and David Boyle. 2019a. Emotion Filtering at the Edge. In Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems (New York, NY, USA). Association for Computing Machinery. https://doi.org/10.1145/3362743.3362960
- Aloufi et al. (2019b) Ranya Aloufi, Hamed Haddadi, and David Boyle. 2019b. Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants. arXiv:1908.03632 [cs.CR]
- Aloufi et al. (2020) Ranya Aloufi, Hamed Haddadi, and David Boyle. 2020. Privacy-Preserving Voice Analysis via Disentangled Representations. Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3411495.3421355
- Aloufi et al. (2021) Ranya Aloufi, Hamed Haddadi, and David Boyle. 2021. Configurable Privacy-Preserving Automatic Speech Recognition. arXiv preprint arXiv:2104.00766 (2021).
- Alzantot et al. (2018) Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2018. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554 (2018).
- Bhattacharya et al. (2016) Gautam Bhattacharya, Jahangir Alam, Themos Stafylakis, and Patrick Kenny. 2016. Deep neural network based text-dependent speaker recognition: Preliminary results. In Proc. Odyssey. 2–15.
- Carlini and Wagner (2018) Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 1–7.
- Champion et al. (2021) Pierre Champion, Denis Jouvet, and Anthony Larcher. 2021. A Study of F0 Modification for X-Vector Based Speech Pseudonymization Across Gender. arXiv:2101.08478 [eess.AS]
- Chen et al. (2020) Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. Gan-leaks: A taxonomy of membership inference attacks against generative models. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 343–362.
- Chen et al. (2015) Nanxin Chen, Yanmin Qian, and Kai Yu. 2015. Multi-task learning for text-dependent speaker verification. In Sixteenth annual conference of the international speech communication association.
- Chettri et al. (2020) Bhusan Chettri, Tomi Kinnunen, and Emmanouil Benetos. 2020. Subband modeling for spoofing detection in automatic speaker verification. arXiv preprint arXiv:2004.01922 (2020).
- Cisse et al. (2017) Moustapha M Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling deep structured visual and speech recognition models with adversarial examples. Advances in neural information processing systems 30 (2017).
- Das et al. (2020) Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, and Haizhou Li. 2020. The Attacker’s Perspective on Automatic Speaker Verification: An Overview. arXiv preprint arXiv:2004.08849 (2020).
- Das et al. (2020) R. K. Das, J. Yang, and H. Li. 2020. Assessing the Scope of Generalized Countermeasures for Anti-Spoofing. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6589–6593. https://doi.org/10.1109/ICASSP40776.2020.9053086
- Dehak et al. (2010) Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19, 4 (2010), 788–798.
- Dubois et al. (2020) Daniel J. Dubois, Roman Kolcun, Anna Maria Mandalari, Muhammad Talha Paracha, David Choffnes, and Hamed Haddadi. 2020. When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers. In Proceedings of the 20th Privacy Enhancing Technologies Symposium (PETS 2020) (Montreal, Canada).
- Dwork et al. (2017) Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. 2017. Exposed! a survey of attacks on private data.(2017). (2017).
-
Fu
et al. (2014)
Tianfan Fu, Yanmin Qian,
Yuan Liu, and Kai Yu.
2014.
Tandem deep features for text-dependent speaker verification. In
Fifteenth Annual Conference of the International Speech Communication Association. - Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
-
Griffin and Lim (1984)
Daniel Griffin and Jae
Lim. 1984.
Signal estimation from modified short-time Fourier transform.
IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 (1984), 236–243. - Han et al. (2020) Y. Han, S. Li, Y. Cao, Q. Ma, and M. Yoshikawa. 2020. Voice-Indistinguishability: Protecting Voiceprint In Privacy-Preserving Speech Data Release. In 2020 IEEE International Conference on Multimedia and Expo (ICME). 1–6. https://doi.org/10.1109/ICME46284.2020.9102875
- Heigold et al. (2016) Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. 2016. End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5115–5119.
- Hemavathi and Kumaraswamy (2021) R Hemavathi and R Kumaraswamy. 2021. Voice conversion spoofing detection by exploring artifacts estimates. Multimedia Tools and Applications (2021), 1–20.
- Ho and Akagi (2021) Tuan Vu Ho and Masato Akagi. 2021. Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network. IEEE Access 9 (2021), 47503–47515. https://doi.org/10.1109/ACCESS.2021.3063519
- Hu et al. (2021) Hwai-Tsu Hu, Hsien-Hsin Chou, and Tung-Tsun Lee. 2021. Robust Blind Speech Watermarking via FFT-Based Perceptual Vector Norm Modulation With Frame Self-Synchronization. IEEE Access 9 (2021), 9916–9925. https://doi.org/10.1109/ACCESS.2021.3049525
- Huang et al. (2020) Y. Huang, L. He, W. Wei, W. Gale, J. Li, and Y. Gong. 2020. Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7399–7403. https://doi.org/10.1109/ICASSP40776.2020.9053104
- Jaiswal and Provost (2019) Mimansa Jaiswal and Emily Mower Provost. 2019. Privacy Enhanced Multimodal Neural Representations for Emotion Recognition. arXiv:1910.13212 [cs.LG]
- Kalchbrenner et al. (2018) Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 2410–2419.
- Kawahara (2006) Hideki Kawahara. 2006. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology 27, 6 (2006), 349–353.
- Kinnunen et al. (2020) Tomi Kinnunen, Héctor Delgado, Nicholas Evans, Kong Aik Lee, Ville Vestman, Andreas Nautsch, Massimiliano Todisco, Xin Wang, Md Sahidullah, Junichi Yamagishi, et al. 2020. Tandem assessment of spoofing countermeasures and automatic speaker verification: Fundamentals. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2195–2210.
- Kobayashi et al. (2021) Kazuhiro Kobayashi, Wen-Chin Huang, Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, and Tomoki Toda. 2021. crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder. arXiv:2103.02858 [eess.AS]
- Lal Srivastava et al. (2020) B. M. Lal Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent. 2020. Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2802–2806. https://doi.org/10.1109/ICASSP40776.2020.9053868
- Leeuwen and Brümmer (2007) David A. Leeuwen and Niko Brümmer. 2007. An Introduction to Application-Independent Evaluation of Speaker Recognition Systems. Springer-Verlag, Berlin, Heidelberg, 330–353. https://doi.org/10.1007/978-3-540-74200-5_19
- Liu et al. (2019) Songxiang Liu, Haibin Wu, Hung-yi Lee, and Helen Meng. 2019. Adversarial attacks on spoofing countermeasures of automatic speaker verification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 312–319.
- Liu et al. (2015) Yuan Liu, Yanmin Qian, Nanxin Chen, Tianfan Fu, Ya Zhang, and Kai Yu. 2015. Deep feature for text-dependent speaker verification. Speech Communication 73 (2015), 1–13.
- Lorenzo-Trueba et al. (2019) Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, and Vatsal Aggarwal. 2019. Towards Achieving Robust Universal Neural Vocoding. 181–185. https://doi.org/10.21437/Interspeech.2019-1424
- Marcel et al. (2019) Sébastien Marcel, Mark S Nixon, Julian Fierrez, and Nicholas Evans. 2019. Handbook of biometric anti-spoofing: Presentation attack detection. Springer.
- Miao et al. (2020) Yuantian Miao, Minhui Xue, Chao Chen, Lei Pan, Jun Zhang, Benjamin Zi Hao Zhao, Dali Kaafar, and Yang Xiang. 2020. The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services. arXiv:1905.07082 [cs.CR]
- Morise et al. (2016) Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems (2016).
- Murakonda and Shokri (2020) Sasi Kumar Murakonda and Reza Shokri. 2020. Ml privacy meter: Aiding regulatory compliance by quantifying the privacy risks of machine learning. arXiv preprint arXiv:2007.09339 (2020).
- Nagrani et al. (2017) Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, Francisco Lacerda (Ed.). ISCA, 2616–2620.
- Nakamura et al. (2019) Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, and Hiroshi Saruwatari. 2019. V2S attack: building DNN-based voice conversion from automatic speaker verification. arXiv:1908.01454 [cs.SD]
- Nasr et al. (2019) M. Nasr, R. Shokri, and A. Houmansadr. 2019. Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning. In 2019 IEEE Symposium on Security and Privacy (SP). 739–753. https://doi.org/10.1109/SP.2019.00065
- Nautsch et al. (2020) Andreas Nautsch, Jose Patino, N. Tomashenko, Junichi Yamagishi, Paul-Gauthier Noé, Jean-François Bonastre, Massimiliano Todisco, and Nicholas Evans. 2020. The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment. Interspeech 2020 (Oct 2020). https://doi.org/10.21437/interspeech.2020-1815
- Oord et al. (2016) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
- Oprisanu et al. ([n.d.]) Bristena Oprisanu, Adria Gascon, and Emiliano De Cristofaro. [n.d.]. Evaluating Privacy-Preserving Generative Models in the Wild Technical Report. ([n. d.]).
- Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. 2016. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 (2016).
- Paulik et al. (2021) Matthias Paulik, Matt Seigel, Henry Mason, Dominic Telaar, Joris Kluivers, Rogier van Dalen, Chi Wai Lau, Luke Carlson, Filip Granqvist, Chris Vandevelde, et al. 2021. Federated Evaluation and Tuning for On-Device Personalization: System Design & Applications. arXiv preprint arXiv:2102.08503 (2021).
- Peri et al. (2020) Raghuveer Peri, Haoqi Li, Krishna Somandepalli, Arindam Jati, and Shrikanth Narayanan. 2020. An empirical analysis of information encoded in disentangled neural speaker representations. In Proceedings of Odyssey (Tokyo, Japan).
- Povey et al. (2011) Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
- Qian et al. (2018) Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiang-Yang Li. 2018. Hidebehind: Enjoy Voice Input with Voiceprint Unclonability and Anonymity. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems (Shenzhen, China). Association for Computing Machinery, 82–94. https://doi.org/10.1145/3274783.3274855
- Qin et al. (2019) Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In International conference on machine learning. PMLR, 5231–5240.
- Reynolds et al. (2000) Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. 2000. Speaker verification using adapted Gaussian mixture models. Digital signal processing 10, 1-3 (2000), 19–41.
- Rogers et al. (2019) Jennie Rogers, Johes Bater, Xi He, Ashwin Machanavajjhala, Madhav Suresh, and Xiao Wang. 2019. Privacy changes everything. In Heterogeneous data management, polystores, and analytics for healthcare. Springer, 96–111.
- Schönherr et al. (2018) Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2018. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665 (2018).
- Schuller and Batliner (1988) Björn W Schuller and Anton M Batliner. 1988. EMOTION, AFFECT AND PERSONALITY IN SPEECH AND LANGUAGE PROCESSING. (1988).
- Sim et al. (2019) Khe Chai Sim, Petr Zadrazil, and Françoise Beaufays. 2019. An investigation into on-device personalization of end-to-end automatic speech recognition models. arXiv preprint arXiv:1909.06678 (2019).
- Sisman et al. (2021) B. Sisman, J. Yamagishi, S. King, and H. Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 132–157. https://doi.org/10.1109/TASLP.2020.3038524
- Snyder et al. (2018) David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5329–5333.
-
Song and
Shmatikov (2019)
Congzheng Song and
Vitaly Shmatikov. 2019.
Auditing data provenance in text-generation models. In
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 196–206. - Song and Shmatikov (2020) Congzheng Song and Vitaly Shmatikov. 2020. Overlearning Reveals Sensitive Attributes. arXiv:1905.11742 [cs.LG]
- Song et al. (2019) Liwei Song, Reza Shokri, and Prateek Mittal. 2019. Privacy risks of securing machine learning models against adversarial examples. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 241–257.
- Srivastava et al. (2019a) Brij Mohan Lal Srivastava, Aurélien Bellet, Marc Tommasi, and Emmanuel Vincent. 2019a. Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion? Interspeech 2019 (Sep 2019). https://doi.org/10.21437/interspeech.2019-2415
- Srivastava et al. (2019b) Brij Mohan Lal Srivastava, Aurélien Bellet, Marc Tommasi, and Emmanuel Vincent. 2019b. Privacy-preserving adversarial representation learning in ASR: Reality or illusion? arXiv preprint arXiv:1911.04913 (2019).
- Srivastava et al. (2020) Brij Mohan Lal Srivastava, Natalia Tomashenko, Xin Wang, Emmanuel Vincent, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, and Marc Tommasi. 2020. Design Choices for X-vector Based Speaker Anonymization. arXiv:2005.08601 [eess.AS]
- Taori et al. (2019) Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2019. Targeted adversarial examples for black box audio systems. In 2019 IEEE Security and Privacy Workshops (SPW). IEEE, 15–20.
- Tian et al. (2015) Yao Tian, Meng Cai, Liang He, and Jia Liu. 2015. Investigation of bottleneck features and multilingual deep neural networks for speaker verification. In Sixteenth Annual Conference of the International Speech Communication Association.
- Tomashenko et al. (2020) Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, et al. 2020. Introducing the VoicePrivacy initiative. arXiv preprint arXiv:2005.01387 (2020).
- Vaidya and Sherr (2019) T. Vaidya and M. Sherr. 2019. You Talk Too Much: Limiting Privacy Exposure Via Voice Input. In 2019 IEEE Security and Privacy Workshops (SPW). 84–91. https://doi.org/10.1109/SPW.2019.00026
- Variani et al. (2014) Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4052–4056.
- Veaux et al. (2013) Christophe Veaux, Junichi Yamagishi, and Simon King. 2013. Towards personalised synthesised voices for individuals with vocal disabilities: Voice banking and reconstruction. In Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies. 107–111.
- Villalba et al. (2020a) Jesús Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Jonas Borgstrom, Leibny Paola García-Perera, Fred Richardson, Réda Dehak, et al. 2020a. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Computer Speech & Language 60 (2020), 101026.
- Villalba et al. (2020b) Jesús Villalba, Yuekai Zhang, and Najim Dehak. 2020b. x-Vectors Meet Adversarial Attacks: Benchmarking Adversarial Robustness in Speaker Verification. Proc. Interspeech 2020 (2020), 4233–4237.
- Wang et al. (2020) Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, and Zhen-Hua Ling. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. arXiv:1911.01601 [eess.AS]
- Wester (2010) Mirjam Wester. 2010. The EMIME bilingual database. Technical Report. The University of Edinburgh.
- Wu et al. (2020b) Haibin Wu, Andy T Liu, and Hung-yi Lee. 2020b. Defense for black-box attacks on anti-spoofing models by self-supervised learning. arXiv preprint arXiv:2006.03214 (2020).
- Wu et al. (2020a) Zhenzong Wu, Rohan Kumar Das, Jichen Yang, and Haizhou Li. 2020a. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. arXiv preprint arXiv:2009.09637 (2020).
- Wu et al. (2015) Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: A survey. speech communication (2015), 130–153.
- Wu and Li (2014) Zhizheng Wu and Haizhou Li. 2014. Voice conversion versus speaker verification: an overview. APSIPA Transactions on Signal and Information Processing 3 (2014).
- y. Huang et al. (2021) C. y. Huang, Y. Y. Lin, H. y. Lee, and L. s. Lee. 2021. Defending Your Voice: Adversarial Attack on Voice Conversion. In 2021 IEEE Spoken Language Technology Workshop (SLT). 552–559. https://doi.org/10.1109/SLT48900.2021.9383529
- Yamagishi et al. (2019) Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). https://doi.org/10.7488/ds/2645
- Yi et al. (2020) Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, and Tomoki Toda. 2020. Voice Conversion Challenge 2020 –- Intra-lingual semi-parallel and cross-lingual voice conversion –-. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. 80–98. https://doi.org/10.21437/VCC_BC.2020-14
- Yoo et al. (2020) I. C. Yoo, K. Lee, S. Leem, H. Oh, B. Ko, and D. Yook. 2020. Speaker Anonymization for Personal Information Protection Using Voice Conversion Techniques. IEEE Access 8 (2020), 198637–198645. https://doi.org/10.1109/ACCESS.2020.3035416
- Yoon et al. (2020) Jinsung Yoon, Lydia N. Drumright, and Mihaela van der Schaar. 2020. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE Journal of Biomedical and Health Informatics 24, 8 (2020), 2378–2388. https://doi.org/10.1109/JBHI.2020.2980262
- Yuan et al. (2018) Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A Gunter. 2018. Commandersong: A systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18). 49–64.
- Zhang et al. (2021) Yangyong Zhang, Maliheh Shirvanian, Sunpreet S Arora, Jianwei Huang, and Guofei Gu. 2021. Practical Speech Re-use Prevention in Voice-driven Services. arXiv preprint arXiv:2101.04773 (2021).
- Zheng et al. (2021) L. Zheng, J. Li, M. Sun, X. Zhang, and T. F. Zheng. 2021. When Automatic Voice Disguise Meets Automatic Speaker Verification. IEEE Transactions on Information Forensics and Security 16 (2021), 824–837. https://doi.org/10.1109/TIFS.2020.3023818
- Zheng and Li (2017) Thomas Fang Zheng and Lantian Li. 2017. Robustness-related issues in speaker recognition. Springer.
Comments
There are no comments yet.