A Tandem Framework Balancing Privacy and Security for Voice User Interfaces

by   Ranya Aloufi, et al.

Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g., identity and emotion, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a legitimate speaker. Conversely, such techniques have been used to generate privacy-transformed speech by suppressing personally identifiable attributes in the voice signals, achieving anonymization. Prior works have studied the security and privacy vectors in parallel, and thus it raises alarm that if a benign user can achieve privacy by a transformation, it also means that a malicious user can break security by bypassing the anti-spoofing mechanism. In this paper, we take a step towards balancing two seemingly conflicting requirements: security and privacy. It remains unclear what the vulnerabilities in one domain imply for the other, and what dynamic interactions exist between them. A better understanding of these aspects is crucial for assessing and mitigating vulnerabilities inherent with VUIs and building effective defenses. In this paper,(i) we investigate the applicability of the current voice anonymization methods by deploying a tandem framework that jointly combines anti-spoofing and authentication models, and evaluate the performance of these methods;(ii) examining analytical and empirical evidence, we reveal a duality between the two mechanisms as they offer different ways to achieve the same objective, and we show that leveraging one vector significantly amplifies the effectiveness of the other;(iii) we demonstrate that to effectively defend from potential attacks against VUIs, it is necessary to investigate the attacks from multiple complementary perspectives(security and privacy).



There are no comments yet.


page 1

page 2

page 3

page 4


A Practical Guide to Logical Access Voice Presentation Attack Detection

Voice-based human-machine interfaces with an automatic speaker verificat...

Evaluating Voice Conversion-based Privacy Protection against Informed Attackers

Speech signals are a rich source of speaker-related information includin...

Data Quality as Predictor of Voice Anti-Spoofing Generalization

Voice anti-spoofing aims at classifying a given speech input either as a...

Voice Privacy with Smart Digital Assistants in Educational Settings

The emergence of voice-assistant devices ushers in delightful user exper...

Practical Attacks on Voice Spoofing Countermeasures

Voice authentication has become an integral part in security-critical op...

Benchmarking and challenges in security and privacy for voice biometrics

For many decades, research in speech technologies has focused upon impro...

Characterizing Privacy Perceptions of Voice Assistants: A Technology Probe Study

The increasing pervasiveness of voice assistants in the home poses sever...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Voice User Interfaces (VUIs) for conversational agents are now common in many services such as banking, call centers, and medical services, in addition to voice assistants (VA) like Amazon Alexa and Google Assistant. Often these services rely on verifying the user through speaker recognition and then using speech recognition techniques to understand a spoken language. Increased reliance on VUI services exposes users to an increasing number of threats to their privacy and security. Speech recordings are a rich source of personal information 

(Schuller and Batliner, 1988), and the degree of privacy-sensitive information (e.g., emotion, sex, accent, and ethnicity) captured in these recordings extends beyond what is said (i.e., linguistic) and who says it (i.e., paralinguistic). Security and privacy concerns arise from the potential interception and misuse of this sensitive information sharing through automatic speech processing technology.

In the security domain, we can consider an adversary which aims to fool the target model (Papernot et al., 2016). Evasion attacks, also known as adversarial examples, add imperceptible perturbation to the input sample to result in the incorrect prediction of the target models (e.g.,

 automatic speech recognition (ASR) and automatic speaker verification (ASV)) 

(y. Huang et al., 2021; Carlini and Wagner, 2018; Abdullah et al., 2019a, b; Cisse et al., 2017; Yuan et al., 2018; Taori et al., 2019; Qin et al., 2019; Schönherr et al., 2018; Abdoli et al., 2019; Alzantot et al., 2018). A spoofing attack (i.e., replay, synthesis, and voice conversion attacks) is a technique where the imposter speaker’s speech is converted to desired speaker’s speech using signal processing approaches that cause false acceptances to authentication systems (Wu and Li, 2014). In response to these potential attacks, the countermeasures for adversarial/spoofing attacks have been proposed to secure target models against these attacks  (Wang et al., 2020; Das et al., 2020; Hemavathi and Kumaraswamy, 2021).

In the privacy domain, the adversary aims to obtain private information about the training data or to obtain the model itself (Nasr et al., 2019). Attacks targeting data privacy include, for example, an attacker aiming to determine if the voice of a certain individual was used for training a speaker identification system. In response to these potential attacks, privacy-preserving defenses have been designed to prevent privacy leakage of the raw data. These defenses fall between anonymization and cryptography (Tomashenko et al., 2020). For example, anonymization aims to make the speech input unlinkable, i.e., ensure that no utterance can be linked to its original speaker by altering a raw signal and mapping the identifiable personal characteristics of a given speaker to another identity (Lal Srivastava et al., 2020). Various studies have proposed anonymization methods based on noise addition (Tomashenko et al., 2020), voice conversion (Lal Srivastava et al., 2020; Ahmed et al., 2020; Srivastava et al., 2020), speech synthesis (Qian et al., 2018), and adversarial learning  (Srivastava et al., 2019a), considering the speaker identity (Tomashenko et al., 2020) or emotion (Aloufi et al., 2019a) as a sensitive attributes.

Voice conversion (VC) is a technique used to convert paralinguistic information such as gender, speaker identity, and emotions while keeping the linguistic information of a source speech. These technologies were made much more powerful by incorporating deep learning mechanisms. Recently, VC technology has become a key technology in designing privacy-preserving voice analytics solutions to produce convincing mimicry of specific target speaker voices. For example, Srivastava 

et al. (Srivastava et al., 2020) designed an anonymization scheme that converts any input speech into that of a random pseudo-speaker. Ho et al. (Ho and Akagi, 2021) propose a speaker identity-controllable framework based on VC technology to mimic voice while continuously controlling speaker individuality. On the contrary, such techniques can enable fooling (spoofing) unprotected speaker authentication systems and therefore might prompt various potential security implications. With the generated spoofed recordings, for instance, an adversary might attack the voice assistant, making it fraudulently respond to identity-based service requests; an insider might attack the VoicePrint-based security system to gain illegitimate access and gain sensitive information; an imposter might call a bank’s contact center by making himself recognized as the victim. Thus, it raises alarm about the feasibility of achieving privacy by applying voice transformation (i.e., anonymization) regarding its security threat in real-time practical applications such as smart-assistance systems.

With the increasing use of automatic speaker verification (ASV) in security-sensitive domains (e.g., forensics identification and smart-home), ASV is becoming a new target for attackers. It has been shown that ASV systems can be vulnerable to fooling/spoofing, also referred to as presentation attacks (Wang et al., 2020), since these systems generally are not yet efficient in recognizing voice modifications/variations (i.e., adversarial examples, noisy voice samples, mismatch conditions between enrolling and trails recordings) (Dehak et al., 2010)

. VC technology could be also misused for attacking these systems, and thus spoofing countermeasures (CM) have been proposed and adopted to protect ASV systems. CMs are designed to learn the distinguishing artifacts present in spoofed audio produced by VC from human speech. Spoofing refers to falsifying a speech signal as system input for feature extraction and verification, the objective of which is to improve the reliability of biometric systems by preventing fraudulent access. While the ASV system should reject a zero-effort impostor (

i.e., false attempt), the CMs should detect a valid trial (i.e., genuine speech).

While security, privacy, and data protection are often studied independently, there is little understanding on their fundamental interconnections, and the complexity of their relation has not been fully explored. Specifically, in the speech domain, prior work has intensively studied the two domains separately (Carlini and Wagner, 2018; Abdullah et al., 2019a; Aloufi et al., 2019a). Thus, it remains unclear what the vulnerability to one domain implies for the other. Revealing such implications is important for developing effective defenses where security and privacy can be co-engineered. It is unclear how the two vectors interact with each other and how their interactions may influence attack dynamics against VUIs systems. Understanding such interactions is critical for building effective defenses. For example, in voice assistance systems, the users need to be verified first using voice-based authentication to gain access to further services (e.g., understanding the user command and responding based on it), assuming that anonymization mechanism is detected to protect user privacy (i.e., hiding sensitive speaker-related information), resulting in modified/synthesized voices that can affect the authentication functionality or be blocked by CMs that may detect it as a spoofed signal. Further, the adversary may exploit such an anonymization tool to mislead the authentication operation. Finally, studying potential attack vectors within a unified framework is essential for assessing and mitigating the broad vulnerabilities of VUIs deployed in practice, in which multiple attacks may be launched simultaneously. In this paper, we seek to answer the following research questions.
RQ1 – What are the fundamental connections between voice spoofing and voice anonymization?
RQ2 – What are the implications of such mechanisms (e.g., speech synthesis, voice cloning, and voice conversion) for an adversary to optimize attack strategies against VUI-enabled services, and for benign users to protect their privacy?
RQ3 – What are the potential countermeasures to maintain secure and private VUI-based systems?

Our Contribution. In this work we present a step towards answering the key questions above. Answering these key questions is crucial for assessing and mitigating the broad vulnerabilities of VUIs deployed in realistic settings.

RA1 – We use a tandem framework that jointly investigates ASV and CM models performance against two vectors of attacks generated by voice transformation and anonymization mechanisms. With this framework, we show that there exists an intricate duality between the two mechanisms. Specifically, they offer different ways to achieve the same objective.

RA2 – Through empirical studies on benchmark datasets and using both spoofing countermeasures and anonymization techniques, we reveal that the anonymized voices are detected as spoofed attacks, intuitively, leading to confusingly questioning its effectiveness in obtaining privacy-transformed utterances to meet the anonymization purposes. We also provide analytical justification for such effects under a different setting. 111Code and research artefacts. https://github.com/RanyaJumah/EDGY/tree/master/Balancing_Privacy&Security_for_VUI
RA3 – Finally, we demonstrate that to effectively defend against attacks, it is necessary to consider attacks from multiple complementary perspectives (i.e., security and privacy) and carefully account for the effects in applying the mitigating solutions.

To our best knowledge, this work represents the first systematic study of voice spoofing (i.e., for security deceiving) and anonymization (i.e., for privacy protection) within a unified framework. We believe our findings deepen understanding of the vulnerabilities of VUIs in practical settings and shed light on how to develop more effective, secure and private solutions.

Figure 1. Voice conversion pipeline: (1) for training, the speech signals from the source and target decompose into features, and then feature mapping performs the modification of these features from source to target speaker resulting in conversion model, (2) for testing, the output of the conversion model is used as a vocoder’s input to regenerate the speech with the target speaker.

2. Voice Transformation and Authentication

2.1. Voice Conversion

Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. A typical voice conversion pipeline includes speech analysis, mapping, and reconstruction modules. Deep learning techniques also transform the way we implement the analysis-mapping-reconstruction pipeline. The concept of embedding in deep learning provides a new way of deriving the intermediate representation, for example, latent code for linguistic content, and speaker embedding for speaker identity. It also makes the disentanglement of speaker from speech content much easier.

2.1.1. Speech Analysis.

From the perspective of speech perception, speaker individuality is characterized at three different levels: segmental, supra-segmental, and linguistic information. The segmental information relates to the short-term feature representations, such as spectrum and instantaneous fundamental frequency (F0). The supra-segmental information describes prosodic features such as duration, tone, stress, rhythm over longer stretches of speech than phonetic units. It is more related to the signal but spanning a longer time than the segmental information. The linguistic information is encoded and expressed through lexical content.

Voice conversion technology is to deal with the segmental and suprasegmental information while keeping the language content unchanged (Vaidya and Sherr, 2019). The speech analyzer decomposes the speech signals of a source speaker into features that represent supra-segmental and segmental information.

2.1.2. Mapping.

The mapping module has taken centre stage in many studies. These techniques can be categorized in different ways, for example, based on the use of training data-parallel vs non-parallel, the type of statistical modeling technique (parametric vs non-parametric), the scope of optimization (frame-level vs utterance level) and the workflow of conversion (direct mapping vs inter-lingual).

The simplest form of voice conversion (i.e.,

 mapping) requires parallel data for training and it is capable of one-to-one speaker conversion. Parallel data include the same transcription utterances spoken by the source and target speakers and they are highly expensive to collect. Thus, several studies attempted to use non-parallel data to train voice conversion models. In the case of multiple-speaker voice conversion, one-to-one speaker conversion algorithms may be applied to obtain separately trained models for all possible combinations of speaker pairs. However, this approach becomes impractical as the number of speakers increases. Traditional VC research includes modeling spectral mapping with statistical methods such as Gaussian mixture model (GMM), partial least squares regression, and sparse representation. Recent deep learning approaches such as deep neural network (DNN), recurrent neural network (RNN) and generative adversarial network (GAN) have advanced the state-of-the-art. The mapping module changes them towards the target speaker.

2.1.3. Vocoding.

Speech reconstruction can be seen as an inverse function of speech analysis that operates on the modified parameters and generates an audible speech signal. It works with speech analysis in tandem. A vocoder learns to reconstruct audio waveforms from acoustic features (Oord et al., 2016). Traditionally, the waveform can be vocoded from these acoustic or linguistic features using handcrafted models such as WORLD (Morise et al., 2016), Straight (Kawahara, 2006), and Griffin-Lim (Griffin and Lim, 1984)

. However, the quality of those traditional vocoders was limited by the difficulty in accurately estimating the acoustic features from the speech signal. Neural vocoders such as Wavenet 

(Oord et al., 2016) have rapidly become the most commonly used vocoding method for speech synthesis. Although it improved the quality of generated speech, it has significant cost in computation power and data sources, and suffers from poor generalization (Lorenzo-Trueba et al., 2019). To solve this problem, many architectures such as Wave Recurrent Neural Networks (WaveRNN) (Kalchbrenner et al., 2018) have been proposed. WaveRNN combines linear prediction with recurrent neural networks to synthesize neural audio much faster than other neural synthesizers.

A vocoder is used to express a speech frame with a set of controllable parameters that can be converted back into a speech waveform. Voice conversion systems only modify the speaker-dependent characteristics of speech, such as fundamental frequency (F0), intonation, intensity, and duration, while carrying over the speaker-independent speech content. The reconstruction module re-synthesizes time-domain speech signals.

2.2. Speaker Verification Techniques

Speaker verification is integral to many security applications. This is to verify the identity of a person from the characteristics of the voice. Contemporary ASV systems involve two processes: offline training (i.e., registration or enrollment) and runtime verification. During the offline training, the ASV system uses speech samples provided by the target speaker to extract certain spectral, prosodic, or other high-level features to create a speaker model. Then, in the runtime verification phase, the receiving voice is verified against the trained speaker model (Wang et al., 2020) and the verification score is compared with a pre-defined threshold. If the score is higher than the threshold, the test is accepted, or rejected otherwise. It is a binary decision task and a verification score is estimated based on the claimed speaker’s model.

2.2.1. Speech Analysis.

Typically, an encoder network extracts frame-level representations from acoustic features (e.g., Mel Frequency Cepstrum Coefficients (MFCCs), filter-banks, or spectrogram). This is followed by a global temporal pooling layer that aggregates the frame-level representation into a single vector per utterance. Finally, a feed-forward classification network processes this single vector to calculate speaker class posteriors (Nagrani et al., 2017). Typically, in the evaluation phase, the speaker embedding is extracted from the first affine transform after the pooling layer. Different x-vector systems are characterized by different encoder architectures, pooling methods, and training objectives (e.g., softmax, angular softmax, contrastive, and triplet losses) (Villalba et al., 2020a).
Traditional Methods. Speaker identification was dominated by Gaussian Mixture Models (GMMs) trained on low dimensional feature vectors (Reynolds et al., 2000). The state-of-the-art involves both the use of joint factor analysis (JFA) based methods which model speaker and channel subspaces separately and i-vectors that attempt to model both subspaces into a single compact, low-dimensional space (Dehak et al., 2010). These systems rely, however, on a low dimensional representation of the audio input, e.g., MFCCs, and thus rapidly degrade in verification performance with real-world noise, and may be lacking in speaker-discriminating features (e.g., pitch information) (Nagrani et al., 2017).
Deep Learning Methods. DNN based acoustic models were used instead of the GMM in the i-vector framework (Dehak et al., 2010)

. Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. An alternative approach is to use DNN to extract bottleneck features 

(Fu et al., 2014; Liu et al., 2015; Tian et al., 2015) or speaker representations directly (Chen et al., 2015). For example, speaker representations such as d-vector (Variani et al., 2014; Heigold et al., 2016) and RNN/LSTM based sequence-vector (s-vector) (Bhattacharya et al., 2016) have been applied as robust speaker embeddings.

2.2.2. Speaker Modeling.

There are two kinds of speaker verification (SV) systems: Text-independent (TI)-SV and Text-dependent (TD)-SV systems. TD-SV assumes cooperative speakers and requires the speaker to speak fixed or spontaneously prompted utterances, whereas TI-SV allows the speaker to speak freely during both enrolment and verification. Both TI-SV and TD-SV systems share the feature extraction techniques while being different in the speaker modeling. However, the text-prompted speaker recognition systems have been the preferred alternative in many practical applications.
TI-SV Modeling. In the text-independent (TI) mode, there are no constraints on the text. Thus, the enrollment and test utterances may have completely different texts. For such cases, it is more convenient for the users to operate. Text-independent ASV systems are more flexible and are able to accept arbitrary utterances, e.g., different languages, from speakers.
TD-SV Modeling. In the text-dependent (TD) mode, the user is expected to speak a pre-determined text for both training and test. Due to the prior knowledge (lexical content) of the spoken phrase, TD systems are generally more robust and can achieve good performance. Text-dependent ASV is more widely selected for authentication applications, since it provides higher recognition accuracy with fewer required utterances for verification.

3. Voice Disguise vs. Speaker Authentication Systems

This section presents some insights into different types of spoofing attacks, followed by two distinct measurement estimators: Disguise and Anonymization. We then use a tandem framework combining these two estimators to manage the application of privacy-preserving solutions.

3.1. Generic Attack Model

Security of automatic speaker verification (ASV) systems can be compromised by various spoofing attacks (e.g., speech synthesis and voice conversion). We consider a scenario in which a user seeks to compromise a system or service protected by ASV. It is assumed in these scenarios that the microphone is not controlled by the authentication system and is instead chosen by the user (i.e., post-sensor scenario). An example is that voice spoofing attacks can be used to impersonate a person’s voice for voice assistants like Amazon Alexa or Google Assistant to shop online, send messages, control smart home appliances, and grant undesirable access to personal users’ data such as financial information. Such attacks are not necessarily fraudulent, whereby a user of these services may want to conceal their identity to spoof or trick third parties for privacy preservation purposes. Attacks then take the form of synthetic speech or converted voice, which is presented to the ASV system without acoustic propagation or microphone effects (i.e., logical-access voice spoofing techniques). From the attacker’s perspective, spoofing attacks (i.e.,  in our case, assuming the anonymization system operates as a spoofing system) can be categorized into non-proactive and adversarial attacks causing potential threats on ASV, spoofing countermeasures, or both (Das et al., 2020), as shown in Figure 2.

Figure 2. Spoofing attacks (A) non-proactive attacks (B) adversarial attacks: using black-box, grey-box and white-box ASV (Das et al., 2020).

3.1.1. Spoofing with Non-proactive Attacks

The attacker lacks a direct optimization target related to the attacked ASV system, as shown in Figure 2 (A). Basically, crafting non-proactive attacks represent ideas or technology originally designed for completely different aims and purposes instead of fooling ASV systems (Das et al., 2020). One example is VC and TTS attacks, which aim at modifying source speaker identity to that of a target speaker, and to produce text in a given target speaker’s voice, respectively. VC and TTS technology takes place as a key concept behind a privacy protection mechanism that anonymizes the speaker identity (Tomashenko et al., 2020). Thus, TTS and VC attacks can compromise the security of ASV systems as a side-purpose rather than its original objective in helping, for example, to give those with conditions like autism the ability to speak naturally (Das et al., 2020). In this paper we focus on the non-proactive type, and consider anonymization objective in fooling ASV as a side effect of voice transformation technology.

3.1.2. Spoofing with Adversarial Attacks

The attacker leverages the information of the attacked ASV system to generate spoofed samples and can use the knowledge of either the attacked ASV or another similar ASV to generate adversarial samples (Das et al., 2020; Villalba et al., 2020b), see Figure 2 (B). Adversarial attacks can be broadly divided into black, grey, and white-box attacks (Goodfellow et al., 2014). In the black-box setting, the adversary’s observation is limited to the system output (e.g., speaker similarity score) and the model parameters and the intermediate steps of the computation are not accessible to the attacker (Nasr et al., 2019). In the grey-box, the attacker has some information such as features of the speakers and their implementation, but not their statistical models (Das et al., 2020). The white-box attacks pose the greatest threat as the attackers have full knowledge of the model under attack including its parameters which are needed for prediction (Nasr et al., 2019; Nakamura et al., 2019). We assume that anonymization tools are designed without considering specific knowledge about ASV used for authenticating either target or non-target users.

3.2. Verification-to-Disguise (V2D) Estimation

V2D is a spoofing detector to discriminate genuine
and synthetic speech utterances.

Spoofing countermeasures (CM) are introduced to the ASV systems to protect them from various attacks (Wu et al., 2020b; Liu et al., 2019). High-performance anti-spoofing is used to protect ASV by identifying and filtering spoofing audio that is deliberately generated by text-to-speech, voice conversion, audio replay, etc. Claimed identities are thus only accepted if a test signal attains a countermeasure score lower than its threshold. Therefore, the existing spoofing countermeasure involves the extraction of various parameters of prediction error, aiming to capture the features that will help to differentiate genuine from spoof speech signals. Spoofing countermeasures use particular features that capture the unique aspects of human speech production, under the hypothesis that machines cannot emulate many of the fine-level intricacies of the human speech production mechanism (Das et al., 2020). This could be because of the complexity of the human speech production mechanism, human speech has a greater degree of inconsistency than machine-generated speech, as shown in Figure 3. Typically, for deep-learning based voice spoofing detection models, the speech features (e.g., LFCC or CQCC) are fed into a neural network to calculate an embedding vector for the input utterance. The objective of training this model is to learn an embedding space in which the genuine voices and spoofing voices can be well discriminated. The embedding would be further used for scoring the confidence of whether the utterance belongs to genuine speech or not.

By increasing the performance of spoofing countermeasures in detecting disguised voices, the privacy-based transformation methods may not be sufficient to further protect the users’ privacy. In our evaluation framework, we evaluate the artifact in the transformed voices by these systems using a spoofing countermeasure to indicate the level of artifacts left in the converted speech. In our case, CM tries to detect whether the privacy-transformed utterance will be detected as spoofed or not. Thus, we may have two initial scenarios regarding the privacy protection level offered by the anonymization solution: strong security, indicating that the transformed input detected as spoofed and will be prevented from accessing the ASV system, and weak security, indicating that the transformed input bypassed the spoofing countermeasures and thus might present security issues to the authentication system. This V2D output will be used in a security estimation over the inputs of VUIs systems.

Figure 3. The same text utterance (”Robin Williams is very subdued.” )’s spectral envelope (log scale) of genuine speech contains natural transition (raw, left) while the spoofed speech does not (anonymized, right).

3.3. Verification-to-Anonymization (V2A) Estimation

V2A is a speaker detector to verify whether the
given utterance is from the target speaker or not.

VUIs use spoken commands to carry out various actions. Sometimes it is necessary to collect some speech to improve and adapt the assistant’s models to the user’s speech. In this case, an attacker could have access to sensitive user data (e.g., they may observe or infer personal information such as identity, age, and gender that can be easily obtained from these utterances). Thus, the objective is privacy preservation, suppressing critical speaker information from speech. To protect their privacy, users may implement a privacy-preservation tool over their data to minimize the personal information, while allowing one or more downstream goals to be achieved. Recent attempts have focused on speech transformation, voice conversion, and speech synthesis as technologies underpinning these solutions.

Privacy by anonymization has achieved remarkable success in concealing identity to preserve users’ privacy (Qian et al., 2018; Ahmed et al., 2020; Han et al., 2020; Srivastava et al., 2020; Vaidya and Sherr, 2019). Although the primary purpose is to protect privacy, this has successfully misled the verification systems (Srivastava et al., 2020). Speakers want to hide their identity while allowing any desired goal to be potentially achieved. In order to hide his/her identity, benign users pass their utterances through an anonymization system before sharing/publication. The resulting anonymized utterances are called trial utterances. They sound as if they were uttered by another speaker, which we call a pseudo-speaker that may be an artificial voice not corresponding to any real speaker. In our case, ASV tries to detect whether the privacy-transformed utterance is spoken by the target speaker or not. Thus, we may have two initial scenarios without considering if a piece of voice is disguised or not: better privacy, indicating that the transformed input is not linkable to the target-speaker, worst-case privacy, indicating that we can still distinguish the target-speaker of the utterance. This V2A output will be used in privacy estimation over the inputs of VUIs systems.

3.4. Tandem Framework

Despite their apparent variations, spoofed inputs and anonymization tools share the same objective of forcing target authentication systems to misclassify pre-defined inputs (target or not). We will focus on the assessment of tandem systems whereby a V2D (i.e., CM) serves as a ‘gate’ to determine whether a given speech input originates from a genuine user, before passing it to V2A (e.g., ASV system). Assuming that verifying the signal integrity comes before achieving privacy, we envision a cascaded (tandem) system where a spoofing countermeasure system is placed before the authentication system (i.e., regarding the anonymization scenario in this paper), to prevent spoofing attacks from reaching this system, as shown in Figure 4.

To assess the joint performance of V2D and V2A, we adopt a new metric called (minimum) tandem detection cost function (t-DCF) (Fu et al., 2014). A t-DCF has been proposed by ASVspoof 2019 as its primary performance metric with a focus on the spoofing attack prior. The t-DCF is based on statistical detection theory and involves detailed specification of an envisioned application. It is a parameterized cost that makes the modeling assumptions of an envisioned operating environment (application) explicit. A key feature of t-DCF is the assessment of a tandem system while keeping the two subsystems (CM and authentication) isolated from each other and they can be developed independently of each other. Since the nature of spoofing attacks is never known in advance, t-DCF metric, therefore, reflects the cost of decisions in a Bayes/minimum risk sense by combining a fixed cost model with trial priors. Thus, beyond its practice for spoofing countermeasures, the specification of costs and priors tailors the t-DCF metric towards the development of secure and private applications for a range of different configurations.

The desired security-privacy trade-off might specify through detection costs assigned to erroneous system decisions and prior probabilities assigned to the commonality of targets, non-targets, and spoofing attacks. For example, a high-security user authentication application (

e.g., access control) where target users and spoofing attacks are almost equally likely to occur, while non-target users are rare. False acceptances (i.e., whether of non-targets or VC attacks) incur a ten-fold cost relative to false rejections. The higher the t-DCF value, the more detrimental the spoofing attack. The maximum value of 1.0 indicates an attack that renders the tandem system useless. Thus, this trade-off can have three possible results which are: (1) CM bonafide and ASV accept means ‘high privacy’, (2) CM bonafide and ASV reject, which means ‘low privacy’, and (3) CM spoof and ASV accept/reject. Therefore, we want to confirm whether this objective metric can capture such score variations and predict scores for evaluating VUIs robustness and privacy.

Figure 4. A tandem system consisting of automatic speaker verification (ASV) and spoofing countermeasure (CM) modules. , and , denote the scores and thresholds of the CM and ASV systems, respectively.

4. experiments

In this section, we describe the datasets, neural network architectures, and corresponding attacks & countermeasure settings that we use in our experiments.

4.1. Study Setting

System VC model Vocoder
P1 VoicePrivacy Challenge Neural Source-filter
P2 VQVAE World
P3 VQVAEGAN Parallel WaveGAN
P4 CycleVQVAE ParallelWaveGAN
P5 CycleVQVAEGAN Parallel WaveGAN
Table 1. Details of the used VC systems aiming to anonymize the speaker identity

Datasets. To factor out the influence of specific datasets, we primarily use 4 benchmark datasets:
VoxCeleb. VoxCeleb dataset (Nagrani et al., 2017) contains over 100,000 utterances for 7325 celebrities, extracted from videos uploaded to YouTube. The speakers span a wide range of different ethnicities, accents, professions and ages. It was curated to facilitate the development of automatic speaker recognition systems. We use it to train and evaluate the authentication system.
VCTK. VCTK dataset (Yamagishi et al., 2019) includes speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences. It was recorded for the purpose of building HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation technologies. We use it to train and evaluate the anonymization systems.
VCC2020. VCC2020 dataset (Yi et al., 2020) is based on the Effective Multilingual Interaction in Mobile Environments (EMIME) dataset (Wester, 2010), which is a bilingual database of Finnish/English, German/English, and Mandarin/English data. There are seven male and seven female speakers for each language, English, Finnish, German, and Mandarin, ending up in 56 speakers in total. It uses to train and evaluate the anonymization systems.
ASVspoof2019. ASVspoof2019 database (Wang et al., 2020) for logical access is based upon a standard multi-speaker speech synthesis database called VCTK (Yamagishi et al., 2019). Genuine speech is collected from 107 speakers (46 male, 61 female) and with no significant channel or background noise effects. Spoofed speech is generated from the genuine data using a number of different spoofing algorithms. It uses to train and evaluate the spoofing countermeasure systems.

Anonymization. As VC is a basic technique behind most current state-of-the-art anonymization solutions (i.e., offering identity privacy preservation) (Tomashenko et al., 2020; Qian et al., 2018; Ahmed et al., 2020; Han et al., 2020; Champion et al., 2021; Srivastava et al., 2020; Yoo et al., 2020; Aloufi et al., 2019b), we implement the following five systems (i.e.,

 P1-P5 respectively): ‘VoicePrivacy Baseline’, ‘VQVAE’, ‘VQVAEGAN’, ‘CycleVQVAE’, and ‘CycleVQVAEGAN’. For ‘P1’, we use the baseline implementation for the VoicePrivacy challange. Then, we use an open-source nonparallel VC software named crank 

(Kobayashi et al., 2021) to implement various VC systems with different configurations including hierarchical architectures ‘P2’, generative adversarial networks ‘P3’, cyclic architectures ‘P4’, speaker adversarial training ‘P5’, and neural vocoders, as shown in Table 1. Following a typical VC systems pipeline in these systems, several steps such as preparing the dataset, feature extraction, training, and conversion are implemented in order to reconstruct the speech utterance while transforming the speaker identity.

Authentication System. We use an x-vector (Snyder et al., 2018) embedding extractor network that was a pre-trained recipe of the Kaldi toolkit (Povey et al., 2011). Training was performed using the speech data collected from 7325 speakers contained in the entire VoxCeleb2 corpus (Nagrani et al., 2017). We extract 512-dimensional x-vectors which are fed to a probabilistic linear discriminant analysis (PLDA). PLDA scoring is used to make a rejection/acceptance decision about the speaker identity.

Tag Category

Posterior odds ratio (flat prior)

0 50:50 decision making of the adversary
A adversary better decisions than 50:50
B one wrong decision in 10 to 100
C one wrong decision in 100 to 1000
D one wrong decision in 1000 to 100.000
E one wrong decision in 100.000 to 1.000.000
F one wrong decision in at least 1.000.000
Table 2. Categorical tags of worst-case privacy disclosure (Nautsch et al., 2020) based on the decision made by an adversary, the better an adversary can make decisions, despite the privacy preservation is applied, the worse is the categorical tag.

Spoofing Countermeasures. Following (Wang et al., 2020), we use several countermeasure models based on the light convolutional neural network (LCNN) (Wu et al., 2020a)

. These models trained on the ASVspoof 2019 logical access scenario considering current strategies that deal with input trials of varied length. We consider three network structures: ‘LCNN-trim-pad’, ‘LCNN-attention’, and ‘LCNN-lstm-sum’. The loss function can be either AM-softmax, OC-softmax, sigmoid, or MSE for P2SGrad. For more details, refer to 

(Wang et al., 2020). We compare their performance on the ASVspoof2019 logical access (LA) dataset (i.e.,  as a known attack (AK)  e.g., waveform filtering, griffinlim, and spectral filtering) and the anonymized recordings (i.e., as an unknown attack implementing different conversion and vocoder models from those used in the training) across the front ends based on linear frequency cepstral coefficients (LFCCs), linear filter bank coefficients (LFBs), and spectrograms. The LFCC is 60-dimensional extracted from a frame length of 20 ms, a frame shift of 10 ms, a 512-point FFT, a linearly spaced triangle filter bank of 20 channels, and delta plus delta-delta coefficients (i.e., the first dimension replaced by log spectral energy). The LFB has a similar configuration but contains only static coefficients from 60 linear filter-bank channels. The spectrogram configures similarly and has 257 dimensions.

Measures. State-of-the-art CM and ASV methods are subsequently utilized to objectively evaluate the impact of voice disguise (i.e., spoofing efficacy) and anonymization level (i.e., privacy protection) by equal error rates (EER) and tandem detection cost function (t-DCF) (Kinnunen et al., 2020).
Spoofing Efficacy.

We measure the attack efficacy by the decision score confidence, which is the probability that the spoofed input belongs to the genuine class as predicted by CMs. We consider the attack successful if the decision score confidence exceeds a threshold.

Privacy Protection. We measure the level of the privacy protection offered by the anonymization solutions by the decision score confidence, which is the probability that the speech input belongs to the target speaker as predicted by ASV. We estimate the protection by anonymization in terms of the average protection afforded to a population and a worst-case to an individual.

Figure 5. Privacy analysis of the adopted anonymization systems using ZEBRA profile metrics in form of: system name followed by population, individual, and tag values.

4.2. Performance Analysis

AM-softmax OC-softmax Sigmoid P2SGrad
Feature NN KA P1 P2 P3-P5 KA P1 P2 P3-P5 KA P1 P2 P3-P5 KA P1 P2 P3-P5
LFB L-T-P 5.580 0.410 0.320 0.330 5.980 2.630 0.350 0.360 7.000 2.020 0.160 0.160 6.810 2.000 0.220 0.230
L-A 4.250 2.110 0.210 0.230 4.010 1.210 0.170 0.190 3.340 1.190 0.270 0.190 3.980 2.340 0.130 0.130
L-L-S 4.230 3.170 0.220 0.230 5.810 3.640 0.230 0.230 7.040 3.210 0.260 0.260 5.060 1.060 0.290 0.290
SPEC L-T-P 4.840 53.00 32.44 37.49 4.410 33.01 24.48 24.78 3.090 21.00 12.50 12.50 2.940 1.150 0.420 0.420
L-A 4.020 6.230 4.600 5.730 4.050 12.08 4.780 5.080 3.920 6.310 6.250 6.250 4.720 6.040 5.270 6.020
L-L-S 3.960 14.49 6.200 11.07 2.810 1.000 0.420 0.450 3.290 3.000 1.040 0.910 2.370 1.490 0.780 1.460
LFCC L-T-P 3.040 27.06 18.74 18.76 2.930 9.320 5.530 5.920 2.500 7.120 6.250 6.250 2.310 6.170 6.020 5.660
L-A 2.990 8.210 6.250 7.110 2.910 7.070 6.250 6.250 3.180 6.030 5.790 5.960 2.720 6.320 6.250 5.890
L-L-S 2.460 6.010 5.240 5.370 2.230 6.110 5.370 5.610 2.670 6.470 7.420 6.250 1.920 6.340 6.250 6.250
Table 3. EERs on the generated voices by the transformation systems across various CMs applied various features, NN architectures, and loss functions (i.e.,  lower EERs is better).
AM-softmax OC-softmax Sigmoid P2SGrad
Feature NN KA P1 P2 P3-P5 KA P1 P2 P3-P5 KA P1 P2 P3-P5 KA P1 P2 P3-P5
LFB L-T-P 0.120 0.007 0.006 0.007 0.150 0.007 0.007 0.007 0.160 0.013 0.006 0.013 0.170 0.015 0.006 0.015
L-A 0.110 0.006 0.006 0.006 0.110 0.020 0.023 0.013 0.060 0.014 0.012 0.014 0.080 0.014 0.013 0.013
L-L-S 0.090 0.015 0.015 0.015 0.160 0.018 0.013 0.015 0.180 0.015 0.005 0.005 0.140 0.006 0.006 0.006
SPEC L-T-P 0.120 0.492 0.440 0.510 0.123 0.680 0.580 0.670 0.085 0.184 0.160 0.210 0.085 0.016 0.008 0.008
L-A 0.110 0.054 0.031 0.052 0.105 0.058 0.030 0.039 0.109 0.240 0.230 0.190 0.135 0.058 0.040 0.058
L-L-S 0.101 0.120 0.069 0.113 0.077 0.018 0.008 0.009 0.087 0.037 0.020 0.018 0.060 0.031 0.015 0.029
LFCC L-T-P 0.068 0.290 0.260 0.260 0.068 0.059 0.048 0.056 0.069 0.218 0.191 0.188 0.056 0.080 0.063 0.050
L-A 0.074 0.274 0.220 0.191 0.066 0.122 0.087 0.081 0.068 0.080 0.053 0.057 0.079 0.116 0.073 0.055
L-L-S 0.057 0.059 0.042 0.044 0.064 0.051 0.044 0.049 0.064 0.180 0.127 0.110 0.052 0.169 0.144 0.150
Table 4. min t-DCFs (i.e., joint performance with ASV) on the generated voices by the transformation systems across various CMs applied various features, NN architectures, and loss functions.

4.2.1. V2D Performance

To evaluate the performance of the spoofing countermeasure system, we use the countermeasure decision score which indicates the similarity of the given utterance with genuine speech. equal error rates (EER) is calculated by setting a threshold on the countermeasure decision score, such that the false alarm rate is equal to the miss rate. A high EER indicates the converted speech to be more human-like speech, whereas a lower EER is the better spoofing countermeasure system at detecting spoofing attacks (i.e., EER is constrained between 0 and 0.5, and values larger than 0.5 indicate decisions worse than random guessing). Then, the t-DCF metric is utilized to assess the influence of CM systems on the reliability of an ASV system. The lower the t-DCF is, the better reliability of ASV is achieved.

In Tables 3 and 4, we summarize and compare the approaches to deal with varied-length input and several loss functions reported in the recent speech anti-spoofing literature. We list the EERs and t-DCFs on the evaluation set of ASVspoof19 (LA) and the output of the five conversion systems. Results on loss functions demonstrate the loss function based on sigmoid and P2SGrad have a competitive performance over unknown attacks which achieved a lower equal error rate of 0.16% and 0.13% compared to the other loss functions.

4.2.2. V2A Performance

A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Three metrics are estimated: EER and log-likelihood ratio (LLR) scores,  (Leeuwen and Brümmer, 2007) (i.e., relates to empirical cross-entropy) and . Denoting by () and () the false alarm and miss rates at threshold , the EER corresponds to the threshold EER at which the two detection error rates are equal, i.e., EER = () = (). Then, we use the output scores within both: tandem (i.e., EER) and ‘ZEBRA’ frameworks (Nautsch et al., 2020) (i.e., all scores). The tandem framework serves as a guide for the optimization of a countermeasure for a given authentication system. ‘ZEBRA’ framework measures the average level of privacy protection afforded by a given privacy-preserving solution for a population and the worst-case privacy disclosure for an individual.

4.2.3. Tandem Framework Performance

By combining spoofing detection scores with ASV scores, adoption of the t-DCF, we evaluate the impact of spoofing and the performance of spoofing countermeasures upon the reliability of ASV. Both the CM and ASV subsystems will make classification errors. The aim is to assess the performance of the tandem system as a whole taking into account not only the detection errors of both subsystems, but assumed prior frequencies. For ‘spoofing performance’ assessment, we consider the ASV and CM results jointly. Minimum normalized t-DCF, defined as: = where is the optimal threshold determined from the giving evaluation dataset using the ground truth. The t-DCF metric has 6 parameters: (i) false alarm and miss costs for both systems, and (ii) prior probabilities of target and spoof trials (with an implied third, nontarget prior). Specifically, VC audio files may be rejected by a CM system if the audio contains detectable artifacts. Even if VC audio files are passed on to the CM system, they may still be rejected by the ASV if their speaker similarity is not close enough to the target speakers, as shown in Fig. 6.

4.2.4. Signal Performance.

Most of the important factors that may clearly affect the performance of speech processing systems, including speaker recognition or even spoofing countermeasure performance are: Issues on Background Noise and Issues on Variability. Background noise is an issue being highlighted by (Zheng and Li, 2017)

as problematic because during training, the speaker often speaks in a clean environment. In contrast, during testing, the speaker speaks in a noisy condition. It disturbs the evaluation test and degrades the performance of the speaker recognition system. Voice variability, also known as session variability, is another factor that may affect the performance of these systems; which can be further classified as intra-variation and inter-variation. Intra-variation occurs due to various factors, such as emotions, rate of utterances, mode of speech, disease, the speaker’s mood, and the emphasis given to the word. Inter-variation exists due to anatomical differences in the speech signals due to different transmission channels, such as the different types of microphones and headphones used during the recording of speech utterances, where speech data is sampled at either 8 kHz, 16 kHz, or 22 kHz. Thus, to avoid the models having a mismatched condition, in our experiments we adopt 16 kHz for the used recordings cross all the training and evaluation systems.

4.3. Implications

4.3.1. Effect I: Security Consequences

To analyze the impact of the privacy-preserving solutions, which rely on anonymity using voice transformation tool, we investigate the following question: to what extent can anonymization solutions result in high spoofing risk for ASV and CM?

Approch. Authentication systems are prone to be intentionally fooled using spoofing attacks (i.e., replay, text-to-speech (TTS), and voice conversion (VC)). In our experiments, we involve both spoofed voices synthesized by modern TTS (i.e., using ASVspoof2019 (LA) evaluation set) and VC models. We used converted voices produced by each of the VC systems (i.e., representive of identity anonymization tools).

Observations. Interestingly, it is noted that despite the difference in the results compared to the known attacks, the countermeasure systems are still able to recognize the converted output (i.e., anonymized) as spoofed inputs. Although identity may be pretended by VC, it might be exploited in inappropriate ways (e.g., deepfake problems such as synthesized fake voice). The reason for this may be that all the conversion mechanisms share the need for vocoders to reconstruct waveforms, and even with the development of such techniques, they still can be distinguished compared to the raw utterances.

Limitations & Future Work. However, the remaining question is about the robustness of such systems under adversarial attacks, are these countermeasures for ASV robust enough to defend against adversarial examples? For the robustness, we want to point out that two important factors may affect the performance of these systems: (1) adversarial inputs and (2) real-world perturbations (i.e., noisy background). Liu (Liu et al., 2019) start to highlight the vulnerability of some of spoofing countermeasures under both white-box and black-box adversarial attacks with the fast gradient sign (FGSM) and the projected gradient descent (PGD) methods. While Chettri et al. in (Chettri et al., 2020) spot the effect of the real-time perturbations on CMs, which could be due to (1) variations within the spoof class e.g., speech synthesizers not present in the training set, (2) within the bonafide class e.g., due to content and speaker, or (3) additional nuisance factors e.g., background noise. Thus, the ideal CM should generalize across environments, speakers, languages, channels, and attacks to allow maximum utility across different applications. It is an important direction to be explored in the future and we seek to include it in our evaluation toward designing secure and private VUIs systems. We seek to use the tandem framework to evaluate a system beyond the ASV (i.e., non-biometric) to add a level of security prior privacy-preserving applications.

4.3.2. Effect II: Privacy Concerns.

To analyze the effectiveness of the voice transformation in designing privacy-preserving solutions, we investigate the following question: does hiding identity offer an ideal solution for privacy protection in VUI-based services?

Approach. Due to privacy concerns, oftentimes such data must be de-identified or anonymized before it is used or shared. Therefore, we measured the level of privacy provided by these solutions. Specifically, according to the privacy goal adopted by these solutions, the extent to which the adversary can obtain the identity of the speaker.

Observations. We use the authentication confidence scores cross x-vectors (i.e., V2A) to calculate empirical cross-entropy (ECE). Then, to quantify levels of privacy preservation, the ZEBRA framework uses ECE value to show: (1) the performance on the population level, as the expected ECE is quantified by integrating out all possible prior beliefs (i.e., 0 bit) for full privacy, and (2) the performance on the individual level, the worst-case strength of evidence is the maximum(absolute(LLR)). As shown in Figure 5, ECEs are presented to simulate all possible prior beliefs (i.e., average/expected performance of the presented privacy-preserving solutions).

Limitations & Future Work. Recently, The VoicePrivacy initiative (Tomashenko et al., 2020) has promoted the development of anonymization methods that aim to suppress personally identifiable information in speech (i.e., speaker identity), while leaving other attributes such as linguistic content intact. Despite the appeal of anonymization techniques and the urgency to address privacy concerns in the speech domain, the current solutions may be useful from singular perspectives and for achieving a specific goals, like hiding the identity of the speaker, but might fail to sufficiently address other scenarios. Additionally, it would be concerning that if a benign user can achieve privacy by a transformation model, that also entails that a malicious user can break security by bypassing the spoofing countermeasure mechanism. Besides the spoofing perspective, Srivastava et al. in (Lal Srivastava et al., 2020) found that when the attacker has complete knowledge of the VC scheme and target speaker mapping, none of the existing VC methods will be able to protect the speaker identity, and they classified the attacks based on the attacker’s knowledge about the anonymization method (i.e., ignorant, informed, and semi-informed).

Figure 6. Normalized t-DCF function calculated using: t-DCF parameters and ASV errors, and the threshold of each evaluated CM setting to its optimal value corresponding to perfect calibration over evaluation sets (i.e., t-DCF as a function of the CM threshold). Above, the performance of giving CM e.g., ‘LCNN-lstm-sum-P2SGard’ compared to bad arbitrary CM.

5. Potential Countermeasures

In this section, we highlight potential defenses considering both security and privacy, and then discuss whether the conflicting between them is necessarily or not.

5.1. Defense

VUI technologies that allow users to speak to interact with their devices are based on accurate speaker and speech recognition to ensure appropriate responsiveness. While VUIs are offering new levels of convenience and changing the user experience, these technologies raise new and important security and privacy concerns (e.g., spoofing attacks (Wu et al., 2015), false activations (Dubois et al., 2020), attributes attack (Aloufi et al., 2020)). Thus, in the following, we discuss potential defenses that can reduce the risk of spoofing attacks while maintaining privacy.

5.1.1. Privacy by Personalization

In many real-world usage scenarios, our voice may often also contains indicators of our identity, mood, emotions, physical and mental wellbeing that may be used to manipulate us and/or shared with third parties. This raises privacy concerns owing to the capture and processing of voice recordings that may involve two or more people, and without their explicit consent. This violates GDPR provisions, for instance. Furthermore, the deep acoustic models used to analyze these recordings may encode more information than needed for the task of interest (i.e., ASR), such as profiles of users’ demographic categories, personal preferences, emotional states, etc., and may therefore significantly compromise their privacy. Current works focus on protecting/anonymizing speaker identity using VC-based mechanisms (Qian et al., 2018; Srivastava et al., 2020). Based on our results in Table 3 and 4, we show the limitation of these techniques in achieving secure VUIs services while maintaining user privacy.

Protecting users’ privacy where speech analysis is concerned continues to be a particularly challenging task. Specifically, privacy-preserving solutions should clearly specify the privacy-utility trade-off in a transparent way: what to protect and what task to achieve. These solutions must be compatible without compromising the security of these systems. This opens a new possibility for on-device personalization of speech processing models, where personalized models are trained on users’ devices. For example, a combination of federated learning and differential privacy has been proposed to develop on-device speaker verification (Paulik et al., 2021; Sim et al., 2019). Adversarial training can be one solution in learning the representation related to the task of interest. Srivastava et al. in (Srivastava et al., 2019b) proposed an on-device encoder to protect the speaker identity using adversarial training to learn representations that perform well in ASR while hiding speaker identity. Likewise, recent applications have suggested the implementation of disentanglement (Aloufi et al., 2020) in learning speech representations can enhance the robustness of speech representations and overcome common speaker recognition issues like spoofing attacks (Peri et al., 2020). We hypothesize that learning of speech representation (i.e., task-specific) on devices yields a desirable model that meeds the needs of individual users, and thus can be achieved in a personalised, privacy-preserving way by fine-tuning a global model using standard optimization methods on data stored locally on a single device overcoming the current need of using VC-based mechanism.

5.1.2. Configurable Privacy

Protecting privacy requires more than hiding speaker information or running on-device ASR. Privacy is subjective, with varying attitudes between users, and which may even depend on the services (and/or service providers) with which these systems communicate. Thus, current solutions may be useful from singular perspectives and for achieving a specific goal, like the identity of the speaker, but might fail to sufficiently address configurable privacy. Recently, Aloufi et al. in (Aloufi et al., 2021) advocate the principle of configurable privacy, emphasizing the importance of enabling different privacy settings for optimizing the privacy-utility trade-off and promoting transparent privacy management practices.

Based on our experiments, the output of the tandem framework (i.e., combining verification and authentication) can be helpful in deciding/controlling where to deploy a privacy-preserving solution. For example, assuming such a service does not require authentication (e.g., sharing on social media platforms), then we still need to verify the input speech as a genuine utterance, but a decision threshold of the ASV can be configured to enable access to a non-target speaker. Or, if authentication is required (e.g., smart assistance) the decision threshold can only accept the target speaker (i.e., the results in Figure 6 assume the latter case). In addition, this tandem framework could be also useful if the data owner and the service provider have an agreement on what privacy-preserving/anonymization mechanism to implement it on the shared data, then such a tandem system should enable the input generated by this mechanism while restricting other inputs.

5.1.3. Online Watermarking

The concept of speech watermarking (i.e., voice signature) has risen to be an efficient and promising solution to safeguard voice signals. It encrypted a user’s personal information into the voice as an inaudible watermark (Hu et al., 2021). For example, fingerprinting the audio sample using the acoustics features and then such fingerprint can be used to securely verify the user of interest. Thus, well-designed voice watermarking can help tandem systems in managing identity security in the voice inputs.

Considering honesty and inviolability as the first step approaching privacy preservation, voice signature is a worthy direction towards delivering voice integrity. However, current speech watermarking is designed for fixed-length offline audio files, e.g.,, meeting recordings, and does not consider the impact of environmental conditions, e.g.,, bitrate variability and background noise (Zhang et al., 2021). Such environmental factors make watermark embedding and retrieval very challenging. Further, they are not designed for real-time speaker recognition systems where input speech is unknown a priori and can be of variable length (Zhang et al., 2021). Therefore, speech watermarking must be extended to address the above challenges (e.g., how and which features to encode) for efficient practical applications.

5.2. Trust vs Trustworthy

Speech is a biometric characteristic of human beings, which can produce distinguishing and repeatable biometric features. Controversy has thus arisen over the risks of privacy and security around it.

5.2.1. Is Conflict a Fundamental Principle?

Privacy as Trust. Should we suppress the speaker-related information including his/her identity for privacy preservation? Speaker-related information typically involves timbre, pitch, speaking rate, and speaking style. With the growth of advanced speech synthesis techniques, it is also easy to build speech synthesis systems (i.e., anonymization) from acquired data and then generate new speech samples which reflect the voice of a pseudo speaker. The genuine user can use the generated utterances for privacy protection against an automatic speaker verification (ASV) system. The hiding of speaker identity is also referred to as speaker anonymization or de-identification. These solutions propose to prevent access to the identity in order not to prevent improper use of it.

Security as Trustworthy. Should we maintain the speaker-related information including his/her identity for security integrity? Voice-based authentication has been implemented in security-sensitive applications (e.g., smart home systems) to enable legitimate access. With the growth of advanced speech synthesis techniques, it is also easy to build speech synthesis systems (i.e., spoofing) from acquired data and then generate new speech samples which reflect the voice of a pseudo speaker. The adversary can use the generated utterances to attack an automatic speaker verification (ASV) system. The hiding of speaker identity is also referred to as speaker spoofing or presentation. These systems are used to gain illegitimate access with claimed identity to services protected by ASV.

We leave the question of deciding whether the trust-trustworthy conflict is fundamental (i.e., how to design the next generation of voice-based applications) as an open question for the research community.

5.2.2. Beyond Voice Analytics.

Our experiments so far focused on the speech processing domain. The development of synthesized techniques (e.g., generative models) are in every domain such as images, videos, etc., these tools have become a tough challenge. Recently, synthesized techniques are proposed in limiting the privacy risks by sharing synthetic data instead of real data in a manner that protects the privacy and preserves data utility (Oprisanu et al., [n.d.]; Yoon et al., 2020). However, such techniques also can advance the development of deepfake techniques, depend on generating synthesized samples to attack the target systems. The problem might be expanding to become a broader question belong to privacy and identity management. Thus, there is an urgent need to develop countermeasures techniques against deepfake consequences.

The need for trustworthy systems that offer end-to-end privacy guarantees is urgent (Rogers et al., 2019). The importance of understanding and accommodating context (i.e., control over deployment/application) is a critical key behind designing privacy-preserving solutions to offer any degree of authenticity and linkability. To be considered privacy-enhancing, such a solution needs to allow the user to choose his required and acceptable degree of anonymization while maintaining the conventional capabilities for identification and authentication.

6. Related Work

In this section, we overview the voice conversion technology in terms of its usage for privacy protection and security concerns against it.

6.1. Voice Conversion

Voice conversion is part of the general field of speech synthesis, where we convert text to speech or changes the properties of speech; for example, voice identity and emotion (Sisman et al., 2021; Aloufi et al., 2019b). VC tools modify speaker-dependent characteristics of the speech signal, such as spectral and prosodic aspects, while maintaining the speaker-independent information (i.e., linguistic). VC enables a wide range of applications including personalized speech synthesis (Veaux et al., 2013; Huang et al., 2020), speaker de-identification (Qian et al., 2018; Tomashenko et al., 2020; Ahmed et al., 2020; Yoo et al., 2020), and voice disguise (Wang et al., 2020).

Preserving Voice Privacy. VC mechanisms show their effectiveness in filtering out the speaker-related voice biometrics present in speech data without altering the linguistic content, thus preserving the usefulness of the shared data while protecting the users’ privacy. Most of the proposed works focus on protecting/anonymizing the speaker identity using these mechanisms (Qian et al., 2018; Srivastava et al., 2019b, 2020). For example, VoiceMask builds upon voice conversion to perturb the speech and then sends the sanitized speech audio to the voice input apps (Qian et al., 2018). Similarly, Srivastava et al. in (Srivastava et al., 2020) propose an x-vector-based anonymization scheme to convert any input voice into a random pseudo-speaker based on the selected gender and region of x-vector space of the target pseudo-speaker. In (Yoo et al., 2020)

it is proposed an algorithm that produces anonymized speeches by adopting many-to-many voice conversion techniques based on variational autoencoders (VAEs) and modifying the speaker identity vectors of the VAE input to anonymize the speech data. Although these VC methods may provide some identity protection against less knowledgeable attackers (

i.e., linkage attacks), they are unable to defend against an attacker that has extensive knowledge of the type of conversion and how it has been applied (Lal Srivastava et al., 2020).

Besides the speaker identity, various works have proposed to use VC-based mechanisms to protect a speaker’s gender (Jaiswal and Provost, 2019) or emotions (Aloufi et al., 2019a). Champion et al. in (Champion et al., 2021) propose to alter other paralinguistic features (i.e., F0) and analyze the impact of this modification across gender. They found that the proposed F0 modification always improves pseudonymization, and both sources and target speaker genders affect the performance gain when modifying the F0. In (Aloufi et al., 2019a), an edge-based system is proposed to filter patterns from a user’s voice before sharing it with cloud services for further analysis. Likewise, Vaidya et al. in (Vaidya and Sherr, 2019) introduce an audio sanitizer, a software audio processor that filters and modifies the voice characteristics of the speaker from audio commands before they leave the client device by altering speech features (i.e., the short-term spectral features, spectro-temporal features, and high level features) in these commands.

Voice Spoofing. VC poses a significant security threat wherever the voice is used as an authenticator (Vaidya and Sherr, 2019). VC has recently become one of the most easily accessible techniques to carry out spoofing attacks, presenting a threat to speaker verification systems (ASV). There are at least four major classes of spoofing attacks: impersonation, replay, speech synthesis, and voice conversion (Marcel et al., 2019). The execution of speech synthesis and voice conversion attacks usually requires sophisticated speech technology. Speech synthesis systems can be used to generate entirely artificial speech signals, whereas voice conversion systems operate on natural speech (Wu and Li, 2014). With sufficient training data, both speech synthesis and voice conversion technologies can produce high-quality speech signals that mimic the speech of a specific target speaker and are also highly effective in manipulating ASV systems. Such synthetic speech can be used to spoof the voice authentication systems and gain access to the user’s private resources (e.g., fraud attacks).

The awareness of this threat spawned research on anti-spoofing, including techniques to distinguish between bona fide and spoofed biometric data. Solutions are referred to as spoofing countermeasures or presentation attack detection systems (Wang et al., 2020). For example, in (Zheng et al., 2021), an introduced approach to estimate the restoration function is proposed by minimizing a function of ASV scores to improve the defense against the automatic voice disguise (AVD) conducted by VC-based methods. Therefore, the improved conversion technologies also led to concerns about security and authentication. It is thus desirable to be able to prevent one’s voice from being improperly used with such voice conversion technologies.

6.2. Privacy Exposure

Deep learning has been a driving force in research and practice across speech application domains raising the need to study what causes privacy leaks and under which conditions a model is sensitive to different types of privacy-related attacks. In privacy-related attacks, the goal of an adversary is to gain knowledge that was not intended to be shared. Such knowledge can be about the training data or information about the model, or even extracting information about attributes of the data, such as unintentionally encoded information (Nasr et al., 2019).

Membership Inference Attacks.

In membership inference attacks (MIAs), the attacker aims to identify if a data record was used to train a machine learning model 

(Nasr et al., 2019). The attack is driven by the different behaviors of the target model when making predictions on samples within or out of its training set (Chen et al., 2020; Murakonda and Shokri, 2020). Song and Shmatikov (Song and Shmatikov, 2019) discuss the application of user-level membership inference on text generative models, exploiting several top ranked outputs of the model. In the speech domain, Miao et al. in (Miao et al., 2020) examine user-level membership inference (i.e., if this user has any data within target model’s training set) in the problem space of voice services, by designing an audio auditor to verify whether a specific user had unwillingly contributed audio used to train an automatic speech recognition (ASR) model under strict black-box access. Song et al. in (Song et al., 2019) combine the privacy and security domains by utilizing the success accuracy of membership inference attacks in reflecting the information leakage of training algorithms about individual members of the training set.

Reconstructing Attacks In a reconstruction attack, the attacker aims to infer attributes of the records in the training set (Nasr et al., 2019; Dwork et al., 2017) by leveraging publicly accessible data that are not explicitly encoded as features or are not correlated to the learning task. ‘Overlearning’ may cause revealing privacy- and bias-sensitive attributes that are not part of the target objective (Song and Shmatikov, 2020). In the speech domain, it is possible to accurately infer a user’s sensitive and private attributes (e.g., their emotion, sex, or health status) from deep acoustic models (e.g., DeepSpeech2). An attacker (e.g., a ‘curious’ service provider) may use an acoustic model trained for speech recognition or speaker verification to learn further sensitive attributes from user input even if not present in its training data (Aloufi et al., 2020). Linkage attacks can be designed depending on the attackers’ knowledge about the anonymization scheme to infer the speaker’s identity (Srivastava et al., 2019a). These types of attributes can lead to a secondary use that may include targeting content, or data brokers might profit from selling this information to other parties such as advertisers and insurance companies, or surveillance agencies may use these attributes to recognize users and track their activities and behaviors.

7. Conclusion

This work represents a step towards understanding the security risks of anonymization tools using a tandem evaluation framework. We show both empirically and analytically that (i) there exist intriguing effects between the two vector domains, (ii) an adversary can exploit these effects to optimize attacks with respect to multiple metrics, and (iii) it requires carefully accounting for such effects in designing effective countermeasures against the potential security and privacy attacks on VUI-based systems. We believe our findings shed light on the inherent vulnerabilities of VUIs deployed under realistic settings. This work also opens a few avenues for further investigation. Devising a unified evaluation framework accounting for both security and privacy may serve as a promising starting point for developing effective countermeasures. The detailed analysis in our paper highlights the importance of thinking about their combination.