Voice-controlled IoT devices and smart home assistants have gained huge popularity on our devices and in our households. Intuitive interaction between users and services is enabled by analyzing speech signals. For example, smart assistants (e.g., Google Assistant, Amazon Echo, and Apple Siri) and voice browsing (e.g., Google Search) use Voice User Interfaces (VUIs) to activate the voice assistant to control IoT devices or perform tasks such as browsing the Internet and/or making recommendations. Figure 1 (A) shows an overview of how these systems work. Although devices often suffer from frequent false activations (Dubois et al., 2020)
, it all begins with some kind of trigger such as ‘Okay, Google’, ‘Alexa’, and ‘Hey, Siri’ to inform the system that speech-based data will be received. Once a voice stream is captured by a device, it outsources analysis to cloud services such as automatic speech recognition (ASR), speaker verification (SV), and natural language processing (NLP) where higher performance is achievable. This frequently involves communicating instructions to other connected devices, appliances, and third-party systems. Finally, text-to-speech services are often employed in order to speak back to the user. Our voice signal is a rich source of personal and sensitive data. It contains indicators of a variety of emotions, physical and mental health and well-being, etc., and thus raises unprecedented security and privacy concerns where raw data is transmitted to third parties. The signal contains linguistic and paralinguistic information such as age, gender, health status, personality, friendliness, mood, and emotions(Schuller and Batliner, ).
Today, deep learning models are playing a pivotal role in speech signal processing to enable natural and intuitive communication with our smart devices. For example, recent end-to-end (E2E) automatic speech recognition systems rely on autoencoder architecture as a way of folding separate acoustic models, pronunciation, and language models (AM, PM, LM) of a traditional ASR system into a single neural network(Chan et al., 2016; Chiu et al., 2018; Wang et al., 2019; Hannun et al., 2014), as shown in Figure 1
(B). These models train by ingesting speech spectrograms as alternative frequency-based representations for speech signals and generate text transcriptions. The encoder encodes the input acoustic feature sequence into a vector, which encapsulates the information for its input to help the decoder in predicting the sequence of symbols. Although these models have comparable performance with conventional models(Chiu et al., 2018), they have been designed without considering potential privacy vulnerabilities, given the need to train on real voice data, which contains significant amount of sensitive information.
Attribute inference attacks aim to reveal individuals’ sensitive attributes (e.g. emotion, gender, health status, etc.) that they did not intend to share. Several privacy violations may arise by obtaining these sensitive data without individuals’ awareness or permission. In this paper, we focus on an adversarial privacy leakage scenario of deep representations for speech processing tasks. In particular, we focus on the probability of inferring sensitive attributes using deep acoustic models that perform different operations like speech to text translation or speaker recognition. For example, an attacker may use an acoustic model trained for speech recognition or speaker verification to learn further sensitive attributes from user input even if not present in its training data, as shown in Figure1
(C). The attacker may use the output of these models to train classifiers to infer private attributes. We can measure an attack’s success as the increase in inference accuracy over random guessing(Yeom et al., 2018), and we find that an attacker can achieve high accuracy in inferring sensitive attributes, ranging from 40% to 99.4%, which is three or four times better than guessing at random, depending on the acoustic conditions of the input. We discuss this further in Section 6.1.
In order to limit the success of such attacks, we propose a user-driven framework designed to offer a practical defense against attribute inference attacks. A challenge in designing the proposed framework to face such attacks is to consider individuals’ privacy preferences in sharing their data. Precisely, different users may have differing privacy preferences on the type of analytics can be done on their data depending on the devices and services with which they are interacting. For instance, when contacting a health service provider, a user may prefer to share raw data without filtering it, while a user may prefer to filter sensitive data when interacting with advertising companies. To address this challenge, the proposed framework works in two phases. In Phase I, the user adjusts their privacy preference, where each of the preferences is associated with a set of tasks (e.g. speech recognition) that can be performed on the user data. In Phase II, we take advantage of learning disentangled representation (van den Oord et al., 2017) in the observed data to explicitly drive each dimension to reflect independent factors for a particular task.
Finally, we evaluate the proposed framework’s efficiency against this class of attacks using various datasets, which were recorded under different acoustic conditions (IEMOCAP (Busso et al., 2008), RAVDESS (Livingstone and Russo, 2018), SAVEE (Haq et al., 2008), LibriSpeech (Panayotov et al., 2015), and VoxCeleb (Nagrani et al., 2017)) to simulate the real-time environment in which voice recordings are collected. The results show the effectiveness of our proposed framework in reducing the success rate of the attacker to less than or equal to randomly guessing for identifying sensitive attributes.
Contribution. Our contributions can be summarized as follows:
We show the vulnerability of underlying acoustic models used by speech processing tasks under attribute inference attack scenarios. Models’ predictions may exploit such models to learn further information about users. We measure the success of these attacks by the increase in inference accuracy over random guessing. We demonstrate the importance of developing privacy-preserving solutions that can run at the edge, i.e. before sharing data with service providers.
We propose and develop a privacy-aware, configurable defence framework against attribute inference attacks. We design it to include users’ privacy preferences in managing the privacy-utility trade-off inherent in data sharing. Precisely, we allow a user to explicitly adjust the disentangled representation of his/her preference, learned by the framework from his/her data. According to our experimental results, we conclude that the controllability enabled by the disentanglement may define a new direction in developing privacy-preserving applications that satisfy the transparency principle.
We experimentally evaluate the proposed framework over various datasets, and the results show the effectiveness of the proposed framework in confronting this type of attack by filtering the sensitive attributes while maintaining high accuracy, i.e. ¿99%, for the tasks of interest (Audio snippets are available via the link in the footnote111https://tinyurl.com/y932f37m
, and our code will be available open source upon this paper’s acceptance.).
The paper is organized into nine sections. Following this introduction, we provide a general background about disentanglement in Section 2 and formulate the problem description in Section 3. The proposed defense framework is given in Section 4. Section 5 presents the experimental settings and the attribute inference attack models. We evaluate the experimental results in Section 6, and provide discussion and highlight directions for future work in Section 7. We discuss related work in Section 8 prior to concluding with Section 9.
In this section, we provide a brief overview of the necessary technical background about disentanglement and its models.
2.1. Learning Disentangled Representation
There has been notable recent interest in learning disentangled representations in various domains, such as computer vision(Hadap et al., 2020), ML fairness (Sarhan et al., 2020; Marx et al., 2019), and domain adaptation (Tsai et al., 2019; Peng et al., 2019), as they promise to enhance robustness, interpretability, and generalization to unseen examples on downstream tasks. The overall goal of disentangling is to improve the quality of the latent representations by explicitly separating the underlying factors of the observed data (Kim and Mnih, 2018). For example, in computer vision, there is a variety of tasks that have benefited from disentangled representations like pose-invariant recognition (Reed et al., 2014), attribute transfer via adversarial disentanglement (Zhao et al., 2019), and person re-identification (Eom and Ham, 2019).
There is an extended trend towards learning disentangled representations in the speech domain. Speech signal simultaneously encodes linguistically relevant information, e.g. phoneme and linguistically irrelevant information like speaker characteristics. In the case of speech processing, an ideal disentangled representation would be able to separate fine-grained factors (Gong and Poellabauer, 2018) such as speaker identity, noise, recording channels, and prosody, as well as the linguistic content. Thus, disentanglement will allow learning of salient and robust representations from the speech that are essential for applications including speech recognition (Park et al., 2019), prosody transfer (Sun et al., 2020; Zhang et al., 2019b), speaker verification (Peri et al., 2020), speech synthesis (Sun et al., 2020; Hu et al., 2020), and voice conversion (Huang et al., 2020), among other applications.
2.2. Disentanglement Models
Most prior works on disentangled representation learning are based on well-established frameworks, such as variational autoencoders (VAEs) (Kingma and Welling, 2013) and generative adversarial models (GANs) (Goodfellow et al., 2014) in learning disentangled and hierarchical representations. They are based on the original objective of these models and derive regularizations to strengthen the disentanglement to learn compact and meaningful representations. These works can be categorized into three groups according to the model that depend on: VAE-based models (Higgins et al., 2017; Lample et al., 2017; Kulkarni et al., 2015), GAN-based models (Chen et al., 2016; Choi et al., 2018; Kim et al., 2017), and combinations of AE’s and GAN’s (Lee et al., 2020; Makhzani et al., 2015; Engel et al., 2017). While extensive progress was made by these prior works in the computer vision domain, little has been done for speech processing.
Learning speech representations that are invariant to variabilities in speakers, language, environments, microphones, etc., are incredibly challenging to capture (Latif et al., 2020). To address this challenge, various variants of VAEs have recently been proposed in learning robust disentangled representation due to their generative nature and distribution learning abilities. Hsu et al. in (Hsu et al., 2017) propose the Factorized Hierarchical VAE (FHVAE) model to learn hierarchical representation in sequential data such as speech at different time scales. Their model aims to separate between sequence-level and segment-level attributes to capture multi-scale factors in an unsupervised manner. Similarly, Predictive Aux-VAE (Springenberg et al., 2019) was proposed to obtain speech representations at different timescales by disentangling local (content) from global (speaker) information inherently. Although the focus of these works is to raise the efficiency and effectiveness of speech processing applications (e.g. speech recognition, speaker verification, and language translation), in this paper we highlight the benefit of learning disentangled representation to learn privacy-preserving speech representations, as well as showing how disentanglement can be useful in transparently protecting user privacy.
3. Problem Description
In this section, we present our threat model and explain the goals of the user, the potential attribute inference attacker, and the defender in this context.
Honest data owners provide information to cloud service providers to maximize their utility, but under the assumptions that the sensitive information in their shared data should be protected. They agree on the use of data for a target task, while they do not consent to the performance of additional analysis on their data that may violate their privacy. In the voice control scenario, while users (data owners) may agree to share their voice recordings for speech recognition and accurate execution of their command, they might want to protect their sensitive information (e.g., emotion or health status) such that no secondary inferences are made from the data. For example, Amazon has patented the technology to analyze users’ voices to determine their emotions and/or mental health conditions. This allows understanding speaker commands and responses according to their feeling to provide highly targeted content (Jin and Wang, 2018).
Our attack aims to correctly infer sensitive attributes (e.g., gender, emotion, and health status) about data owners by exploiting a secondary use of the same data collected for the main task. Specifically, the attacker could be any party (e.g., a service provider, advertiser, data broker, or a surveillance agency) which has interest in data owners’ sensitive attributes. The service providers could use these attributes for targeting content; or data brokers might profit from selling these information to other parties such as advertisers and insurance companies, and surveillance agencies may use these attributes to recognize users and track their activities and behaviour. In this paper, we focus on the following question: to what extent can such an attacker infer data owners’ sensitive attributes which they prefer not to share, and to what extent can this be prevented. To answer this, we assume that the attacker has white-box knowledge (i.e. parameters and target model architecture) and a machine learning classifier that uses data owners’ data as input to train the classifier and predict data owners’ sensitive attributes.
The goal of the privacy-preserving framework in this paper is to protect the sensitive attributes of data shared from the potential attribute inference attacks launched by a curious attacker. We propose a privacy-aware defense framework controlled by the data owner to filter the raw data at the edge before sharing it with cloud service providers, as shown in Figure 2. The proposed framework works as a bridge between the data owners and the service providers to allow privacy-preserving communication between them. This framework receives the raw data as well as user preferences as auxiliary information, then it uses the user preference to filter out sensitive attributes they want to protect, which would be otherwise contained in their shared data.
In Algorithm 1, we present the overall workflow of the proposed framework to reconstruct the filtered data using learning disentangled representation. We call the proposed framework Dual-phase Disentangled Filter (DDF). Firstly, the DDF receives the inputs, which are raw data , as well as user privacy preferences within one of the options provided by the DDF. These are high, moderate, and low. These options are adjustable according to user preference, which may change for differing application domains, service contexts, etc. A privacy preference is associated with the set of tasks, resulting in a list of tasks that can be performed on the raw data . Phase II begins by checking the contents of the privacy preference list. In the case that it is empty, the user prefers to share his/her data without filtering it. Otherwise, the raw data along with privacy preference list will be passed to the disentangle module, which starts different branches, and each branch attempts to learn independent information related to a specific task. After the disentanglement, the decoder reconstructs the filtered data by receiving the concatenation of the output of the desired branches.
4. Dual-Phase Disentangled Filter
In this section, we describe our approach to include user privacy preferences in the filtering process by learning disentangled representation.
We focus on the setting where the users’ preferences serve as a control signal over a utility-privacy optimization problem. The users’ inclusion can enable them to manage their information flow and potentially make better decisions on sharing their data to reduce privacy concerns. However, the major challenge to adjust this setting is how to learn disentangled and robust representations from the users’ input that reflect their privacy preference. To tackle this challenge, we propose a DDF framework that builds upon VAEs (Kingma and Welling, 2013) to encourage learning these disentangled latent representations and then using users’ preferences to filter out unwanted representations. Inspired by recommender systems, giving users explicit control over the filtering process can enhance explainability and transparency in sharing their data.
In Phase I (Optimization), we categorize users’ preferences into options , which may be based on the application domain (e.g. audio analysis). For each option , there exists a set of tasks that are associated with it. More clearly, when specifying a preference option , the tasks associated with this preference will achieve high accuracy, while the rest of the tasks may have low accuracy. The relation between the preference option and the task is denoted by , where = 1 indicates that preference explicitly adopts task , whereas = 0 means there is no relation between the two.
In Phase II (Filtering), by leveraging the information-theoretic interpretation of VAEs, we propose an autoencoder architecture with a disentangle module to explicitly decouple the distinct factors in the raw data. Firstly, the disentangle , which is the key module in the proposed framework, receives a user’s raw data and privacy preference . Based on the preferred option, the disentangle starts a particular branch for each task . Each branch aims to learn task-specific representations , while ignoring task-invariant representations . Then, the branches’ outputs of the target tasks are concatenated to form a disentangle output . Finally, the Decoder uses the disentangle output to reconstruct the filtered data .
4.2. DDF for Speech Representation
Leveraging the multi-scale nature of sequences such as speech, text, and video, distinct factors can be captured at different timescales (Hsu et al., 2017). For example, in speech signals the phonetic content affects the segment level, while the speaker characteristic affects the sequence level. Thus, the speech signal can be disentangled into several independent factors, each of which carries a different type of information. In our context, the idea is to disentangle the factors related to the task we want to compute. We aim to demonstrate the effectiveness of learning disentangled representation in preserving the sensitive attributes in the user data. Besides, this disentanglement can also be beneficial to promote transparency in protecting users’ privacy. Figure 3 presents our use of the disentangled representation to enable users’ control over the data they want to share.
4.2.1. Phase I
We consider three preference options: high , moderate and low . We also suppose there are three main tasks that can be performed on the user data: speech recognition , speaker verification , and others (i.e. emotion and gender recognition) . For each option , we associate a set of tasks . For example, when a user specifies a preference option , the user’s raw data will be used for the , while the rest of the tasks and will get mistaken results. As the relation between the preference option and the task is denoted by , then = 1, whereas and = 0. Similarly, when the user selects a preference option , then and = 1, whereas = 0. For the last preference option , , which means no filter operation will be done over the user’s raw data .
4.2.2. Phase II
Intuitively, autoencoders use an encoding network to extract a latent representation, which then passes through a decoding network to recover the original data. Autoencoders are trained to minimize the reconstruction error between the encoded-decoded data and the raw data. VAE is an autoencoder whose encodings distribution is regularized during training to ensure that its latent space captures useful representation to allow generating powerful new data. VAE consists of the following main parts: an encoder network for modelling a posterior distribution q(—
) of discrete latent random variables z given the input data x, a prior distribution p(), and a decoder with a distribution p(—) over the input data. decomposes into reconstruction loss of standard autoencoder and Kullback-Leibler (KL) divergence between the prior p() and the posterior distribution q(—). The joint minimization of both losses leads to reasonable reconstruction while reducing the latent space dimension at the same time.
In the speech domain, there are different variations of VAE that aim to learn disentangled representation (Hsu et al., 2017; Sun et al., 2020) to allow disentangling and controlling different attributes within the speech signal such as speech content, speaker identity, and emotion. Thus, to achieve our goal in learning disentangled representation for privacy preservation purpose, we use different methods to obtain these representation. Details about the implementation of each module are as follows:
Disentangle We intend to disentangle speech representations from the input speech explicitly into several factors that can be used independently for different tasks. To achieve this, we divide the disentangle module into separate branches to force learning diverse types of information (Mathieu et al., 2016). We use a combination of objectives to encourage these different branches to learn task-related factors. Assuming we have two basic tasks, speech recognition and speaker verification, that we want to maintain, we have two branches to learn independent factors for each.
Branch 1 ()
Inspired by Vector Quantized VAE (VQ-VAE) in (van den Oord et al., 2017), we perform vector quantization to extract the phonetic content while being invariant to low-level information. VQ-VAE model aims to produce discrete latent space using Vector Quantization (VQ) techniques. During the forward pass, the output of the encoder (x) is mapped to the closest entry in a discrete codebook of = [,,..,]. Precisely, VQ-VAE finds the nearest codebook using Eq.1 and uses it as the quantized representation (x) = (x) which is passed to the decoder as content information.
The transition from (x) toet al., 2013). VQ-VAE is trained using a sum of three-loss terms (in Eq.2): the negative log-likelihood of the reconstruction, which uses the straight-through estimator to bring the gradient from the decoder to the encoder, and two VQ-related terms - the distance from each prototype to its assigned vectors and the commitment cost (van den Oord et al., 2017).
sg(·) denotes the stop-gradient operation that zeros the gradient with respect to its argument during backward pass.
By using vector quantization as a regularizer, the encoder in this branch is encouraged to extract content-specific representations and discard the invariant representations that the decoder can infer from the information of the other branch for reconstruction purposes. Alternatively, we can use the output of this branch as speech embedding to train models that use these discrete representations directly to translate from speech to text instead of reconstruction, which may cause a significant improvement in privacy protection in sharing speech data, as shown in Figure 3. For example, similar to VQ-VAE (van den Oord et al., 2017), vq-wav2vec was proposed by (Baevski et al., 2020)
to quantize the dense representations from the speech segments by implementing either a Gumbel-Softmax or online k-means clustering. Then, they apply well-performing NLP algorithms (e.g. BERT) to these quantized representations and they present promising state-of-the-art results in phoneme classification and speech recognition.
Branch 2 ()
Obtaining a good speaker representation becomes particularly important in speaker recognition, speaker adaptation, and more other applications, where irrelevant information of the signal should be filtered out. Although speaker recognition systems can vary widely in their design, they share the same objective in finding discriminative representations to maintain high accuracy and robustness in a variety of environments.
The goal of this branch is to learn such speaker representations that preserve user identity. To achieve this, we use two different methods to extract these representations. Firstly, we use a one-hot speaker code (Hojo et al., 2018) to extract the speaker’s representations and then use this code as a global condition for the decoder to reconstruct the speech signal. Alternatively, we use Thin ResNet-34 (Xie et al., 2019) trained using the angular variant learning metric (Chung et al., 2020) to encourage learning discriminative representation. The encoder in this branch will encourage the extraction of speaker-specific representations and discard invariant representations the decoder can infer from information of the other branch for reconstruction. To support our goal of enhancing privacy protection in sharing speech data, we want to point out that the output of this branch can be used independently as a speaker embedding, as shown in Figure 3 for speaker verification application instead of reconstructing.
Decoder In the speech domain, a vocoder learns to reconstruct audio waveforms from acoustic features (Oord et al., 2016), as shown in 4. Traditionally, the waveform can be vocoded from these acoustic or linguistic features using handcrafted models such as WORLD (Morise et al., 2016), Straight (Kawahara, 2006), and Griffin-Lim (Griffin and Lim, 1984). However, the quality of those traditional vocoders was limited by the difficulty in accurately estimating the acoustic features from the speech signal.
Neural vocoders such as Wavenet (Oord et al., 2016) have rapidly become the most commonly used vocoding method for speech synthesis. Although it improved the quality of generated speech, it has significant cost in computation power and data sources, and suffers from poor generalization (Lorenzo-Trueba et al., 2019)
. To solve this problem, many architectures such as Wave Recurrent Neural Networks (WaveRNN)(Kalchbrenner et al., 2018) have been proposed. WaveRNN combines linear prediction with recurrent neural networks to synthesize neural audio much faster than other neural synthesizers. In our framework, we use WaveRNN as a decoder with a minor change suggested by (Lorenzo-Trueba et al., 2019)
. The autoregressive component consists of a single forward GRU (hidden size of 896) and a pair of affine layers followed by a softmax layer with 1024 outputs, predicting the 10-bit mu-law samples for a 24 kHz sampling rate. The conditioning network consists of a pair of bi-directional gated recurrent units (GRUs) with a hidden size of 128. The autoregressive component captures the content, while the conditioning component represents the speaker’s characteristics. To achieve our goal of preserving privacy, the quality of generated speech is measured by the extent to which it contains the desired information after the filtering process and removing invariant information.
In general, Phase II is designed by taking advantage of the disentanglement in learning independent representations from the input, and then Phase I output is used to determine the outputs of the proposed framework. More clearly, Phase II is intended to accommodate possible preferences assuming that the input is passed on several branches to learn different information according to the specific task of the branch.
In this section, we describe the datasets, inference attack models, and proposed framework settings. We conduct our experiments on a Z8 G4 workstation with Intel (R) Xeon (R) Gold 6148 (2.8 GHz) CPU and 256 GB RAM. The operating system is Ubuntu 18.04. We train all the models using PyTorch(65) on an NVIDIA Quadro RTX 5000 GPU.
We use five real-world datasets recorded for various purposes such as speech recognition, speaker recognition, and emotion recognition. The details of each dataset are as follows:
IEMOCAP. The Interactive Emotional Dyadic Motion Capture dataset (Busso et al., 2008) has 12 hours of audio-visual data from 10 actors where the recordings follow dialogues between a male and a female actor in both scripted or improvised topics in the English language. The data was segmented by speaker turn, resulting in 5,255 scripted recordings and 4,784 improvised recordings. It was mainly recorded to facilitate the development of multimodal emotion recognition systems. We use the scripted recordings that were labeled with four emotions: anger, happy, sad, and neutral.
RAVDESS. The Ryerson Audio-Visual Database of Emotional Speech and Song (Livingstone and Russo, 2018) contains 1,440 recording for 24 actors (12 male and 12 female), vocalizing two lexically-matched statements in a neutral North American accent. It was recorded to facilitate the development of multimodal emotion recognition systems. It includes seven emotions: calm, happy, sad, angry, fearful, surprise, and disgust, as well as neutral expression. We use the entire dataset.
SAVEE. Surrey Audio-Visual Expressed Emotion database (Haq et al., 2008). It consists of phonetically-balanced sentences from standard TIMIT (acoustic-phonetic continuous speech dataset) uttered by four English actors with a total size of 480 utterances. Mainly, it was recorded to facilitate the development of multimodal emotion recognition systems. It contains expressions of seven emotions: calm, happy, sad, angry, fearful, surprise, and disgust, as well as neutral. We use the entire dataset.
LibriSpeech. LibriSpeech (Panayotov et al., 2015) is a large dataset of approximately 1,000 hours of reading of English. It was derived from reading audiobooks from the LibriVox project, and was recorded to facilitate the development of automatic speech recognition systems. We use the train-clean100 set.
VoxCeleb. The VoxCeleb dataset (Nagrani et al., 2017) contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube. It was curated to facilitate the development of automatic speaker recognition systems. We use the VoxCeleb2 subset of about 1,200 recordings.
Training and Testing. For each dataset, we divide it 80 for training and 20 for testing.
5.2. Attribute Inference Attacks
An attribute inference attack aims to infer sensitive information from users’ recordings. Specifically, an attacker trains a particular classifier that takes the representation extracted from users’ recordings as input and infers sensitive attributes (e.g., emotion and gender).
5.2.1. Target Attributes.
For IEMOCAP and RAVDESS, we consider inference tasks are emotion recognition and binary gender attributes, and train separate models to classify emotion and gender recognition for the entire representation (after extracting these representations from the raw recording) for each dataset. For LibriSpeech and VoxCeleb, we consider the inference task to be gender, and we train separate models to classify gender for the entire representation for each dataset. For SAVEE, as it contains one gender, we only consider the emotion inference. We repeat this setting for each type of attacker classifier (35 models in total).
Below are the details for each attack classifier:
Logistic Regression (LR):
LR is a machine learning classification algorithm used to predict the probability of a categorical dependent variable. For binary classification such as gender recognition, we use a sigmoid function to predict the true label, i.e. male or female based on a given representation. For multiclass prediction, we use the softmax function instead of the sigmoid function to normalize the input values from all classes between 0 and 1 and return the probabilities of each class. All models train using a stochastic average gradient (SAG) and for 300 iterations. In this attack, the attacker uses a LR classifier to perform attribute inference attacks.
Random Forest (RF):
RF is a machine learning classification algorithm that creates decision trees on randomly selected data samples, gets a prediction from each tree, and selects the best solution by the means of voting. All models implement 100 estimators, which indicates the number of trees in the forest. In this attack, the attacker uses a RF classifier to perform attribute inference attacks.
Support Vector Machine (SVM):
SVM is a discriminative classifier to find a hyperplane in N-dimensional space (N: features numbers) that accurately classifies the data points. All models implement a radial basis function (RBF) as a kernel function to scale properly on large numbers of features in the input space, and scale gamma distribution. In this attack, the attacker uses SVM classifier to perform attribute inference attacks.
Neural Network (NN)- Multilayer Perceptron (MLP):
Neural Network (NN)- Multilayer Perceptron (MLP):
In this attack, the attacker uses a three-layer fully connected neural network (input layer, a hidden layer which has 2048 neurons, and output layer) to perform attribute inference attacks. All models adopt the rectified linear unit (ReLU) as an activation function. They train by using Adam optimizer with learning rate = 0.001 and batch size = 200 for 300 iterations. As it is difficult to determine the possible structure of NNs, we chose a simple structure expected to be enough to analyze the captured information in extracted representations.
In advance of training these models, we must first extract the representations from various datasets using pre-trained acoustic models for speech recognition tasks. We extract the representations from raw audio in different datasets using wav2vec model (Schneider et al., 2019), which achieves 2.43
word error rate (WER) for speech recognition. The wav2vec relies on a fully convolutional architecture by applying two networks. The encoder network embeds the audio signal in a latent space and the context network combines multiple time-steps of the encoder to obtain contextualized representations. We use the pre-training model on the full 960-hour Librispeech training set with 32.5M parameters. To achieve our purpose of obtaining similar representations to those which may be used in acoustic models, we used only the output from the encoder network. The encoder layers have kernel sizes (10, 8, 4, 4, 4) and strides (5, 4, 2, 2, 2). The output of the encoder is a low-frequency feature representationthat encodes about 30 ms of 16 kHz of raw audio and the striding results in representations every 10 ms. We then used these representations to train attacker classifiers. We also extract the speech representation using another state-of-the-art model, DeepSpeech2 (Amodei et al., 2016)
, which reported a 6.71% WER. It consists of eleven layers including many bidirectional recurrent and convolutional layers. The model was trained using the CTC loss function and with a Stochastic Gradient Descent (SGD) and Momentum optimizer which was extended with the Layer-wise Adaptive Rate Clipping (LARC) algorithm. We use the pre-trained model to extract the feature representationfrom the log-spectrogram of the raw audio waveform signal. Then we used these representations to train attacker classifiers.
5.3. Dual-phase Disentangled Filter Setting
Firstly, spectrograms are generated from the raw time-domain waveform sampled at 16 kHz in a sliding window fashion using a Hamming window of width 25 ms and step 10 ms. For the speech embedding branch, these spectrograms are encoded by the encoder which consists of five residual convnet layers (using 768 units and ReLU activation). Then, the encoder output (latent vectors) passes through vector quantization (512 codebook size) to become a sequence of quantized representation that serves as the speech embedding. For the speaker embedding branch, the generated spectrograms are used as input for the encoder (Thin ResNet-34 (Xie et al., 2019)) which the same as the original ResNet with 34 layers, except cutting down the number of channels in each residual block to reduce computational cost. Then, Self-attentive pooling (SAP) (Cai et al., 2018) is used to aggregate frame-level features into the utterance-level representation that serves as speaker embedding. The representations from different branches are then upsampled and concatenated to form the conditioning input to the WaveRNN decoder (Note: a one-hot vector representing the speaker can be used as a global condition of WaveRNN decoder). We train the proposed framework on LibriSpeech, which has multiple speakers and was recorded at a sampling rate of 16 kHz. We used the Adam optimizer with an initial learning rate 4e-4 and evaluate the performance after 250,000 steps with batch size 64 (600,000 steps in total).
In this section we evaluate our results in terms of (i) the effectiveness of the attributes inference attacks in voice processing using different model architectures and several datasets; and (ii) the efficiency of the proposed framework to defend against this class of attacks in the voice domain.
6.1. Attack Effectiveness
6.1.1. Inference Accuracy.
Since the attacker’s goal is to infer the target attribute, we evaluate an attack using the inference accuracy of the classifier used by the attacker. Precisely, we mean the accu- racy of the classifier to infer sensitive information from the test set over the probability of the random guessing. Assuming, for exam- ple, that the sensitive attribute in question is the user’s emotion, we have seven labelled categories in the available datasets (Ravdess and SAVEE). The random guess rate for success is therefore around 14%. If we assume that the sensitive attribute is ‘gender’ (e.g. binary male or female), the random guess rate will be 50%. As the models potentially available to the attacker are unknown to us, we measure the success accuracy of various models to infer the target attribute trained on various datasets.
From Table 1, we see that the inference models have varying performance, ranging from about 40% to 99.4% in successfully inferring different attributes (e.g. emotion and gender). This means that the inference attacks can improve success accuracy to three or four times better than making a random guess. The difference between these percentages reflects the extent to which the attributes relate to each other. For example, gender is more entangled with a speaker’s identity than emotion, thus the attacker’s success rate is higher in identifying speaker gender. Table 3 shows that although there is a reduction in the success rate of an attacker in identity speaker gender, still there is a slight increase over random guessing in some cases.
Moreover, the diversity in the datasets recorded in different environmental conditions and for diverse purposes may mimic the differences in the real-time environments for the deployment of voice-controlled devices. We notice that this diversity affects the attack success accuracy; shown in Table 1. For example, an attacker’s success accuracy to infer emotion attribute is varied among the three emotional datasets ( IEMOCAP, RAVDESS, and SAVEE), and the inference accuracy over RAVDESS is better than the other datasets due to the good quality of the emotional recordings. Despite these differences, we demonstrate that the deep acoustic models can be exposed to sensitive attributes extraction from their inputs.
6.1.2. Impact of Acoustic Model Architecture on Attack Success.
We observe that the difference in the architecture of acoustic models can help attackers to successfully achieve their objectives. Insofar as the accuracy in extracting deep representations is increased to raise the efficiency of the speech processing tasks, the success percentage in inference of sensitive representations will also increase. For example, wav2vec (Schneider et al., 2019) has been developed to extract more powerful representations for speech recognition compared to the DeepSpeech2 (Amodei et al., 2016) model. From Table 1 we can see that the extracted representation using wav2vec increases the probability of the attacker inferring sensitive attributes compared with the DeepSpeech2 model.
6.2. Defense Efficiency
6.2.1. Disentanglement and Controllability.
We aim to enable users to have control over their data by taking advantage of disentangled representation learning. Thus, we design and implement the proposed framework on the assumption that there are three privacy preferences options, namely high, moderate, and low. After training the proposed framework to explicitly learn the disentangled representation from the speech data, it can generate different outputs that reflect the selected privacy preferences. Setting the ‘high’ option, speech content representation will be disentangled from the speaker’s identity.
The proposed framework can generate two types of output, either speech embedding or reconstruction of speech by concatenating the speech embedding with a synthetic identity. For the moderate option, the proposed framework can generate three types of outputs, which are speech embedding, speaker embedding, or reconstruction of speech by preserving the identity of the speaker while filtering out other information (e.g, emotions). Finally, by selecting the ‘low’ option, the proposed framework will send the raw data without any filtering. Figure 5 shows the spectrogram of the reconstructed speech signal for the different options. Moreover, we use word error rate (WER), a common metric of speech recognition performance, to use the difference in the word level between two spoken sequences to measure the difference in speech recognition between the raw speech signal and the reconstructed one for the different privacy preference options. We find, as shown in Table 2, that there is an insignificant decrease in (1%) in speech recognition accuracy. We use the equal error rate (EER) to measure the speaker verification accuracy (for the moderate privacy preference), and we find that an almost negligible rate between the raw and reconstructed speech signals for this speaker verification task.
Learning these disentangled representations not only serves our purpose to protect user privacy, but also is useful in finding robust representations for different speech processing tasks with limited data in the speech domain (Latif et al., 2020).
6.2.2. Privacy Estimation.
The baseline is the inference success from unfiltered representations.
Privacy Preference: High. The output of the framework should reflect this privacy preference by achieving high accuracy in speech recognition while hiding a speaker’s identity. Therefore, we measure the efficiency of the framework to learn a disentangled representation that preserves the speech content and discards the invariant information (i.e. speaker identity, emotion , and gender) by examining an attacker’s success in obtaining sensitive information using this representation. For fair comparison with the baseline inference accuracy, we only use the quantized embedding before concatenating it with a synthetic identity during reconstruction. Figure 6 shows a considerable drop in the inference accuracy after implementing vector quantization (one technique) to learn such disentangled representations (van den Oord et al., 2017), where the outcome is shown to be in line with guessing at random for all attacker models.
Privacy Preference: Moderate, The output of the framework should reflect this preference by achieving high accuracy in speech recognition while preserving the speaker’s identity. Thus, we measure the efficiency of the framework to learn a disentangled representation that preserves the speech content and speaker identity, and discards the invariant information (i.e. emotion and gender) by examining an attacker’s success in obtaining this sensitive information using the output of the proposed framework for this preference. Figure 7 shows a notable reduction in the inference attacks’ accuracy after reconstruction. This can be considered as a marginal improvement on random guessing. When comparing Figure 6 and Figure 7, we see that the speakers’ representations may still preserve representations related to some sensitive attribute based on the slight rise in attacker success rate in emotion recognition. Moreover, we notice that the accuracy of gender recognition is higher in some cases (e.g., RF applied to LibriSpeech and MLP and SVM applied to IEMOCAP) and even compared to emotion recognition, which means gender closely related to the speaker’s identity representation (i.e. highly related representation), as shown in Table 3. In future work, we will investigate further disentanglement approaches (e.g., adversarial learning) within speaker embedding and add constraints as appropriate to try to limit this success. This could also be used to address various models used in different speech processing applications for extracting acoustic features from raw signals outperforming one another, e.g. wav2vector (self-supervised) outperforms DeepSpeech2 (supervised) in speech recognition, as mentioned in Sec.6.1.2.
6.2.3. Prosody Visualization.
Chroma feature (chromagram) is a fast and robust way to visualize audio attributes, and is relatively invariant to changes in the vocal tract resonances (Wakefield, 1999). This feature shows the distribution of energy along with the twelve different pitches or pitch classes, which refer to tones that share the same pitch-space (refers to tones sounding the same but separated by relative highness or lowness). To compute this feature, the spectrum is firstly computed in the logarithmic scale, with a selection of the 20 highest dB and restriction to a certain frequency range that includes an integer number of octaves. Then, the spectrum energy is redistributed along with the different pitches (i.e., chromas).
Prosodic features, like pitch, play an essential role in the transmission of vocal emotions (Bulut and Narayanan, 2008). We therefore use chromagram visualization to measure the characteristics of the prosodic features between the raw speech and the reconstructed one. Figure 8 compares the raw speech (angry emotion), the reconstructed speech with identity preserved (calm emotion), and the reconstructed speech with suppressed identity. It is clear that the change in the energy located in each pitch class for each frame reflects the success of the proposed framework in changing the prosodic representation related to the user’s emotion to maintain his/her privacy.
7. Discussion and Future Work
Protecting users’ privacy where speech analysis is concerned continues to be a particularly challenging task. Yet, our experiments and findings indicate that it is possible to achieve a fair level of privacy while maintaining a high level of functionality for speech-based systems. Our results can be extended to shed light on several other questions discussed in this section.
To what extent can speech representation be private? Our experimental evaluation highlights the vulnerability of the underlying acoustic models used by the speech processing systems (e.g. ASR systems) to potential attribute inference attacks. We estimate an attacker’s success by running various arbitrary classifiers to measure the extent to which sensitive information can be obtained from a user’s speech data. Based on the results shown in Table 1, we find that such an attacker has the opportunity to extract this information with a much higher degree of accuracy than would otherwise be possible from chance. For example, for emotion recognition using the Ravdess dataset, and assuming that we have seven different emotions, then the random assumption rate will be 14% of the time, but when using the logistic model the success rate is four times greater than this. Moreover, when using, for example, the SVM model, which is a suggested model for analyzing emotions and physical conditions based on the Amazon patent (Jin and Wang, 2018), we observe that its success rate exceeds random guessing by three times. Although these classifiers are not ideal and the attackers can improve their strength by using more robust models (e.g. adversarial classifiers), our work aims to demonstrate these vulnerabilities and raise the alarm concerning the need for on-device solutions to sanitize user inputs insofar as possible before sharing them with service providers.
Is a two-phase framework necessary? The controllability enabled by the disentangled representations can help to design new privacy-preserving applications considering users’ privacy preferences. This controllability will allow us to explicitly adjust the disentangled representation to match user privacy preferences. We expect that there are different user privacy preferences for analytics depending on the service providers with which they interact. For example, when users communicate with health service providers, they may prefer to share raw data without any filtering due to the urgent need to provide accurate information to trusted specialists. To accommodate such differences, we design a two-phase framework where the first phase captures user preferences, while the second phase learns disentangled representations to reflect these preferences.
Initially, we suggest three privacy preferences options (i.e. high, moderate, and low). Supposing that the user wants to interact with a smart home assistant such as Amazon’s Alexa or Google Home, for the high privacy preference option, the default analysis task should be to understand the user command and response based on it without any additional information that allows secondary processing or re-purposing of the user data. For the moderate privacy preference, the default analysis tasks should be speech-to-text and speaker recognition for authentication purposes, whereas the low privacy option allows users to share their data without any alteration. However, these are just some examples of potential preferences and many more could be included due to the diversity of these systems users. In future work, we intend to provide users with additional controls depending on the devices and services with which they are interacting.
Is disentanglement necessary? Speech data has complex distributions and contains crucial information beyond linguistic content that may include information contained in background noise and speaker characteristics,among other information. Among these sources of variability, the current training of speech processing systems without regard to the impact of these sources will affect its performance and effectiveness. For example, only a portion of this information is related to ASR, while the rest can be considered as invariant and therefore impinge upon the performance of ASR systems. This effect may lead to gender-biased or race-specific systems (Zhang et al., 2019a). Koenecke et al. in (Koenecke et al., 2020) examine the racial disparities of five state-of-the-art ASR systems developed by Amazon, Apple, Google, IBM, and Microsoft by transcribing structured interviews conducted with 42 white speakers and 73 black speakers. They found that there are disparities in the underlying acoustic models used by these ASR systems and do not work equally well for all subgroups of the population. Likewise, the implementation of disentanglement in learning speaker representations can enhance the robustness of speaker representations and overcome common speaker recognition issues like anti-spoofing (Peri et al., 2020).
Many recent applications have suggested that a disentangled speech representation can improve the interpretability and transferability of the representation in the speech signal (Hsu et al., 2017). Although these applications seek to improve the quality and effectiveness of speech processing systems, it has not been considered for use in protecting privacy. We observe that the ability of the proposed framework to disentangle these representations can reconstruct different outputs that reflect a variety of privacy preferences. Thus, it can be argued that the separation of these representations will help to develop future privacy-aware solutions between users and service providers. Moreover, learning disentangled representations that reflect users’ preferences can bring enhanced robustness, interpretability, and controllability. We will, in future, seek to combine different techniques like adversarial training (Huang et al., 2020) and Siamese networks (Last et al., 2020) with disentanglement, or add further constraints grounded in information theory, to improve learning such disentangled representations from users’ signals.
Can we really do this at the edge? One of the primary reasons for taking an edge computing approach is to filter data locally prior to sending it to the cloud. Local filtering may be used to enhance protection of users’ privacy. For example, an on-device transformation of sensor data was proposed by Malekzadeh et al. in (Malekzadeh et al., 2019). They used convolutional autoencoders (CAE) as a sensor data anonymizer to remove user identifiable features locally and then share the filtered sensor data with specific applications, such as those designed for daily activities monitoring. In this work, we show how urgent it is to develop on-device privacy-preserving solutions for voice inputs by extracting the distinguishing representation from the speech without compromising individual privacy. In earlier versions of this work (Aloufi et al., 2019), we developed a privacy-preserving filter for voice inputs on edge devices to protect private paralinguistic information of a speaker. This filter enables users to protect their sensitive attributes (e.g. emotion) while benefiting from sharing their voice data with cloud-based voice analysis services. We implemented and evaluated the on-device filtering approach using a Raspberry Pi 4 as an example of an edge device, and our experimental results showed that similar performance in protecting sensitive information is attainable at the edge in comparison with cloud-based approaches. Although we showed that it is feasible for such models to be run on edge devices, further work is required to improve their efficiency, particularly with regard to model size and execution time. For example, model execution on a Raspberry Pi 4 takes twice as long (40 seconds) as the cloud. In this work, our prototype implementation indicates the effectiveness of the proposed framework in reconstructing the speech signal. In addition, there is a decrease in the model size from about 126 MB to 95 MB. As future work, we aim to significantly reduce the execution time and memory usage of running the proposed framework on edge devices by further optimizing and quantizing the implementation of the model to make it suitable for use in real-time applications.
8. Related Work
Privacy Leakage in Deep Learning. Deep learning models are vulnerable to various inference attacks as they remember information about their training data. Unwanted learning in the deep learning models was indicated by (Song et al., 2017; Melis et al., 2019), where they demonstrate that models leak detailed information about their training datasets. Likewise, in (Carlini et al., 2019), they show that generative text models trained on sensitive data can memorize its training data and an attacker could extract unique and secret sequences like credit card numbers given these models. Song et al. define “overlearning” on deep learning models to be a model trained for a simple objective that can be re-purposed for a privacy-violating task in (Song and Shmatikov, 2020). Motivated by these previous works, and given the scarcity of works targeting speech processing models specifically underlying deep acoustic models, in this paper we demonstrate the privacy leakage of input data from these models.
Other works have focused on protecting against membership inference attacks, which aim to determine whether a given data sample is used in the model’s training (Shokri et al., 2017). Nasr et al. measure training data privacy leakage of deep learning algorithms by analyzing state-of-the-art pre-trained models from the CIFAR dataset in (Nasr et al., 2019). They show that even well-generalized deep models are exposed to white-box membership inference attacks and leak a significant amount of information about their training data. Investigating membership inference attack is, however, beyond the scope of this paper but worthy of further investigation. We focus instead on the scenario whereby attackers can infer a significant amount of private information by observing the model input even if it is not in the training data.
Attribute inference attacks have shown to compromise user privacy in various application domains including recommender systems (Jia and Gong, 2018), side-channel attacks (Wei et al., 2018), location inference attacks (Shokri et al., 2012), and property inference attacks (Ganju et al., 2018)
. In these attacks, an attacker aims to infer the private attributes of the target user from his/her public data. Ateniese et al. show how an attacker can use access to the parameters of machine learning models such as Hidden Markov Models (HMM) to extract a predicate of the training data (e.g., the accent of the speaker in speech recognition models)(Ateniese et al., 2015). In contrast to their work, we attest that such attacks perform well on the state-of-art underlying deep acoustic models for speech processing tasks to extract user-specific private attributes.
Privacy Preserving Speech Representation. Learning privacy preserved representations in speech data is relatively unexplored (Latif et al., 2020). In (Nautsch et al., 2019) Nautsch et al. investigate the importance of the development of privacy-preserving technologies to protect speech signals and highlight the importance of the need to apply these technologies to protect speakers and speech characterization in recordings. Some recent works have sought to protect speaker identity (Qian et al., 2018), gender identity (Jaiswal and Provost, 2019) and emotion (Aloufi et al., 2019). VoiceMask, for example, was proposed to mitigate the security and privacy risks of voice input on mobile devices by concealing voiceprints (Qian et al., 2018). It aims to strengthen users’ identity privacy by sanitizing the voice signal received from the microphone and then sending the perturbed speech to the voice input apps or the cloud. Moreover, in (Aloufi et al., 2019) an edge-based system is proposed to filter affect patterns from a user’s voice before sharing it with cloud services for further analysis. Unlike other approaches, however, we seek to protect the privacy of multiple user attributes for IoT scenarios that depend on voice input or speech analysis, i.e. sanitizing the speech signal of attribute a user may not wish to share but without decreasing functionality. We also emphasize the importance of learning disentangled speech representation for optimizing the privacy-utility trade-off and promoting privacy in a transparent manner.
Fairness Representation. Fairness in machine learning is another research area related to this work that shares the same methods but where the objective is not to protect privacy. It aims to develop models that are invariant to particular attributes such as demographic information (Madras et al., 2018). In (Edwards and Storkey, 2015) the authors have shown how the adversarial approach can be adapted to the task of removing sensitive information from representations. In (Moyer et al., 2018), Moyer et al. have argued, however, that adversarial training for fairness and invariance is unnecessary, and sometimes produces counterproductive effects. Disentanglement has recently been shown to be useful for learning and evaluating fair machine learning models. Creager et al. proposed a fair representation learning model by disentanglement to achieve subgroup fairness in (Creager et al., 2019). Similarly, Locatello et al. investigated how disentanglement impacts the fairness of general-purpose representations in (Locatello et al., 2019). In (Marx et al., 2019), disentangling influence was presented to learn the influence of such attributes in accomplishing a given task. The authors investigate the importance of a feature’s influence over the model outcomes taking advantage of disentangled representations. By contrast, our goal is to protect user privacy by preventing attackers from obtaining sensitive information which is significantly different from the motivation and goals of previous studies.
In this paper, we demonstrated the potential vulnerabilities of underlying acoustic models used by speech processing tasks under attribute inference attacks. We proposed a privacy-aware, configurable framework for optimizing data sharing through voice user interfaces. Our proposed framework works in two phases, where the first phase adjusts privacy preferences and the second filters out sensitive attributes from users’ input data depending on the configured privacy preference. We based our evaluation on empirical results derived from numerous real-world datasets, and show that the proposed framework can effectively defend against this class of attack. Specifically, we can reduce the success rate of inferring private attributes to less than or equal to chance, while providing on average over 99% accuracy in primary tasks.
In the next steps of the work, we intend to focus on extending our framework to be more tunable to provide users with more controls depending on the devices and services with which they are interacting. An interesting direction for future research is to explore new privacy-preserving applications that can be enabled by the interpretability and controllability brought about by disentangled representations.
- Emotion filtering at the edge. In Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems, External Links: Cited by: §7, §8.
- Deep speech 2: end-to-end speech recognition in english and mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 173–182. Cited by: §5.2.3, §6.1.2.
- Hacking smart machines with smarter ones: how to extract meaningful data from machine learning classifiers. International Journal of Security and Networks 10 (3), pp. 137–150. Cited by: §8.
Vq-wav2vec: self-supervised learning of discrete speech representations. In International Conference on Learning Representations, Cited by: §4.2.2.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §4.2.2.
- On the robustness of overall f0-only modifications to the perception of emotions in speech. The Journal of the Acoustical Society of America 123 (6), pp. 4547–4558. Cited by: §6.2.3.
- IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335. Cited by: §1, §5.1.
- Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. In Odyssey 2018: The Speaker and Language Recognition Workshop, 26-29 June 2018, Les Sables d’Olonne, France, A. Larcher and J. Bonastre (Eds.), pp. 74–81. External Links: Cited by: §5.3.
- The secret sharer: evaluating and testing unintended memorization in neural networks. In Proceedings of the 28th USENIX Conference on Security Symposium, USA. Cited by: §8.
- Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.
- Infogan: interpretable representation learning by information maximizing generative adversarial nets. Cited by: §2.2.
- State-of-the-art speech recognition with sequence-to-sequence models. pp. 4774–4778. External Links: Cited by: §1.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797. Cited by: §2.2.
- In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982. Cited by: §4.2.2.
- Flexibly fair representation learning by disentanglement. arXiv preprint arXiv:1906.02589. Cited by: §8.
- When speakers are all ears: characterizing misactivations of iot smart speakers. In Proceedings of the 20th Privacy Enhancing Technologies Symposium (PETS 2020), Cited by: §1.
- Censoring representations with an adversary. arXiv preprint arXiv:1511.05897. Cited by: §8.
- Latent constraints: learning to generate conditionally from unconditional generative models. arXiv preprint arXiv:1711.05772. Cited by: §2.2.
- Learning disentangled representation for robust person re-identification. In Advances in Neural Information Processing Systems, pp. 5298–5309. Cited by: §2.1.
- Property inference attacks on fully connected neural networks using permutation invariant representations. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 619–633. Cited by: §8.
- Towards learning fine-grained disentangled representations from speech. arXiv preprint arXiv:1808.02939. Cited by: §2.1.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2.
Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §4.2.2.
- Neural face editing with intrinsic image disentangling. Google Patents. Note: US Patent 10,565,758 Cited by: §2.1.
- DeepSpeech: scaling up end-to-end speech recognition. Cited by: §1.
Audio-visual feature selection and reduction for emotion classification. In Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08), Tangalooma, Australia, Cited by: §1, §5.1.
- Beta-vae: learning basic visual concepts with a constrained variational framework.. Iclr 2 (5), pp. 6. Cited by: §2.2.
- DNN-based speech synthesis using speaker codes. IEICE TRANSACTIONS on Information and Systems 101 (2), pp. 462–472. Cited by: §4.2.2.
- Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pp. 1878–1889. Cited by: §2.2, §4.2.2, §4.2, §7.
- Unsupervised style and content separation by minimizing mutual information for speech synthesis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3267–3271. Cited by: §2.1.
- Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Transactions on Emerging Topics in Computational Intelligence. Cited by: §2.1, §7.
- Privacy enhanced multimodal neural representations for emotion recognition. arXiv preprint arXiv:1910.13212. Cited by: §8.
AttriGuard: a practical defense against attribute inference attacks via adversarial machine learning. In 27th USENIX Security Symposium (USENIX Security 18), Cited by: §8.
- Voice-based determination of physical and emotional characteristics of users. Cited by: §3.1, §7.
- Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 2410–2419. Cited by: Figure 4, §4.2.2.
- STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoustical science and technology 27 (6), pp. 349–353. Cited by: §4.2.2.
- Disentangling by factorising. In International Conference on Machine Learning, pp. 2649–2658. Cited by: §2.1.
- Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1857–1865. Cited by: §2.2.
- Auto-encoding variational bayes. Cited by: §2.2, §4.1.
- Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117 (14), pp. 7684–7689. Cited by: §7.
- Deep convolutional inverse graphics network. In Advances in neural information processing systems, pp. 2539–2547. Cited by: §2.2.
- Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, pp. 5967–5976. Cited by: §2.2.
- Unsupervised feature learning for speech using correspondence and siamese networks. IEEE Signal Processing Letters. Cited by: §7.
- Deep representation learning in speech processing: challenges, recent advances, and future trends. arXiv preprint arXiv:2001.00378. Cited by: §2.2, §6.2.1, §8.
- High-fidelity synthesis with disentangled representation. arXiv preprint arXiv:2001.04296. Cited by: §2.2.
- The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13 (5). Cited by: §1, §5.1.
- On the fairness of disentangled representations. In Advances in Neural Information Processing Systems, pp. 14584–14597. Cited by: §8.
- Towards achieving robust universal neural vocoding. pp. 181–185. External Links: Cited by: §4.2.2.
- Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pp. 3384–3393. Cited by: §8.
- Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §2.2.
- Mobile sensor data anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation, pp. 49–58. Cited by: §7.
- Disentangling influence: using disentangled representations to audit model predictions. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 4496–4506. Cited by: §2.1, §8.
- Disentangling factors of variation in deep representation using adversarial training. In Advances in neural information processing systems, pp. 5040–5048. Cited by: §4.2.2.
- Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 691–706. Cited by: §8.
- WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems. Cited by: §4.2.2.
- Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pp. 9084–9093. Cited by: §8.
- VoxCeleb: A large-scale speaker identification dataset. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda (Ed.), pp. 2616–2620. Cited by: §1, §5.1.
- Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, External Links: Cited by: §8.
- Preserving privacy in speaker and speech characterisation. Comput. Speech Lang. 58, pp. 441–480. Cited by: §8.
- Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §4.2.2, §4.2.2.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §1, §5.1.
- Unsupervised speech domain adaptation based on disentangled representation learning for robust speech recognition. arXiv preprint arXiv:1904.06086. Cited by: §2.1.
- Domain agnostic learning with disentangled representations. In ICML, Cited by: §2.1.
- An empirical analysis of information encoded in disentangled neural speaker representations. In Proceedings of Odyssey, Cited by: §2.1, §7.
-  (2018-01-01)(Website) External Links: Cited by: §5.
- Hidebehind: enjoy voice input with voiceprint unclonability and anonymity. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems, pp. 82–94. External Links: Cited by: §8.
- Learning to disentangle factors of variation with manifold interaction. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. II–1431–II–1439. Cited by: §2.1.
- Fairness by learning orthogonal disentangled representations. Cited by: §2.1.
- Wav2vec: unsupervised pre-training for speech recognition. In INTERSPEECH, Cited by: §5.2.3, §6.1.2.
-  EMOTION, affect and personality in speech and language processing. Cited by: §1.
- Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §8.
- Protecting location privacy: optimal strategy against localization attacks. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 617–627. External Links: Cited by: §8.
- Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 587–601. Cited by: §8.
- Overlearning reveals sensitive attributes. In International Conference on Learning Representations, Cited by: §8.
- Predictive auxiliary variational autoencoder for representation learning of global speech characteristics. Proc. Interspeech 2019, pp. 934–938. Cited by: §2.2.
- Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6264–6268. Cited by: §2.1, §4.2.2.
- Domain adaptation for structured output via disentangled representations. Google Patents. Note: US Patent App. 16/400,376 Cited by: §2.1.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1, §4.2.2, §4.2.2, §6.2.2.
- Chromagram visualization of the singing voice. In International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, Cited by: §6.2.3.
- End-to-end anchored speech recognition. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
I know what you see: power side-channel attack on convolutional neural network accelerators. In Proceedings of the 34th Annual Computer Security Applications Conference, pp. 393–406. Cited by: §8.
- Utterance-level aggregation for speaker recognition in the wild. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. Cited by: §4.2.2, §5.3.
- Privacy risk in machine learning: analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. Cited by: §1.
- Group retention when using machine learning in sequential decision making: the interplay between user dynamics and fairness. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 15269–15278. Cited by: §7.
- Learning latent representations for style control and transfer in end-to-end speech synthesis. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6945–6949. Cited by: §2.1.
Look across elapse: disentangled representation learning and photorealistic cross-age face synthesis for age-invariant face recognition. In
Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp. 9251–9258. Cited by: §2.1.