With the emergence of pervasive voice assistants [lopez2017alexa, kepuska2018next] like Amazon Alexa, Apple’s Siri and Google Home, voice has become one of the most widespread forms of human-machine interaction. In this context, the speech signal is sent from the user device to a cloud-based service, where automatic speech recognition (ASR) and natural language understanding are performed in order to address the user request.111See e.g., https://cloud.google.com/speech-to-text/ While recent studies have identified security vulnerabilities in these devices [lei2017insecurity, chung2017alexa], such studies tend to hide more important privacy risks that can have long-term impact. Indeed, state-of-the-art speech processing algorithms can infer not only the spoken contents from the speech signal, but also the speaker’s identity [reynolds1995speaker], intention [gu2017speech, hellbernd2016prosody, ballmer2013speech, stolcke1998dialog], gender [zeng2006robust, kotti2008gender], emotional state [el2011survey, ververidis2004automatic, kwon2003emotion], pathological condition [dibazar2002feature, umapathy2005feature, schuller2013interspeech], personality [schuller2013computational, schuller2015survey] and cultural [sekiyama1997cultural, vinciarelli2009social] attributes to a great extent. These algorithms require just a few tens of hours of training data to achieve reasonable accuracy, which is easier than ever to collect via virtual assistants. The dissemination of voice signals in large data centers thereby poses severe privacy threats to the users in the long run.
These privacy issues have little been investigated so far. The most prominent studies use homomorphic encryption and bit string comparison [pathak2012privacy, glackin2017privacy]
. While these methods provide strong cryptographic guarantees, they come at a large computational overhead and can hardly be applied to state-of-the-art end-to-end deep neural network based systems.
An alternative software architecture is to pre-process voice data on the device to remove some personal information before sending it to web services. Although this does not rule out all possible risks, a change of representation of the voice signal can contribute to limiting unsolicited uses of data. In this paper, we investigate how much of a user’s identity is encoded in speech representations built for ASR. To this end, we conduct closed- and open-set speaker recognition experiments. The closed-set experiment refers to a classification setting where all test speakers are known at training time. In contrast, the open-set
experiment (a.k.a. speaker verification) aims to measure the capability of an attacker to discriminate between speakers in a more realistic setting where the test speakers are not known beforehand. We implement the attacker with the state-of-the-art x-vector speaker recognition technique[snyder2018x].
The representations of speech we consider in our work are given by the encoder output of end-to-end deep encoder-decoder architectures trained for ASR. Such architectures are natural in our privacy-aware context, as they correspond to encoding speech on the user device and decoding in the cloud. Our baseline network follows the ESPnet architecture [watanabe2018espnet], with one encoder and two decoders: one based on connectionist temporal classification (CTC) and the other on an attention mechanism. Inspired by [feutry2018learning], we further propose to extend the network with a speaker-adversarial branch so as to learn representations that perform well in ASR while hiding the speaker identity.
Several papers have recently proposed to use adversarial training for the goal of improving ASR performance by making the learned representations invariant to various conditions. While general form of acoustic variabilities have been studied [serdyuk2016invariant], there is some work specifically on speaker invariance [tsuchiya2018speaker, meng2018speaker]. Interestingly, there is no general consensus on whether it is more appropriate to use speaker classification in an adversarial or a multi-task manner, despite the fact that these two strategies implement opposite means (i.e., encouraging representations to be speaker-invariant or speaker-specific). This question was studied in [adi2018reverse], in which the authors conclude that both approaches only provide minor improvements in terms of ASR performance. Their speaker classification experiments also show that the baseline system already tends to learn speaker-invariant features. However, they did not run speaker verification experiments and hence did not assess the suitability of these features for the goal of anonymization.
In contrast to these studies which aim to increase ASR performance, our goal is to assess the potential benefit of adversarial training for concealing speaker identity in the context of privacy-friendly ASR. Our contributions are the following. First, we combine CTC, attention and adversarial learning within an end-to-end ASR framework. Second, we design a rigorous protocol to quantify speaker identity in ASR representations through a series of closed-set classification and open-set verification experiments. Third, we run these experiments on the Librispeech corpus [panayotov2015librispeech] and show that this framework dramatically reduces speaker classification accuracy, but does not increase speaker verification error. We suggest several possible reasons behind this disparity.
2 Proposed model
We start by describing the ASR model we use as a baseline, before introducing our speaker-adversarial network.
2.1 Baseline ASR model
We use the end-to-end ASR framework presented in [watanabe2017hybrid] as the baseline architecture. It is composed of three sub-networks: an encoder which transforms the input sequence of speech feature vectors into a new representation , and two decoders that predict the character sequence from . We assume that these networks have already been trained using data previously collected by the service provider (which may be public data, opt-in user data, etc). Then, in the deployment phase of the system that we envision, the encoder would run on the user device and the resulting representation would be sent to the cloud for decoding.
The first decoder is based on CTC and the second on an attention mechanism. As argued in [watanabe2017hybrid], attention works well in most cases because it does not assume conditional independence between the output labels (unlike CTC). However, it is so flexible that it allows nonsequential alignments which are undesirable in the case of ASR. Hence, CTC acts as a regularizer to prune such misaligned hypotheses. We denote by the parameters of the encoder, and by and the parameters of the CTC and attention decoders respectively. The model is trained in an end-to-end fashion by minimizing an objective function which is a combination of the losses and from both decoder branches:
with a trade-off parameter between the two decoders.
We now formally describe the form of the two losses and . We denote each sample in the dataset as , where is the sequence of acoustic feature frames, is the sequence of characters in the transcription, and is the speaker label. In the case of CTC, several intermediate label sequences of length are created by repeating characters and inserting a speacial blank label to mark character boundaries. Let be the set of all such intermediate label sequences. The CTC loss is computed as where . This sum is computed by assuming conditional independence over , hence . The attention branch does not require an intermediate label representation and conditional independence is not assumed, hence the loss is simply computed as .
2.2 Speaker-adversarial model
In order to encourage the network to learn representations that are not only good at ASR but also hide speaker identity, we propose to extend the above architecture with what we call a speaker-adversarial branch. This branch models an adversary which attempts to infer the speaker identity from the encoded representation . We denote by the parameters of the speaker-adversarial branch. Given the encoder parameters , the goal of the adversary is to find that minimizes the loss . Our new model is then trained in an end-to-end manner by optimizing the following min-max objective:
where is a trade-off parameter between the ASR objective and the speaker-adversarial objective. The baseline network can be recovered by setting . Note that the max part of the objective corresponds to the adversary, which controls only the speaker-adversarial parameters . The goal of the speaker-adversarial branch is to act as a “good adversary” and produce useful gradients to remove the speaker identity information from the encoded representation . In practice, we use a gradient reversal layer [ganin2016domain]
between the encoder and the speaker-adversarial branch so that the whole network can be trained end-to-end via backpropagation. We refer to Fig.1 for an illustration of the full architecture.
3 Experimental evaluation
We use the Librispeech corpus [panayotov2015librispeech] for all the experiments. We use different subsets for ASR training, adversarial training, and speaker verification. For the sake of clarity we refer to them as data-full, data-adv, and data-spkv, respectively (see Table 1). The data-full set is almost the original Librispeech corpus, including train-960 for training, dev-clean and dev-other for validation, and test-clean and test-other for test, except that utterances with more than 3,000 frames or more than 400 characters have been removed from train-960 for faster training.
The data-adv set is a 100 h subset of train-960, which is obtained by removing long utterances from the original Librispeech train-100 set similarly to above. It is split into three subsets in order to perform closed-set speaker identification experiments, since the speakers in the original train/dev/test splits are disjoint. There are 251 speakers in data-adv: we assign 2 utterances per speaker to each test-adv and dev-adv. The remaining utterances are used for training and referred to as train-adv.
For speaker verification with x-vectors [snyder2018x], we use data-spkv, which is again derived from data-full. The train-960 subset was augmented using room impulse responses, isotropic and point-source noises [ko2017study] as well as music and speech [musan2015] as per the standard sre16 recipe for training x-vectors [snyder2018x] from the Kaldi toolkit [povey2011kaldi], which we adapted to Librispeech. This increased the amount of data by a factor of 4. A subset of the augmented data containing 373,985 utterances was used to train the x-vector representation and another subset containing 422,491 utterances to train the probabilistic linear discriminant analysis (PLDA) backend. These subsets are referred to as train-spkv and train-plda, respectively. For evaluation, we built an enrollment set (test-clean-enroll) and a trial set (test-clean-trial) from the test-clean data. Out of 40, 29 speakers were selected from test-clean based on sufficient data availability. For each speaker, we selected a 1 min subset after speech activity detection for enrollment and used the rest for trials. The details of the trials are given in Table 2.
|dataset||data split||# utts||duration (h)|
|# Genuine trials||449||548|
|# Impostor trials||9,457||11,196|
3.2 Evaluation metrics
For all tested systems, we measure ASR performance in terms of the word error rate (WER) and we assess the amount of information about speaker identity in the encoded speech representation in terms of both speaker classification accuracy (ACC) and speaker verification equal error rate (EER). The WER is reported on the test-clean set. The ACC measures how well speakers can be discriminated in a closed-set setting, i.e., speakers are known at training time. It is evaluated over the test-adv
set using the same classifier architecture as the speaker-adversarial branch of the proposed model (see Section2). As opposed to the ACC, the EER measures how well the representations hide the speaker identity for unknown speakers, in an open-set scenario. It reflects the process of confirming whether a person is actually who the attacker thinks it might be. It evaluated over the trial set (see Table 2) using x-vector-PLDA.
The ACC and the EER will be computed for different representations: the baseline filterbank features, the representations encoded by the network trained for ASR only (corresponding to ) as well as those obtained with the speaker-adversarial approach (corresponding to for some values of ).
3.3 Network architecture and training
For all experiments, we use the ESPnet [watanabe2018espnet] toolkit which implements the hybrid CTC/attention architecture [watanabe2017hybrid]. The input features are 80-dimensional mel-scale filterbank coefficients with pitch and energy features, totalling 84 features per frame. The encoder
is composed of a VGG-like convolutional neural network (CNN) layer followed by 5 bidirectional long short-term memory (LSTM) layers with 1,024 units. The VGG layer contains 4 convolutional layers followed by max pooling. The feature maps used in the convolution layers are of dimensions, , and . The attention-based decoder consists of location-aware attention [chorowski2015attention] with 10 convolutional channels of size 100 each followed by 2 LSTM layers with 1,024 units. The CTC loss is computed over several possible label sequences using dynamic programming. In all experiments, the trade-off parameter
between the two decoder losses is set to 0.5. We train a single-layer recurrent neural network language model (RNNLM) with 1,024 hidden units over thetrain-960 transcriptions and use it to rescore the ASR hypotheses. The resulting WER is very close to the state of the art [zeghidour2018fully] when trained on train-960. Finally, we implemented the speaker-adversarial
branch via a 3 bidirectional LSTM layers with 512 units followed by a softmax layer with 251 outputs corresponding to the 251 speakers indata-adv. The adversarial loss is summed across all vectors in the sequence. The speaker label is duplicated to match the length of the sequence, which is smaller than due to the subsampling performed within the encoder. Due to this subsampling as well as to the use of bidirectional LSTM layers within the encoder and the speaker-adversarial branch, the frame-level adversarial loss approximates well a utterance-level speaker loss that would be computed from a fixed-sized utterance-level representation, while being easier to train.
In all experiments, we start by pre-training the ASR branch for 10 epochs overdata-full and then the speaker-adversarial branch for 15 epochs on data-adv in order to get a strong adversary on the pre-trained encoded representations. Then, due to time constraints, all networks are fine-tuned on data-adv: we run 15 epochs of adversarial training (which corresponds to simple ASR training when ). Due to this, the WER is comparable to that typically achieved by end-to-end methods when trained on the train-100 subset of Librispeech rather than the full train-960 set. Finally, freezing the resulting encoder, we further fine-tune the speaker-adversarial branch only for 5 epochs to make sure that the reported ACC reflects the performance of a well-trained adversary.
The encoder network contains 133.5M parameters. To encode a 10s audio file, it perform 1.1e12 arithmetic operations which can be executed in-parallel on a 40 core CPU in 17.6s and on a single Tesla P100 GPU in 149ms.
3.4 Results and discussion
We train our speaker-adversarial network for , leading to three encoded representations . Recall that corresponds to the baseline ASR system as it ignores the speaker-adversarial branch. Table 3 summarizes the results.
The first column presents the ACC and EER obtained with the input filerbank features, which are consistent with the numbers reported in the literature. As expected, speaker identification and verification can be addressed to very high accuracy on those features. Using the encoded representation trained for ASR only already provides a significant privacy gain: the ACC is divided by 2 and the EER is multiplied by 4, which suggests that a reasonable amount of speaker information is removed during ASR training. Nevertheless, still contains some speaker identity information.
More interestingly, our results clearly show that adversarial training drastically reduces the performance in speaker identification but not in verification. On the other hand, and counterintuitive to the speaker-invariance claims by several previous studies, we observe that the verification performance actually improves after adversarial training. This exhibits a possible limitation in the generalization of adversarial training to unseen speakers and hence establishes the need for further investigation. The reason for the disparity between classification and verification performance might be that the speaker-adversarial branch does not inherently perform verification and hence is not optimized for that task. It might also be attributed to the representation capacity of that branch, to the number of speakers presented during adversarial training, and/or to the exact range of needed for generalizable anonymization. These factors of variation open several venues for future experiments.
We also notice that the WER stays reasonably low and stabilizes to the value of 12.5% after increasing from 0.5 to 2. In particular, for the WER is just 1.6% absolute more than the baseline ().
We evaluate whether utterances from the same speaker stay in the same neighborhood or are scattered in the representation space. We compute t-SNE embeddings on the x-vector representations of 20 utterances for 10 speakers (5 male, 5 female), shown in Figure 2. When using filterbanks, we can observe well-clustered utterances. The clusters break down when training the x-vectors on . For the x-vectors trained on and , the clusters start to re-emerge. The silhouette scores for x-vectors extracted from filterbank, , and representations are , , and respectively, are consistent with the observed EER values.
4 Conclusions and future work
We proposed to combine CTC and attention losses with a speaker-adversarial loss within an end-to-end framework with the goal of learning privacy-preserving representations for ASR. Such representations could be safely transmitted to cloud-services for decoding. We investigate the level of speaker identity anonymization achieved by adversarial training through closed-set speaker classification and open-set speaker verification measures. Adversarial training appears to dramatically reduce the closed-set classification accuracy, seemingly indicating a high-level of anonymization. However, this observation does not match with the open-set verification results, which correspond to the real scenario of an adversary trying to confirm the identity of a suspected speaker. Hence we conclude that the adversarial training does not immediately generalize to produce anonymous representations in speech. We hypothesize that this disparity might be attributed to the representation capacity of the adversarial branch, the size of the training set, the formulation of the adversarial loss, and/or the value of the trade-off parameter with the ASR loss. As a future work, we plan to modify the speaker adversarial branch to inherently optimize for verification instead of classification and ascertain the impact of these experimental choices over different datasets, including for languages not seen in training.
This work was supported in part by the European Union’s Horizon 2020 Research and Innovation Program under Grant Agreement No. 825081 COMPRISE (https://project.inria.fr/comprise/) and by the French National Research Agency under project DEEP-PRIVACY (ANR-18-CE23-0018). Experiments were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations. The authors would like to thank Md Sahidullah for providing the speaker verification data split.