Speech technology has been developed rapidly in recent years, with applications such as a virtual assistant that receives speech commands as its input. On the one hand, this technology helps us to live more conveniently. On the other hand, it can cause privacy issues because speech encapsulates personal identifiable information (PII) such as age, gender, health, and emotional state . This issue has been escalating due to the increasing availability of advanced speech synthesis technology. At this stage, with only a few speech samples spoken by a target speaker, we could easily clone a new utterance spoken by the speaker .
To preserve speech privacy, speaker anonymization or de-identification is used as a preprocessing method before the speech is publicly distributed. Research on speaker anonymization constitutes a relatively new area that has emerged following the efforts to suppress speaker identity to fool speaker identification systems. In prior studies, the Kaldi phone has been used for voice transformation [3, 4]
and could successfully fool the earlier speaker identification system based on a Gaussian mixture model (GMM). In addition, another approach based on cepstral frequency warping has been used for speech transformation for the purpose of de-identification.
Although a few solutions have been proposed for anonymization, there is no formal definition or task in place yet, which causes ambiguity in determining the level of anonymization performance. Hence, the VoicePrivacy 2020 Challenge (VP2020) was initiated with the goal of fostering the development of speaker anonymization techniques by defining the task, metrics, scenarios, and a benchmark for the initial solutions (baseline). In this challenge, an anonymization system suppressed PII while maintaining the intelligibility and naturalness of a speech signal 
.Two baseline systems were provided: (1) a primary baseline based on a state-of-the-art x-vector embedding and neural waveform modelling, and (2) a secondary baseline based on the manipulation of pole locations derived by linear predictive coding (LPC) using the McAdams coefficient .
As a result of the VP2020, several systems have been proposed for speaker anonymization . Most of these systems focused on improving the primary baseline [9, 10, 11], since it achieved superior results in comparison to the secondary baseline. In this paper, we focus on improving the security of the secondary baseline, which is much less complex than the primary baseline and requires no training data. Our proposed method is based on a speech watermarking approach. With speech watermarking, it is possible to embed a secret message (also called a watermark)  that can be used to verify the authentication of the speech. We speculate that by embedding a watermark in anonymized speech, we should be able to identify the originality of the speech or use it to prevent speech spoofing.
In Section 2 of this paper, we provide a brief overview of related studies. Section 3 introduces our proposed method. Our experimental setting, evaluations, and results are reported in Section 4. We conclude in Section 5 with a brief summary and mention of future work.
2 McAdams coefficient-based Speaker Anonymization
As one solution for preserving the privacy of speech data, VP2020  was initiated with the goal of fostering the development of speaker anonymization techniques. Such techniques are ideally expected to suppress the leakage of PII while maintaining the linguistic information of the speech signal. In this challenge, four requirements were determined for the speaker anonymization technique: (1) the output had to be a speech waveform, (2) it must maximize the suppression of speaker individuality information, (3) it must preserve speech naturalness and intelligibility, and (4) it must ensure the distinction of voices of different speakers. This challenge also provided two different baseline systems111https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2020, datasets, and protocols for evaluating the speaker anonymization performance.
The primary baseline system was developed on the basis of x-vector embeddings and a neural source filter-based waveform model . The idea behind this system is to modify the speech identity information (x-vector) after disentangling it from the speech content (phoneme posteriorgram and the fundamental frequency). The secondary baseline system (shown in Fig. 1) was developed on the basis of modifications to the McAdams coefficient . The McAdams coefficient is related to the adjustment of harmonic frequency distributions, which affects the perception of timbre . Although the results of the secondary baseline were not as good as the first baseline, it requires no training data and is much less complex.
The McAdams coefficient proposed in  is a parameter derived on the basis of the additive synthesis method in music signal processing . This method is applied to timbre generation by resynthesizing multiple harmonic cosinusoidal oscillations, as
where is the synthesized signal by combining harmonic consinusoidal oscillations with inverse Fourier series, is the harmonic index, is the amplitude, is the phase, and is the McAdams coefficient . Prior work on speaker anonymization , has shown that the McAdams coefficient can transform the spectral envelope of speech signals and affect timbre perception. To examine this in more detail, we show the spectral envelopes of the frequency response by various McAdams coefficients in Fig. 2. We can see that the farther away the McAdams coefficient is from the original speech (), the greater the shift of spectral envelope we can obtain. The degree of this shifting affects our perception in perceiving the speech formants .
3 Proposed Method
In this study, we propose a method based on the secondary baseline to enhance the security of anonymized speech by using a watermarking approach. Speech watermarking aims to protect the security in a speech signal by imperceptibly embedding within it a particular message, such as a signature that indicates the speech’s ownership. Speech watermarking should fulfill at least three requirements: inaudibility (not perceivable by the human auditory system), blindness (detection without the availability of original signal), and robustness against common signal processing operations. The trade-off between inaudibility and robustness has been the most pressing issue in existing speech watermarking techniques . In the present work, we investigate the effectiveness of using a watermarking approach for a McAdams coefficient-based speech anonymization method with regard to these requirements.
Generally, speech watermarking consists of two main processes: embedding and detection. Figure 3 shows the block diagram of our embedding process. As the first step, we generated the anonymized signals from the original signal () using two different McAdams coefficients ( and ). The anonymization procedure follows the steps in Fig. 1.
Firstly, the original speech was segmented into speech frames with frame length depending on the watermarking payload. Subsequently, the speech frame was analyzed using linear predictive coding (LPC) with an order of 20 (). The mathematical form of the LPC is characterized by the following differential equation:
where is a speech signal, corresponds to the filter coefficient in -th order, is the maximum order of the prediction (in this study ), and is the prediction error. The corresponding transfer function () for Eq. (2) is represented by twentieth-order all-pole autoregressive filters, which is given by:
From LPC analysis, we obtained the linear prediction coefficients (LP coefficients) and residuals. These LP coefficients () were then used to derive the pole positions. The derived poles were comprised of complex poles (poles with non-zero-valued imaginary terms) and real poles (poles with zero-valued imaginary terms). The shift of complex poles position () was resulting in the angle shifting to either clockwise or counter-clockwise of the complex positions . The effect of this angle shifting in the spectral envelope is shown in Fig. 2. After the modification of McAdams coefficients , the modified complex poles and the real poles were converted back to LP coefficients. The anonymized speech frame was resynthesized from these LP coefficients and the original residuals.
After obtaining the anonymized signals, we normalized them to be in similar relative loudness (fixed a target peak level in decibel relative to full scale (dBFS)) and range of frequency components (using a bandpass filter (BPF)). The cut-off frequencies for the BPF were 125 Hz and 4 kHz. Finally, we constructed the watermarked signal () by frame-by-frame concatenation of the anonymized signals obtained by bit inverse according to watermarked bit-stream.
We found that the anonymized signals from different McAdams coefficients carried different amounts of power spectrum, specifically in the lower frequency components. Using a higher McAdams coefficient results in a higher power spectrum in the lower frequency. On the basis of this characteristic, we determined a power threshold for the blind detection process (shown in Fig. 4
). The detection process was conducted by comparing the power spectrum obtained by fast Fourier transform (FFT) of the watermarked signal () in a specific frequency range with the designated threshold .
This section describes the experimental setting in our study and reports the objective evaluation results.
4.1 Experimental Setting
datasets (development and testing sets) that were provided in VP2020. LibriSpeech is an English speech corpus designed for automatic speech recognition (ASR) research sampled at 16 kHz. VCTK is an English speech corpus that contains 109 native speakers with various accents and was designed for text-to-speech (TTS) research sampled at 48 kHz. The development and training data of both datasets in VP2020 consists of more than 20,000 utterances from almost 200 speakers. The sampling rate of the speech data is set to 16 kHz. Similar to the secondary baseline , we do not need any training data for our proposed method. The McAdams coefficient used to represent bit “0” was 0.6 () and bit “1” () was 0.8. We evaluated the speaker verifiability of our proposed method by using a pretrained automatic speaker verification system (ASVeval) and the intelligibility by using a pretrained automatic speech recognition system (ASReval), similar to the protocol in VP2020. The suggested requirement for a reasonable payload of an audio watermarking is 72 bits per 30 seconds (around 2–3 bps) . In this study, we used a 4-bps watermarked signal to evaluate the anonymization performance.
For speech watermarking, we also evaluated the speech quality and robustness of our method with a total 100 randomly selected utterances from the LibriSpeech and VCTK datasets. Since the original signal was not available, we used MOSNet, the pretrained mean opinion score (MOS) predictor proposed in 
. MOSNet is an objective evaluation tool based on deep learning approach for predicting human MOS ratings in a voice conversion system. Subsequently, we evaluated the robustness of our propsed method as suggested in by calculating the bit error rate (BER) and balanced F1-score during normal (no attack) operations along with several signal processing operations, including noise addition, resampling, requantization, compression, and speech codecs. We also examined the security level by calculating the false acceptance rate (FAR) and false rejection rate (FRR). The maximum acceptable BER threshold as the robustness indication is . We embedded random binary streams with payloads of 2, 4, 8, 16, and 32 bps and varied the detection threshold in the order of lower to higher payloads (0.15, 0.09, 0.05, 0.02, and 0.01, respectively). Due to the space limitation, the results reported in this paper are mainly in mean value.
Our intent with this evaluation is mainly to investigate the effectiveness and reliability of the proposed method in anonymizing the PII of the speaker and watermarking the speech. For the speaker anonymization, we conducted our evaluation using ASVeval to check the speaker verifiability performance in three cases: original enrolls and original trials case (o-o), original enrolls and anonymized trials case (o-a), and anonymized enrolls and anonymized trials case (a-a). The results are shown in Fig. 5, where the top and bottom graphs shows the results of female and male utterances, respectively. As we can see, the proposed method (P) improved the speaker similarity in comparison to the secondary baseline in VP2020 (B2): the EER improved significantly to more than 35% in the o-a case and to more than 20% in the a-a case for both female and male utterances.
|original||-||3.15 ± 0.49|
|anonymized||-||2.70 ± 0.18|
|2||2.73 ± 0.20|
|4||2.73 ± 0.21|
|8||2.70 ± 0.19|
|16||2.67 ± 0.18|
|proposed method||32||2.60 ± 0.18|
For the speech intelligibility, we conducted an objective evaluation by using ASReval trained on original data and anonymized data. The results are shown in Fig. 6, where the blue and orange bars indicate the results obtained from ASReval trained on original and anonymized data, respectively. In contrast to the speech verifiability, our proposed method caused a noticeable degradation to intelligibility, especially for the VCTK dataset (the degradation is increased from 11.79% to 54.78%). This degradation was mainly occurred due to the anonymization method as we can obtain almost similar average WER with the stochastic approach proposed in . The further shift to McAdams coefficients, the more WER occurred. To improve the intelligibility, we could retrain the ASReval using the anonymized data as in . For example, the WER is shown by the blue bar for “VCTK (P)” could be reduced to 11.91%.
Inaudibility and robustness are essential requirements in speech watermarking. To evaluate the inaudibility of the proposed method, we used MOSNet  to derive the MOS of the watermarked signal. Table 1 shows the MOSNet results of original signal, anonymized signal with McAdams coefficients , and the output signal of our proposed method with various payloads. We can see that there was a speech quality degradation (MOS degraded from 3.15 to 2.70) caused by the McAdams coefficient-based anonymization method, while in contrast, the proposed method could maintain a similar MOS even with a relatively high payload.
show the robustness test results. We examined nine cases: no attack (normal), addition of white Gaussian noise (AWGN), downsampling to 8 kHz (resample-8), upsampling to 24 kHz (resample-24), bit compression to 8 bits (requantize-8), bit extension to 24 bits (requantize-24), MP3 compression (MP3), flash video format (flv), and G723.1 codec. For AWGN processing, the signal to noise ratio used is 40 decibel (dB). Meanwhile, the bitrate range for MP3 compression was from 220-260 kbps (kilo bits per second). The bitrate of G723.1 codec was 5.3 kbps with algebraic code-excited linear prediction (ACELP) algorithm. As we can see, our proposed method was robust against other operations (the BER was similar to the normal case), except for the conversion to video codec (flv). The results here demonstrate that our method is suitable for watermarking purposes, since the BER for 4 bps satisfied the robustness criteria (). The results also suggest that the security level indicated by the FAR, FRR, and F1-score is adequate for payloads up to 4 bps.
In summary, our proposed method could improve significantly the speaker verifiability by ASVeval. Although intelligibility was degraded, the WER could reduced by simply retraining the ASReval. The evaluation for speech watermarking also confirmed that our proposed method could resulted in inaudible and robust watermark. As an example, Figure 9 shows the example of original image watermarked in the speech with the detected image after several operations. We could easily perceive the 2021 logo in the detected watermarks in almost all processing (BER is approximately 7%).
5 Conclusion and Future Work
In this paper, we proposed a technique to improve the security of McAdams coefficient-based speaker anonymization by using a watermarking approach. By adding a watermark into the anonymized signal, we could ensure the originality of the speech (one of the anti-spoofing countermeasures). The performance of the proposed method was evaluated objectively based on the standard in speaker anonymization (VP2020 protocol) and watermarking. The results showed that our method is suitable for watermarking, as it could satisfy the blind, inaudible, and robustness requirements. In addition, it significantly improved the speaker dissimilarity in ASVeval. The limitation of our method in comparison to the baseline system is that it caused degradation in speech intelligibility, although this could be resolved by retraining the ASReval with the anonymized speech.
In spite of these promising results, we acknowledge that the evaluation, particularly in consideration to human perception such as the speech quality test, might not be sufficient. As future work, we will explore more suitable ways to evaluate the watermarking technique for speaker anonymization. We also intend to optimize the proposed method, as its detection rate and payload are relatively limited.
This work was supported by a Grant-in-Aid for Scientific Research (B) (No. 17H01761), JSPS KAKENHI Grant (No.20J20580), Fund for the Promotion of Joint International Research (Fostering Joint International Research (B))(20KK0233), and KDDI foundation (Research Grant Program).
-  N. Tomashenko, B. M. L. Srivastava, X. Wang, E. Vincent, A. Nautsch, J. Yamagishi, N. Evans, J. Patino, J.-F. Bonastre, P.-G. Noé, and M. Todisco, “The VoicePrivacy 2020 Challenge evaluation plan,” 2020.
-  S. Ö. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” CoRR, vol. abs/1802.06006, 2018. [Online]. Available: http://arxiv.org/abs/1802.06006
-  Q. Jin, A. R. Toth, A. W. Black, and T. Schultz, “Is voice transformation a threat to speaker identification?” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 4845–4848.
-  Q. Jin, A. R. Toth, T. Schultz, and A. W. Black, “Speaker de-identification via voice transformation,” in 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 2009, pp. 529–533.
-  C. Magariños, P. Lopez-Otero, L. Docio-Fernandez, E. Rodriguez-Banga, D. Erro, and C. Garcia-Mateo, “Reversible speaker de-identification using pre-trained transformation functions,” Computer Speech & Language, vol. 46, pp. 36–52, 2017.
-  F. Fang, X. Z. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.-F. Bonastre, “Speaker anonymization using x-vector and neural waveform models,” 2019.
-  J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker anonymisation using the McAdams coefficient,” arXiv preprint arXiv:2011.01130, 2020.
“The VoicePrivacy 2020 Challenge, challenge setup and results,” 2020.
C. O. Mawalim, K. Galajit, J. Karnjana, and M. Unoki, “X-vector singular value modification and statistical-based decomposition with ensemble regression modeling for speaker anonymization system,” inInterspeech 2020. ISCA, 2020, pp. 1703–1707. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-1887
-  H. Turner, G. Lovisotto, and I. Martinovic, “Speaker anonymization with distribution-preserving x-vector generation for the VoicePrivacy challenge 2020.” [Online]. Available: https://www.voiceprivacychallenge.org/docs/Oxford.pdf
F. M. Ezpinoza-Cuadros, J. M. Perero-Codosero, J. Anton-Martin, and L. A. Hernandez-Gomez, “Speaker de-identification system using autoencoders and adversarial training.” [Online]. Available:https://www.voiceprivacychallenge.org/docs/Sigma.pdf
-  G. Hua, J. Huang, Y. Q. Shi, J. Goh, and V. L. Thing, “Twenty years of digital audio watermarking—a comprehensive review,” Signal Processing, vol. 128, pp. 222–242, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0165168416300263
-  S. Mcadams, “Spectral fusion, spectral parsing and the formation of auditory images,” Ph. D. Thesis, Stanford, 1984.
-  C. Dodge and T. A. Jerse, “Computer music: Synthesis, composition, and performance,” 1997.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE ICASSP, 2015, pp. 5206–5210.
-  C. Veaux, J. Yamagishi, and K. Macdonald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2017.
-  STEP2001, “News release, final selection of technology toward the global spread of digital audio watermarks,” 2001. [Online]. Available: http://www.jasrac.or.jp/ejhp/release/2001/0629.html
-  C.-C. Lo, S.-W. Fu, W.-C. Huang, X. Wang, J. Yamagishi, Y. Tsao, and H.-M. Wang, “MOSNet: Deep learning based objective assessment for voice conversion,” 2019.
-  IHC Committee, “IHC evaluation criteria and competition.” [Online]. Available: https://www.ieice.org/iss/emm/ihc/IHC_criteriaVer6.pdf