Noise-robust voice conversion with domain adversarial training

01/26/2022
by   Hongqiang Du, et al.
Search Home 首页»»正文
0

Voice conversion has made great progress in the past few years under the studio-quality test scenario in terms of speech quality and speaker similarity. However, in real applications, test speech from source speaker or target speaker can be corrupted by various environment noises, which seriously degrade the speech quality and speaker similarity. In this paper, we propose a novel encoder-decoder based noise-robust voice conversion framework, which consists of a speaker encoder, a content encoder, a decoder, and two domain adversarial neural networks. Specifically, we integrate disentangling speaker and content representation technique with domain adversarial training technique. Domain adversarial training makes speaker representations and content representations extracted by speaker encoder and content encoder from clean speech and noisy speech in the same space, respectively. In this way, the learned speaker and content representations are noise-invariant. Therefore, the two noise-invariant representations can be taken as input by the decoder to predict the clean converted spectrum. The experimental results demonstrate that our proposed method can synthesize clean converted speech under noisy test scenarios, where the source speech and target speech can be corrupted by seen or unseen noise types during the training process. Additionally, both speech quality and speaker similarity are improved.

READ FULL TEXT VIEW PDF

page 4

page 7

page 8

page 9

page 10

page 11

09/30/2020

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

This paper presents a novel framework to build a voice conversion (VC) s...
07/02/2022

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

Building a voice conversion system for noisy target speakers, such as us...
08/10/2020

Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training

Data efficient voice cloning aims at synthesizing target speaker's voice...
11/23/2018

Training Multi-Task Adversarial Network For Extracting Noise-Robust Speaker Embedding

Under noisy environments, to achieve the robust performance of speaker r...
02/16/2020

Speech-to-Singing Conversion in an Encoder-Decoder Framework

In this paper our goal is to convert a set of spoken lines into sung one...
07/11/2021

A Deep-Bayesian Framework for Adaptive Speech Duration Modification

We propose the first method to adaptively modify the duration of a given...
06/25/2021

Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets

We address voice activity detection in acoustic environments of transien...

1 Introduction

Voice conversion (VC) is a technique to transform the speech signal of a source speaker to sound like that of a target speaker without changing the linguistic content Mohammadi and Kain (2017). This technique has many applications, including voice morphing, emotion conversion, speech enhancement Mouchtaris et al. (2004), movie dubbing as well as other entertainment applications.

Voice conversion has taken some major strides in terms of speech quality and speaker similarity. Various approaches have been proposed, such as Gaussian mixture model (GMM) 

Benisty and Malah (2011); Stylianou et al. (1998); Toda et al. (2007), frequency warping approaches Erro et al. (2009); Godoy et al. (2011); Tian et al. (2015), exemplar based methods Takashima et al. (2012); Wu et al. (2014); Tian et al. (2017), and neural network based methods Sun et al. (2015); Hsu et al. (2016, 2017a); Kaneko and Kameoka (2018b, a); Kameoka et al. (2018); Tanaka et al. (2019); Zhang et al. (2019); Du et al. (2021); Wang et al. (2021b)

. Recently, disentangling speaker and linguistic content representations based on deep learning for voice conversion 

Chou and Lee (2019); Qian et al. (2019); Du and Xie (2021); Wang et al. (2021a) has received much attention. Comparing with conventional methods, source speakers and target speakers during evaluation are not required to be seen in the training process. Disentangling approaches achieve good performance in terms of speech quality and speaker similarity when evaluated under a clean scenario, where both source speech and target speech are clean.

Despite recent progress, the test speech from source and target speaker can be corrupted by various environment noises in real applications. Recent studies Takashima et al. (2012); Huang et al. (2021) show that many popular voice conversion frameworks, including disentangling methods: AdaIN-VC Chou and Lee (2019), and AUTOVC Qian et al. (2019), DGAN-VC Chou et al. (2018), suffer serious speech quality and speaker similarity degradation in noisy conditions. Corpora from studio settings are greatly different from real-world testing conditions. In noisy conditions, linguistic content, speaker identity, and noise are intermingled together in speech signals. How these factors are intermingled to compose the speech signal and impact each other are far from clear Li et al. (2018). Due to the complex nature of the noise corruption process, the linguistic content representations and speaker representations from clean speech and corresponding noise corrupted speech have different distributions, respectively Sun et al. (2017); Wang et al. (2018).

There have been a few techniques to address such problem in voice conversion. Non-negative matrix factorization (NMF) Takashima et al. (2012, 2014); Aihara et al. (2015) assumes that the speech can be expressed with exemplars and corresponding weights. NMF builds a dictionary consisting of corresponding exemplars from source speech and target speech. The converted spectrum is reconstructed with target exemplars and the picked weights related to the source exemplars. However, this method only considers background noise in source speech and target speech should be clean during evaluation. Recently, Hsu et al. Hsu et al. (2017b) proposed to learn disentangled latent representations. Hsu et al. explored a hierarchical latent space which encodes different attributes into latent segment variables and latent sequence variables. At run-time, by replacing the latent sequence variable from noise speech with that of a clean utterance, this framework is able to synthesize denoised converted speech. However, this method requires that the target speech is clean. Therefore, it remains a challenge to generate clean converted speech under complex noisy conditions in real applications, where both source and target speech are corrupted by various noises.

To build a noise-robust voice conversion system that is robust to complex noisy conditions, a straightforward idea is to use speech enhancement as a pre-processing module Valentini-Botinhao et al. (2016) to get denoised speech for downstream tasks. But the inevitable distortion of denoised speech can lead to clear quality deterioration to the synthesized speech. This conclusion has been further confirmed in Yang et al. (2020). Additionally, the method strongly relies on prior knowledge of noise Sekkate et al. (2019), which limits their applications. Hsu et al. Hsu et al. (2019) explored adversarial training for disentangling speaker attribute from noise attribute. This technique can independently control the speaker identity and background noise in the generated speech. Learning noise-invariant features is another successful attempt. Domain adversarial training (DAT) is a popular method to extract domain-invariant representations. Domain adversarial neural network consists of a feature extractor, a gradient reversal layer (GRL) Ganin et al. (2016)

, a task classifier, and a domain classifier. The feature extractor extracts a representation that is discriminative to the task classifier, while is indiscriminate to the domain classifier with the help of GRL, which is referred as domain-invariant representation. Recently, due to easy implementation and great performance, DAT has been applied in speech recognition 

Sun et al. (2017); Shinohara (2016), speaker verification Wang et al. (2018); Tu et al. (2019), speech enhancement Liao et al. (2018) and wake-up word detection Lim et al. (2020).

Building on the success of the prior studies on disentangling content and speaker representations and domain adversarial training, in this paper, we propose a novel encoder-decoder based noise-robust voice conversion method. AdaIN-VC Chou and Lee (2019) is a successful end-to-end disentangling approach, which consists of a speaker encoder, a content encoder, a decoder, and they are jointly optimized. Compared with GAN Kaneko and Kameoka (2018a); Kameoka et al. (2018) based and sequence to sequence Tanaka et al. (2019); Zhang et al. (2019) based methods, AdaIN-VC can perform non-parallel Hsu et al. (2016) any-to-any voice conversion. Moreover, Huang et al. Huang et al. (2021) confirmed that AdaIN-VC is more robust than DGAN-VC and AUTOVC in noisy conditions. Therefore, we use AdaIN-VC as a case study. Based on successful framework design of AdaIN-VC, we extend the content encoder and speaker encoder with a gradient reversal layer (GRL) and a domain classifier, respectively. Note that we do not need the phone or speaker related task classifier to extract content and speaker representations. The gradient reversal layer helps to reduce the performance of the domain classifier. At a result, the gap between representations extracted from noisy domain and clean domain also reduced. Domain adversarial training makes the learned content and speaker representations from noisy speech and clean speech noise-invariant, respectively. In this way, the decoder can synthesize clean converted speech with noise-invariant content representation from source speech and speaker representation from target speech when either source or target speech is corrupted by noises at run-time.

The main contributions are as follows: first, we learn noise-invariant content representation and speaker representation for voice conversion in an unsupervised manner. Second, since the available works concentrate on only source speech is corrupted by noise at run-time, this paper represents a new effort to approach real-life conditions where the utterances from source and target speaker may be corrupted by various noises. Third, we find that our proposed method is robust for different noise types. Even for unseen noise types, our system can synthesize clean converted speech.

The rest of the paper is organized as follows. In Section 2, we briefly discuss the related work, including disentangling content and speaker representations for voice conversion, and domain adversarial training. In Section 3, we introduce our proposed noise-robust voice conversion method. In Section 4, we introduce the experiment setups. In Section 5, we report and analyze the experimental results. We conclude this paper in Section 6.

2 Related Work

In this section, we will give an overview of disentangling content and speaker representations for voice conversion, and domain adversarial training, to set the stage for our study.

2.1 Disentangling content and speaker representations for voice conversion

The speech signal can be factorized into a speaker representation and a content representation, meanwhile speech signals can also be recovered by the two explanatory factors of variation Chou and Lee (2019). Voice conversion asks to maintain content information while changing the speaker information in one utterance. Figure 1 shows the run-time process of voice conversion framework based on disentangling content and speaker representations. The speaker encoder takes spectrum from target speech as input and the output is speaker representation in utterance level. The content encoder takes spectrum from source speech as input and the output is linguistic content representation in frame level. The decoder utilizes the concatenated latent representation to reconstruct converted spectrum. In this way, we can control the speaker identity in the generated speech independently. The conversion process is formulated as follows, where is the converted spectrum.

(1)

AdaIN-VC Chou and Lee (2019) adopts the same framework structure as Figure 1 for voice conversion. The speaker and content representations can be extracted successfully by designing the neural network of speaker encoder and content encoder.

Speaker identity is time-independent and merely changes in one utterance, hence it is static Chou and Lee (2019). To disentangle speaker representation from spectrum, the speaker encoder first uses 1D convolution layers to get frame-level speaker features. Then it adopts average pooling layers to aggregate frame-level speaker information to form utterance level speaker representation Okabe et al. (2018); Tang et al. (2019).

Figure 1: Run-time process of voice conversion framework based on disentangling content and speaker representations. The speaker encoder extracts speaker representation from target speech. The content encoder extracts content representation from source speech. The decoder takes content and speaker representations as input to reconstruct converted spectrum.

Linguistic content changes dramatically among frames in one utterance, hence it is dynamic Chou and Lee (2019). It is important for the content encoder not to memorize the input spectrum but to encode it in a semantic way Mor et al. (2018). The content encoder compresses the spectrum to form bottleneck representation, which means that a part of the information, e.g. speaker information, is lost during the compressing process and the remaining information should provide the decoder with sufficient information that is necessary for perfect reconstruction Qian et al. (2019). To further remove speaker related information, the content encoder also utilizes instance normalization technique Ulyanov et al. (2016). Finally, a multivariate content representation is sampled from bottleneck representation, which corresponds to semantically meaningful factors of variation of the observations (e.g., linguistic content) Hsu et al. (2016); Locatello et al. (2019).

To sum up, disentangling content and speaker representations for voice conversion requires that the speaker encoder and content encoder are designed to discard unnecessary information while preserving relevant information for the task of interest. The decoder attempts to reconstruct the spectrum from the compressed representations extracted by the content encoder and speaker encoder.

2.2 Domain adversarial training

Domain adversarial training (DAT) is designed to learn domain-invariant representation by reducing the bias presented in data from different domains Ganin et al. (2016). The framework of domain adversarial neural network is shown in Figure 2, which consists of a feature extractor, a gradient reversal layer (GRL), a task classifier, and a domain classifier. The key component is the gradient reversal layer. Learning domain-invariant representation is achieved by inserting a gradient reversal layer between the feature extractor and the two classifiers: task classifier and domain classifier. The gradient reversal layer (GRL) can be formulated as Eq 2:

Figure 2: The framework of domain adversarial neural network. The network consists of four components: a feature extractor, a gradient reversal layer, a task classifier, and a domain classifier.
(2)

where is input, is a scaling factor of the gradient. In the forward-pass, the GRL acts as an identity layer that leaves the input unchanged. In backward-pass, the gradient is multiplied by a negative scalar and propagated back to the shared feature extractor layers. The whole network is optimized to minimize the error of task classifier while maximizing the error of the domain classifier with the help of GRL Sun et al. (2017); Wang et al. (2018); Ganin et al. (2016). The low performance of domain classifier indicates that the gap between learned representations from different domains is reduced. As a result, the extracted representations from different domains are projected into the same feature subspace, and they have very similar distributions Wang et al. (2018), which are indistinguishable to the domain classifier but are discriminative to the task classifier. The downstream tasks will not take domain information into consideration, and thus will be robust to domain variation.

3 Noise-robust voice conversion with DAT

In this section, we will introduce our proposed noise-robust voice conversion method, which integrates disentangling content and speaker representations technique, and domain adversarial training technique.

3.1 Disentangling noise-invariant content and speaker representations with DAT

Domain adversarial training has been successfully applied to speech recognition Shinohara (2016); Sun et al. (2017) and speaker verification Wang et al. (2018). For speech recognition task, the feature extractor extracts domain-invariant content representation by minimizing the loss of the phoneme classifier while maximizing the loss of the domain classifier. For speaker verification task, the feature extractor extracts domain-invariant speaker representation by minimizing the loss of speaker classifier while maximizing the loss of the domain classifier. But for our work, we extract noise-invariant speaker and content representations in an unsupervised manner. We do not add explicitly phoneme classifier and speaker classifier.

Previous work Hsu et al. (2019) adopts noise augmentation and domain adversarial training in text to speech (TTS). Our work substantially differs from the above work in following aspects: (1) VC is a related but different task from TTS. For TTS, the input is text and the output is spectrum. For VC, both input and output are spectrum. (2) The work Hsu et al. (2019) adopts noise augmentation and domain adversarial training mainly for extracting noise-invariant speaker embedding, while our work focuses on extracting noise-invariant speaker embedding and content embedding. (3) To extract noise-invariant representations, we only use domain classifier, while previous work uses both task and domain classifiers. Furthermore, domain adversarial training is separately used for each element of content representation in our work. (4) In the previous work, noise and speaker are correlated, and speech of one speaker is augmented with random SNRs to make SNRs less discriminative about speakers, while in our work, we consider more practical situations – the speech of one speaker is corrupted with various noise types at a random SNR to make learned representation robust against different noise types and SNRs. Next, we will introduce our proposed method.

Figure 3: The diagram of our proposed noise-robust voice conversion method. This framework consists of a speaker encoder , a content encoder , two gradient reversal layers and two domain classifiers , for speaker encoder and content encoder, and a decoder .

We denote the training speech dataset as . is the -th data sample, is the corresponding domain label. indicates that comes from the clean domain, represents that is from noisy domain. Note that we do not use explicit noise types, e.g. street, cafe, as noisy domain labels. We use as domain-label to make the extracted representations noise-robust to different noise types and SNRs. The data from clean domain and noisy domain are paired. Our goal is to utilize these clean and noisy data to learn noise-invariant linguistic content representation and speaker representation for noise-robust voice conversion.

The framework of our proposed noise-robust voice conversion method is shown in Figure 3. It consists of following components: a speaker encoder , a content encoder , two gradient reversal layers and two domain classifiers , for speaker encoder and content encoder, and a decoder . Here, , , , and are the parameters of the corresponding network. The whole network shown in Figure 3 can be formulated as follows:

(3)
(4)
(5)
(6)
(7)

where the input for speaker encoder and content encoder may be a noisy or clean utterance randomly selected from . and are the extracted noise-invariant speaker and content representation, respectively. and are outputs of the two domain classifiers. The output

of decoder is the clean converted spectrum. Our proposed model reconstructs clean speech from noisy speech, which makes the model like a denoising autoencoder 

Lu et al. (2013); Shivakumar and Georgiou (2016).

The proposed framework for noise-robust voice conversion consists of two steps: conversion model training and run-time conversion. The forward pass of training process is shown in Figure 3. On the one hand, the speaker encoder and content encoder extract speaker representation and content representation from clean or noisy speech. The learned representations from clean and noisy speech have different distributions. To make the representations and noise-invariant, two gradient reversal layers take and as input, respectively. Then the outputs of two gradient reversal layers are fed into two domain classifiers, respectively. As there are two domains, the domain classifiers are binary classifiers. On the other hand, the decoder takes the speaker and content representations to reconstruct the clean spectrum. The backward pass of training process is shown in Figure 3. , , , and

are four loss functions. The details of the loss functions will be introduced in Section 

3.2. The parameters of speaker encoder are updated by and . The parameters of content encoder are updated by , and . The gradients in two gradient reversal layers are multiplied by a negative scalar and propagated back to the content encoder and speaker encoder, respectively. In this way, the learned content and speaker representations make the error of two domain classifiers increase. The gap of distributions between representations from noisy speech and clean speech is reduced, which helps to project the speaker and content representations into a noise-invariant subspace, respectively. In this subspace, the learned speaker and content representations satisfy the following condition:

(8)

In this way, the domain information is not included in the learned representations. Therefore, the decoder can take noise-invariant speaker and content representations as input to reconstruct the clean speech.

During run-time process, only the speaker encoder, content encoder and decoder work. The inputs of content encoder and speaker encoder are from source speaker and target speaker, respectively. Then the learned noise-invariant speaker representation from target speaker and content representation from source speaker are fed into decoder to get clean converted output. As the speech from source speaker and target speaker may be clean or noisy, our proposed noise-robust voice conversion method can achieve voice conversion under four scenarios: (1) both source speech and target speech are clean (SC-TC); (2) source speech is clean while target speech is noisy (SC-TN); (3) source speech is noisy while target speech is clean (SN-TC); (4) both source and target speech are noisy (SN-TN).

3.2 Loss functions

Our proposed noise-robust voice conversion framework is jointly optimized. The overall loss function is a linear combination of reconstruction loss, Kullback-Leibler (KL) loss, and two domain classification losses. During the training process, we seek to minimize the reconstruction loss and KL loss while maximizing the two domain classification losses. The formula is expressed as follows:

(9)

where , , , are the hyper-parameters to balance different losses. The first term is the reconstruction loss, which is a mean absolute error between reconstructed spectrum and clean spectrum. The second term is the Kullback-Leibler (KL) divergence loss between the content representation’s posterior and prior . The prior is assumed to be a centered isotropic multivariate Gaussian , where

is the identity matrix. Minimizing the KL loss encourages content encoder to learn linguistic content representation 

Hsu et al. (2016). Minimizing the reconstruction loss encourages the decoder to learn to reconstruct the input with content and speaker representations.

The third term and the fourth term are the two domain classifier losses. As we want to make content and speaker representations noise-invariant, the learned representations and should make the well-trained two domain classifiers and fail to distinguish which domain the representation comes from. To achieve this, the two domain classification loss and are maximized.

Overall, by using the four loss functions, the learned content and speaker representations can keep enough relative information for perfect reconstruction. Meanwhile, they are robust to domain variations.

4 Experimental setups

To validate our proposed method, we develop three proposed systems with different configurations: domain adversarial training is used only for speaker encoder, content encoder, and both speaker encoder and content encoder, respectively. Then we select our best system and compare it with other baseline systems under four test scenarios: both source speech and target speech are clean (SC-TC); (2) source speech is clean and target speech is noisy (SC-TN); (3) source speech is noisy while target speech is clean (SN-TC); (4) both source and target speech are noisy (SN-TN).

4.1 Database and feature extraction

CSTR-VCTK database Veaux et al. (2017) is a clean multi-speaker corpus, which contains 44 hours of speech samples from 109 speakers. A noise corpus from CHiME4 challenge Vincent et al. (2017)

contains about 8.5 hours of background noises recorded in four different locations (bus, cafe, pedestrian area, and street), was used to simulate noisy speech. Three-fourths of noise corpus were used for training and the rest for testing. Each utterance from VCTK corpus was corrupted with four noise types at a random signal-to-noise ratio (SNR) ranging from 5dB to 20dB. The clean and augmented speech together were used as the training dataset to train the voice conversion model.

Voice conversion experiments were carried out on CMU-ARCTIC Kominek and Black (2004) database. We selected two female and two male speakers, and the following conversion pairs were conducted: female to male, female to female, male to male, male to female. Usually, a small number of utterances Tian et al. (2017) for each speaker are used for test. In this paper, 30 utterances from each conversion pair were used for evaluation. 120 converted utterances were synthesized for the four speaker pairs in total. All audio files were downsampled to 16 kHz. To test performance of our proposed method, we conducted experiments under seen and unseen noise conditions. For the seen noise conditions, we used the remaining noise clips from CHiME-4 to simulate noisy test speech at 5 dB, 10 dB, 15dB, and 20 dB SNR. For the unseen noise condition, we selected babble and hfchannel noises from NOISEX-92 Varga and Steeneken (1993) corpus and added them to the test speech at 5 dB, 10 dB, 15dB, and 20 dB SNR, respectively. A noisy speech corpus CSTR-NOISE Botinhao et al. (2016) was used to test different systems under real noisy conditions. The noise conditions include cafe, restaurant, car, kitchen and meeting room. We selected two female and two male speakers, 30 utterances of each speaker were utilized for evaluation.

Mel spectrogram is a compact representation of the audio signal. Librosa McFee et al. (2015) was employed to extract 256 dimensional mel spectrogram from clean and noisy speech with 50ms frame length and 12.5ms frame shift. Neural vocoder Parallel WaveGAN Yamamoto et al. (2020) was used to synthesize the converted speech.

4.2 System architectures

The details of baselines and the proposed noise-robust voice conversion systems are introduced as follows. All the following methods are conducted for any-to-any voice conversion. As NMF is a traditional one-to-one voice conversion method, we did not use it as a baseline.

  1. Baseline systems

    1. VAE-C-C: AdaIN-VC Chou and Lee (2019) system takes 256 dimensional mel spectrogram from clean speech as input and the output is 256 dimensional clean mel spectrogram.

    2. VAE-CN-C: VAE-CN-C has the same setting as VAE-C-C except that this takes 256 dimensional mel spectrogram from clean or noisy speech as input and the output is 256 dimensional clean mel spectrogram.

    3. VAE-CD-C: VAE-CD-C has the same setting as VAE-CN-C except that this system takes mel spectrogram from clean or denoised speech as input. For this denoising baseline system, we use the state-of-the-art speech enhancement model named DCCRN Xie (2020) to get denoised speech.

    4. FHVAE-CN-CN: FHVAE-CN-CN Hsu et al. (2017b) takes 256 dimensional mel spectrogram from clean or noisy speech as input and the output is 256 dimensional clean or noisy mel spectrogram. Note that this system only works in one kind of noisy scenario where the source speech is noisy and target speech is clean.

  2. Proposed systems

    1. VAEDC-CN-C: This is one of our proposed noise-robust voice conversion systems. Domain adversarial training is only used for the content encoder to extract noise-invariant content representation. This system takes 256 dimensional mel spectrogram from clean or noisy speech as input and the output is 256 dimensional clean mel spectrogram.

    2. VAEDS-CN-C: VAEDS-CN-C has the same setting as VAEDC-CN-C except that domain adversarial training is only used for the speaker encoder to extract noise-invariant speaker representation.

    3. VAED-CN-C: VAED-CN-C has the same setting as VAEDC-CN-C except that domain adversarial training is used for both content encoder and speaker encoder to extract noise-invariant content and speaker representations.

AdaIN-VC system adopts an encoder-decoder based framework. We make some adjustments to the speaker encoder and decoder to improve performance. The speaker encoder consists of a ConvBank block Chou and Lee (2019), a residual network (ResNet) and a dense block Chou and Lee (2019)

. The content encoder follows original configurations. The speaker encoder and content encoder take 256 dimensional mel spectrogram as input and the output is 128 dimensional speaker and content representations, respectively. To further improve the speech quality, auto-regressive technique is used in the decoder. The domain adversarial neural network consists of a gradient reversal layer, a dense layer and a softmax layer. The whole network is optimized with Adam optimizer 

Kingma and Ba (2014). The scalar is set to 0.1. The hyper-parameters , , , are set to 10, 0.5, 0.1, 0.1, respectively.

Speech enhancement model DCCRN Xie (2020) ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS). We used an internal DCCRN model. The noisy speech was simulated with dynamic mixing during model training. The total data seen by DCCRN was over 2000 hours. For the noise-robust voice conversion model FHVAE, we used the original neural network configurations Hsu et al. (2017b).

4.3 Evaluation metrics

Both objective and subjective evaluations were conducted to evaluate the systems.

4.3.1 Objective evaluation

Mel-cepstral distortion (MCD) Toda et al. (2007) was employed to measure the spectral distortion. Euclidean distance of the 40 dimensional mel cepstral coefficients (MCC) between the converted speech and the target speech is calculated with MCD formula. Given a speech frame, the MCD is defined as follows:

(10)

where and are the coefficient of the converted and target mel cepstra, is the dimension of MCC. The lower MCD indicates the smaller distortion.

Note that MCD is an indirect measurement, which is not directly related to speech quality Machado and Queiroz (2010). As the durations of converted speech and target speech are different, we can not adopt perceptual evaluation of speech quality (PESQ) Rix et al. (2001) to measure the speech quality. The results of subjective evaluation reflect the actual perceptual quality of generated speech.

Word error rate (WER) evaluated by automatic speech recognition (ASR) indicates intelligibility. The ASR model was Conformer 

Gulati et al. (2020) based and trained on the 960h of LibriSpeech corpus.

4.3.2 Subjective evaluation

For subjective evaluation, we evaluated speech quality and speaker similarity using AB and ABX preference tests, respectively. For the AB test, A and B represent the randomly selected samples, where A and B have the same linguistic content. Listeners are asked to select a sample with better quality from A and B. For the ABX test, X refers to the reference sample of the target, A and B represent the converted samples randomly selected from the proposed and baseline method. Then listeners are asked to choose the sample closer to the reference sample or no preference in terms of speaker similarity.

We also assessed the speech naturalness with mean opinion score (MOS) test, where each listener is required to give opinion score on a five-point scale (5: excellent, 4: good, 3: fair, 2: poor, 1: bad). All the converted samples were used for listening tests. For each listener, we randomly selected 30 samples for listening test. Different listeners may listen to different samples. 20 listeners participated in all listening tests.

5 Experimental results and analysis

In this section, we report the experimental results to verify the effectiveness of proposed method under different test scenarios.

5.1 Disentangling noise-invariant speaker and content representations

First, we compared the MCD scores of proposed systems under noisy scenarios to select best proposed system. Figure 4 (a) compares the average MCD between VAEDS-CN-C and VAED-CN-C under SN-TC scenario to verify that domain adversarial training for content encoder is necessary. VAED-CN-C outperforms VAEDS-CN-C significantly under different SNR levels when source speech is corrupted by noises. Figure 4 (b) compares the average MCD between VAEDC-CN-C and VAED-CN-C under SC-TN scenario to verify that domain adversarial training for speaker encoder is necessary. VAED-CN-C outperforms VAEDC-CN-C significantly under different SNR levels when target speech is corrupted by noises. VAED-CN-C achieves the best performance among the proposed systems. This confirms that domain adversarial training for both content encoder and speaker encoder are necessary.

Figure 4: Comparison of average MCD of proposed systems with and without domain adversarial training for the content encoder and speaker encoder under noisy scenarios.

Figure 5: An example of visualizing speaker representations and content representations from the male to female conversion pair under 5dB SNR of cafe noise, where the speaker representations and content representations are trained without and with domain adversarial training, respectively. (a) the speaker representations of VAEDC-CN-C are trained without DAT, (b) the speaker representations of VAED-CN-C are trained with DAT, (c) the content representations of VAEDS-CN-C are trained without DAT, (d) the content representations of VAED-CN-C are trained with DAT.
Scenario System Street Cafe Babble Hfchannel
MCD(dB) WER(%) MCD(dB) WER(%) MCD(dB) WER(%) MCD(dB) WER(%)
5dB 20dB 5dB 20dB 5dB 20dB 5dB 20dB 5dB 20dB 5dB 20dB 5dB 20dB 5dB 20dB
SC-TN VAE-CN-C 11.69 11.44 12.12 11.13 11.56 11.44 12.71 11.63 11.46 11.37 13.81 12.62 11.52 11.43 13.35 12.14
VAE-CD-C 10.65 10.53 11.56 10.91 10.67 10.52 11.97 11.13 10.69 10.56 12.58 11.49 10.69 10.52 12.25 11.37
VAED-CN-C 9.71 9.53 11.13 10.41 9.66 9.53 11.29 10.36 9.62 9.51 11.79 10.87 9.61 9.47 11.49 10.53
SN-TC VAE-CN-C 11.95 11.59 38.52 13.63 11.98 11.57 44.56 14.56 11.90 11.58 47.23 15.22 11.82 11.58 48.58 16.18
VAE-CD-C 10.75 10.62 18.83 12.11 10.87 10.65 22.35 12.16 10.87 10.64 24.98 13.28 10.94 10.67 25.28 14.34
FHVAE-CN-CN 11.40 10.37 22.43 12.50 11.95 10.83 29.83 14.35 10.90 10.41 40.15 13.30 11.98 10.90 51.32 27.16
VAED-CN-C 9.75 9.42 19.95 12.43 9.91 9.45 24.21 12.18 9.68 9.47 26.54 13.62 9.70 9.43 24.61 15.56
SN-TN VAE-CN-C 12.06 11.49 40.45 14.52 11.88 11.47 44.87 15.43 11.84 11.41 47.45 15.18 11.81 11.48 48.84 16.35
VAE-CD-C 10.82 10.61 18.91 12.71 10.94 10.67 22.15 12.28 10.95 10.70 25.21 13.21 10.96 10.74 26.71 13.15
VAED-CN-C 9.88 9.50 19.79 12.81 10.06 9.51 24.39 12.33 9.83 9.51 26.38 13.51 9.82 9.45 25.21 15.25
Table 1: A comparison of average MCD and WER between the proposed noise-robust voice conversion system and baseline systems under different noisy test scenarios. At run-time, SC-TN indicates that source speech is clean and target speech is noisy. SN-TC represents that source speech is noisy and target speech is clean. SN-TN indicates that both source and target speech are noisy.

Then we show an example of speaker representations and content representations from the male to female conversion pair under 5dB SNR of cafe noise. The representations are projected to 2D using t-distributed stochastic neighbor embedding (t-SNE) Hinton (2008). Figure 5 (a) shows the speaker representations extracted by VAEDC-CN-C from clean and noisy target speech. The speaker encoder of VAEDC-CN-C takes clean or noisy speech as input, and the output of decoder is always the clean speech. Hence the speaker encoder learns to project the speaker representations extracted from clean speech and noisy speech into the same subspace. As a result, the speaker representations from two domains are partially overlapped. Due to the limited capability of projecting the speaker representations from different domains into the same space, the speaker representations of clean speech and noisy speech from the same speaker have different distributions when speaker encoder is trained without domain adversarial training. Figure 5 (b) shows that the speaker representations extracted by VAED-CN-C have similar distributions with the help of DAT for speaker encoder. Figure 5 (c) shows that the content representations extracted by VAEDS-CN-C from clean and noisy source speech belong to different distributions. With the help of domain adversarial training for the content encoder of VAED-CN-C, the content representations are overlapped shown in Figure 5 (d). It was observed that the proposed system VAED-CN-C successfully learns to extract noise-invariant speaker and content representations with domain adversarial training. Note that we also observed some mis-alignments on the contours of distributions of extracted content and speaker representations from VAED-CN-C, respectively. This indicates that there is still room for improvement.

With the empirical observations above, we use VAED-CN-C in the rest of the experiments and compare it with baseline systems.

5.2 Objective evaluation

The objective evaluation results of different noise-robust voice conversion methods under three noisy test scenarios are shown in Table 1. Due to limited space, we report MCD and WER results at different SNR levels of two seen noise types: street and cafe, two unseen noise types: babble and hfchannel, respectively.

Under the SC-TN noisy scenario, noise is only from target speech, which affects the learned speaker representations. We first evaluated system performance under seen noise types. As the SNR increases from 5dB to 20dB, the MCD and WER scores of all systems decrease for street and cafe noises. Under the same noise type and SNR level, as the speaker representations extracted from clean and noisy speech have different distributions while the output of decoder is clean speech, this mismatch degrades the performance of decoder, VAE-CN-C gets the highest MCD and WER scores, which indicate that VAE-CN-C performs worst among all the systems. By using speech enhancement model DCCRN as a pre-processing module to get de-noised speech, VAE-CD-C performs consistently better than VAE-CN-C. By learning noise-invariant speaker representations which reduce the gap in distributions of different domains, our proposed system VAED-CN-C consistently outperforms the baseline systems at different SNR levels of street noise and cafe noise. According to WER, VAED-CN-C consistently performs better than VAE-CN-C and VAE-CD-C. Then we evaluated system performance under unseen noise types. It was observed that as the SNR level increases, the MCD and WER scores decrease for all systems. Additionally, VAED-CN-C consistently outperforms the baseline systems at all evaluated SNR levels in terms of MCD and WER. Finally, we compared the system performance between seen and unseen noise types. It was observed that VAE-CN-C and VAE-CD-C perform worse at all evaluated SNR levels of unseen noise types than that of seen noise types in terms of WER. Our proposed VAED-CN-C performs more stable between seen and unseen noise types.

Under the SN-TC noisy scenario, noise is only from source speech, which affects the learned content representations. We first evaluated system performance under seen noise types. When the SNR level increases from 5dB to 20dB, the MCD and WER scores of all systems decrease under street and cafe noisy conditions. Compared at the same SNR level, VAE-CN-C performs worst under different SNR levels among all the systems, because the content representations extracted from clean and noisy domains have different distributions, which degrade the performance of decoder. VAE-CD-C outperforms VAE-CN-C at all evaluated SNR levels due to taking denoised speech as input. Since the content representation of FHVAE is noise-invariant Hsu et al. (2017b), FHVAE-CN-CN performs better than VAE-CN-C. As speech enhancement model DCCRN introduces some processing artifacts, while domain adversarial training projects the representations from clean and noisy domains into the same subspace which mitigates the deterioration, our proposed system VAED-CN-C is able to perform better than VAE-CD-C in terms of MCD. According to WER, VAE-CD-C performs best. VAED-CN-C is slightly worse than VAE-CD-C under different SNR levels, since there still exists mis-alignments on the contours of distributions of extracted content reresentations shown in Figure 5. Then we evaluated system performance under unseen noise types. It was observed that VAED-CN-C consistently outperforms VAE-CN-C, VAE-CD-C and FHVAE-CN-CN at different SNR levels in terms of MCD. According to WER, VAE-CD-C performs best. VAED-CN-C outperforms VAE-CN-C and FHVAE-CN-CN significantly. Finally, we compared the system performance between seen and unseen noise types. VAE-CN-C, VAE-CD-C and VAED-CN-C perform similarly between seen and unseen noise types in terms of MCD, while FHVAE-CN-CN performs worse under hfchannel noisy condition. All the systems perform worse under unseen noise types than that of seen noise types according to WER.

Under the SN-TN noisy scenario, noise is from both source speech and target speech, which affects the learned content representations and speaker representations at the same time. First, we compared our proposed system with baseline systems. At all evaluated SNR levels of seen and unseen noise types, it was observed that VAED-CN-C consistently outperforms VAE-CN-C and VAE-CD-C in terms of MCD. According to WER, VAED-CN-C performs better than VAE-CN-C, while performing slightly worse than VAE-CD-C. Then we compared the system performance between seen and unseen noise types. The performances between seen and unseen noise types of all systems are similar in terms of MCD. Additionally, all the systems perform worse under unseen noise types than that of seen noise types according to WER.

We also compared all the systems under clean scenario where both source speech and target speech are clean. The MCD and WER scores are shown in Table 2. As the decoder of VAE-C-C only takes representation from clean speech as input, while the decoder of VAE-CN-C takes representations from clean or noisy speech as input which have differerent distributions, the performance of VAE-CN-C seriously degrades compared with VAE-C-C. By obtaining de-noised speech, VAE-CD-C performs better than VAE-CN-C. As FHVAE-CN-CN is able to disentangle noise from content representation, it performs better than VAE-CN-C. Our proposed method VAED-CN-C makes speaker and content representations extracted from noisy and clean speech indistinguishable, VAED-CN-C achieves better performance in terms of MCD. According to WER, VAED-CN-C outperforms VAE-CN-C and VAE-CD-C, while performing worse than FHVAE-CN-CN.

Scenario System MCD(dB) WER(%)
SC-TC VAE-C-C 9.39 8.93
VAE-CN-C 11.32 10.74
VAE-CD-C 10.45 9.81
FHVAE-CN-CN 10.34 8.62
VAED-CN-C 9.47 9.25
Table 2: A comparison of MCD and WER between proposed system and baseline systems under clean scenario. At run-time, SC-TC indicates that both source speech and target speech are clean.

Finally, Figure 6 shows the spectrogram of the converted speech from male speaker to female speaker by the systems VAE-CN-C, VAE-CD-C, VAED-CN-C under SN-TN scenario with 5dB SNR of cafe noise, respectively. Figure 6 (a) shows that the converted speech by VAE-CN-C is corrupted by noise, and high-frequency part is lost. Figure 6 (b) suggests that the converted speech from VAE-CD-C is clean. However, DCCRN introduces extra distortion. Figure 6 (c) shows that our proposed method effectively reduces the noise from the noisy input speech while the speech distortion is minimized.

Figure 6: Spectrogram for a sentence “Famine had been my great ally”, converted from male speaker to female speaker under SN-TN scenario with 5dB SNR of cafe noise, and generated by three different systems (a) VAE-CN-C, (b) VAE-CD-C, (c) VAED-CN-C.

In summary, learning noise-invariant representations is effective to predict clean converted spectrum under different noise types and SNR levels.

5.3 Subjective evaluation

Figure 7 reports the AB listening tests for speech quality under six types of noise with 5dB SNR. First, we would like to confirm that domain adversarial training helps to improve speech quality under noisy scenarios with low SNR level. Figure 7 (a) suggests that our proposed VAED-CN-C significantly outperforms VAE-CN-C under 5dB SNR. Then we compared VAED-CN-C with noise-robust voice conversion baseline systems. Figure 7

(b) shows that the preference scores of VAED-CN-C and VAE-CD-C fall into each other’s confidence intervals, which means they are not significantly different. Figure 

7 (c) shows that VAED-CN-C clearly outperforms FHVAE-CN-CN.

Figure 7: Speech quality preference tests under 5dB SNR of six types of noise with 95% confidence intervals for (a) VAE-CN-C vs VAED-CN-C, (b) VAE-CD-C vs VAED-CN-C, (c) FHVAE-CN-CN vs VAED-CN-C.

Figure 8 reports the AB listening tests for speech quality under six types of noise with 20dB SNR. First, we would like to confirm that domain adversarial training helps to improve speech quality under noisy scenarios with high SNR level. Figure 8 (a) suggests that our proposed VAED-CN-C significantly outperforms VAE-CN-C under 20dB SNR. Then we compared VAED-CN-C with other noise-robust voice conversion systems. Figure 8 (b) and (c) show that VAED-CN-C outperforms VAE-CD-C and FHVAE-CN-CN in terms of speech quality.

Figure 8: Speech quality preference tests under 20dB SNR of six types of noise with 95% confidence intervals for (a) VAE-CN-C vs VAED-CN-C, (b) VAE-CD-C vs VAED-CN-C, (c) FHVAE-CN-CN vs VAED-CN-C.

Figure 9 reports the AB listening tests for speech quality under clean scenario. Figure 9 (a) and (b) suggest that VAED-CN-C outperforms VAE-CN-C and VAE-CD-C. Figure 9 (c) shows that FHVAE-CN-CN performs better than VAED-CN-C.

Figure 9: Speech quality preference tests under clean scenario with 95% confidence intervals for (a) VAE-CN-C vs VAED-CN-C, (b) VAE-CD-C vs VAED-CN-C, (c) FHVAE-CN-CN vs VAED-CN-C.

The speaker similarity ABX tests under six types of noise with 5dB SNR are presented in Figure 10. First, we would like to confirm that domain adversarial training helps to improve speaker similarity under noisy scenario with low SNR level. Figure 10 (a) suggests our proposed VAED-CN-C significantly outperforms VAE-CN-C. Then we compared VAED-CN-C with other noise-robust voice conversion systems. Figure 10 (b) and (c) show that VAED-CN-C outperforms VAE-CD-C and FHVAE-CN-CN, respectively.

Figure 10: Similarity preference tests under 5dB SNR of six types of noise with 95% confidence intervals for (a) VAE-CN-C vs VAED-CN-C, (b) VAE-CD-C vs VAED-CN-C, (c) FHVAE-CN-CN vs VAED-CN-C.

The speaker similarity ABX tests under six types of noise with 20dB SNR are presented in Figure 11. Figure 11 (a) suggests our proposed VAED-CN-C significantly outperforms VAE-CN-C under 20dB SNR level. Comparing with other noise-robust voice conversion systems, Figure 11 (b) and (c) show that VAED-CN-C outperforms VAE-CD-C and FHVAE-CN-CN, respectively.

Figure 11: Similarity preference tests under 20dB SNR of six noise types with 95% confidence intervals for (a) VAE-CN-C vs VAED-CN-C, (b) VAE-CD-C vs VAED-CN-C, (c) FHVAE-CN-CN vs VAED-CN-C.

The speaker similarity ABX tests under clean scenario are presented in Figure 12. Figure 12 (a) and (b) suggest our proposed VAED-CN-C clearly outperforms VAE-CN-C and VAE-CD-C. Figure 12 (c) shows that FHVAE-CN-CN performs slightly worse than VAED-CN-C.

Figure 12: Similarity preference tests under clean scenario with 95% confidence intervals for (a) VAE-CN-C vs VAED-CN-C, (b) VAE-CD-C vs VAED-CN-C, (c) FHVAE-CN-CN vs VAED-CN-C.

Figure 13: Mean opinion score listening tests among different systems under six types of noise with 5dB SNR. Error bar represents 95% confidence interval.

Figure 14: Mean opinion score listening tests among different systems under six types of noise with 20dB SNR. Error bar represents 95% confidence interval.

Figure 15: Mean opinion score listening tests among different systems under clean scenario. Error bar represents 95% confidence interval.

We evaluated the naturalness of speech under noisy scenario with 5dB SNR, 20dB SNR, and clean scenario through MOS tests, respectively. Figure 13 shows the MOS results under noisy scenario with 5dB SNR. It suggests that VAE-CD-C, FHVAE-CN-CN, and VAED-CN-C outperform VAE-CN-C. Furthermore, VAED-CN-C performs slightly worse than VAE-CD-C and better than FHVAE-CN-CN. Figure 14 shows the MOS results under noisy scenario with 20dB SNR. VAE-CD-C, FHVAE-CN-CN, and VAED-CN-C consistently outperform VAE-CN-C, and our proposed VAED-CN-C achieves the best MOS score. Figure 15 shows the MOS results under clean scenario. We observed that VAED-CN-C performs better than VAE-CD-C while slightly worse than FHVAE-CN-CN. The synthesized samples can be found on the website 111https://dhqadg.github.io/noise-robust/.

Figure 16: Speech quality preference tests under real noisy scenario with 95% confidence intervals for (a) VAE-CN-C vs VAED-CN-C, (b) VAE-CD-C vs VAED-CN-C, (c) FHVAE-CN-CN vs VAED-CN-C.

Finally, Figure 16 reports the AB listening tests for speech quality under real noisy scenario. Figure 16 (a) validates that our proposed VAED-CN-C significantly outperforms VAE-CN-C. Figure 16 (b) shows VAED-CN-C performs slightly worse than VAE-CD-C. Figure 16 (c) shows that VAED-CN-C outperforms FHVAE-CN-CN.

6 Conclusion and future work

In this paper, we propose a novel noise-robust voice conversion framework. This framework can synthesize clean converted speech under complex noisy conditions, where both source speech and target speech at run-time can be corrupted by seen and unseen noise types. Specifically, based on the framework of encoder-decoder, we integrate disentangling speaker and content representations technique with domain adversarial training technique. Domain adversarial training makes speaker representations and content representations extracted by the speaker encoder and content encoder from clean speech and noisy speech in the same space, which are noise-invariant. Therefore, the noise-invariant representations can be taken as input for the decoder to predict clean converted spectrum. The experimental results demonstrate that our proposed method can synthesize clean converted speech under complex noisy conditions. The speech quality and speaker similarity are improved compared with baseline systems. In the future, we will continue to focus on the neural network architecture to make it work better under noisy conditions with low SNR level.

7 Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2020AAA0108600).

This work was also supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-GC-2019-002) and (AISG Award No: AISG-100E-2018-006), and its National Robotics Programme (Grant No. 192 25 00054), and in part by RIE2020 Advanced Manufacturing and Engineering Programmatic Grants A1687b0033, and A18A2b0046.

References

  • R. Aihara, T. Fujii, T. Nakashika, T. Takiguchi, and Y. Ariki (2015) Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization. EURASIP Journal on Audio, Speech, and Music Processing 2015 (1), pp. 1–9. Cited by: §1.
  • H. Benisty and D. Malah (2011)

    Voice conversion using gmm with enhanced global variance

    .
    In Twelfth Annual Conference of the International Speech Communication Association, Cited by: §1.
  • C. V. Botinhao, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In 9th ISCA Speech Synthesis Workshop, pp. 159–165. Cited by: §4.1.
  • J. Chou and H. Lee (2019) One-shot voice conversion by separating speaker and content representations with instance normalization. Proc. Interspeech 2019, pp. 664–668. Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §2.1, item 1a, §4.2.
  • J. Chou, C. Yeh, H. Lee, and L. Lee (2018) Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. Proc. Interspeech 2018, pp. 501–505. Cited by: §1.
  • H. Du, X. Tian, L. Xie, and H. Li (2021) Optimizing voice conversion network with cycle consistency loss of speaker identity. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 507–513. Cited by: §1.
  • H. Du and L. Xie (2021) Improving robustness of one-shot voice conversion with deep discriminative speaker encoder. arXiv preprint arXiv:2106.10406. Cited by: §1.
  • D. Erro, A. Moreno, and A. Bonafonte (2009) Voice conversion based on weighted frequency warping. IEEE Transactions on Audio, Speech, and Language Processing 18 (5), pp. 922–931. Cited by: §1.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks.

    The Journal of Machine Learning Research

    17 (1), pp. 2096–2030.
    Cited by: §1, §2.2, §2.2.
  • E. Godoy, O. Rosec, and T. Chonavel (2011) Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Transactions on Audio, Speech, and Language Processing 20 (4), pp. 1313–1323. Cited by: §1.
  • A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100. Cited by: §4.3.1.
  • G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (2605), pp. 2579–2605. Cited by: §5.1.
  • C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang (2016) Voice conversion from non-parallel corpora using variational auto-encoder. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. Cited by: §1, §1, §2.1, §3.2.
  • C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang (2017a)

    Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks

    .
    arXiv preprint arXiv:1704.00849. Cited by: §1.
  • W. Hsu, Y. Zhang, and J. Glass (2017b) Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pp. 1878–1889. Cited by: §1, item 1d, §4.2, §5.2.
  • W. Hsu, Y. Zhang, R. J. Weiss, Y. Chung, Y. Wang, Y. Wu, and J. Glass (2019) Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905. Cited by: §1, §3.1.
  • T. Huang, J. Lin, and H. Lee (2021) How far are we from robust voice conversion: a survey. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 514–521. Cited by: §1, §1.
  • H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo (2018) Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In IEEE Spoken Language Technology Workshop (SLT), pp. 266–273. Cited by: §1, §1.
  • T. Kaneko and H. Kameoka (2018a) Cyclegan-vc: non-parallel voice conversion using cycle-consistent adversarial networks. In 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. Cited by: §1, §1.
  • T. Kaneko and H. Kameoka (2018b) Parallel-data-free voice conversion using cycle-consistent adversarial networks. In 26th European Signal Processing Conference (EUSIPCO), pp. 2114–2118. Cited by: §1.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: §4.2.
  • J. Kominek and A. W. Black (2004) The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis, Cited by: §4.1.
  • L. Li, D. Wang, Y. Chen, Y. Shi, Z. Tang, and T. F. Zheng (2018) Deep factorization for speech signal. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5094–5098. Cited by: §1.
  • C. Liao, Y. Tsao, H. Lee, and H. Wang (2018) Noise adaptive speech enhancement using domain adversarial training. arXiv preprint arXiv:1807.07501. Cited by: §1.
  • H. Lim, Y. Kim, and H. Kim (2020) Cross-informed domain adversarial training for noise-robust wake-up word detection. IEEE Signal Processing Letters 27, pp. 1769–1773. Cited by: §1.
  • F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114–4124. Cited by: §2.1.
  • X. Lu, Y. Tsao, S. Matsuda, and C. Hori (2013) Speech enhancement based on deep denoising autoencoder.. In Interspeech, Vol. 2013, pp. 436–440. Cited by: §3.1.
  • A. F. Machado and M. Queiroz (2010) Voice conversion: a critical survey. Proc. Sound and Music Computing (SMC), pp. 1–8. Cited by: §4.3.1.
  • B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8, pp. 18–25. Cited by: §4.1.
  • S. H. Mohammadi and A. Kain (2017) An overview of voice conversion systems. Speech Communication 88, pp. 65–82. Cited by: §1.
  • N. Mor, L. Wolf, A. Polyak, and Y. Taigman (2018) A universal music translation network. arXiv preprint arXiv:1805.07848. Cited by: §2.1.
  • A. Mouchtaris, J. Van der Spiegel, and P. Mueller (2004) A spectral conversion approach to the iterative wiener filter for speech enhancement. In IEEE international conference on multimedia and expo (ICME), Vol. 3, pp. 1971–1974. Cited by: §1.
  • K. Okabe, T. Koshinaka, and K. Shinoda (2018) Attentive statistics pooling for deep speaker embedding. Proc. Interspeech 2018, pp. 2252–2256. Cited by: §2.1.
  • K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson (2019) Autovc: zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pp. 5210–5219. Cited by: §1, §1, §2.1.
  • A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 2, pp. 749–752. Cited by: §4.3.1.
  • S. Sekkate, M. Khalil, A. Adib, and S. Ben Jebara (2019) An investigation of a feature-level fusion for noisy speech emotion recognition. Computers 8 (4), pp. 91. Cited by: §1.
  • Y. Shinohara (2016) Adversarial multi-task learning of deep neural networks for robust speech recognition.. In Interspeech, pp. 2369–2372. Cited by: §1, §3.1.
  • P. G. Shivakumar and P. G. Georgiou (2016) Perception optimized deep denoising autoencoders for speech enhancement.. In Interspeech, pp. 3743–3747. Cited by: §3.1.
  • Y. Stylianou, O. Cappé, and E. Moulines (1998) Continuous probabilistic transform for voice conversion. IEEE Transactions on speech and audio processing 6 (2), pp. 131–142. Cited by: §1.
  • L. Sun, S. Kang, K. Li, and H. Meng (2015)

    Voice conversion using deep bidirectional long short-term memory based recurrent neural networks

    .
    In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4869–4873. Cited by: §1.
  • S. Sun, B. Zhang, L. Xie, and Y. Zhang (2017) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257, pp. 79–87. Cited by: §1, §1, §2.2, §3.1.
  • R. Takashima, R. Aihara, T. Takiguchi, and Y. Ariki (2014) Noise-robust voice conversion based on sparse spectral mapping using non-negative matrix factorization. IEICE TRANSACTIONS on Information and Systems 97 (6), pp. 1411–1418. Cited by: §1.
  • R. Takashima, T. Takiguchi, and Y. Ariki (2012) Exemplar-based voice conversion in noisy environment. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 313–317. Cited by: §1, §1, §1.
  • K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo (2019) AttS2S-vc: sequence-to-sequence voice conversion with attention and context preservation mechanisms. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6805–6809. Cited by: §1, §1.
  • Y. Tang, G. Ding, J. Huang, X. He, and B. Zhou (2019) Deep speaker embedding learning with multi-level pooling for text-independent speaker verification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6116–6120. Cited by: §2.1.
  • X. Tian, S. W. Lee, Z. Wu, E. S. Chng, and H. Li (2017) An exemplar-based approach to frequency warping for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1863–1876. Cited by: §1, §4.1.
  • X. Tian, Z. Wu, S. W. Lee, N. Q. Hy, E. S. Chng, and M. Dong (2015) Sparse representation for frequency warping based voice conversion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4235–4239. Cited by: §1.
  • T. Toda, A. W. Black, and K. Tokuda (2007)

    Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory

    .
    IEEE Transactions on Audio, Speech, and Language Processing 15 (8), pp. 2222–2235. Cited by: §1, §4.3.1.
  • Y. Tu, M. Mak, and J. Chien (2019) Variational domain adversarial learning for speaker verification.. In Interspeech, pp. 4315–4319. Cited by: §1.
  • D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.1.
  • C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi (2016) Investigating rnn-based speech enhancement methods for noise-robust text-to-speech.. In SSW, pp. 146–152. Cited by: §1.
  • A. Varga and H. J. Steeneken (1993) Assessment for automatic speech recognition: ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication 12 (3), pp. 247–251. Cited by: §4.1.
  • C. Veaux, J. Yamagishi, K. MacDonald, et al. (2017) CSTR vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR). Cited by: §4.1.
  • E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer (2017) An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech & Language 46, pp. 535–557. Cited by: §4.1.
  • Q. Wang, W. Rao, S. Sun, L. Xie, E. S. Chng, and H. Li (2018) Unsupervised domain adaptation via domain adversarial training for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4889–4893. Cited by: §1, §1, §2.2, §3.1.
  • Z. Wang, Q. Xie, T. Li, H. Du, L. Xie, P. Zhu, and M. Bi (2021a) One-shot voice conversion for style transfer based on speaker adaptation. arXiv preprint arXiv:2111.12277. Cited by: §1.
  • Z. Wang, X. Zhou, F. Yang, T. Li, H. Du, L. Xie, W. Gan, H. Chen, and H. Li (2021b) Enriching source style transfer in recognition-synthesis based non-parallel voice conversion. arXiv preprint arXiv:2106.08741. Cited by: §1.
  • Z. Wu, T. Virtanen, E. S. Chng, and H. Li (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (10), pp. 1506–1521. Cited by: §1.
  • L. Xie (2020) DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement.. In Interspeech, Cited by: item 1c, §4.2.
  • R. Yamamoto, E. Song, and J. Kim (2020) Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In International Conference on Acoustics, Speech and Signal Processing, pp. 6199–6203. Cited by: §4.1.
  • S. Yang, Y. Wang, and L. Xie (2020) Adversarial feature learning and unsupervised clustering based speech synthesis for found data with acoustic and textual noise. IEEE Signal Processing Letters 27, pp. 1730–1734. Cited by: §1.
  • J. Zhang, Z. Ling, L. Liu, Y. Jiang, and L. Dai (2019) Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (3), pp. 631–644. Cited by: §1, §1.