AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

11/09/2018 ∙ by Kou Tanaka, et al. ∙ 0

This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerates the training procedure by considering guided attention and proposed context preservation losses, 2) allows not only spectral envelopes but also fundamental frequency contours and durations of speech to be converted, 3) requires no context information such as phoneme labels, and 4) requires no time-aligned source and target speech data in advance. In our experiment, the proposed VC framework can be trained in only one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the synthesized speech is higher than that of speech converted by Gaussian mixture model-based VC and is comparable to that of speech generated by recurrent neural network-based text-to-speech synthesis, which can be regarded as an upper limit on VC performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Voice conversion (VC) systems aim to convert para/non- linguistic information included in a given speech waveform while preserving its linguistic information. VC has been applied to various tasks, such as speaker conversion [1, 2, 3] for impersonating or hiding a speaker’s identity, as a speaking aid [4, 5] for overcoming speech impairments, as a style conversion [6, 7] for controlling speaking styles including emotion, and for pronunciation/accent conversion [8, 9] in language learning.

A popular form of VC is a statistical one based on a Gaussian mixture model (GMM) [10]; it requires time-aligned parallel data of the source and target speech for training the conversion models. For frameworks requiring time-aligned parallel data, other researchers have proposed exemplar-based VCs using non-negative matrix factorization (NMF) [11, 12]

and neural network (NN)-based VCs using restricted Boltzmann machines 

[13, 14], feed-forward NNs [15, 16], recurrent NNs [17, 18]

, variational autoencoders 

[19, 20], and generative adversarial nets [9]. On the other hand, frameworks requiring no parallel data, called parallel-data-free VCs, have been proposed [1, 3] to avoid the time-consuming job of recording speech for parallel data collection. Notably, the drawbacks of these VCs are the prerequisite of a large number of transcripts and/or difficulty converting the durations of the source speech.

Recently, sequence-to-sequence (Seq2Seq) learning [21, 22] has proved to be outstanding at various research tasks such as text-to-speech synthesis (TTS)  [23, 24, 25] and automatic speech recognition (ASR) [26, 27]. The early Seq2Seq model [21] has encoder and decoder architectures for mapping an input sequence to an encoded representation used by the decoder network to generate an output sequence. To select critical information from the encoded representation in accordance with the output sequence representation, later Seq2Seq models [22, 28]

introduce an attention mechanism. The key advantages of the Seq2Seq learning approach are the ability to train a single end-to-end model directly on the source and target sequences and the capacity to handle input and output sequences of different lengths. In particular, we expect that the Seq2Seq model makes it possible to convert not only acoustic features but also the durations of the source speech to those of the target speech. Moreover, Seq2Seq learning is extensible to semi-supervised learning 

[29], where it can avoid the time-consuming task of collecting parallel data. In a supervised learning task, Seq2Seq learning requires parallel data of the source and target sequences rather than time-aligned parallel data. Considering dual learning [30, 31], Seq2Seq learning can be trained with a small amount of parallel data and a large amount of non-parallel data.

In this paper, we propose a Seq2Seq-based VC with attention and context preservation mechanisms111Audio samples can be accessed on our web page: http://www.kecl.ntt.co.jp/people/tanaka.ko/projects/atts2svc/attention_based_seq2seq_voice_conversion.html. Our contributions are as follows:

  • Our VC method makes it possible to stabilize and accelerate the training procedure by considering guided attention and context preservation losses.

  • It makes it possible to convert not only spectral envelopes but also fundamental frequency contours and durations of the speech.

  • It requires no context information such as phoneme labels, unlike [32, 33] which introduced the Seq2Seq model and used context information.

  • It requires no time-aligned source or target speech data in advance.

We conducted an our experiment demonstrating that the quality of the synthesized speech generated by our VC framework is higher than that of speech generated by the conventional GMM-based VC, and it is comparable to that of speech generated by recurrent-NN based TTS in terms of both naturalness and speaker similarity. Note that the proposed model was trained in only one day, using only one GPU of an NVIDIA Tesla K80.

2 Conventional VC

2.1 Frame/Sequence- based VC

There are two types of frame/sequence- based VC: VCs requiring parallel data [34, 10, 35] and parallel-data-free VC [1, 3]. The first framework has different procedures of training and conversion, as shown in Fig. 1a. The conversion procedure does not have a time warping function, despite that the training procedure includes a time-alignment step to handle source and target sequences having different lengths. The second framework is a parallel-data-free VC that does not require parallel source and target speech data. To realize parallel-data-free VC, the second framework uses context information [36, 37], adaptation techniques [38, 39], a pre-constructed speaker space [40, 41], and cycle consistency [1, 3]. Although these VC techniques have various training procedures, the conversion procedure does not involve the time warping function. Consequently, the frame/sequence- based VC frameworks do not allow us to convert the durations and the acoustic features of the source speech at the same time. In contrast, our model allows both the acoustic features and the durations to be converted at the same time.

2.2 Seq2Seq-based VC

In contrast to the frame/sequence- based VC frameworks, Seq2Seq-based VC frameworks make it possible to convert not only the acoustic features but also the durations of the source speech. Most Seq2Seq-based VCs consist of ASR and TTS modules which are trainable with pairs of speech and its transcript rather than the source and target speech. The ASR module converts the acoustic feature sequence of the source speech into a sequence of context information such as phoneme labels and context posterior probabilities 

[32, 37], and the TTS module generates the acoustic feature sequence of the desired speech from the sequence of context information. One approach to changing the duration of the source speech uses a re-generation method that generates the duration information from the text symbols after converting the acoustic features of the source speech into text symbols once. Namely, the duration information is erased once and re-generated. Another approach [32, 33] involves Seq2Seq learning, as shown in Fig. 1b. In this approach, the context posterior probability sequence of the source speech including the duration information is directly converted into a context posterior probability sequence of the desired speech including the duration information. Both approaches work well if ASR performs robustly and accurately enough, but they require a large number of transcripts to train each module. In contrast, our model does not use any transcript.

Figure 1: System overviews of conventional VC, a) frame/sequence- based VC using parallel data (see Sec. 2.1) and b) Seq2Seq-based VC (see Sec. 2.2). “CPPs” and “BiLSTM” denote context posterior probabilities and bidirectional LSTM, respectively.

3 AttS2S-VC

Our method consists of 1) four basic components of the Seq2Seq model and 2) two additional components as a context preservation mechanism. The four basic components are a source encoder, target encoder, target autoregressive (AR) decoder, and attention mechanism. The two additional components are a source decoder and another target decoder to keep linguistic information of the source speech. Figure 2 is an overview of the system.

3.1 Seq2Seq Model with Attention Mechanism

Let us use and to denote sequences of acoustic features of the source and target speech, respectively. The source encoder network and target encoder network encode the input sequences and to the embeddings and , as follows:

(1)
(2)

In order to accurately predict the output sequence,  [22, 28] introduced an attention mechanism. At each time frame of the embeddings

, the attention mechanism gives a probability distribution that describes the relationship between the given time frame feature

and the embeddings . Consequently, the attention matrix can be written as

(3)
(4)

where indicates a function described by feed-forward NNs and is an element (, ) of the attention matrix .

A seed of the target AR decoder is obtained by considering the long-range temporal dependencies between the source and target sequences as follows:

(5)

As the name implies, the target AR decoder involves all previous outputs of itself. Hence, the input of the target AR decoder is combined with the seed and the embeddings . The output of the Seq2Seq model is obtained through the target AR decoder ,

(6)

Finally, we minimize the objective function of Seq2Seq learning:

(7)
Figure 2: System overviews of proposed VC. The solid lines indicate the training and conversion procedures. The dashed lines indicate calculations of the differences during training. Green boxes and red boxes respectively denote the original components and the proposed additional components.

3.2 Stabilizing and Accelerating Training Procedure

3.2.1 Guided Attention Loss

To accelerate the training of an attention module,  [25] introduced a guided attention loss. Generally speaking, most speech signal processing applications, such as ASR, TTS, and VC, are time incremental algorithms. It is natural to assume that the time frame of the source speech waveform progresses nearly linearly with respect to the time frame of the target speech waveform, i.e., , where . Therefore, the attention matrix should be a nearly diagonal. A penalty matrix is designed, as follows:

(8)

where controls how close is to a diagonal matrix. The guided attention loss is defined as

(9)

where indicates an element-wise product.

3.2.2 Context Preservation Loss

To stabilize the training procedure, we propose a context preservation loss. In preliminary experiments, we found that the training procedure sometimes failed even if it took into account the guided attention loss (see speech samples on our web page 1). In particular, the converted speech seemed like randomly generated speech or speech repeating several phonemes. One possible reason is that minimizing the objective function sometimes makes the target AR decoder a network just reconstructing the input of the target encoder. It is because we use rather than as the input of the target encoder in the training. As a result, the source encoder is not required to control the output of the target AR decoder and preserve the context information of the source speech.

To make the source encoder meaningful, we introduce two additional networks to the original Seq2Seq model as a context preservation mechanism. One is a source decoder for reconstructing the source speech from the embeddings . The other is a target decoder for predicting the target speech from the seed .

(10)
(11)

From another point of view, the source decoder helps the source encoder to preserve the linguistic information of the source speech , while the target decoder helps the source encoder to encode the source speech to the shared space of the source and target speech. Note that in the preliminary experiments, the target decoder was more important than the source decoder. The full objective function of our model is formulated as

(12)

where controls the context preservation loss.

4 Experiments

4.1 Experimental Conditions

Datasets: We used the CMU Arctic database [42] consisting of utterances by two male speakers (rms and bdl) and two female speakers (clb and slt). To train the models, we used about 1,000 sentences (speech section of 50 min) of each speaker. To evaluate the performance, we used 132 sentences of each speaker. The sampling rate of the speech signals was 16 kHz. We treated rms and clb as source speakers and bdl and slt as target speakers. For the evaluations, we conducted intra-gender pairs, rms-bdl and clb-slt, and cross-gender pairs, rms-slt and clb-bdl. Note that we trained the conversion models for every speaker pair, independently.

Baseline system 1 (GMM-VC-wGV): We used a GMM-based VC method [10] as the baseline for frame/sequence- based VC described in Sec. 2.1. To train the conversion models, we used an open source VC toolkit sprocket [43] and its default settings, except for

ranges and power thresholds. Note that a global variance (GV) 

[10] was also considered.

Baseline system 2 (LSTM-TTS): By assuming the ASR module and the encoder part of the encoder-decoder module in [32] work perfectly, we can focus on the TTS module. Therefore, we used an LSTM-based TTS method as the baseline of Seq2Seq-based VC described in Sec. 2.2. The contextual features used as input were 416-dimensional linguistic features obtained using the default question set of the open source TTS toolkit Merlin [44]. From the speech data, 60 Mel-cepstral coefficients, logarithmic , and coded aperiodicities were extracted every 5 ms with the WORLD analysis system [45]. As the duration model, we stacked three LSTMs with 256 cells followed by a linear projection. As the acoustic model, we stacked three bidirectional LSTMs with 256 cells followed by a linear projection.

Proposed system (Proposed): Inspired by Tacotron [23], we used the architecture described in open Tacotron [46]

. Note that we replaced all ReLU activations 

[47] with a gating mechanism of gated linear units [48]. Although the proposed method worked well for not only acoustic features of the WORLD vocoder but also raw spectral features, we chose to use the acoustic features of WORLD vocoder to balance the experimental conditions of LSTM-TTS

. Note that the target AR decoder also generated the stop tokens. As the additional source decoder and target decoder networks, we used the same architectures as in the source encoder. The hyperparameters

, ,

were 0.4, 10,000, and 10, respectively. The batch size, number of epochs, and reduction factor 

[49] were 32, 1,000 and 5. We used the Adam optimizer [50] and varied the learning rate over the course of training [51].

4.2 Experimental Results

Figure 3: Results of preference tests of naturalness (upper) and speaker similarity (lower).

As shown in Fig. 3, we conducted two subjective evaluations, preference tests on naturalness and speaker similarity. The number of listeners was 15, and each listener evaluated 80 shots consisting of randomly selected 10 speech samples 4 pairs of intra/cross- gender 2 comparisons, v.s. GMM-VC-wGV and v.s. LSTM-TTS.

The evaluations indicated that Proposed outperformed GMM-VC-wGV in terms of both naturalness and speaker similarity. This is because our method makes it possible to convert not only the acoustic features but also the durations of speech. In contrast, baseline system 1 forces the conversion while preserving the durations of the source speech. Consequently, durations not used in the target speech make the conversion errors larger.

Moreover, Proposed was comparable to LSTM-TTS. This result demonstrates that our method makes it possible to learn the key components for changing the individuality of the speaker while preserving the linguistic information. Notably, our model was trained without any transcript while  [32, 33] used a large number of transcripts.

5 Conclusions

We proposed a method based on Seq2Seq learning with attention and context preservation mechanisms for VC tasks. Experimental results demonstrated that the proposed method outperformed the conventional GMM-based VC and was comparable to LSTM-based TTS. Extending the proposed method so that it can be used in semi-supervised learning tasks is ongoing work. Note that since we also progressed in a convolutional version of the proposed method [52] simultaneously, we will conduct further evaluations and report the results.

Acknowledgements: This work was supported by a grant from the Japan Society for the Promotion of Science (JSPS KAKENHI 17H01763).

References