Machine Speech Chain with One-shot Speaker Adaptation

03/28/2018 ∙ by Andros Tjandra, et al. ∙ 0

In previous work, we developed a closed-loop speech chain model based on deep learning, in which the architecture enabled the automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components to mutually improve their performance. This was accomplished by the two parts teaching each other using both labeled and unlabeled data. This approach could significantly improve model performance within a single-speaker speech dataset, but only a slight increase could be gained in multi-speaker tasks. Furthermore, the model is still unable to handle unseen speakers. In this paper, we present a new speech chain mechanism by integrating a speaker recognition model inside the loop. We also propose extending the capability of TTS to handle unseen speakers by implementing one-shot speaker adaptation. This enables TTS to mimic voice characteristics from one speaker to another with only a one-shot speaker sample, even from a text without any speaker information. In the speech chain loop mechanism, ASR also benefits from the ability to further learn an arbitrary speaker's characteristics from the generated speech waveform, resulting in a significant improvement in the recognition rate.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In human communication, a closed-loop speech chain mechanism has a critical auditory feedback mechanism from the speaker’s mouth to her ear [1]. In other words, the hearing process is critical not only for the listener but also for the speaker. By simultaneously listening and speaking, the speaker can monitor the volume, articulation, and general comprehensibility of her speech. Inspired by such a mechanism, we previously constructed a machine speech chain [2] based on deep learning. This architecture enabled ASR and TTS to mutually improve their performance by teaching each other.

One of the advantages of using a machine speech chain is the ability to train a model based on the concatenation of both labeled and unlabeled data. For supervised training with labeled data (speech-text pair data), both ASR and TTS models can be trained independently by minimizing the loss of their predicted target sequence and the ground truth sequence. However, for unsupervised training with unlabeled or unpaired data (speech only or text only), the two models need to support each other through a connection. Our experimental results reveal that such a connection enabled ASR and TTS to further improve their performance by using only unpaired data. Although this technique could provide a significant improvement in model performance within a single-speaker speech dataset, only a slight increase could be gained in multi-speaker tasks.

Difficulties arise due to the fundamental differences in the ASR and TTS mechanisms. The ASR task is to “extract” data from a large amount of information and only retain the spoken content (many-to-one mapping). On the other hand, the TTS task aims to “generate” data from compact text information into a generated speech waveform with an arbitrary speaker’s characteristics and speaking style (one-to-many mapping). The imbalanced amounts of information contained inside the text and speech causes information loss inside the speech-chain and hinders us in perfectly reconstructing the original speech. To enable the TTS system to mimic the voices of different speakers, we previously only added speaker information via a speaker’s identity by one-hot encoding. However, this is not a practical solution because we are still unable to handle unseen speakers.

In this paper, we propose a new approach to handle voice characteristics from an unknown speaker and minimize the information loss between speech and text inside the speech chain loop. First, we integrate a speaker recognition system into the speech chain loop. Second, we extend the capability of TTS to handle the unseen speaker using one-shot speaker adaptation. This enables TTS to mimic voice characteristics from one speaker to another with only a one-shot speaker sample, even from text without any speaker information. In the speech chain loop mechanism, ASR also benefits from furthering learning an arbitrary speaker’s characteristics from the generated speech waveform. We evaluated our proposed model with the well-known Wall Street Journal corpus, consisting of multi-speaker speech utterances that are often used as an ASR benchmark test set. Our new speech mechanism is able to handle unseen speakers and improve the performance of both the ASR and TTS models.

2 Machine Speech Chain Framework

Figure 1: Overview of proposed machine speech chain architecture with speaker recognition; (b) Unrolled process with only speech utterances and no text transcription (speech [ASR,SPKREC]

[text + speaker vector]

TTS speech); (c) Unrolled process with only text but no corresponding speech utterance ([text + speaker vector by sampling SPKREC] TTS speech ASR text). Note: grayed box is the original speech chain mechanism.

Figure. 1 illustrates the new speech chain mechanism. Similar to the earlier version, it consists of a sequence-to-sequence ASR [3, 4], a sequence-to-sequence TTS [5], and a loop connection from ASR to TTS and from TTS to ASR. The key idea is to jointly train the ASR and TTS models. The difference is that in this new version, we integrate a speaker recognition model inside the loop illustrated in Fig. 1(a). As mentioned above, we can train our model on the concatenation of both labeled (paired) and unlabeled (unpaired) data. In the following, we describe the learning process.

  1. Paired speech-text dataset (see Fig. 1(a)) Given the speech utterances and the corresponding text transcription from dataset , both ASR and TTS models can be trained independently. Here, we can train ASR by calculating the ASR loss directly with teacher-forcing. For TTS training, we generate a speaker embedding vector , integrate information with the TTS, and calculate the TTS loss via teacher-forcing.

  2. Unpaired speech data only (see Fig. 1(b)) Given only the speech utterances from unpaired dataset , ASR generates the text transcription (with greedy or beam-search decoding), and SPKREC provides a speaker-embedding vector . TTS then reconstructs the speech waveform

    , and given the generated text

    and the original speaker vector via teacher forcing. After that, we calculate the loss between and .

  3. Unpaired text data only (see Fig. 1(c)) Given only the text transcription from unpaired dataset , we need to sample speech from the available dataset and generate a speaker vector from SPKREC. Then, the TTS generates the speech utterance with greedy decoding, while the ASR reconstructs the text , given generated speech via teacher forcing. After that, we calculate the loss between and .

We combine all loss together and update both ASR and TTS model:


where are hyper-parameters to scale the loss between supervised (paired) and unsupervised (unpaired) loss.

3 Sequence-to-Sequence ASR

A sequence-to-sequence [6]

architecture is a type of neural network that directly models the conditional probability

between two sequences and . For an ASR model, we assume the source sequence is a sequence of speech feature (e.g., Mel-spectrogram, MFCC) and the target sequence is a sequence of grapheme or phoneme.

The encoder reads source speech sequence , forwards it through several layers (e.g., LSTM[7]/GRU[8], convolution), and extracts high-level feature representation

for the decoder. The decoder is an autoregressive model that produces the current output conditioned on the previous output and the encoder states

. To bridge the information between decoder states and encoder states , we use an attention mechanism [9] to calculate the alignment probability and then calculate the expected context vector . Finally, the decoder predicts the target sequence probability

. In the training stage, we optimized the ASR by minimizing the negative log-likelihood loss function:


4 Sequence-to-Sequence TTS with One-shot Speaker Adaptation

A parametric TTS can be formulated as a sequence-to-sequence model where the source sequence is a text utterance with length , and the target sequence is a speech feature with length . Our model objective is to maximize w.r.t TTS parameter . We build our model upon the basic structure of the “Tacotron” TTS [5] and “DeepSpeaker” [10] models.

The original Tacotron is a single speaker TTS system based on a sequence-to-sequence model. Given a text utterance, Tacotron produces the Mel-spectrogram and the linear spectrogram followed by the Griffin-Lim algorithm to recover the phase and reconstruct the speech signal. However, the original model is not designed to incorporate speaker identity or to generate speech from different speakers.

On the other hand, DeepSpeaker is a deep neural speaker-embedding system (here denoted as “SPKREC”). Given a sequence of speech features , DeepSpeaker generates an L2-normalized continuous vector embedding . If and are spoken by the same speakers, the trained DeepSpeaker model will produce the vector and the vector , which are close to each other. Otherwise, the generated embeddings and will be far from each other. By combining Tacotron with DeepSpeaker, we can do “one-shot” speaker adaptation by conditioning the Tacotron with the generated fixed-size continuous vector from the DeepSpeaker with a single speech utterance from any speaker.

Figure 2: Proposed model: sequence-to-sequence TTS (Tacotron) + speaker information via neural speaker embedding (DeepSpeaker).

Here, we adopt both systems by modifying the original Tacotron TTS model to integrate the DeepSpeaker model. Figure 2 illustrates our proposed model. From the encoder module, the character embedding maps a sequence of characters into a continuous vector. The continuous vector is then projected by two fully connected (FC) layers with the LReLU[11] function. We pass the results to a CBHG module (1D Convolution Bank + Highway + bidirectional GRU) with =8 (1 to 8) different filter sizes. The final output from the CBHG module represents high-level information from input text .

On the decoder side, we have an autoregressive decoder that produces current output Mel-spectrogram given the previous output , the encoder context vector , and the speaker-embedding vector . First, at the time-step -th, the previous input is projected by two FC layers with LReLU. Then, to inform our decoder which speaker style will be produced, we feed the corresponding speech utterance and generate speaker-embedding vector . This speaker embedding is generated by using only 1 utterance of target speakers, thus it is called as “one-shot” speaker adaptation. After that, we integrate the speaker vector with a linear projection and sum it with the last output from the FC layer. Then, we apply two LSTM layers to generate current decoder state . To retrieve the relevant information between the current decoder state and the entire encoder state, we calculate the attention probability and the expected context vector . Then, we concatenate the decoder state , context vector , and projected speaker-embedding together into a vector, followed by two fully connected layers to produce the current time-step Mel-spectrogram output . Finally, all predicted outputs of Mel-spectrogram are projected into a CBHG module to invert the corresponding Mel-spectrogram into a linear-spectrogram . Additionally, we have an end-of-speech prediction module to predict when the speech is finished. The end-of-speech prediction module reads the predicted Mel-spectrogram and the context vector

, followed by an FC layer and sigmoid function to produce a scalar


In the training stage, we optimized our proposed model by minimizing the following loss function:


where are our sub-loss hyper-parameters, and are the ground truth Mel-spectrogram, linear spectrogram, and end-of-speech label and speaker-embedding vector from real speech data, respectively. represent the predicted Mel-spectrogram, linear spectrogram, and end-of-speech label, respectively, and speaker-embedding vector is the predicted speaker vector from the Tacotron output. Here, consists of 3 different loss formulations: Eq. 5 line 1 applies L2 norm-squared error between ground-truth and predicted speech as a regression task, Eq. 5 line 2 applies binary cross entropy for end-of-speech prediction as a classification task, and Eq. 5 line 3 applies cosine distance between the ground-truth speaker-embedding and predicted speaker-embedding , which is the common metric for measuring the similarity between two vectors; furthermore, by minimizing this loss, we also minimize the global loss of speaker style [12, 13].

5 Experiment

5.1 Corpus Dataset

In this study, we run our experiment on the Wall Street Journal CSR Corpus [14]. The complete data are contained in an SI284 (SI84+SI200) dataset. We follow the standard Kaldi [15] s5 recipe to split the training set, development set, and test set. To reformulate the speech chain as a semi-supervised learning method, we prepare SI84 and SI200 as paired and unpaired training sets, respectively. SI84 consists of 7138 utterances (about 16 hours of speech) spoken by 83 speakers, and SI200 consists of 30,180 utterances (about 66 hours) spoken by 200 speakers (without any overlap with speakers of SI84). We use “dev93” to denote the development and “eval92” for the test set.

5.2 Feature and Text Representation

All raw speech waveforms are represented at a 16-kHz sampling rate. We extracted two different sets of features. First, we applied pre-emphasis (0.97) on the raw waveform, and then we extracted the log-linear spectrogram with 50-ms window length, 12.5-ms step size, and 2048-point short-time Fourier transform (STFT) with the Librosa package

[16]. Second, we extracted the log Mel-spectrogram with an 80 Mel-scale filterbank. For our TTS model, we used both log-linear and log-Mel spectrogram for the first and second output. For our ASR and DeepSpeaker model, we used the log-Mel spectrogram for the encoder input.

The text utterances were tokenized as characters and mapped into a 33-character set: 26 alphabetic letters (a-z), 3 punctuation marks (’.-), and 4 special tags noise, spc,s, and /s as noise, space, start-of-sequence, and end-of-sequence tokens, respectively. Both ASR input and TTS output shared the same text representation.

5.3 Model Details

For the ASR model, we used a standard sequence-to-sequence model with an attention module. On the encoder sides, the input log Mel-spectrogram features were processed by 3 bidirectional LSTMs (Bi-LSTM) with 256 hidden units for each LSTM (total 512 hidden units for Bi-LSTM). To reduce memory consumption and processing time, we used hierarchical sub-sampling [17, 3] on all three Bi-LSTM layers and thus reduced the sequence length by a factor of 8. On the decoder sides, we projected the one-hot encoding from the previous character into a 256-dims continuous vector with an embedding matrix, followed by one unidirectional LSTM with 512 hidden units. For the attention module, we used standard content-based attention [9]

. In the decoding phase, the transcription was generated by beam-search decoding (size=5), and we normalized the log-likelihood score by dividing it with its own length to prevent the decoder from favoring the shorter transcriptions. We did not use any language model or lexicon dictionary in this work.

For the TTS model, we used the proposed TTS explained in Sec. 4

. The hyperparameters for the basic structure were generally the same as those for the original Tacotron, except we replaced ReLU with the LReLU function. For the CBHG module, we used

filter banks instead of 16 to reduce the GPU memory consumption. For the decoder sides, we deployed two LSTMs instead of GRU with 256 hidden units. For each time-step, our model generated 4 consecutive frames to reduce the number of steps in the decoding process. For the sub-loss hyperparameter in Eq. 5, we set .

For the speaker recognition model, we used the DeepSpeaker model and followed the original hyper-parameters in the previous paper. However, our DeepSpeaker is only trained on the WSJ SI84 set with 83 unique speakers. Thus, the model is expected to generalize effectively across all remaining unseen speakers to assist the TTS and speech chain training. We used the Adam optimizer with a learning rate of for the ASR and TTS models and

for the DeepSpeaker model. All of our models in this paper are implemented with PyTorch


6 Experiment Result

Model CER (%)
Supervised training:
WSJ train_si84 (paired) Baseline
Att Enc-Dec [19] 17.01
Att Enc-Dec [20] 17.68
Att Enc-Dec (ours) 17.35
Supervised training:
WSJ train_si284 (paired) Upperbound
Att Enc-Dec [19] 8.17
Att Enc-Dec [20] 7.69
Att Enc-Dec (ours) 7.12
Semi-supervised training:
WSJ train_si84 (paired) + train_si200 (unpaired)
Label propagation (greedy) 17.52
Label propagation (beam=5) 14.58
Proposed speech chain (Sec. 2) 9.86
Table 1: Character error rate (CER (%)) comparison between results of supervised learning and those of a semi-supervised learning method, evaluated on test_eval92 set
Model L2-norm
Supervised training:
WSJ train_si84 (paired) Baseline
Proposed Tacotron (Sec. 4) (ours) 1.036
Supervised training:
WSJ train_si284 (paired) Upperbound
Proposed Tacotron (Sec. 4) (ours) 0.836
Semi-supervised training:
WSJ train_si84 (paired) + train_si200 (unpaired)
Proposed speech chain (Sec. 2 + Sec. 4) 0.886
Table 2: L2-norm squared on log-Mel spectrogram to compare the supervised learning and those of a semi-supervised learning method, evaluated on test_eval92 set. Note: We did not include standard Tacotron (without SPKREC) into the table since it could not output various target speaker.

Table 1 shows the ASR results from multiple scenarios evaluated on eval92. In the first block, we trained our baseline model by using paired samples from the SI84 set only, and we achieved 17.35% CER. In the second block, we trained our model with paired data of the full WSJ SI284 data, and we achieved 7.12% CER as our upper-bound performance. In the last block, we trained our model with a semi-supervised learning approach using SI84 as paired data and SI200 as unpaired data. For comparison with other models trained with semi-supervised learning, we carried out a label-propagation [21] strategy to “generate” the ground-truth for the unlabeled speech dataset, and the model with beam-size=5 successfully reduced the CER to 14.58%. Nevertheless, our proposed speech-chain model could achieve a significant improvement over all baselines (paired only and label-propagation) with 9.86% CER, close to the upper-bound results.

Table 2 shows the TTS results from multiple scenarios evaluated on eval92. We calculate the difference with L2-norm squared between ground-truth and and the predicted log-Mel spectrogram. We observed similar trends with the ASR results, where the semi-supervised training with speech chain method improved significantly over the baseline, and close to the upper-bound result.

7 Related Works

While single speaker TTS has achieved high-quality results [5, 22], speaker adaptation remained a challenging task for TTS system. As discussed in [23], adaptation techniques for neural networks fall into three classes: feature-space transformation, auxiliary features augmentation, and model-based adaptation. Wu et al. [24] performed a systematic speaker adaptation for DNN-based speech synthesis at different levels. First, i-vector features [25] to represent speaker identity was augmented at the input level. Then, they performed model adaptation using the learning hidden unit contributions at the middle level based on the speaker dependent parameters [23]

. Finally, feature space transformations are applied at the output level. The parameters are transformed to mimic the target speaker’s voice with joint density Gaussian mixture model (JD-GMM) model


Our adaptation approach might fall into a similar category with the augmentation of auxiliary features such an i-vector. But, in this case, we utilize DeepSpeaker [10] that is trained to minimize the distance between embedding pairs from the same speaker and maximize the distance between pairs from different speakers. It has been proved to provide better performance on speaker recognition task compare to i-vector. Furthermore, instead of focusing a speaker adaptation task only on TTS, we integrate all end-to-end models including ASR, TTS, and DeepSpeaker into a machine speech chain loop.

8 Conclusion

In this paper, we introduce a new speech chain mechanism by integrating a speaker recognition model inside the loop. By using the new proposed system, we eliminate the downside from our previous speech chain, where we are unable to incorporate the data from unseen speakers. We also extending the capability of TTS to generate speech from unseen speaker by implementing the one-shot speaker adaptation. Thus, the TTS can generate a speech with a similar voice characteristic only with a one-shot speaker example. Inside the speech chain loop, the ASR also get new data from the combination between a text sentence and an arbitrary voice characteristic. Our results shows that after we deploy the speech-chain loop, the ASR system got significant improvement compared to the baseline (supervised training only) and other semi-supervised technique (label propagation). Similar trends as ASR, the TTS system also got an improvement compared to the baseline (supervised training only).

9 Acknowledgment

Part of this work was supported by JSPS KAKENHI Grant Numbers JP17H06101 and JP17K00237.