Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

05/18/2020 ∙ by Seungwoo Choi, et al. ∙ Hyperconnect, Inc. 0

On account of growing demands for personalization, the need for a so-called few-shot TTS system that clones speakers with only a few data is emerging. To address this issue, we propose Attentron, a few-shot TTS model that clones voices of speakers unseen during training. It introduces two special encoders, each serving different purposes. A fine-grained encoder extracts variable-length style information via an attention mechanism, and a coarse-grained encoder greatly stabilizes the speech synthesis, circumventing unintelligible gibberish even for synthesizing speech of unseen speakers. In addition, the model can scale out to an arbitrary number of reference audios to improve the quality of the synthesized speech. According to our experiments, including a human evaluation, the proposed model significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent successes of deep learning methods for speech synthesis enabled text-to-speech (TTS) systems to synthesize realistic and natural speech 

[skerry2018towards, ping2018clarinet, shen2018natural]. Beyond this capability, there have been growing demands for personalization, putting pressure on modern TTS systems to generate customized voices of high quality. Conventional multi-speaker TTS systems [Park2019MultiSpeakerES, gibiansky2017deep] require a substantial amount of data merely to model the speakers observed during training. Unfortunately, many personalized applications can only afford a handful of reference data (e.g., restoring communication ability to people who lost their voice). Such needs call for a speaker cloning system so-called few-shot TTS that can function with only a few reference audios.

To implement few-shot TTS, previous studies suggested a speaker adaptation process [kons2019high, chen2018sample, bollepalli2019lombard, deng2018modeling] that pre-trains models on a large dataset of many speakers and then continues training on a small dataset of target speakers. Such methods, however, require at least a few minutes of audio samples together with the additional fine-tuning process. Therefore, it is less attractive under in-the-wild scenarios where immediate voice cloning of arbitrary new speakers cannot be avoided.

Some prior approaches predicted a speaker embedding from speech to clone unseen speakers without fine-tuning, using a speaker encoder jointly trained with the TTS model [nachmani2018fitting, arik2018neural] or a model individually trained for a speaker verification task [hu2019neural, cooper2019zero, jia2018transfer]. However, it is challenging to produce a single embedding that represents every utterance characteristic, including the speaker identity and speaking style. In fact, it was reported that a single embedding works poorly if the reference speech is shorter than the target speech [wang2018style].

To tackle such a problem, previous studies proposed several specialized embeddings, each with a different set of responsibilities to represent diverse speech attributes (e.g., speaking style prosody, and noise) [wang2018style, stanton2018predicting, hsu2018hierarchical, hsu2019disentangling, battenberg2019effective, chen2019cross] or switched over to a variable-length embedding to maintain temporal information [lee2019robust, sun2020fully, sun2020generating]. However, most of them focused not on cloning a target speaker but on controlling the disentangled attributes [wang2018style, stanton2018predicting, hsu2018hierarchical, hsu2019disentangling, battenberg2019effective, sun2020fully, sun2020generating], and some of them could not synthesize speech of an unseen speaker due to their architectural limitations [lee2019robust]. In addition, there have been few attempts to utilize multiple reference audios to enhance the speech quality [arik2018neural] even though several utterances of a target speaker are available during inference in a real-world scenario.

In this paper, we propose Attentron, a novel architecture of few-shot TTS for unseen speakers. It consists of a fine-grained encoder, which includes an attention mechanism to extract detailed styles from multiple reference audios and a coarse-grained encoder, which extracts overall information of speech and helps to stabilize the output. Our contributions are as follows:

  • [leftmargin=*]

  • We utilize two speech encoders, the fine-grained encoder and the coarse-grained encoder, to enable multi-speaker TTS to clone unseen target speakers with only a few reference audios. The encoders generate a variable-length embedding and a global embedding, respectively, and utilize them to condition the TTS system.

  • We propose an attention mechanism that finds only the relevant positions among the audio frames of the multiple references. It allows the model to take any number of the reference audios, and the quality improves with more reference audios.

  • We compare the proposed model with state-of-the-art methods for multi-speaker TTS that can clone unseen speakers. Our experimental results, including human evaluations, show that the proposed model outperforms the state-of-the-art methods.111Samples available at

2 Attentron Architecture

Figure 1: Overall architecture of Attentron.

Figure 1 illustrates the overall architecture of the proposed few-shot TTS for unseen speakers, named Attentron. It is based on Tacotron 2 [shen2018natural], which takes a text sequence as an input and generates a sequence of mel spectrogram frames as an output. What differentiates our model is the presence of two additional encoders, the coarse-grained encoder and the fine-grained encoder. In Attentron, the fine-grained encoder extracts detailed style from multiple (i.e., N in Figure 1) reference audios, and the coarse-grained encoder extracts overall information of speech, thereby enabling high-quality few-shot TTS for unseen targets. In our experiments, we use WaveRNN [kalchbrenner2018efficient] vocoder.

To synthesize speech, first, the coarse-grained encoder generates a global embedding,

, where a single vector aggregates the temporal dimension of input audios. Then, we broadcast-concatenate it with the text encoding,

, generated by the Tacotron encoder. It forms the input sequence for the Tacotron decoder, , where represents the length of the text encoding. Equipped with the proposed attention mechanism, the fine-grained encoder extracts a variable-length embedding maintaining temporal dimension and feeds them to the Tacotron decoder while it synthesizes the spectrogram frames autoregressively. Finally, the vocoder converts the spectrogram into the audio.

2.1 Fine-grained Encoder and Attention Mechanism

Figure 2 illustrates the composition of the proposed fine-grained encoder and how the attention mechanism works. It aims to (1) make good use of multiple reference audios, (2) utilize a variable-length embedding, not just a single global embedding, to maintain detailed information, and (3) leverage features near to raw reference audio for better generalization. We use scaled dot-product attention [vaswani2017attention], which allows the model to attend to the most relevant frames in reference audio spectrograms.

The inputs to the fine-grained encoder are

random reference audios spoken by the same speaker as the target audio. We first convert the reference audios to the mel spectrograms with paddings to match the maximum length of the spectrograms,

. Converted reference spectrograms, , are passed to two convolutional layers followed by two bidirectional LSTM layers to give reference embeddings, , where is the number of mel bins. Given the hidden state of decoder LSTM at -th decoding step, , the attention query , key , value , and a -th component of variable-length embedding, , are calculated as follows:

where is a flattening function, and all are linear projection matrices. Each component of variable-length embedding, , is concatenated with and fed into fully-connected layer. The above process iterates autoregressively until generating spectrogram is completed (e.g., it iterates times in Figure 2).

Note that the reference spectrograms are passed through only one fully-connected layer to generate the attention values (\⃝raisebox{-0.9pt}{1} in Figure 2). The intuition is that the more it has following layers, the more prone to overfitting to the speakers in the training data. Further analysis is described in Section 4.3.

Figure 2: Details of proposed attention mechanism.

2.2 Coarse-grained Encoder

Speeches of the same speaker have different characteristics, such as emotion and prosody, even with the same transcript. Thus, synthesizing speech only from the input text may suffer from unstable speech synthesis as it is a non-deterministic one-to-many problem by nature. To stabilize it, the coarse-grained encoder is designed to generate a global embedding, which includes overall information of the target speech. Giving the outline of the desired output, it narrows down the range of the output speech and makes it close to a one-to-one problem. The previous approaches [wang2018style, stanton2018predicting, hsu2018hierarchical, hsu2019disentangling, battenberg2019effective, sun2020fully, sun2020generating] aimed to control the speech characteristics of the output speech through an encoder. Consequently, it was based on complex architecture or accompanied by additional training objectives. In contrast, our encoder focuses only on stabilizing the synthesis, and, accordingly, we use a relatively simple network generating a global embedding,

, without any additional loss function.

The coarse-grained encoder is based on the encoder architecture proposed in [hsu2018hierarchical], which has two convolutional layers followed by two bidirectional LSTM layers. It has an average pooling layer at last to generate a global vector. Note that we utilize the target audio as the input during training and utilize reference audios spoken by the target speaker at inference time. The last layer averages out the embeddings from multiple reference audios to get the single embedding during inference.

3 Experimental Method

3.1 Experimental Setup

Datasets. We trained every model with warm-start method [cooper2019zero]. We used LJSpeech [ito2017lj] for the pre-training phase and VCTK [veaux2016superseded] for the multi-speaker training phase. Three VCTK speakers were entirely excluded due to their missing or inadequate data. To evaluate objective metrics on unseen speakers, we held out eight VCTK speakers (four men and four women, each of whom has about 400 utterances) during training. We also split the data of the remaining 98 VCTK speakers into training and validation sets for seen speaker evaluation, and the validation set had mostly 10 utterances for each speaker.

Pre-processing. We used a character sequence as an input to the Tacotron encoder, where each character is represented as a 512-dim character embedding. We downsampled an audio to 16kHz, and trimmed leading and trailing silence using librosa [mcfee2015librosa]

. We generated a spectrogram by 2048 point Fourier transform with Hann windowing with 16 ms shift and 64 ms length. Finally, we converted it to a mel spectrogram with 80 frequency bins spanning from 125 Hz to 7.6 kHz.

Implementation. In the coarse-grained and fine-grained encoders, the convolutional layers have 512 channels with kernels, and the bidirectional LSTM layers have 256 cells for each direction. The sizes of the global embedding, , and the variable-length embedding, , are both 256. The dimension of the attention key and query in the fine-grained encoder, , is 256. We follow the details described in [shen2018natural] to implement the Tactoron 2 backbone.

Training. We used the Adam optimizer [kingma2014adam] with , , , and a weight decay value of . We minimized reconstruction loss, as described in [shen2018natural]. We pre-trained each model on LJSpeech data for 30k steps with the initial learning rate of decaying to at 20k steps. Then, we resumed training the models on VCTK data for 70k steps with the learning rate of decaying to at 50k steps. The batch size was 16 for the whole training process.

3.2 Evaluation Method

Models. We refer to our proposed model as Attentron and baselines, which can synthesize speech from speakers unseen during training as LDE [cooper2019zero] and GMVAE [hsu2018hierarchical].

  • [leftmargin=*]

  • LDE(a) transfers well-trained speaker embedding space into multi-speaker TTS. We utilized the official implementation of the speaker verification model to extract embeddings and built a non-learnable speaker lookup table. We trained multi-speaker Tacotron with this table. The parenthesized parameter, a, is the number of audios utilized to extract a speaker embedding.

  • GMVAE(a)

    is a controllable TTS model based on the variational autoencoder (VAE) framework. Since an official implementation is absent, we made an honest attempt to reproduce the results. To stabilize the training process, we adopted KL annealing proposed in

    [zhang2019learning]. The parenthesized parameter, a, is the number of audios fed into the encoder during inference.

  • Attentron(a-b). The first parameter in the parenthesis, a, represents the number of input audios fed into the fine-grained encoder during training. The latter parameter, b, corresponds to the number of the audios fed into both the fine-grained encoder and the coarse-grained encoder during inference.

Metrics. We utilize the following metrics to evaluate models:

  • [leftmargin=*]

  • MCD-DTW compares compatibility between the spectra of two audio sequences. Since sequences are not aligned, dynamic time warping is performed prior to comparison [kubichek1993mel].

  • Speaker similarity

    evaluates how much the synthesized speech resembles the target speaker. We extracted x-vectors from the synthesized speech as well as the actual speech of the target speaker and measured cosine similarity between them using the deep learning package named Resemblyzer 

    [wan2018generalized]. The value ranges from 0 to 1.

  • Attention collapse count adds up significant intelligibility errors [he2019robust]. Without well-formed decoder attention, the output sounds incomprehensible and never ends on time. We report an attention collapse when the number of the generated spectrogram frames exceeds the pre-defined threshold ( in our experiment).

User study. We performed user tests to evaluate human preference as to Naturalness and Similarity. We used the mean opinion score (MOS) to rate user preference on a scale of 1-5, 5 being the best. We utilized the subjective MOS evaluation method proposed by Jia et al. [jia2018transfer], which pairs each synthesized speech with the reference speech, to evaluate similarity. We used Amazon’s Mechanical Turk to collect subjective evaluations from crowd-sourced native speakers. We selected eight seen speakers and unseen speakers (four men and women each) from VCTK data, and randomly selected 10 fixed examples from evaluation data. The samples were presented at a fixed frame rate of 16kHz and at least 20 evaluators participated in each experiment.

4 Experimental Result

4.1 Evaluation Result

Model Seen speaker Unseen speaker
MCD Sim Fail Nat-MOS Sim-MOS MCD Sim Fail Nat-MOS Sim-MOS
Groundtruth - - - 4.13 0.04 4.81 0.02 - - - 4.15 0.04 4.83 0.02
LDE(1) 12.85 0.731 43 3.56 0.06 3.11 0.07 14.38 0.677 82 3.75 0.05 2.88 0.07
GMVAE(1) 12.51 0.774 23 3.61 0.05 3.22 0.06 13.94 0.686 59 3.76 0.05 3.17 0.06
Attentron(1-1) 12.25 0.784 9 3.63 0.05 3.33 0.06 13.20 0.731 25 3.86 0.05 3.30 0.06
LDE(8) 11.49 0.796 24 3.73 0.05 3.48 0.06 13.50 0.709 39 3.91 0.05 3.17 0.06
GMVAE(8) 11.34 0.798 0 3.72 0.05 3.40 0.05 13.11 0.698 1 3.88 0.04 3.27 0.06
Attentron(8-8) 10.99 0.812 0 3.76 0.05 3.60 0.05 11.67 0.788 0 3.97 0.04 3.57 0.05
Table 1:

Evaluation result. MCD-DTW (MCD), speaker similarity (Sim), and attention collapse count (Fail) are the objective metrics acquired from 960 and 3222 utterances of seen speakers and unseen speakers, respectively. The subjective metrics, naturalness MOS (Nat-MOS) and similarity MOS (Sim-MOS), are presented with 95% confidence intervals. Upward/downward pointing arrows correspond to metrics that are better when the values are higher/lower. Bold values correspond to the best values of each metric.

Table 1 shows the objective evaluation results, such as MCD-DTW, speaker similarity, and attention collapse count. While there is no remarkable difference for seen speakers, proposed models significantly outperform the baseline models for the synthesis targeting unseen speakers. Since it is not suggested how to make use of multiple reference audios in the GMVAE model [hsu2018hierarchical], we apply the -shot inference method to it for a fair comparison by averaging out the embeddings.

The large improvement in the unseen speaker similarity scores, 7.9% and 6.6% comparing Attentron(1-1) to LDE(1) and GMVAE(1), respectively, denotes that the proposed model clones a target unseen speaker’s voice much closer than the others. In addition, the fewer attention collapse counts of Attentron

(1-1), i.e., 9, show that our model outperforms baselines in the speech quality. The baseline occasionally fails to form a proper attention alignment resulting in an unintelligible output speech, which is probably due to the absence of coarse-grained encoder in

LDE(1) and training difficulty of VAE in GMVAE(1).

Furthermore, the strength of the proposed model to clone unseen speakers is more evident while utilizing eight reference audios as an input. Attentron(8-8) significantly improves MCD-DTW and speaker similarity of unseen speakers, 11.6% and 7.9%, respectively, compared to Attentron(1-1), whereas LDE(8) and GMVAE(8) improve slightly compared to their 1-shot versions. It suggests that our model benefits from using multiple reference inputs more effectively than the baselines, and Attentron(8-8) learns where to attend among many reference audios during training and exploits them during inference.

4.2 User Study

The MOS of naturalness and speaker similarity are listed in Table 1. The proposed model slightly improves the naturalness compared to the baselines under the condition utilizing the identical number of reference audios. The subjective metrics are consistent with the objective metrics showing the proposed method achieves a substantial improvement in the unseen speaker similarity without sacrificing the naturalness of speech. We find that the naturalness MOS on unseen speakers is higher than seen speakers, by approximately 0.2 points on every model. We conjecture that it is because the naturalness scored by users varies with randomly sampled target speakers in the dataset that have different recording environments and speaking styles.

4.3 Ablation Study

Model Seen speaker Unseen speaker
Attentron(8-1) 12.94 0.769 13.26 0.753
Attentron(1-8) 11.04 0.809 12.19 0.749
w/o CE 12.64 0.748 13.47 0.738
w/o FE 10.98 0.819 12.20 0.757
Average pooling 10.71 0.817 11.88 0.751
Self-attention 10.62 0.821 11.86 0.755
Encoded value 10.76 0.819 11.96 0.754
Table 2: Ablation study for verifying impact of utilizing multiple reference inputs, coarse-grained encoder, fine-grained encoder and leveraging feature near raw reference audio.

Table 2 shows the result of the ablation study. We evaluate key components of the proposed model by tweaking Attentron(8-8). First, we further analyze the impact of multiple reference inputs. Either utilizing multiple inputs during training (denoted as Attentron(8-1)) or inference (denoted as Attentron(1-8)) improves unseen speaker similarity compared to utilizing single reference input. In addition, the result of Attentron(8-8) is better than that of Attentron(1-8) or Attentron(8-1). It suggests that the proposed model maximizes benefits from using multiple reference inputs by utilizing them during both training and inference.

We address the impact of the coarse-grained encoder. Without coarse-grained encoder (denoted as w/o CE), it loses its stability to generate an intelligible speech accompanying 35 and 143 attention collapse counts for seen speakers and unseen speakers, respectively. Consequently, it shows worse MCD-DTW and speaker similarity than Attentron(8-8). Note that the attention collapse column is omitted from the table for clarity.

Next, we examine the fine-grained encoder in more detail. To verify the effectiveness of the variable-length embedding, we remove the fine-grained encoder (denoted as w/o FE) or use a single embedding. The single embedding is obtained by replacing the attention module with an average pooling (denoted as Average pooling) or a self-attention (denoted as Self-attention[arik2018neural]. These three cases make it hard to extract sufficient information from the reference audios to clone unseen speakers, and thus the unseen speaker metrics are inferior.

We also test encoding the attention value by replacing one fully-connected layer into two convolutional layers, followed by two bidirectional LSTM layers (denoted as Encoded value). The seen speaker similarity of Encoded value and Attentron(8-8) are comparable. The unseen speaker similarity of Encoded value, however, significantly decreases compared to Attentron(8-8), from 0.788 to 0.754. We consider that manipulating the reference audio makes the model vulnerable to overfitting to seen speakers, which causes a large decline in the unseen speaker similarity. It may induce the model to memorize the voice during training rather than imitate it on-the-fly.

5 Related Works

Previous studies led to decent results for few-shot TTS, which, however, require an additional fine-tuning process [kons2019high, chen2018sample, bollepalli2019lombard, deng2018modeling, arik2018neural]. To avoid it, some approaches [Park2019MultiSpeakerES, nachmani2018fitting, arik2018neural, hu2019neural, cooper2019zero, jia2018transfer, chen2019cross] jointly or individually trained a speaker encoder generating a global embedding and utilized it to condition the TTS model. However, conditioning the TTS model using only the global embedding is not sufficient to clone unseen speakers.

Similar to our work, some works made use of a variable-length embedding maintaining temporal information [lee2019robust, sun2020generating]. Lee and Kim [lee2019robust] introduced an attention module to obtain a variable-length embedding. However, it mainly focused on fine-grained control of prosody, and cannot clone unseen speakers since it utilizes a speaker lookup table which supports only seen speakers. Sun et al. [sun2020generating] introduced a fine-grained VAE model. It also aimed to provide finer-level interpretations of prosody control and suggested autoregressive prior, differing from our interests in cloning unseen speakers with a few samples.

6 Conclusion

We proposed the novel architecture of multi-speaker TTS for cloning unseen speakers with a few samples. It exploits two types of embeddings, the variable-length and global embedding generated by the fine-grained and coarse-grained encoder, respectively. In addition, the fine-grained encoder extracts proper characteristics from relevant positions in reference audios and enables us to utilize an arbitrary number of audios. By these means, it achieved high-quality synthesized speech of unseen speakers. Our experimental results, including human evaluation, showed the excellence of the proposed architecture in terms of the speaker similarity and speech quality.

It would be interesting to explore methods to control the prosody of the proposed few-shot TTS system, and we leave it for future work. Another avenue of future work is cloning unseen speakers with a few samples in a voice conversion task.