A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

by   Hieu-Thi Luong, et al.

By representing speaker characteristic as a single fixed-length vector extracted solely from speech, we can train a neural multi-speaker speech synthesis model by conditioning the model on those vectors. This model can also be adapted to unseen speakers regardless of whether the transcript of adaptation data is available or not. However, this setup restricts the speaker component to just a single bias vector, which in turn limits the performance of adaptation process. In this study, we propose a novel speech synthesis model, which can be adapted to unseen speakers by fine-tuning part of or all of the network using either transcribed or untranscribed speech. Our methodology essentially consists of two steps: first, we split the conventional acoustic model into a speaker-independent (SI) linguistic encoder and a speaker-adaptive (SA) acoustic decoder; second, we train an auxiliary acoustic encoder that can be used as a substitute for the linguistic encoder whenever linguistic features are unobtainable. The results of objective and subjective evaluations show that adaptation using either transcribed or untranscribed speech with our methodology achieved a reasonable level of performance with an extremely limited amount of data and greatly improved performance with more data. Surprisingly, adaptation with untranscribed speech surpassed the transcribed counterpart in the subjective test, which reveals the limitations of the conventional acoustic model and hints at potential directions for improvements.


page 1

page 5

page 6

page 10


Multimodal speech synthesis architecture for unsupervised speaker adaptation

This paper proposes a new architecture for speaker adaptation of multi-s...

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

In this paper, we propose a method of speaker adaption with intuitive pr...

Linear networks based speaker adaptation for speech synthesis

Speaker adaptation methods aim to create fair quality synthesis speech v...

AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation

Speaker adaptation in text-to-speech synthesis (TTS) is to finetune a pr...

Tracing Linguistic Relations in Winning and Losing Sides of Explicit Opposing Groups

Linguistic relations in oral conversations present how opinions are cons...

Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems

Most neural-network based speaker-adaptive acoustic models for speech sy...

Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

In this paper, we propose an effective method to synthesize speaker-spec...

I Introduction

thanks to recent advances in sample-by-sample waveform generation methods [1, 2] and end-to-end models [3, 4], text-to-speech (TTS) has achieved outstanding performance, with the generated speech being indistinguishable from a recording under certain conditions [5]. Due to this development, many speech-synthesis researchers have moved on to more challenging tasks, speaker adaptation being one such [6, 7]. Speaker adaptation for speech synthesis is the task of creating a new voice for a TTS system by adjusting parameters of an initial model. Speaker adaptation is not a new topic but a well-researched one, especially for HMM-based acoustic models of speech synthesis [8] and speech recognition [9]

. Maximum likelihood linear regression (MLLR)

[10, 11] and constrained MLLR [12]

are popular adaptation techniques for HMM-based systems that apply some form of linear transformation to the Gaussian distributions of the initial model. As pointed out in


, there are many factors that affect performance of the adapted models besides the type of speaker transformation, such as the state of the initial model and estimation criteria.

For neural speech synthesis, training a speaker-adaptive model by conditioning on a low-dimensional speaker vector is a popular method in both multi-speaker modelling [13] and speaker adaptation [14]. The Deep Voice 3 model [15] adds a speaker embedding to multiple parts of the network in order to train a multi-speaker TTS model for thousands of speakers. Arik et al. [16] used a jointly trained speaker encoder network to extract a speaker embedding of unseen speakers, while Jia et al. [17] used a separate speaker verification network. The Voiceloop model [6] jointly trains a speaker embedding with the acoustic model and can adapt to unseen speakers by using both the speech and transcriptions of the target speakers. Nachmani et al. [18] replaced the jointly trained speaker embedding of Voiceloop with a speaker embedding obtained solely from acoustic features so that the model could adapt using untranscribed speech. There are many reasons for performing speaker adaptation instead of conventional training, for instance reducing the speaker footprint [19] and quickly adapting to new speakers [16]. But the most important reason is its potential to handle unrefined adaptation data, whether is in an insufficient quantity [7] or unreliable quality like noisy speech [20], incorrect transcript, or no transcript at all [18].

Here, we propose a multimodal speech synthesis architecture that can adapt to unseen speakers by using either transcribed or untranscribed speech111We presented a proof-of-concept for using a multimodal architecture to perform speaker adaptation with untranscribed speech in [21]. A preliminary study on utilizing scaling and bias codes for adaption using transcribed speech is published in [22]. The current paper combines and extends the methodologies of these two papers into a comprehensive study on speaker adaptation for speech synthesis.. In either case, backpropagation is used to fine tune part or all of the network. Simultaneously we investigate multiple strategies to model speaker transformations. The rest of the paper is organized as follows: Section II systematically reviews the related work on speaker adaptation for neural acoustic models. Section III describes our factorized strategies for modeling the speaker transformation, while Section IV explains the methodology of training and using the multimodal architecture to perform adaptation with transcribed or untranscribed speech. Section V gives details about the experiments. Section VI shows the results of objective and subjective evaluations, and Section VII concludes with our finding.

Ii Related work on adapting neural acoustic models to unseen speakers

Speaker adaptation involves tuning the initial acoustic model using the data of unseen speakers. In the case of speech recognition, speaker adaptation makes the model perform better on the data of unseen speakers while in the case of speech synthesis, speaker adaptation allows a model to synthesize voices of new speakers. A deep neural network (DNN) is a multilayer perceptron with many non-linear hidden layers stacked on top of each other

[23]. For speech synthesis, a typical neural acoustic model is trained to map a text representation (e.g., linguistic features) to a speech representation (e.g., acoustic features); this mapping is reversed for speech recognition. A simple feedforward hidden layer can be defined as follows:


where is the output of the -th hidden layer. Assuming all hidden layers have the same hidden units, the parameters of the -th hidden layer are a weight matrix and a bias vector .

is an element-wise activation function with non-linear functions being the most common type. The speaker-dependent layer or speaker layer is one whose parameters have been trained on data of one specific speaker:


where represents the speaker layer with parameters and depending on -th speaker. The conventional single speaker speech synthesis model is essentially a neural network with all of its layers trained on data of a single speaker.

Training or fine-tuning the entire neural network is a simple and straightforward approach to obtain a speech synthesis model for a target speaker. However it is vulnerable to overfitting when the target speaker has a limited amount of data, as there are too many parameters to adjust. Many adaptation techniques have been proposed to overcome this problem. Below, we systematically review them by characterizing them according to three factors: 1) the components used to model the speaker characteristics; 2) the speaker awareness (or unawareness) of the initial model and 3) the ability to perform adaptation using untranscribed speech.

Ii-a Speaker component

The speaker component is the most crucial aspect of a speaker adaptation methodology as it directly affects the speaker footprint and performance of the adapted model. Here, we could adapt either the entire neural network [24] or all but the output layer [25] of a pre-trained model. However, as mentioned above, this approach is vulnerable to overfitting, so techniques like regularization [26] or early stopping [7] are often introduced in the adaptation stage. Instead of using regularization, the number of adaptable parameters could be reduced as a way to prevent overfitting. The speaker component can be reduced to to just one [27] or a few layers [28]. These layers can be further factorized [29] to discourage the adapted model of the target speaker from straying too far from the initial state. Below, we categorize these factorized methods on the basic of the type of transformation they model within a single token layer of the neural network:

Ii-A1 Speaker layer

We can use the entire layers with both weights and biases as the speaker components. The Equation 2 described a simple speaker layer approach. Usually, these speaker layers are strategically placed at input [30], output [31] or in-between the hidden layers [28] depending on the task at hand. The weights and biases can be factorized in various ways to further reduce the speaker footprint [29, 28].

Ii-A2 Speaker weight

Many approaches only use the layer weights as the speaker components. For example singular value decomposition (SVD) bottleneck

[19] factorizes the full matrix into products of several low-rank matrices to reduce the speaker footprint:


where , and . By setting   the speaker specific parameters become a lot less numerous than using the square matrix . Similarly, in the cluster adaptive training (CAT) method proposed by Tan et al. [32]

, the speaker weight is estimated based on an interpolation between several canonical matrices and hence the interpolation coefficients

are speaker-specific parameters:


where depends on the canonical set . Factorized hidden layer (FHL) [33] exercises a similar concept, modeling the speaker weight as a subspace over a finite set of canonical matrices.

Ii-A3 Speaker scaling

Scaling is the most common type of linear transformation used to model speaker transformation by itself. For example, Learning hidden unit contribution (LHUC) [34] uses a speaker-dependent vector to adjust the output of the hidden layers:


From the perspective of the next hidden layer, is basically a diagonal scaling matrix:


where is the operation of changing an vector into a diagonal matrix. By restricting the speaker transformation to just scaling, it reduces the speaker footprint as well as prevents the adapted model from deviating too far from the initial state. Just like the speaker weight, speaker scaling can be factorized further using the subspace approach. Samarakoon et al. [35] proposed subspace LHUC, in which is projected from a vector of arbitrary size by using a SI matrix :


Other variations of subspace speaker scaling are investigated in [22] and [36].

Ii-A4 Speaker bias

The layer bias has also been proven to be an effective speaker-specific parameter to be used on its own [13, 37]. In practice, to model a speaker bias, we augment the input or hidden layer(s) with a one-hot vector representing the speaker [38]:


where is a speaker-specific bias projected from the speaker one-hot vector. We could factorize the speaker bias further by using a continuous vector to represent speaker instead of using the discrete one-hot vector. Abdel-Hamid et al. [39] jointly train the speaker embedding with the acoustic model, whereas Saon et al. [40] use i-vector obtained from an external system to represent speaker:


where is a subspace matrix that projects an arbitrary-sized vector to the speaker bias.

Ii-A5 Combinations

By categorizing the adaptation methods based on the type of transformation within a single layer, we got a comprehensive overview about the nature of the speaker components. However the hierarchical nature of the neural network adds another aspect for speaker modeling, as these speaker components can be used together at one or multiple layers. For example, as the speaker scaling and bias complement each other, certain representations of them are utilized for speaker modeling in [22] and [36]. Similarly, the subspace variations of speaker weight and bias are investigates in [33]. In [14] and [41] the effects of combining several types of speaker components are investigated. It is difficult to provide an absolute answer about what the best setup for adaptation is, as the performance depends heavily on the network architecture, quality of the speech data, ratio between the speaker footprint and the amount of adaptation data, as well as the training conditions of the initial model.

Ii-B Speaker-awareness of the initial model

Besides the speaker component, the state of the initial model also affects the performance of the adapted model [8]

. More specifically, we can classify the initial model either as speaker-aware or speaker-unaware. A speaker-unaware model is trained without information about the speaker. This sort of model includes the conventional SI model of speech recognition and the single-speaker

[42, 24] or average-voice model [43] of speech synthesis. A speaker-aware model is trained with the speaker components integrated into the initial model. For speech recognition, it is generally known as speaker-adaptive training (SAT) [40]. For speech synthesis, it is the multi-speaker model [13].

Most speaker components reviewed in Section II-A can be used for both speaker-aware and speaker-unaware setups. For example, Fan et al. [27] train a multi-speaker model with a speaker output layer, capable of adapting to an unseen speaker by fine-tuning a new layer for the target. Meanwhile Huang et al. [28] add new layers for unseen speakers on top of a pretrained speaker-unaware single-speaker model. Similar to the LHUC method, Swietojanski et al. [34] proposed to add speaker parameters on top of the SI model for adaptation. By constrast, in a more recent publication, they introduced SAT-LHUC [44] which adds LHUC parameters right from the training stage. The speaker awareness or unawareness of the initial model does not change the structure of the adapted model, but it changes the representation learned by the hidden layers. Training a speaker-aware model encourages the model to disentangle speaker characteristics (style) from the linguistic information (content) which in turn would help the adaptation [44].

Ii-C Adaptation using untranscribed speech

As a neural network is trained with the backpropagation algorithm [45], we can use backpropagation to fine-tune part of or all of the model in the adaptation stage as well if both input features and output features of the target domain are available. Therefore, it is straightforward to adapt the acoustic model to an unseen speaker when the adaptation data is transcribed speech, whereas it becomes trickier when we have untranscribed speech. In automatic speech recognition (ASR), adaptation using speech and text is referred to as supervised adaptation while using only speech is referred to as unsupervised adaptation [26, 46]; we will adopt this terminology for speech synthesis in the rest of this paper. The common unsupervised adaptation approach for ASR is the two-pass adaptation: an SI ASR model is used to obtain the text label; then speech and the prediction label are used to perform unsupervised adaptation with backpropagation just like in the supervised counterpart [26, 25]. As both supervised and unsupervised adaptation are based on backpropagation, we could use any type of speaker component reviewed in Section II-A.

In the case of speech synthesis, a common method of unsupervised speaker adaptation is to assume that the characteristics of the -th speaker can be represented by a single fixed-length vector extracted solely from speech. The speaker-adaptive model is then trained by using as a bias code:


To perform adaptation for unseen speakers, we simply extract from the speech of the target by using an external system. This approach is sometimes referred to as one-shot learning [47], as it does not involve an optimization loop with backpropagation. For example, Wu et al. used i-vectors as the bias code [14], while Doddipatla et al. [48] and Jia et al. used d-vectors [17]. Tjandra et al. [47] jointly trained a DeepSpeaker network with the acoustic model, which is used to extract the speaker vector. By restricting the speaker component to a single bias code, we also restrict the performance of the adaptation. This leads to a gap in performance between the seen and unseen speakers [37, 17]. To overcome this limitation, we proposed the multimodal speech synthesis architecture in our previous study [21], which allows both supervised and unsupervised speaker adaptation to be conducted using backpropagation algorithm. In this paper, we improve upon this methodology.

Iii Speaker components for modelling speaker-adaptive speech synthesis systems

Iii-a Speaker scaling and bias for speaker-adaptive modeling and adapting to unseen speakers

We conducted a preliminary study on using scaling and bias codes in [22]. The results showed that having both components does improve the performance of speaker adaptation. However it does not seem to yield further improvement when more data become available [22]. To address this limitation, we can increase the amount of adaptable parameters by allowing each layer to have its own speaker scaling and bias:


where and are speaker-specific scaling and bias at the -th layer. Compared with LHUC [34] described by Equation 6, our scaling operation is placed on the other side of the layer weight. There is no significance in this choice besides that we want both and to inherit the number of units of the -th host layer. Moreover, to prevent overfitting while still having speaker components in multiple layers we can apply the subspace approach to factorize the speaker scaling and bias into scaling and bias codes:


where and are scaling and bias codes of the -th layer and have arbitrary size and . These codes are projected into a speaker scaling and bias , by using the SI matrices and trained with data of multiple speakers in the training stage. This reduces the number of parameters to be fine-tuned in adaptation stage.

Next, we extend the definitions of the speaker scaling and bias to gated convolution layers [49] which we use in our experiments to capture the temporal context in the time domain and help the information flow [50]:


where the filter and gate have their own scaling vectors and bias vectors . Just as in the case of the feedforward layer we can factorize the speaker vectors into smaller speaker scaling and bias codes:

a nd (15)
a nd (16)

where the scaling code and bias code are shared between the filter and the gate.

Iii-B Adapting to unseen speakers by fine-tuning entire speaker-adaptive network

By adding a speaker scaling along with the speaker bias, we can model more sophisticated transformation than just using speaker bias. However, it is still restricted when comparing with using a speaker layer. Recent studies [7, 51] have shown that fine-tuning the entire network along with the speaker embedding is better than fine-tuning just the speaker-embedding. Given a SA layer with the speaker bias code defined by Equation 10, by finetuning the SA layer we obtained an adapted layer defined as follows:


where all parameters of the layer now depend on the target speaker. However it is redundant to have model a speaker bias as a single vector can perform a same job.

Based on the above observations, we proposed a similar adaptation strategy in which we fine-tune entire layers of an initial SA network by first removing all speaker components like and . The final adapted layer has a structure described by the Equation 2. Liu et al. [52] used a similar strategy to adapt a multi-speaker Wavenet vocoder [53] to unseen speakers with limited data. We hypothesize that a speaker-aware model with all speaker-specific parameters stripped is a good initialization for the adaptation.

Iv Multimodal architecture for unsupervised speaker adaptation

Fig. 1: Blueprint of proposed multimodal architecture used in the experiments. The layers that can potentially contain speaker components are marked with a yellow identifier tag. Layers with numbers in the middle are dilated convolution layers; the number indicates the dilation rate.

We proposed a novel method for unsupervised speaker adaptation in our previous publication [21]. The main idea is splitting the conventional acoustic model into an SI linguistic encoder and SA common layers and then training an auxiliary speech encoder to be used as substitute for the linguistic encoder when linguistic feature is unobtainable, so that the adaptation can still be conducted with backpropagation. The proposed method has several limitations: we have to use waveform as the input of the speech encoder, as the network tends to ignore speaker embedding when we use acoustic features; the quality of the generated speech is still low in general. In this paper, we refine the methodology proposed in [21]. The enhanced multimodal architecture is illustrated in Figure 1, with the three modules renamed as linguistic encoder, acoustic encoder, and acoustic decoder. The biggest change is that the encoders no longer output a deterministic latent variable, namely a latent linguistic embedding (LLE), but a density function of it. The idea is inspired by mixture density network (MDN) [54, 55]

and variational autoencoder (VAE)

[56]. By modeling the density function of LLE, we encourage the network to learn a continuous latent space for it.

(a) Training
(b) Inference
(c) Supervised
(d) Unsupervised
Fig. 2: Different modes of the multimodal speaker-adaptive acoustic architecture. Dashed border indicates modules with trainable parameters while bold solid border indicated modules with immutable parameters.

Iv-a Kullback-–Leibler divergence bound multimodal speech synthesis system

A conventional acoustic model is a function which transforms linguistic features into acoustic features . Given the LLE , the acoustic decoder is a transform function that maps to . is defined by its parameters and , which are the speaker-independent and speaker-dependent parameters, respectively.

In the training stage, is trained to focus on the common mapping between the linguistic information and the acoustic output, shared among all speakers, while is trained to focus on the unique characteristic of each training speaker. can be a speaker scaling, bias or any of speaker components discussed in Section III-A. Our model assumes that contains no information about the speaker so the acoustic decoder has to depend on in order to reconstruct the speaker characteristics for the acoustic feature output:


The linguistic encoder encodes a deterministic linguistic feature to the continuous latent representation . The linguistic encoder is a neural network structure defined by its parameter . To imbue a continuous nature to the latent space of , the output of the linguistic encoder is modeled with a location-scaled distribution inspired by the VAE network. By stacking the linguistic encoder and acoustic decoder, we obtain a complete TTS network with a deterministic linguistic input and a target acoustic output 222In our implementation, the acoustic decoder only outputs the mean value instead of the density function to simplify the setup.:


The TTS stack can be trained with backpropagation by minimizing the mean square error between the network output and the target acoustic feature


When the adaptation data does not include a transcript, we use the acoustic encoder as a substitute for the linguistic encoder. The acoustic encoder is a function which transforms the acoustic feature into the latent variable by stripping unnecessary information (i.e speaker characteristics) and retaining linguistic information. The latent output of the acoustic encoder may be used by the acoustic decoder to reconstruct the acoustic feature as follows:


We refer the combined network of the acoustic encoder and acoustic decoder as a speech-to-speech (STS) stack. The STS stack is used to adapt the acoustic decoder to unseen speakers when the adaptation data is untranscribed speech.

The key challenge of our proposal is to train a latent variable that satisfies all of the assumptions made. Previously [21], we introduced the joint-goal and tied-layers training methods for this purpose and obtained promising results. In this paper, we modify the tied-layers training method for the enhanced architecture. More precisely, instead of using cosine or Euler distance functions to measure the difference between two latent samples and , we use the Kullback-Leibler (KL) divergence to measure the information lost when using a density function of to approximate the density function of

. By using the Gaussian as the probability density function we can calculate the KL divergence between two in closed form

[57]333In our implementation, since we further assume gaussian having diagonal covariance matrix similar to VAE [56], we calculate KLD of each element of

independently, take the average and then use it as the loss function:

. Modeling LLE as a latent variable and using KL divergence as the tied-layer loss are the two most important modifications to our previous publication [21].

Iv-B Different modes of the speaker-adaptive multimodal architecture

The purpose of the multimodal architecture is that we can use the model in different modes to solve different problems at hand. For speaker-adaptive speech synthesis we need four main modes as illustrated in Fig.2: training, inference, supervised adaptation and unsupervised adaptation.

Iv-B1 Training

This is the initial mode of the model, in which we need to jointly train all modules to learn a good representation to perform the tasks involved. We used the tied-layers training method, proposed previous in [21], to optimize all parameters (, , and of every training speaker) by minimizing a loss:


where is the TTS loss calculated as the distortion between the output of the TTS stack and the target acoustic features:


and the tied-layer loss is the KL divergence between the output of linguistic encoder and that of the acoustic encoder:


With this setup, the linguistic encoder and acoustic decoder are trained with a typical TTS acoustic model objective while the acoustic encoder is trained to approximate the linguistic encoder so it could be used as a substitute. By combining the loss and jointly training all modules, we encourage the network to find the optimal representation for all criteria.

Set Train (Number of utterances) Valid (Number of utterances) Test (Number of utterances) Number of speakers
Each speaker Total Each speaker Total Each speaker Total Male Female Total
jp.base 148 34713 3 705 - - 51 184 235
jp.target.5 5 100 3 60 10 200 10 10 20
jp.target.25 25 500
jp.target.100 100 2000
TABLE I: Japanese speech corpus used in the experiments.

Iv-B2 Inference

As our main task is speech synthesis, in inference mode, we utilize the TTS stack to transform linguistic features into acoustic features with the voice of the desired speaker by using the corresponding speaker component . As a side note, we should confirm that when using the STS stack for inference, the model acts as a many-to-many voice conversion system; we will leave such an investigation for future work.

Iv-B3 Supervised adaptation

We perform speaker adaptation when we want the model to able to generate speech in the voice of the -th unseen speaker. When both the speech and transcript are available we can adapt the model by using the TTS stack to optimize parameters of the acoustic decoder, as illustrated in Fig. LABEL:fig:stage-supervised. In the case of adapting only the speaker components as described in Section III-A, we train a new set of for -th unseen speaker while keeping the other parameters unchanged. In the case of fine-tuning the entire acoustic decoder as described in Section III-B, we first remove all speaker components from the acoustic decoder and then fine tune the remaining parameters to the target speaker. In either case, the adapted parameters are obtained by minimizing the distortion between the output of the TTS stack and the natural features:


As supervised adaptation and the inference modes use the same TTS stack, it is expected to perform better than unsupervised adaptation.

Iv-B4 Unsupervised adaptation

When a transcript does not exist, we can perform unsupervised adaptation. The unsupervised adaptation is conducted in a similar manner as the supervised one, but with one difference: the acoustic encoder is used as a substitute for the linguistic encoder, so we do not have to rely on text. The parameters of the acoustic decoder are optimized to minimize the distortion between the output of the STS stack and the natural features:


As we optimize the acoustic decoder using the STS stack in unsupervised adaptation mode but then use the TTS stack in inference mode, this creates a mismatch between adapting and inferring.

V Experimental Conditions

Fig. 3: Temporal contexts captured using a stack of non-overlapping dilated convolution layers.

V-a Datasets

We used an in-house multi-speaker Japanese dataset to train the initial multi-speaker model and to conduct speaker adaptation. Table I shows the details of the data usage. The setup is similar to our previous study on scaling and bias codes [22] with a slight adjustment to the amount of data used for adaptation. The objective results are calculated on 200 utterances from twenty speakers included in the jp.target test set. One should note that the data used to train the initial model jp.base is gender-imbalanced with more female speakers than male.

V-B Acoustic model configuration

Our acoustic networks contain two types of layer: feedforward and dilated convolution. is the activation function of most feedforward layers, but the last hidden layer and the output layer of the acoustic decoder use a linear function instead. The dilated convolution layer is a variation of a time delay neural network (TDNN) [58, 59]. Our version is most similar to the one used in the WaveNet model [1], except that it does not have the causal part. We use the blocks of dilated convolution layers to capture both left and right non-overlapping contexts by setting the dilation rate in order of 1, 3, 9 and 27 as illustrated in Fig. 3. Two types of gated unit are used with the convolution layer. The first type (Fig. (a)a) is used in the linguistic encoder and has a residual and a skip output, with trainable weights for both. The second type (Fig. (b)b) is used in acoustic encoder and acoustic decoder and only has a residual output. Optionally the layers of the acoustic decoder can contain speaker components like speaker scaling and bias, as defined in Equation 14.

(a) Linguistic encoder
(b) Acoustic en(de)coder
Fig. 4: Gated units of convolution layers used in the experiments.

Fig. 1

is the blueprint of our configuration; the structure of the network is designed to be representative and convenient for testing our hypothesis. Each module is a standalone network and share a similar structure. The input is transformed to a higher representation with two nonlinear hidden layers. The subsequent block(s) of convolution layers are used to capture temporal context. One last hidden layer is added before the hidden representation is transformed to the desired output. In the case of the linguistic and acoustic encoders, the output is a density function of LLE; therefore given latent representation of the last hidden layer

, the output layers of the encoders transform

into the mean and standard deviation of the density function of



The exponential function is used to make sure the standard deviation will always receive a positive value. We then apply the reparameterization trick to make the network differentiable:


All hidden layers of the linguistic and acoustic encoders have 128 units while the hidden layers of the acoustic decoder have 256 unit. LLE was set to be a 64-dimensional feature.

V-C WaveNet vocoder configuration

We trained a speaker-independent WaveNet vocoder [53] using the jp.base training set. We then used the model to generate speech for the target unseen speakers. While fine-tuning the WaveNet vocoder [52, 60] to the target unseen speaker is reported to improve performance, we decided to use a single SI WaveNet vocoder in all of the experiments and focus on evaluation the performance of the acoustic model. WaveNet is trained to model a 16kHz waveform which is quantified into a 10-bit u-law. The network contains 40 dilated causal layers conditioned on a mel-spectrogram. We kept the setup for WaveNet simple and similar to that of the original study [1].

V-D Feature pre-processing

V-D1 Linguistic features

: We used standard linguistic features of Japanese speech synthesis. The features contained quinphone contexts, word part-of-speech tags, pitch accent types of the accent phrases, interrogative phrase marks, and other structural information such as the position of the mora in a word, accent phrases, and utterances. We aligned the linguistic features with the acoustic features by using an external system [37]. The linguistic features were then concatenated with duration information into a 389-dimensional vector.

V-D2 Acoustic features

We simplified our setup by using an 80–dimensional mel-spectrogram as the acoustic feature compared to our previous study [22] where we used multiple types of vocoder features. The features are extracted from a 25ms window and shifted in steps of 5ms over the speech waveform. The WaveNet vocoder is used to synthesize speech waveform from the mel-spectrogram feature. With this setup we could use a mean square error metric for both the training loss and the objective evaluation. For the objective evaluation, we removed silence frames, indicated by the linguistic features, before calculating the mean square error in order to obtain results more focused on speech regions.

V-E Training and adapting optimization

The initial multi-speaker models were trained to minimize the designate loss in Equation 25, with the tied-layers factor

set to 0.25 for all strategies. The training stopped naturally after 5 epochs without any improvement on the validate set or was forcefully stopped at the 128th epoch, the last best epoch is used as the final model. In practice, all training is naturally stopped by about the thirtieth epoch. The speaker adaptation followed a similar scheme but only a certain part of the acoustic decoder was optimized instead of the entire network. For the adaptation with five utterances, many strategies were forcefully stopped at the 128th epoch. For the adaptations with 25 and 100 utterances, the adaptation usually converged after 5 epochs without further improvements.

Fig. 5: Objective evaluations of supervised and unsupervised speaker adaptation of multiple strategies utilizing speaker scaling and bias.

Vi Evaluation and Discussion

Vi-a Baseline objective evaluations of the conventional multi-speaker and supervised adaptation tasks

Strategy Layer Speaker bias Mean square error
5 utts 25 utts 100 utts
MU-A1b A1 128 0.555 - 0.533
MU-A1B A1 full 0.553 - 0.518
AD-A1b A1 128 0.584 0.560 0.553
AD-A1bB A1 256 0.578 0.555 0.545
AD-A1B A1 full 0.578 0.553 0.544
TABLE II: Objective evaluation of the baseline system for multi-speaker and supervised adaptation tasks under various strategies.

We first evaluate several baselines without the proposed elements. These baseline systems only contain the linguistic encoder and the acoustic decoder, where the linguistic encoder outputs a deterministic latent variable instead of a density function. We add either a bias code or speaker bias to the A1 layer (Fig. 1) to represent the conventional speaker codes approach. The strategies we investigate for baseline are shown in Table II for multi-speaker task and speaker adaptation task. MU models are trained using the training data of jp.base combined with either jp.target.5 or jp.target.100. On the other hand, the initial model of the speaker adaptation task AD is trained using training data of jp.base; adaptation is then performed by optimizing the speaker components for unseen speakers with jp.target.{5,25,100} data.

From the table, we can first see that there is little difference between two multi-speaker strategies MU-A1b and MU-A1B when the amount of data is limited. However MU-A1B shows a greater improvement when more data become available. Similarly for the speaker adaptation task, there is little difference between AD-A1bB and AD-A1B, and both shown slightly better performance than AD-A1b. The multi-speaker MU-A1B strategy is better than the speaker adaptation AD-A1B counterpart but speaker adaptation is faster and more convenient to conduct: the same conclusion was reached in our previous studies [37, 22].

Vi-B Preliminary objective evaluations on supervised and unsupervised adaptation with speaker scaling and bias

Strategy Layer(s) Number of speaker parameters
Speaker bias Speaker scale Total
A1b A1 128 0 128x1
A1B A1 full 0 256x1
A3a A3 128 128 256x1
A3A A3 full full 512x1
B1b B1 128 0 128x1
B1B B1 full 0 512x1
B8a B8 128 128 256x1
B8A B8 full full 1024x1
Bab B[1-8] 64 0 64x8
BaB B[1-8] full 0 512x8
Baa B[1-8] 64 64 128x8
BaA B[1-8] full full 1024x8
TABLE III: Adaptation strategies utilizing speaker scaling and bias

We evaluated our methods described in Section IV for both supervised and unsupervised speaker adaptation. We also investigated multiple strategies for modeling the speaker transformation, as shown in Table III. Here, A-type is a strategy in which the speaker component is injected at a feedforward layer. A1 represents the conventional speaker embedding, while A3 represents the best strategy in our preliminary study [22] with both scaling and bias codes at a layer near the output. B-type is a strategy in which the speaker components are injected at the convolution layers. Ba is where all eight convolution layers of the acoustic decoder have their own speaker components.

Figure 5 shows the results of the objective evaluations of the strategies listed in Table III for both supervised and unsupervised adaptation tasks.

Vi-B1 Comparison with the baseline

The supervised adaptation of A1b is slightly worse than AD-A1b, the same goes for A1B and AD-A1B. Comparing between different amounts of data, we conclude that the degradation in performance is an acceptable trade-off for the ability to perform unsupervised speaker adaptation.

Vi-B2 Supervised and unsupervised adaptation

For the A-type strategies, the unsupervised adaptation consistently improves when the supervised adaptation improves, which validates our method. The best strategy identidied so far, A3a, is as good as the supervised adaptation baseline AD-A1B in both supervised and unsupervised tasks

Vi-B3 Speaker scaling and bias at residual gated layer

We trained a couple strategies with speaker scaling and bias and put them in layer B1 or B8. Figure 5 shows that these strategies performed as well as A-type strategies. This confirms that speaker scaling and bias can be used at either feedforward or convolution layers.

Vi-B4 Speaker components at multiple layers

The Ba strategies have speaker scaling and bias at all eight residual convolution layers. BaB obtained the best results of all those evaluated, without showing any overfitting in the adaptation of five-utterances case.

(a) Female speaker
(b) Male speaker
Fig. 6: Objective evaluation of adaptation for male speaker. Mean square error between the natural and generated feature.
(a) Female Speaker
(b) Male Speaker
Fig. 7:

Subjective results of quality and similarity test. Samples were generated from six strategies for supervised and unsupervised adaptation using either 5 or 250 utterances. The error bar indicates the 95% confidence interval.

Vi-C Focused objective and subjective evaluations and strategies of adapting the entire network

Here, we evaluate the adaptation performance for just two target speakers, 1 male and 1 female, who have more speech data. The initial models used in previous section are reused to adapt to these two target speakers: A1B is used as the new baseline, A3a is a strategy with both scaling and bias code, while BaB and Baa are those that obtained the best objective evaluations in the previous section. Finally two new strategies and are introduced for the method described in Section III-B. For these strategies, we first remove the speaker components from pretrained BaB and Baa models; then we adapt the remaining parameters of the acoustic decoder to the target speaker.

Vi-C1 Objective evaluation

The objective evaluations are shown in Fig. 6; the results are calculated from 100 test utterances of each speaker. The number of utterances used for adaptation ranged from 5 to 1000. We can see that A3a is still slightly better than A1b at most data points. Among the legacy strategies, BaB benefits the most from the increase in data. For the new strategies, surprisingly outperforms all others, while shows poor results when the amount of data is limited. The pattern is consistent between male and female speakers. For the unsupervised adaptation task, also seems to be the best strategy. However, adding more adaptation data seems to worsen the objective results. There is still a gap between the performances of the supervised and unsupervised adaptation intra-strategies, but the best proposed strategy surpasses the baseline A1B in both supervised and unsupervised adaptation tasks.

Vi-C2 Subjective evaluation

(a) Female speaker
(b) Male speaker
Fig. 8: Detailed results of similarity evaluations for selected strategies. The subscript denotes supervised while denotes unsupervised adaptation.

We conducted subjective surveys on the supervised and unsupervised adaptation tasks. To reduce the number of systems that the participants had to evaluate, we only used models adapted with 5 and 250 utterances444Speech samples are available online at https://nii-yamagishilab.github.io/sample-tts-unified-adaptation/. The SI WaveNet vocoder was used to synthesize waveforms from the generated mel-spectrogram. A copy synthesis system was also included as a reference. For the quality test, participants were asked to judge the quality of a sample in terms of a 5-point scale mean opinion score (MOS). For the similarity test, participants were asked to judge the similarity between a generated sample and the recorded sample in a 4-point scale MOS test where 1 means different (sure), 2 different (not sure), 3 the same (not sure) and 4 is the same (sure). One session consisted of 25 quality and 25 similarity questions, with one question for each system. The final results were calculated from only those sessions in which all 50 questions were answered. Each session contained samples of either female or male speakers. We collected in total 500 sets for the female speaker and 497 sets for the male speaker from a total of 198 paid participants, who each did ten sessions at most.

The mean values of the quality and similarity tests are shown in Fig. 7. Several inter-speaker and inter-strategie trends are: the quality of the male speaker is lower than the female speaker; when more data becomes available the similarity score increases for most strategies, while the quality sometimes decreases; strategies utilizing both speaker scaling and bias got worse results than those utilizing only speaker bias, despite their better objective results; the supervised and unsupervised adaptations strategies gave similar results. Generally speaking, is the best strategy for both supervised and unsupervised adaptation tasks, especially in the 250-utterance case. The most surprising outcome is that unsupervised adaptation of outperforms its supervised counterpart, even though the objective results indicated the opposite. Figure 8 shows the details of the similarity test of A1B, BaB and . The unsupervised adaptation of using 250 utterances has the most positive results for both male and female speakers.

Vii Conclusion

We systematically reviewed the methodology of speaker adaptation for speech synthesis systems and pointed out the remaining limitations. We then proposed a unified framework for conducting supervised and unsupervised adaptation with backpropagation. Our method can use different types of speaker components to model the speaker transformation instead of assuming that the speaker characteristics can be represented by a single fixed-length vector. Further this approach allows us to fine tune the entire acoustic decoder even if the adaptation data do not include transcriptions. The results of the experiments suggest that by providing a good initial factorized model, fine-tuning the entire acoustic decoder yields the best performance for both supervised and unsupervised adaptation.

Interestingly, the unsupervised adapted model turned out to be significantly better than its supervised counterpart in the subjective test. Our hypothesis is that element-wise metrics like the mean square error might not reflect human perception. A similar conclusion has been suggested in other studies involving speech [61] and image generation [56]. Incorporating a generative adversarial network (GAN) [61] into the architecture is a popular way to address this issue.


This work was partially supported by a JST CREST Grant (JPMJCR18A6, VoicePersonae project), Japan, and MEXT KAKENHI Grants (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan. We are grateful to Dr. Erica Cooper for helpful comments.


  • [1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  • [2] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, “Samplernn: An unconditional end-to-end neural audio generation model,” arXiv preprint arXiv:1612.07837, 2016.
  • [3] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” Proc. INTERSPEECH, pp. 4006–4010, 2017.
  • [4] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2wav: End-to-end speech synthesis,” in Proc. ICLR Workshop, 2017.
  • [5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
  • [6] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” arXiv preprint arXiv:1707.06588, 2017.
  • [7] Y. Chen, Y. Assael, B. Shillingford, D. Budden, S. Reed, H. Zen, Q. Wang, L. C. Cobo, A. Trask, B. Laurie, C. Gulcehre, A. van den Oord, O. Vinyals, and N. de Freitas, “Sample efficient adaptive text-to-speech,” arXiv preprint arXiv:1809.10460, 2018.
  • [8] J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai, “Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm,” IEEE Trans. Acoust., Speech, Signal Process., vol. 17, no. 1, pp. 66–83, 2009.
  • [9] M. J. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer speech & language, vol. 12, no. 2, pp. 75–98, 1998.
  • [10]

    C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,”

    Computer speech & language, vol. 9, no. 2, pp. 171–185, 1995.
  • [11] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Speaker adaptation for HMM-based speech synthesis system using mllr,” in Proc. SSW, 1998, pp. 273–276.
  • [12] V. V. Digalakis, D. Rtischev, and L. G. Neumeyer, “Speaker adaptation using constrained estimation of gaussian mixtures,” IEEE Trans. Speech Audio Process., vol. 3, no. 5, pp. 357–366, 1995.
  • [13] Y. Zhao, D. Saito, and N. Minematsu, “Speaker representations for speaker adaptation in multiple speakers’ blstm-rnn-based speech synthesis,” in Proc. INTERSPEECH, 2016, pp. 2268–2272.
  • [14] Z. Wu, P. Swietojanski, C. Veaux, S. Renals, and S. King, “A study of speaker adaptation for DNN-based speech synthesis,” in Proc. INTERSPEECH, 2015, pp. 879–883.
  • [15] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
  • [16] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Proc. NIPS, 2018, pp. 10 040–10 050.
  • [17] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” arXiv preprint arXiv:1806.04558, 2018.
  • [18] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” arXiv preprint arXiv:1802.06984, 2018.
  • [19] J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network,” in Proc. ICASSP, 2014, pp. 6359–6363.
  • [20] S. Takaki, Y. Nishimura, and J. Yamagishi, “Unsupervised speaker adaptation for DNN-based speech synthesis using input codes,” in Proc. APSIPA, 2018, pp. 649–658.
  • [21] H.-T. Luong and J. Yamagishi, “Multimodal speech synthesis architecture for unsupervised speaker adaptation,” in Proc. INTERSPEECH, 2018, pp. 2494–2498.
  • [22] ——, “Scaling and bias codes for modeling speaker–adaptive DNN–based speech synthesis systems,” in Proc. SLT, 2018, pp. 610–617.
  • [23] Y. Bengio, “Learning deep architectures for AI,”

    Foundations and trends in Machine Learning

    , vol. 2, no. 1, pp. 1–127, 2009.
  • [24] Z. Kons, S. Shechtman, A. Sorin, R. Hoory, C. Rabinovitz, and E. Da Silva Morais, “Neural tts voice conversion,” in Proc. SLT, 2018, pp. 290–296.
  • [25] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adaptation for end-to-end ctc models,” in Proc. SLT, 2018, pp. 542–549.
  • [26] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. ICASSP, 2013, pp. 7893–7897.
  • [27] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis,” in Proc. ICASSP, 2015, pp. 4475–4479.
  • [28] Z. Huang, H. Lu, M. Lei, and Z. Yan, “Linear networks based speaker adaptation for speech synthesis,” in Proc. ICASSP, 2018, pp. 5319–5323.
  • [29] Y. Zhao, J. Li, and Y. Gong, “Low-rank plus diagonal adaptation for deep neural networks,” in Proc. ICASSP, 2016, pp. 5005–5009.
  • [30] J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and T. Robinson, “Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system,” in Proc. EUROSPEECH, 1995, pp. 2171–2174.
  • [31] B. Li and H. Zen, “Multi-language multi-speaker acoustic modeling for lstm-rnn based statistical parametric speech synthesis.” in Proc. INTERSPEECH, 2016, pp. 2468–2472.
  • [32] T. Tan, Y. Qian, M. Yin, Y. Zhuang, and K. Yu, “Cluster adaptive training for deep neural network,” in Proc. ICASSP, 2015, pp. 4325–4329.
  • [33] L. Samarakoon and K. C. Sim, “Factorized hidden layer adaptation for deep neural network based acoustic modeling,” IEEE/ACM Trans. Audio, Speech, Language Process, vol. 24, no. 12, pp. 2241–2250, 2016.
  • [34] P. Swietojanski and S. Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in Proc. SLT, 2014, pp. 171–176.
  • [35] L. Samarakoon and K. C. Sim, “Subspace lhuc for fast adaptation of deep neural network acoustic models.” in Proc. INTERSPEECH, 2016, pp. 1593–1597.
  • [36] X. Cui, V. Goel, and G. Saon, “Embedding-based speaker adaptive training of deep neural networks,” in Proc. INTERSPEECH, 2017, pp. 122–126.
  • [37] H.-T. Luong, S. Takaki, G. E. Henter, and J. Yamagishi, “Adapting and controlling DNN-based speech synthesis using input codes,” in Proc. ICASSP, 2017, pp. 4905–4909.
  • [38] N. Hojo, Y. Ijima, and H. Mizuno, “DNN-based speech synthesis using speaker codes,” IEICE T. Inf. Syst., vol. 101, no. 2, pp. 462–472, 2018.
  • [39] O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code,” in Proc. ICASSP, 2013, pp. 7942–7946.
  • [40] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors.” in Proc. ASRU, 2013, pp. 55–59.
  • [41] S. Takaki, S. Kim, and J. Yamagishi, “Speaker adaptation of various components in deep neural network based speech synthesis,” in Proc. SSW, 2016, pp. 153–159.
  • [42] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013, pp. 7962–7966.
  • [43] A. Gutkin, L. Ha, M. Jansche, K. Pipatsrisawat, and R. Sproat, “Tts for low resource languages: A bangla synthesizer,” in Proc. LREC, 2016, pp. 2005–2010.
  • [44] P. Swietojanski and S. Renals, “Sat-lhuc: Speaker adaptive training for learning hidden unit contributions.” in Proc. ICASSP, 2016, pp. 5010–5014.
  • [45] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representation by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
  • [46] H. Liao, “Speaker adaptation of context dependent deep neural networks,” in ICASSP, 2013, pp. 7947–7951.
  • [47] A. Tjandra, S. Sakti, and S. Nakamura, “Machine speech chain with one-shot speaker adaptation,” Proc. INTERSPEECH, pp. 887–891, 2018.
  • [48] R. Doddipatla, N. Braunschweiler, and R. Maia, “Speaker adaptation in DNN-based speech synthesis using d-vectors,” in Proc. INTERSPEECH, 2017, pp. 3404–3408.
  • [49] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves et al., “Conditional image generation with pixelcnn decoders,” in Proc. NIPS, 2016, pp. 4790–4798.
  • [50] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proc. NIPS, 2015, pp. 2377–2385.
  • [51] Y. Deng, L. He, and F. Soong, “Modeling multi-speaker latent space to improve neural tts Quick enrolling new speaker and enhancing premium voice,” arXiv preprint arXiv:1812.05253v2, 2018.
  • [52] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai, “Wavenet vocoder with limited training data for voice conversion,” in Proc. INTERSPEECH, 2018, pp. 1983–1987.
  • [53] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda, “An investigation of multi-speaker training for wavenet vocoder,” in Proc. ASRU, 2017, pp. 712–718.
  • [54] C. M. Bishop, “Mixture density networks,” Dept. of Computer Science and Applied Mathematics, Aston University, Tech. Rep., 1994.
  • [55] H. Zen and A. Senior, “Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis,” in Proc. ICASSP, 2014, pp. 3844–3848.
  • [56] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” arXiv preprint arXiv:1512.09300, 2015.
  • [57] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave generation in end-to-end text-to-speech,” arXiv preprint arXiv:1807.07281, 2018.
  • [58] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 3, pp. 328–339, 1989.
  • [59] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. INTERSPEECH, 2015, pp. 3214–3218.
  • [60] B. Sisman, M. Zhang, and H. Li, “A voice conversion framework with tandem feature sparse representation and speaker-adapted wavenet vocoder,” in Proc. INTERSPEECH, 2018, pp. 1978–1982.
  • [61] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 1, pp. 84–96, 2018.