Towards Universal Neural Vocoding with a Multi-band Excited WaveNet

10/07/2021 ∙ by Axel Roebel, et al. ∙ 0

This paper introduces the Multi-Band Excited WaveNet a neural vocoder for speaking and singing voices. It aims to advance the state of the art towards an universal neural vocoder, which is a model that can generate voice signals from arbitrary mel spectrograms extracted from voice signals. Following the success of the DDSP model and following the development of the recently proposed excitation vocoders we propose a vocoder structure consisting of multiple specialized DNN that are combined with dedicated signal processing components. All components are implemented as differentiable operators and therefore allow joined optimization of the model parameters. To prove the capacity of the model to reproduce high quality voice signals we evaluate the model on single and multi speaker/singer datasets. We conduct a subjective evaluation demonstrating that the models support a wide range of domain variations (unseen voices, languages, expressivity) achieving perceptive quality that compares with a state of the art universal neural vocoder, however using significantly smaller training datasets and significantly less parameters. We also demonstrate remaining limits of the universality of neural vocoders e.g. the creation of saturated singing voices.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The introduction of the WaveNet [5] has demonstrated that DNN can be trained to produce high quality speech signals when conditioned on a mel spectrogram. This result has triggered numerous research activities aiming to reduce the high computational demands of the original WaveNet or to reduce the size of the training data that is required [6, 7, 8, 9, 10]. Recently research focus has been extended from single speaker models to multi speaker models or even universal neural vocoders [1, 2, 11] that is vocoders that support arbitrary speakers, languages and expressivity. An important motivation for constructing a universal neural vocoder is the simplification of the process to create new voices for TTS systems. An interesting line of research in this context are models that try to incorporate prior information about the speech signal into the generator [12, 13, 4, 14]. These models, in the following denoted as excitation networks, simplify the task of the generator by means of splitting the vocal tract filter (VTF) into a dedicated unit. Instead of generating the speech signal the generator is then used only to produce the excitation signal. On the other hand, only one of these models [4] takes a mel spectrogram as input. The others use more classical vocoder parameters like F0, line spectral frequencies, and an voiced/unvoiced flag. The idea to introduce domain knowledge into the model seems particularly interesting. It is in line with the recent DDSP [3] framework for music synthesis that replaces a part of the generator by means of a sinusoidal model and uses the DNN only to control the parameters of the sinusoidal model. The main disadvantage of using the classical vocoder parameters for conditioning is the fact that these parameters are deeply entangled. Disentangling a set of heterogeneous vocoder parameters seems significantly more difficult than disentangling for example the speaker identity from the mel spectrogram. This is due to the fact that the mel spectrogram is a homogeneous representation similar to images and therefore techniques for attribute manipulation that have proven useful for image manipulation (notably disentanglement) can be applied with only minor changes. By consequence research on voice attribute manipulation like: Speaker Identity Conversion [15], rhythm and F0 conversion [16, 17], Gender Conversion [18], speaker normalization [19] generally starts with a (mel) spectral representation of the voice signal. In a companion paper that demonstrates high quality singing voice transpositions over multiple octaves [20], the manipulation of the mel spectrogram and the resynthesis with a neural vocoder has proven highly effective.

These experiences motivate our research into extending the voice signals that are supported by neural vocoders. The present paper discusses especially the case of speech and singing signals and will present a new neural vocoder with signficantly better support for singing than existing models. To achieve this goal we will introduce 2 novelties that are the central contributions of the present research:

  • To improve the signal quality as well as to ease the use of the vocoder in practical applications we will replace the approximate VTF estimation from the mel spectrogram proposed in

    [4] by means of a small separate model that predicts the cepstral coefficients of the VTF.

  • To facilitate the generation of long and stable quasi periodic oscillations that are crucial for singing we simplify the task of the excitation generator by means of splitting the generator into a small DNN that predicts the F0 contour from the input mel spectrogram and a differentiable wavetable generator that produces the correspoding excitation. The subsequent WaveNet then operates without recursion and only has the task to shape the given periodic pulses in accordance with the conditioning mel spectrogram.

The rest of the paper is organized as follows. In section 2 we will introduce the various components of the model and put them into context of existing work. In section 3 we will describe the model topology, in section 4 will describe the datasets and will discuss our experimental results.

2 Model components

The present section will discuss the structure of the proposed neural vocoder that we denote as multi band excited WaveNet. We will notably discuss relations with existing work. The fundamental idea of the present work follows and extends the arguments in [12, 13, 4, 14] that the excitation networks with objective to simplify the task of the WaveNet generator by means of removing the VTF from the generator output. Similar to [4] we use the mel spectrogram to represent the acoustic features. The following section describes the proposed contributions in more details.

2.1 VTF generation

[4] proposes to recover an approximate all-pole representation of the VTF by means of first converting the log amplitudes in the mel spectrogram to linear amplitudes and then applying the pseudo-inverse of the mel filter bank to recover an approximation of the linear amplitude spectrum from which an all-pole model of the envelope can be obtained. It is well known known however that all-pole estimation from harmonic spectra is subject to systematic errors [21]. To counter these systematic errors the generation of the VTF by means of an auxiliary DNN seems a preferable option. Here we propose to use an auxiliary DNN that predicts a cepstral representation of the VTF. This prediction is cheap because it is performed frame wise and operates therefore with a small samplerate. We limit the model to predict causal cepstral coefficients, so that the resulting VTF will be minimum phase [22].

Whether we use all-polse VTF or cepstral representations, in both cases when we predict the VTF from the Mel spectrogram we encounter the question of the gain coefficients. If we do not constrain the VTF we create a gain ambiguity because any gain in the VTF can be compensated by a inverse gain in the excitation generator. We therefore decide to force the cepstral model to have zero gain.

2.2 Wavetable based excitation generation

The existing excitation networks all use a WaveNet or more generally a DNN to create the quasi periodic excitation. In our experiments notably for singing signals we noticed that the generation of a stable quasi periodic excitation is a difficult problem for the generator. For our neural vocoder we therefore decided to create a quasi periodic excitation and pass this excitation together with a white noise signal through the WaveNet. Given the F0 contour is already correct the WaveNet now serves only create the appropriate pulse form and the balance between the deterministic and stochastic components.

Predicing the F0 from the mel spectrum has turned out to be rather simple. For the generation of the quasiperiodic excitation we decided to use wavetable synthesis aiming to create an excitation that is approximately white so that all harmonics are already present. The problem here is that the number of harmonics depends on the F0. To ensure the time varying number of harmonics we create N wavetables for different F0 ranges (n=5 for the present study). The band limited pulses for each of the

wavetables are generated in the spectral domain depending on the F0 range such that even for the maximum F0 for that a avtable entry will be used no aliasing takes place. The wavetable synthesis are always performed in parallel and the final excitation is generated by means of linearly interpolating only two of the

available wavetables. The whole process is implemented to run on the GPU. The wavetable positions that need to be sampled to ensure the correct F0 contour are


where the is the size of the wavetable, the F0 contour in Hz and the samplerate in Hz. Because is a continuous variable the values to be taken from the wavetable will not fall onto the grid of values stored in the tables. For the gradient to pass through the wavetable into the F0 predictor it is important that the positions are not quantized! Instead the values to be output from the wavetable need to be linearly interpolated.

Figure 1: MBExWN schematic generator: Green boxes are DNN models, yellow boxes are differentiable operators, red ovals are losses. The numbers below the boxes specify the output dimensions of the box in the format time x channels (batch dimension not shown).

3 Model topology

The overall schematic diagramm of the MBExWN generator is shown in fig. 1. The diagram represents the flow of a single frame of a mel spectrogram with 80 channels. Each block dispays the output dimensions it would produce for the single spectral frame. The signal flow will be discussed in more detail below.

The input Mel spectrogram enters three subnets: First the F0 subnet that produces an F0 sequence with upsampling factor 100. The sequence of layers of the F0 predictor is specified in terms of layer type (C:Conv1D, L:linear upsamling) followed by a short parameter spefication. The Conv1D layer parameters are given as kernel size x number of filters optionally followed by an upsamplingfactor. If given the upsampling performed by means of reshaping channels into time dimension. As an example consider the layer specification C:3x240x2. This would be implemented by means of a Conv1D layer with kernel size 3 and 120 channels followed b a reshape operation that upsamples by factor 2 by means of folding every other channel into time direction. The linear interpolation layer L is a Conv1d layer with precopted parameters that performes upsampling. The only paramter here is the upsampling factor.

The F0 net specfication is then as follows: C:3x150, C:3x300x2, C:5x150, C:3x120, C:3x600x5, C:1x120, C:3x500x5, C:1x100, C3:50,L:2

The activation functions in the F0 predictor are all relu and are situated after each convolutional layer. The only excpetion here is the last layer that uses a soft sigmoid as activation function. The output vector is then offset and scaled to the desired F0 range. In the presetn model this range is 45Hz-1400Hz. After this operation the F0 contour passes through the wavetable generator described in section 

2.2. It follows a reshape operation and a concatenation of a white noise signal duplicating the size of the excitaion signal. The basic excitation signal then enters the pulse shaping WaveNet. This WaveNet is following the classical configuration using gated tanh activations and kernel size 3. It consists of 2 blocks of 5 layers, having 240 or 320 channels for single or multi voice models.

The PostNEt is a single Conv1D layer that reduces the channel size from 30 to 15 to adapt the WaveNet output for the subsequent a PQMF [10] synthesis filter with 15 bands.

The VTF predictor is again a CNN with the specification: C:3x400, C:1x600, C:1x400, C:1x400, C:1x160.

Activations functions are relu after all but the last convolutional layer. The final layer does not have any activation function and passes directly into a real valued FFT operator to produce a minimum phase spectral envelope [22].

VTF and excitation signal produced by the PostNet are multiplied in the spectral domain to produce the final speech signal. The STFT parameter are copied from the parameters used for creating the mel spectrogram.

4 Experiments

For the following experiments we used 4 databases. The first is the LJSpeech single speaker dataset [23] denoted as LJ in the following. The second, denoted as SP, is a multi speaker dataset composed of VCTK [24], PTDB [25] and AttHack [26] datasets. The SP dataset contains approximately 45h of speech recorded from 150 speakers. For singing voice experiments we used a single singer dataset containing a greek byzantine singer [27] denoted as DI and for the multi singer model a database composed of the NUS [28], SVDB [29], PJS [30], JVS, [31] and Tohoku [32] datasets, as well as an internal datasets composed of 2 pop, 6 classical singers. This dataset contains about 27h of singing recordings from 136 singers. The last dataset will be denoted as SI.

All database recordings were resampled to 24kHz. All voice files were annotated automatically with F0 contours using the FCN estimator [33]. We emploi the noise separation algorithm described in [34] to separate deterministic and noise components and calculate the noise/total energy balance over 4 periods of the F0. We annotate segments with more than 50% of the energy in the noise component as unvoiced.

For the optimization we use Adam optimizer [35] with learning rate . The decay rate parameters are , for training without discriminator and for training with discriminator. Batch size is always 40, and the segment length is appoximately 200ms.

As objective functions we use the following loss functions. The first loss is the f0 prediction loss given by


is the target F0 and are the predicted value at time sample position and is the set of points that are annotated as voiced and further than 50ms away from a voiced/unvoiced boundary. For these unambiguously voiced sections the F0 predictor can be optimized using only the prediction error.

The second loss is a multi resolution spectral reconstruction loss similar to [2]. It is composed of two terms the first one calculated as normalized linear magnitude differences and the second as log amplitude differences.


Here and are the magnitudes of the STFT of the target and generated signals and and are the number of frames and the number of bins in the STFT matrices.

The final reconstruction loss is then the mean of the reconstruction losses obtained for the different resolutions


where runs over the resolutions. For the folloiwn gexperiements we used STFT with window size in seconds given by , and hop size in seconds .

The reconstruction loss is used as objective function for the pulse shaping WaveNet and for the F0 predictor around voiced unvoiced boundaries (more precisely within 50ms of these boundaries within voiced segments, and within 20ms of these boundaries in unvoiced segements). In these transition areas we expect the F0 annotation to be less reliable and not sufficient to create optimal resynhesis performance. Therefore here we optimize the F0 predictor as part of the generator.

Finally, when training with the discriminator loss we use exactly the same discriminator configuration and loss as [2]. The only difference is that we only use 2 discriminator rates. The first one working on the original samplerate and the second after average pooling of factor 4. We motivate the decision to drop the last discriminator with the fact that the stability of the periodic oscillations is already ensured by the excitation model and therefore the discrinminator is only needed to evaluate the pulse form and the balance between deterministic and stochastic signal components.

For each model we first pretrain the F0 prediction model over 100k batches using only the eq. (2) as objective function. Pretraining the F0 predictor reliably achieves prediction errors below 3Hz. We wont further discuss these. As a next step we pretrain the full generator strtaing with the pretrained F0 model loaded. Pretraining of the generator runs for 200k batches. To create a generator without using adversarial loss we continue training the generator for further 400k iterations. When training with adversarial loss we load the discriminator after the 200k training steps for the generator and train with discriminator for further 800k batches.

4.1 Perceptual Evaluation

In the following section will compare different models and configurations. We will denote these using a code structured like: TXC. Here T will be replaced by the model type using the following shortcuts: MW: multi band excited WaveNet introduced here, MMG: multi band melgan from [10], UMG universal melgan vocoder from [2]. The further codes will be used only for the MW models, where X is either U for multi voice (universal) models or S for single voice models. Finally C is a sequence of letters representing specific components missing from the full model. Here only a single letter is used. The letter in the last position v indicates that the model does not use a dedicated module for VTF prediction. For the MW model we have trained two multi voice models. A singing model that is trained with the full pitch range available in the SI dataset and a speech model on the SP dataset. In the tables below when singing data is treated the singing model is used. Equivalently the speech model will be used for speech. During the perceptual evaluation we used the speech mdoel as well for those singing signals that do stay in a pitch range thet was part of our speech databases. For that special cases we denote the model as MWU.

We first summarize results about pretraining. Pretraining the F0 models on any of the datasets converges reliably to an F0 prediction error around 2Hz for singing and around 2.5Hz for speech. Pretraining the generator achieves spectral reconstruction errors in the order of 3dB for singing and 3.2dB for speech. Reconstruction error on the mel spectrogram is even smaller and generally 2dB. Listening to the generated sounds reveals a constant buzz in many of the noise sections. The main problem here are residual pulses that are not sufficiently suppressed by the pulse forming wavenet. To solve this problem we will use the time domain discriminators proposed originally in [36].

4.2 Perceptual tests

We have conducted a perceptual test evaluating the perceived quality of the selected MWU and MWS models trained on multi and single user database. We will use seen and unseen speakers, languages, expressivities, as well as singing styles. For these tests we use three baselines. We used an open source multi-band melgan implementation

111 and trained it for 1M iterations on the DI and LJ datasets. Further we downloaded original samples together with resynthesized results of the Universal MelGan model [2]222 and used these as a baseline for the multi voice models. Each of the tests has been conducted by 42 participants, consisting of audio and music professionals working at or with IRCAM and native English speakers recruited via the prolific online platform333Demo sounds are available under

In contrast to perceptual tests performed in other studies our main interest is the perceptually transparent resynthesis of the original speech signal. Therefore we chose to perform a MUSHRA test containing the reference signal and a group of resynthesized signal that the participants can play as they like. The task given was to concentrate on any differences that might be perceived between the original and the resynthesis and to rate the peceived differences on a scale from 0 to 100 with categories imperceptible (80-100), perceptible, not annoying (60-80), slightly annoying (40-60), annoying (20-40), very annoying (0-20). Results are listed in table table (1). The first column indicates the data source the data is taken from. The second column marked HREF represents a hidden reference (copy of the reference) for which we expect and observe an evaluation around 90 for all cases. In the second column we find the MBExWN models trained on a multi voice dataset. In the sub sequent columns we find various baselines.

In the upper part of the table we find the perceptual evaluation of singing data. In the first line the evaluation data of the SI dataset is used. The result of the MWExWN model trained on the singing voice dataset is equivalent to the result of the hidden reference. In the second line we compare the multi singer model with dedicated single singer models trained on the DI dataset as well as the multi speaker model. Note that the singer DI is neither part SI nor of SP. The best result here is obtained by the the single singer MBExWN model. This model is trained exclusively on that singer. Both multi voice models, whether trained on sining or trained on speech achieve a rating in the highest category of the mushra test. The model achieving worst performance is the multi-band melgan trained as well on DI. The explanation here are instabilities in the synthesized F0 trajectory, sometimes the model even changes phonemes. This problem confirms our observation with the existing neural vocoders that tend to have probems with stable periodic oscillations. In the last two lines we use pop and metall solo singing as out of domain singing. Both multi voice models degrade somewhat for pop singing and produce an annoying quality for saturated metal singing. The model structure will need to change to support the characteristic sub harmonics of saturated voices. In the lower part of the table we find results for speech data. In the first line we see the multi speaker model evaluated on its validation data. The model remains in the range of the highest MUSHRA quality. A likely reason for the sligh reduction are frequent pop noises in the speech signals. The model will not reproduce these perceptually transparent. In the second line we find the evaluation of the multi speaker model on an unseen speaker comparing to the MBExWN and MMG single speaker models trained on the evaluated speaker the MWS model is rated best but the differences are not statistically significant. The last model in the test is a MBExWN multi speaker model configure without the VTF component. This model is clearly perceived as less transparent, which validates the VTF as part of the model structure. In the last line we compare the data retrieved from the UMG demo site. The samples are considered out of domain samples (expressivity, language, speakers) for both the UMG and the MWU model. Comparing these two shows that the MWU model is ranked higher as the UMG model but this difference may not be significant.

Singing Models/Singing Data
SI 90 (4.2) 90 (3.8) - - -
DI 89 (4.6) 83 (5.8) 89 (3.8) 66 (8.4) 80 (6.5)
Pop-sing 88 (4.7) 71 (8.4) - - 76 (7.8)
Met-sing 91 (2.8) 57 (9.9) - - 55 (8.8)
Speech Data/Speech Models
SP 92 (2.8) 85 (7.3) - I - -
LJ 90 (3.6) 84 (6.1) 85 (5.1) 83 (5.6) 77 (6.9)
UMG_V 92 (5.1) 84 (6.3) 79 (8.7)
Table 1: Perceptual evaluation of the perceived difference beween original and resynthesis for different models and conditions.

4.3 Complexity

The multi speaker model with 320 WaveNet channels has about 10M parameters and achieves inference speed of 50kSamples/s when running on a single core of an Intel i7 laptop CPU. On a NVidia V100 GPU the inference rate of 2.4Msamples/s. These numbers compare favourably with the universal melgan [2] that has 90M parameters and achieves an inference speed of 860kHz on a V100 NVidia GPU.

5 Conclusions

In this paper we have presented MBExWN a new neural vocoder with an externally excited WaveNet as source. A perceptual test has shown that the proposed model supports achieves near tranparent quality even for out of domain data. The signal degrades when confronted with rough and saturated voices. Further research will be conducted to solve these cases.

5.1 Acknowledgements

We would like to thank Won Jang for sharing information and materials related to the universal melgan.


  • [1] Jaime Lorenzo-Trueba et al., “Towards Achieving Robust Universal Neural Vocoding,” in Interspeech 2019. Sept. 2019, pp. 181–185.
  • [2] Won Jang et al., “Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains,” ArXiv201109631 Cs Eess, Mar. 2021.
  • [3] Jesse Engel et al,. “DDSP: Differentiable Digital Signal Processing,” in Proc. ICLR, Sept. 2019.
  • [4] Lauri Juvela et al., “GELP: GAN-Excited Liner Prediction for Speech Synthesis from Mel-Spectrogram,” in Interspeech 2019. 2019, pp. 694–698.
  • [5] Jonathan Shen et al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in ICASSP , Apr. 2018, pp. 4779–4783.
  • [6] Aaron van den Oord et al., “Parallel WaveNet: Fast High-Fidelity Speech Synthesis,” ArXiv E-Prints, p. arXiv:1711.10433, Nov. 2017.
  • [7] R. Prenger et al, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” in ICASSP , 2019, pp. 3617–3621.
  • [8] Xin Wang et al., “Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis,” IEEEACM Trans. Audio Speech Lang. Process., vol. 28, pp. 402–415, 2020.
  • [9] Ryuichi Yamamoto et al.,

    “Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram,”

    in ICASSP, May 2020, pp. 6199–6203.
  • [10] Geng Yang et al., “Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech,” in 2021 IEEE Spoken Language Technology Workshop (SLT), Jan. 2021, pp. 492–498.
  • [11] Yunlong Jiao et al., “Universal Neural Vocoding with Parallel Wavenet,” in ICASSP, June 2021, pp. 6044–6048.
  • [12] Eunwoo Song et al., “ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems,” in 2019 Proc 27th EUSIPCO. 2019, pp. 1–5, IEEE.
  • [13] Lauri Juvela et al., “GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis,” IEEEACM Trans. Audio Speech Lang. Process., vol. 27, no. 6, pp. 1019–1030, June 2019.
  • [14] Suhyeon Oh et al., “ExcitGlow: Improving a WaveGlow-based Neural Vocoder with Linear Prediction Analysis,” in APSIPA ASC. 2020, pp. 831–836, IEEE.
  • [15] Jing-Xuan Zhang et al., “Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations,” IEEEACM Trans. Audio Speech Lang. Process., vol. 28, pp. 540–552, 2020.
  • [16] Kaizhi Qian et al., “Unsupervised Speech Decomposition via Triple Information Bottleneck,” in Proceedings of the 37th ICML. Nov. 2020, pp. 7836–7846, PMLR.
  • [17] Kaizhi Qian et al., “Unsupervised Speech Decomposition via Triple Information Bottleneck,” in Proceedings of the 37th ICML. Nov. 2020, pp. 7836–7846, PMLR.
  • [18] Laurent Benaroya et al., “Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations,” ArXiv210712346 Cs Eess, July 2021.
  • [19] Fadi Biadsy et al., “Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation,” in INTERSPEECH, 2019.
  • [21] A. El-Jaroudi and J. Makhoul, “Discrete All-Pole Modeling,” IEEE Trans. Signal Process., vol. 39, no. 2, pp. 411–423, 1991.
  • [22] J. O. Smith, Spectral Audio Signal Processing,, W3K Publishing, 2011.
  • [23] Keith Ito and Linda Johnson, “The LJ Speech Dataset,”, 2017.
  • [24] Junichi Yamagishi et al., “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),”, Nov. 2019.
  • [25] Gregor Pirker et al., “A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario,” in Interspeech, 2011, pp. 1509–1512.
  • [26] Clément Le Moine and Nicolas Obin, “Att-HACK: An Expressive Speech Database with Social Attitudes,” in Speech Prosody, Tokyo, Japan, May 2020.
  • [27] N. Grammalidis et al., “The i-Treasures Intangible Cultural Heritage dataset,” in Proceedings of the 3rd International Symposium on Movement and Computing, New York, NY, USA, July 2016, MOCO ’16, pp. 1–8.
  • [28] Zhiyan Duan et al., “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in 2013 APSIPA ASC. 2013, pp. 1–9, IEEE.
  • [29] Liliya Tsirulnik and Shlomo Dubnov, “Singing Voice Database,” in International Conference on Speech and Computer. 2019, pp. 501–509, Springer.
  • [30] Junya Koguchi et al., “PJS: Phoneme-balanced Japanese singing-voice corpus,” in APSIPA ASC. 2020, pp. 487–491, IEEE.
  • [31] Hiroki Tamaru et al., “JVS-MuSiC: Japanese multispeaker singing-voice corpus,” ArXiv Prepr. ArXiv200107044, 2020.
  • [32] Itsuki Ogawa et al., “Tohoku Kiritan singing database: A singing database for statistical parametric singing synthesis using Japanese pop songs,” Acoust. Sci. Technol., vol. 42, no. 3, pp. 140–145, 2021.
  • [33] Luc Ardaillon et al., “Fully-Convolutional Network for Pitch Estimation of Speech Signals,” in Interspeech 2019. Sept. 2019, pp. 2005–2009, ISCA.
  • [34] Stefan Huber et al., “On glottal source shape parameter transformation using a novel deterministic and stochastic speech analysis and synthesis system,” in Proc InterSpeech, 2015.
  • [35] Diederik P. Kingma et al., “Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations, 2015.
  • [36] Kundan Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in Advances in Neural Information Processing Systems, 2019, vol. 32, pp. 14910–14921.