Modern end-to-end speech synthesis models mostly consist of two stages: (1) transforming character embeddings to acoustic features such as mel-spectrograms (Ping et al., 2017; Shen et al., 2018; Vasquez and Lewis, 2019; Ren et al., 2019), and (2) synthesizing time-domain waveforms from the derived acoustic features (Oord et al., 2016, 2017; Ping et al., 2018; Kim et al., 2018; Prenger et al., 2019). In various end-to-end speech synthesis models, the WaveNet vocoder conditioned on mel-spectrograms is employed for the second stage to generate high-fidelity raw audio (Ping et al., 2017; Shen et al., 2018). However, samples cannot be obtained in real-time with the WaveNet vocoder due to its autoregressive nature. To enable fast sampling, flow-based generative models have recently attracted attention in the field of speech synthesis (Oord et al., 2017; Kim et al., 2018; Ping et al., 2018; Prenger et al., 2019).
which takes advantage of the inverse transformation of an autoregressive function. Although IAF allows a parallel sampling procedure, it is not suitable to directly train the parallel WaveNet or ClariNet according to the maximum likelihood criterion. Instead, the parallel WaveNet and ClariNet are trained through probability density distillation which requires a well-trained teacher network. Additional hand-engineered objective functions are also needed to make the training procedure stable and to produce high quality audio.
. The affine coupling layer provides a simple inverse transformation and tractable determinant of the Jacobian. Unlike the IAF-based flow models, both the inference and sampling processes are parallelizable so that these models can be trained according to the maximum likelihood criterion without any auxiliary loss terms. However, Real NVP-based models require a number of flow steps to perform density estimation accurately as the affine coupling layer is too inflexible and simple. In this respect, FloWaveNet and WaveGlow are considered inherently memory-inefficient models.
Recently, Chen et al. (2018)
introduced a new technique in which the hidden units are assumed to be continuously time-varying. In this framework, the continuous-time dynamics of the hidden units and their probability densities are described by deep neural networks. A continuous normalizing flow (CNF) is specified by these two dynamics using the instantaneous change of variables formula. Contrary to the discrete normalizing flows, the CNF does not impose any restrictions on its architecture and allows to use a quite flexible function for flow transformation as shown in Fig.1. Here, in light of the advantages of the CNF, we propose a novel generative flow for speech synthesis called WaveNODE. WaveNODE generates high-fidelity waveforms from the corresponding mel-spectrograms with much fewer parameters compared to the conventional flow-based models by replacing the discrete normalizing flows with CNF.
We propose WaveNODE which takes advantage of the CNF for speech synthesis. WaveNODE is capable of generating high-fidelity waveforms from mel-spectrograms with a few flow steps. Moreover, WaveNODE does not require a teacher network or additional loss terms for training. The overall structure of the proposed WaveNODE is shown in Figure 2. We describe the CNF and the hierarchical architecture of WaveNODE in the next subsections.
2.1 Continuous Normalizing Flow
In Neural ODEs (Chen et al., 2018), the continuous dynamics of time-varying hidden units are parameterized using an ODE
where is implemented by a neural network. Here, represents the time-step of solving the ODE, not the temporal axis of waveforms. Applying Eq. (1), the change in log-density of follows a differential equation given by
which is called the instantaneous change of variables formula. To avoid the
cost of computing the trace operation, an unbiased estimate of Eq. (2) can be derived using the Hutchinson’s trace estimator (Hutchinson, 1990) as follows:
is a noise vector drawn fromsuch that and (Grathwohl et al., 2018). Eq. (3) enables a fast computation of Eq. (2) since et al., 2019)) without explicitly writing out the Jacobian matrix. By setting , we obtain an unbiased stochastic estimator of the dynamics of the log-likelihood with cost.
To train a generative model based on the CNF, we first assume that a latent variable follows a simple distribution . Next, let be the dynamics of with the initial value . Given a datapoint , the corresponding latent variable is obtained by solving the following ODE:
where the final value . The log-likelihood of can also be determined by solving another ODE
where is a sampled noise vector from ). Since varies along the vector field , the sampling process can be performed by simply reversing the time interval in Eq. (4) as follows:
Unlike the conventional normalizing flows, is not required to be invertible or have a tractable Jacobian. Hence, any arbitrary functions can be employed for .
2.2 CNF Layer
Since we are interested in retrieving a time-domain signal from its mel-spectrogram, we apply a CNF framework to estimate the conditional distribution of audio samples. Given an upsampled mel-spectrogram to full time resolution, Eq. (1) and Eq. (2) can be extended to a conditional formulation as follows:
To capture the long-term dependency between audio samples, WaveNODE adopts a non-causal dilated convolutional network similar to the WaveGlow (Prenger et al., 2019) architecture for . Let and be a convolutional layer, and
be a linear projection. The activation function of the layers used foris defined as
where * denotes a convolution operator, super/subscripts and denote filter and gate, respectively. Note that WaveNODE uses as a global condition via broadcasting and as a local condition. The CNF layer receives the initial value , time interval and the condition as inputs. Using a black-box ODE solver, the CNF layer outputs the final value and the change in log-likelihood .
2.3 Squeeze Layer
A squeeze layer rearranges an input tensor of shape () to form an output tensor of shape () where is a scale factor. Increasing the number of channels by times, the squeeze layer enlarges the receptive field of the convolutional networks linearly. This also helps each NODE block to focus on different temporal dependencies.
Since speech signals have a very high temporal resolution, it may not be desirable to directly use a mono audio signal with the shape () as input to a convolutional network. To deal with this, WaveNODE employs the initial squeeze layer to transform an input tensor of shape () into an output tensor of shape () given an initial scale factor . This is the same as using Squeeze Layer with a scale factor in the first NODE Block.
2.4 Norm Layer
WaveNODE incorporates a norm layer to alleviate the difficulties that arise when training deep neural networks. In this work, we consider two variants of batch normalization(Ioffe and Szegedy, 2015).
With trainable parameters and (), actnorm (Kingma and Dhariwal, 2018) performs per-channel affine transformation on
where and are the -th elements of and , respectively, and is the -th channel vector of . The parameters and are initialized to normalize the pre-actnorm activations given an arbitrary initial batch (i.e., data dependent initialization). The change in log-likelihood obtained by passing through the actnorm layer can be computed as
2.4.2 Moving Batch Normalization
Moving batch normalization (MBN) exploits running averages instead of the current batch statistics as given by
are the running averages of mean and standard deviation for each channel (), and subscript denote the -th channel component of variables. Similar to Eq. (13), the change in log-likelihood can be computed as
2.5 NODE Block
, WaveNODE stacks several NODE blocks and factors out half of the feature channels at a selected few NODE blocks. Factored-out channels are assumed to be Gaussian whose mean and variance are computed via a density estimation layer (DE layer) using the remaining channels as inputs. These statistics are used to evaluate the likelihood of the factored-out channels. The remaining channels are further passed through deeper NODE blocks and transformed into standard Gaussian noise in the end.
and various neural vocoders. All the neural vocoder models were trained for 7 days on a single NVIDIA RTX 2080Ti GPU. For the subjective evaluation of audio fidelity, we performed a 5-scale mean opinion score (MOS) test with 33 audio examples per model and 27 participants. The audio examples were randomly selected from the test set. Each participant listened to the audio examples played in random order and evaluated the audio quality. Confidence intervals of MOS were calculated using the method proposed inRibeiro et al. (2011). To encourage reproducibility, we attach the code for WaveNODE and the audio samples used in the experiments 111https://github.com/ANLGBOY/WaveNODE/. 222https://wavenode-example.github.io/.. Also, we describe the configuration of other vocoders in Appendix A.
WaveNODE has 4 NODE blocks, each of which basically consists of a CNF layer, an actnorm layer, and a squeeze layer with scale factor = 2. For the CNF layer, WaveNODE employs a 4-layer non-causal WaveNet with kernel size 3 where the channels of residual and skip connections are set to 128. Since WaveNODE stacks only a few NODE blocks, we set the base of dilation to 3 to increase the receptive field. At the end of the second NODE block, half of feature channels are factored out and their likelihood is estimated via a DE layer with a 2-layer network. The initial scale factor is set to 4. For upsampling mel-spectrograms, a single layer of transposed 1D convolution is incorporated.
4.1 Audio Fidelity and Conditional Log-Likelihood
We report the results of model comparison on a 5-scale MOS and conditional log-likelihoods (CLL) in Table 1. In both the MOS and the CLL tests, the performance of the WaveNet vocoder was the best among all vocoders used in the experiments. Among the flow-based vocoder models, WaveGlow gained the highest scores for both MOS and CLL of 4.17 and 4.501, respectively. The MOS score of WaveNODE was between the scores of WaveGlow and FloWaveNet. Note that WaveGlow and FloWaveNet have a much larger number of parameters than WaveNODE. To verify that only WaveNODE is capable of generating high-fidelity audio with a few flow steps, we also evaluated the performance of the compressed models of WaveGlow and FloWaveNet. As shown in Table 1, the compact models of WaveGlow and FloWaveNet received relatively poor MOS scores of 1.75 and 1.22, respectively. The results demonstrate that the ability of WaveNODE to use constraint-free functions for flow transformation allows the implementation of a memory-efficient vocoder that is capable of generating high-fidelity waveforms.
4.2 Synthesis Speed
WaveNODE is able to control the synthesis speed by tuning the accuracy of the black-box ODE solver. When solving an ODE, the ODE solver predicts the errors and adjusts its step-size to reduce the errors below a user set tolerance. To study the effect of changing the accuracy of the ODE solver, we tested the synthesis speed on a single RTX 2080Ti GPU. More specifically, we divided the total number of generated sample points by the total time, measured the number of function evaluations (NFE), and evaluated the audio quality by modifying the tolerance which had been fixed to during training. The middle and right graphs in Figure 3 represent that the synthesis speed increased steadily by allowing higher tolerances at logarithmic scale. On the other hand, the audio quality was little affected until the tolerance was set to . However, the audio quality dropped sharply after setting the tolerance to higher than . The results suggest that WaveNODE can increase the sampling speed to some extent without seriously degrading the audio fidelity.
To compare the synthesis speed of the various neural vocoders, we counted the number of samples generated per second and report the results in Table 2. We set the test tolerance of WaveNODE to . While WaveNet achieved the best result in the MOS test, it showed the worst performance on synthesis speed as it generates one sample point at a time (i.e., ancestral sampling). On the other hand, WaveNODE generated 51K samples/sec, which was a lot faster than WaveNet due to the parallel sampling process of flow operation. This suggests that WaveNODE is capable of generating audio samples in real-time even though WaveNODE has to solve complex ODEs in every flow operation. The sampling speeds of FloWaveNet and WaveGlow were the fastest since these models are not required to solve ODEs.
4.3 Type of Norm Layer
|Model||Batch size||Epoch||Iteration||CLL||MOS||Training time|
In Table 3, we report the performance of WaveNODE models with different norm layers. The results show that the norm layer is advantageous for WaveNODE to achieve good performance when adopting a multi-scale architecture unlike WaveGlow which can be fully trained without batch normalization. Popular CNF-based models often employ an MBN layer for stable training while constructing a deep architecture (Grathwohl et al., 2018; Yang et al., 2019). In our experiments, WaveNODE showed the finest results in terms of both MOS and CLL when employing an actnorm layer rather than the MBN layer. This implies that CNF-based models for speech data can be trained more efficiently by exploiting the actnorm layer.
4.4 Analysis of Training Progress
One of the major drawbacks shared by CNF-based models is the computational cost of the black-box ODE solver. The ODE solver creates a deep computational graph since it finds the solutions to complex ODEs specified by neural networks in an iterative way. Due to this large amount of computation, the mini-batches in the CNF-based models are processed for a long time. Table 4 shows that processing time per mini-batch in WaveNODE was significantly longer than the conventional flow-based models. While WaveGlow went through 240 epochs in 7 days, WaveNODE performed only 46 epochs during the same period. Interestingly, on the other hand, WaveNODE showed decent performance with less training steps. This implies that WaveNODE is trained more efficiently per mini-batch due to the flexible functions used for flow transformation. We also tested whether reducing the batch size to increase the number of iterations can improve the performance of WaveNODE, but the resultant performance was degraded as shown in Table 4.
It has been reported that the NFE in CNF-based models increases as training progresses (Chen et al., 2018; Grathwohl et al., 2018). We also observed a similar phenomenon when training WaveNODE and report the overall trend of the NFE consumed for inference in Figure 4. The main reason for this phenomenon is that ODEs become more complicated to accurately estimate the conditional distribution of waveforms. Since NFE directly affects the time taken to process a mini-batch, we plan to research how to prevent an increase of NFE in future work.
4.5 Analysis of Dilation
WaveNODE is composed of only a few flow steps unlike the conventional flow-based models, which might result in the small receptive field. In order to capture the long-range temporal dependencies in audio signals, we basically set the base of dilation to 3 in WaveNODE for the previous experiments. To verify the effect of dilation, we trained the WaveNODE model with dilation of at the -th layer and evaluated performance in terms of MOS and CLL. Indeed, the dilation was critical to the quality of audio samples generated by WaveNODE as shown in Table 5. We found that WaveNODE produces a trembling sound when the dilation is a multiple of 2 due to the narrow receptive field.
In this work, we presented the novel generative model, namely WaveNODE, which leverages a CNF for speech synthesis. We successfully applied the CNF framework to a large-dimensional data (e.g., audio) without any additional loss term. In the experiments, we demonstrated that WaveNODE shows comparable performance with fewer flow steps compared to the conventional flow-based models. Also, we verified that WaveNODE is able to synthesize audio samples in real-time due to the parallel sampling process. We believe that applying CNF to speech synthesis can be further developed and refined to produce more realistic waveforms.
This work was supported by Samsung Research Funding Center of Samsung Electronics under Project Number SRFC-IT1701-04.
- . In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1, §2.1, §4.4.
- Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1.
- Ffjord: free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §2.1, §4.3, §4.4.
Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §A.1, §3.
- A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation 19 (2), pp. 433–450. Cited by: §2.1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.4.
- The lj speech dataset. Note: https://keithito.com/LJ-Speech-Dataset/ Cited by: §3.
- FloWaveNet: a generative flow for raw audio. arXiv preprint arXiv:1811.02155. Cited by: §A.3, §1, §1, §2.5.
- Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.4.1.
- Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1.
- Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
- Parallel wavenet: fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433. Cited by: §1, §1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §2.1.
- Clarinet: parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281. Cited by: §A.2, §1, §1.
- Deep voice 3: scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654. Cited by: §1.
- Waveglow: a flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §A.4, §1, §1, §2.2.
- FastSpeech: fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263. Cited by: §1.
- Crowdmos: an approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2416–2419. Cited by: §3.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1.
MelNet: a generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083. Cited by: §1.
Pointflow: 3d point cloud generation with continuous normalizing flows.
Proceedings of the IEEE International Conference on Computer Vision, pp. 4541–4550. Cited by: §4.3.
Appendix A Model Configuration
The Griffin-Lim algorithm (Griffin and Lim, 1984) estimates the signal from its modified short-time Fourier transform (STFT) magnitude in an iterative way. For the experiments, we first approximated the STFT magnitude from the mel-spectrogram and applied the Griffin-Lim algorithm with 32 iterations for time-domain conversion.
We trained an autoregressive WaveNet whose output is a single Gaussian distribution. It has been shown that a single Gaussian WaveNet is capable of modeling raw waveforms without degradation compared to WaveNet models with mixture distribution(Ping et al., 2018)
. We stacked 2 dilated residual blocks of 10 layers with kernel size 2 and set the number of hidden units in both residual and skip connections to 128. To upsample the mel-spectrograms from frame-level to sample-level resolution, we employed two layers of transposed 2D convolution with one leaky ReLU activation.
FloWaveNet (Kim et al., 2018) consists of 8 context blocks, each of which contains 6 flow operations. For the affine coupling layers, FloWaveNet employs a 2-layer non-causal WaveNet with kernel size 3. FloWaveNet uses 256 channels for residual and skip connections. Also, FloWaveNet factors out half of the feature channels after 4 context blocks and uses another 2-layer network to estimate the distribution of factored-out channels. FloWaveNet incorporates the same upsampling module described in Section A.2.
For the experiments, we trained the compact version of FloWaveNet as well as the original model. We used 4 context blocks with 3 flow operations and factored out half of the feature channels after 2 context blocks for the compact model.
WaveGlow (Prenger et al., 2019) is composed of 12 blocks, each of which contains an affine coupling layer and an invertible 1x1 convolution. WaveGlow employs 8-layer non-causal WaveNet networks for the affine coupling layers and one layer of transposed 1D convolution for upsampling mel-spectrograms. WaveGlow factors out 2 of feature channels after every 4 blocks.
In addition to the original WaveGlow network, we also trained the compressed model for the experiments. We stacked 9 blocks, used 4-layer networks, and factored out 2 of feature channels after every 3 blocks for the compressed model.