1 Introduction
Modern endtoend speech synthesis models mostly consist of two stages: (1) transforming character embeddings to acoustic features such as melspectrograms (Ping et al., 2017; Shen et al., 2018; Vasquez and Lewis, 2019; Ren et al., 2019), and (2) synthesizing timedomain waveforms from the derived acoustic features (Oord et al., 2016, 2017; Ping et al., 2018; Kim et al., 2018; Prenger et al., 2019). In various endtoend speech synthesis models, the WaveNet vocoder conditioned on melspectrograms is employed for the second stage to generate highfidelity raw audio (Ping et al., 2017; Shen et al., 2018). However, samples cannot be obtained in realtime with the WaveNet vocoder due to its autoregressive nature. To enable fast sampling, flowbased generative models have recently attracted attention in the field of speech synthesis (Oord et al., 2017; Kim et al., 2018; Ping et al., 2018; Prenger et al., 2019).
In order to generate audio samples in realtime, parallel WaveNet (Oord et al., 2017) and ClariNet (Ping et al., 2018) employ inverse autoregressive flow (IAF) (Kingma et al., 2016)
which takes advantage of the inverse transformation of an autoregressive function. Although IAF allows a parallel sampling procedure, it is not suitable to directly train the parallel WaveNet or ClariNet according to the maximum likelihood criterion. Instead, the parallel WaveNet and ClariNet are trained through probability density distillation which requires a welltrained teacher network. Additional handengineered objective functions are also needed to make the training procedure stable and to produce high quality audio.
FloWaveNet (Kim et al., 2018) and WaveGlow (Prenger et al., 2019) adopt an affine coupling layer which was originally proposed in Real NVP (Dinh et al., 2016)
. The affine coupling layer provides a simple inverse transformation and tractable determinant of the Jacobian. Unlike the IAFbased flow models, both the inference and sampling processes are parallelizable so that these models can be trained according to the maximum likelihood criterion without any auxiliary loss terms. However, Real NVPbased models require a number of flow steps to perform density estimation accurately as the affine coupling layer is too inflexible and simple. In this respect, FloWaveNet and WaveGlow are considered inherently memoryinefficient models.
Recently, Chen et al. (2018)
introduced a new technique in which the hidden units are assumed to be continuously timevarying. In this framework, the continuoustime dynamics of the hidden units and their probability densities are described by deep neural networks. A continuous normalizing flow (CNF) is specified by these two dynamics using the instantaneous change of variables formula. Contrary to the discrete normalizing flows, the CNF does not impose any restrictions on its architecture and allows to use a quite flexible function for flow transformation as shown in Fig.
1. Here, in light of the advantages of the CNF, we propose a novel generative flow for speech synthesis called WaveNODE. WaveNODE generates highfidelity waveforms from the corresponding melspectrograms with much fewer parameters compared to the conventional flowbased models by replacing the discrete normalizing flows with CNF.2 WaveNODE
We propose WaveNODE which takes advantage of the CNF for speech synthesis. WaveNODE is capable of generating highfidelity waveforms from melspectrograms with a few flow steps. Moreover, WaveNODE does not require a teacher network or additional loss terms for training. The overall structure of the proposed WaveNODE is shown in Figure 2. We describe the CNF and the hierarchical architecture of WaveNODE in the next subsections.
2.1 Continuous Normalizing Flow
In Neural ODEs (Chen et al., 2018), the continuous dynamics of timevarying hidden units are parameterized using an ODE
(1) 
where is implemented by a neural network. Here, represents the timestep of solving the ODE, not the temporal axis of waveforms. Applying Eq. (1), the change in logdensity of follows a differential equation given by
(2) 
which is called the instantaneous change of variables formula. To avoid the
cost of computing the trace operation, an unbiased estimate of Eq. (
2) can be derived using the Hutchinson’s trace estimator (Hutchinson, 1990) as follows:(3) 
where
is a noise vector drawn from
such that and (Grathwohl et al., 2018). Eq. (3) enables a fast computation of Eq. (2) since(vectorJacobian product) can be efficiently calculated via deep learning libraries (e.g., PyTorch
(Paszke et al., 2019)) without explicitly writing out the Jacobian matrix. By setting , we obtain an unbiased stochastic estimator of the dynamics of the loglikelihood with cost.To train a generative model based on the CNF, we first assume that a latent variable follows a simple distribution . Next, let be the dynamics of with the initial value . Given a datapoint , the corresponding latent variable is obtained by solving the following ODE:
(4) 
where the final value . The loglikelihood of can also be determined by solving another ODE
(5) 
where is a sampled noise vector from ). Since varies along the vector field , the sampling process can be performed by simply reversing the time interval in Eq. (4) as follows:
(6) 
Unlike the conventional normalizing flows, is not required to be invertible or have a tractable Jacobian. Hence, any arbitrary functions can be employed for .
2.2 CNF Layer
Since we are interested in retrieving a timedomain signal from its melspectrogram, we apply a CNF framework to estimate the conditional distribution of audio samples. Given an upsampled melspectrogram to full time resolution, Eq. (1) and Eq. (2) can be extended to a conditional formulation as follows:
(7) 
(8) 
To capture the longterm dependency between audio samples, WaveNODE adopts a noncausal dilated convolutional network similar to the WaveGlow (Prenger et al., 2019) architecture for . Let and be a convolutional layer, and
be a linear projection. The activation function of the layers used for
is defined as(9) 
(10) 
(11) 
where * denotes a convolution operator, super/subscripts and denote filter and gate, respectively. Note that WaveNODE uses as a global condition via broadcasting and as a local condition. The CNF layer receives the initial value , time interval and the condition as inputs. Using a blackbox ODE solver, the CNF layer outputs the final value and the change in loglikelihood .
2.3 Squeeze Layer
A squeeze layer rearranges an input tensor of shape (
) to form an output tensor of shape () where is a scale factor. Increasing the number of channels by times, the squeeze layer enlarges the receptive field of the convolutional networks linearly. This also helps each NODE block to focus on different temporal dependencies.Since speech signals have a very high temporal resolution, it may not be desirable to directly use a mono audio signal with the shape () as input to a convolutional network. To deal with this, WaveNODE employs the initial squeeze layer to transform an input tensor of shape () into an output tensor of shape () given an initial scale factor . This is the same as using Squeeze Layer with a scale factor in the first NODE Block.
2.4 Norm Layer
WaveNODE incorporates a norm layer to alleviate the difficulties that arise when training deep neural networks. In this work, we consider two variants of batch normalization
(Ioffe and Szegedy, 2015).2.4.1 Actnorm
With trainable parameters and (), actnorm (Kingma and Dhariwal, 2018) performs perchannel affine transformation on
(12) 
where and are the th elements of and , respectively, and is the th channel vector of . The parameters and are initialized to normalize the preactnorm activations given an arbitrary initial batch (i.e., data dependent initialization). The change in loglikelihood obtained by passing through the actnorm layer can be computed as
(13) 
2.4.2 Moving Batch Normalization
Moving batch normalization (MBN) exploits running averages instead of the current batch statistics as given by
(14) 
where and
are the running averages of mean and standard deviation for each channel (
), and subscript denote the th channel component of variables. Similar to Eq. (13), the change in loglikelihood can be computed as(15) 
2.5 NODE Block
The NODE block is the primary component of WaveNODE, which basically consists of a squeeze layer, a norm layer, and a CNF layer as shown in Figure 2. Similar to FloWaveNet (Kim et al., 2018)
, WaveNODE stacks several NODE blocks and factors out half of the feature channels at a selected few NODE blocks. Factoredout channels are assumed to be Gaussian whose mean and variance are computed via a density estimation layer (DE layer) using the remaining channels as inputs. These statistics are used to evaluate the likelihood of the factoredout channels. The remaining channels are further passed through deeper NODE blocks and transformed into standard Gaussian noise in the end.
3 Experiments
In order to evaluate the performance of WaveNODE, we conducted a set of experiments using the LJ speech dataset (Ito, 2017) with the GriffinLim algorithm (Griffin and Lim, 1984)
and various neural vocoders. All the neural vocoder models were trained for 7 days on a single NVIDIA RTX 2080Ti GPU. For the subjective evaluation of audio fidelity, we performed a 5scale mean opinion score (MOS) test with 33 audio examples per model and 27 participants. The audio examples were randomly selected from the test set. Each participant listened to the audio examples played in random order and evaluated the audio quality. Confidence intervals of MOS were calculated using the method proposed in
Ribeiro et al. (2011). To encourage reproducibility, we attach the code for WaveNODE and the audio samples used in the experiments ^{1}^{1}1https://github.com/ANLGBOY/WaveNODE/. ^{2}^{2}2https://wavenodeexample.github.io/.. Also, we describe the configuration of other vocoders in Appendix A.3.1 WaveNODE
WaveNODE has 4 NODE blocks, each of which basically consists of a CNF layer, an actnorm layer, and a squeeze layer with scale factor = 2. For the CNF layer, WaveNODE employs a 4layer noncausal WaveNet with kernel size 3 where the channels of residual and skip connections are set to 128. Since WaveNODE stacks only a few NODE blocks, we set the base of dilation to 3 to increase the receptive field. At the end of the second NODE block, half of feature channels are factored out and their likelihood is estimated via a DE layer with a 2layer network. The initial scale factor is set to 4. For upsampling melspectrograms, a single layer of transposed 1D convolution is incorporated.
4 Results
4.1 Audio Fidelity and Conditional LogLikelihood
Model 

CLL  MOS  

Ground Truth      4.840.06  
GriffinLim      2.820.26  
WaveNet  4.8M  4.616  4.480.16  
WaveGlow  87.9M  4.501  4.170.15  
WaveGlow  17.1M  4.366  1.750.20  
FloWaveNet  182.6M  4.449  2.990.19  
FloWaveNet  18.6M  4.249  1.220.18  
WaveNODE  16.2M  4.497  3.530.18 
We report the results of model comparison on a 5scale MOS and conditional loglikelihoods (CLL) in Table 1. In both the MOS and the CLL tests, the performance of the WaveNet vocoder was the best among all vocoders used in the experiments. Among the flowbased vocoder models, WaveGlow gained the highest scores for both MOS and CLL of 4.17 and 4.501, respectively. The MOS score of WaveNODE was between the scores of WaveGlow and FloWaveNet. Note that WaveGlow and FloWaveNet have a much larger number of parameters than WaveNODE. To verify that only WaveNODE is capable of generating highfidelity audio with a few flow steps, we also evaluated the performance of the compressed models of WaveGlow and FloWaveNet. As shown in Table 1, the compact models of WaveGlow and FloWaveNet received relatively poor MOS scores of 1.75 and 1.22, respectively. The results demonstrate that the ability of WaveNODE to use constraintfree functions for flow transformation allows the implementation of a memoryefficient vocoder that is capable of generating highfidelity waveforms.
4.2 Synthesis Speed
Model  Samples/sec 

WaveNet  56 
WaveGlow  328,690 
FloWaveNet  320,062 
WaveNODE  51,045 
WaveNODE is able to control the synthesis speed by tuning the accuracy of the blackbox ODE solver. When solving an ODE, the ODE solver predicts the errors and adjusts its stepsize to reduce the errors below a user set tolerance. To study the effect of changing the accuracy of the ODE solver, we tested the synthesis speed on a single RTX 2080Ti GPU. More specifically, we divided the total number of generated sample points by the total time, measured the number of function evaluations (NFE), and evaluated the audio quality by modifying the tolerance which had been fixed to during training. The middle and right graphs in Figure 3 represent that the synthesis speed increased steadily by allowing higher tolerances at logarithmic scale. On the other hand, the audio quality was little affected until the tolerance was set to . However, the audio quality dropped sharply after setting the tolerance to higher than . The results suggest that WaveNODE can increase the sampling speed to some extent without seriously degrading the audio fidelity.
To compare the synthesis speed of the various neural vocoders, we counted the number of samples generated per second and report the results in Table 2. We set the test tolerance of WaveNODE to . While WaveNet achieved the best result in the MOS test, it showed the worst performance on synthesis speed as it generates one sample point at a time (i.e., ancestral sampling). On the other hand, WaveNODE generated 51K samples/sec, which was a lot faster than WaveNet due to the parallel sampling process of flow operation. This suggests that WaveNODE is capable of generating audio samples in realtime even though WaveNODE has to solve complex ODEs in every flow operation. The sampling speeds of FloWaveNet and WaveGlow were the fastest since these models are not required to solve ODEs.
4.3 Type of Norm Layer
Model  Norm layer  CLL  MOS 

WaveNODE  Actnorm  4.497  3.530.18 
WaveNODE  MBN  4.460  3.360.17 
WaveNODE  None  4.457  3.220.17 
Model  Batch size  Epoch  Iteration  CLL  MOS  Training time 

WaveGlow  8  240  354K  4.501  4.170.15  7 days 
FloWaveNet  2  138  814K  4.449  2.990.19  7 days 
WaveNODE  20  46  27K  4.497  3.530.18  7 days 
WaveNODE  4  26  77K  4.452  2.910.17  7 days 
In Table 3, we report the performance of WaveNODE models with different norm layers. The results show that the norm layer is advantageous for WaveNODE to achieve good performance when adopting a multiscale architecture unlike WaveGlow which can be fully trained without batch normalization. Popular CNFbased models often employ an MBN layer for stable training while constructing a deep architecture (Grathwohl et al., 2018; Yang et al., 2019). In our experiments, WaveNODE showed the finest results in terms of both MOS and CLL when employing an actnorm layer rather than the MBN layer. This implies that CNFbased models for speech data can be trained more efficiently by exploiting the actnorm layer.
4.4 Analysis of Training Progress
One of the major drawbacks shared by CNFbased models is the computational cost of the blackbox ODE solver. The ODE solver creates a deep computational graph since it finds the solutions to complex ODEs specified by neural networks in an iterative way. Due to this large amount of computation, the minibatches in the CNFbased models are processed for a long time. Table 4 shows that processing time per minibatch in WaveNODE was significantly longer than the conventional flowbased models. While WaveGlow went through 240 epochs in 7 days, WaveNODE performed only 46 epochs during the same period. Interestingly, on the other hand, WaveNODE showed decent performance with less training steps. This implies that WaveNODE is trained more efficiently per minibatch due to the flexible functions used for flow transformation. We also tested whether reducing the batch size to increase the number of iterations can improve the performance of WaveNODE, but the resultant performance was degraded as shown in Table 4.
It has been reported that the NFE in CNFbased models increases as training progresses (Chen et al., 2018; Grathwohl et al., 2018). We also observed a similar phenomenon when training WaveNODE and report the overall trend of the NFE consumed for inference in Figure 4. The main reason for this phenomenon is that ODEs become more complicated to accurately estimate the conditional distribution of waveforms. Since NFE directly affects the time taken to process a minibatch, we plan to research how to prevent an increase of NFE in future work.
4.5 Analysis of Dilation
Model 

CLL  MOS  

WaveNODE  4.497  3.530.18  
WaveNODE  4.408  3.170.17 
WaveNODE is composed of only a few flow steps unlike the conventional flowbased models, which might result in the small receptive field. In order to capture the longrange temporal dependencies in audio signals, we basically set the base of dilation to 3 in WaveNODE for the previous experiments. To verify the effect of dilation, we trained the WaveNODE model with dilation of at the th layer and evaluated performance in terms of MOS and CLL. Indeed, the dilation was critical to the quality of audio samples generated by WaveNODE as shown in Table 5. We found that WaveNODE produces a trembling sound when the dilation is a multiple of 2 due to the narrow receptive field.
5 Conclusion
In this work, we presented the novel generative model, namely WaveNODE, which leverages a CNF for speech synthesis. We successfully applied the CNF framework to a largedimensional data (e.g., audio) without any additional loss term. In the experiments, we demonstrated that WaveNODE shows comparable performance with fewer flow steps compared to the conventional flowbased models. Also, we verified that WaveNODE is able to synthesize audio samples in realtime due to the parallel sampling process. We believe that applying CNF to speech synthesis can be further developed and refined to produce more realistic waveforms.
Acknowledgments
This work was supported by Samsung Research Funding Center of Samsung Electronics under Project Number SRFCIT170104.
References
 . In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1, §2.1, §4.4.
 Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1.
 Ffjord: freeform continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367. Cited by: §2.1, §4.3, §4.4.

Signal estimation from modified shorttime fourier transform
. IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: §A.1, §3.  A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in StatisticsSimulation and Computation 19 (2), pp. 433–450. Cited by: §2.1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.4.
 The lj speech dataset. Note: https://keithito.com/LJSpeechDataset/ Cited by: §3.
 FloWaveNet: a generative flow for raw audio. arXiv preprint arXiv:1811.02155. Cited by: §A.3, §1, §1, §2.5.
 Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §2.4.1.
 Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §1.
 Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §1.
 Parallel wavenet: fast highfidelity speech synthesis. arXiv preprint arXiv:1711.10433. Cited by: §1, §1.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §2.1.
 Clarinet: parallel wave generation in endtoend texttospeech. arXiv preprint arXiv:1807.07281. Cited by: §A.2, §1, §1.
 Deep voice 3: scaling texttospeech with convolutional sequence learning. arXiv preprint arXiv:1710.07654. Cited by: §1.
 Waveglow: a flowbased generative network for speech synthesis. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §A.4, §1, §1, §2.2.
 FastSpeech: fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263. Cited by: §1.
 Crowdmos: an approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 2416–2419. Cited by: §3.
 Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §1.

MelNet: a generative model for audio in the frequency domain
. arXiv preprint arXiv:1906.01083. Cited by: §1. 
Pointflow: 3d point cloud generation with continuous normalizing flows.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 4541–4550. Cited by: §4.3.
Appendix A Model Configuration
a.1 GriffinLim
The GriffinLim algorithm (Griffin and Lim, 1984) estimates the signal from its modified shorttime Fourier transform (STFT) magnitude in an iterative way. For the experiments, we first approximated the STFT magnitude from the melspectrogram and applied the GriffinLim algorithm with 32 iterations for timedomain conversion.
a.2 WaveNet
We trained an autoregressive WaveNet whose output is a single Gaussian distribution. It has been shown that a single Gaussian WaveNet is capable of modeling raw waveforms without degradation compared to WaveNet models with mixture distribution
(Ping et al., 2018). We stacked 2 dilated residual blocks of 10 layers with kernel size 2 and set the number of hidden units in both residual and skip connections to 128. To upsample the melspectrograms from framelevel to samplelevel resolution, we employed two layers of transposed 2D convolution with one leaky ReLU activation.
a.3 FloWaveNet
FloWaveNet (Kim et al., 2018) consists of 8 context blocks, each of which contains 6 flow operations. For the affine coupling layers, FloWaveNet employs a 2layer noncausal WaveNet with kernel size 3. FloWaveNet uses 256 channels for residual and skip connections. Also, FloWaveNet factors out half of the feature channels after 4 context blocks and uses another 2layer network to estimate the distribution of factoredout channels. FloWaveNet incorporates the same upsampling module described in Section A.2.
For the experiments, we trained the compact version of FloWaveNet as well as the original model. We used 4 context blocks with 3 flow operations and factored out half of the feature channels after 2 context blocks for the compact model.
a.4 WaveGlow
WaveGlow (Prenger et al., 2019) is composed of 12 blocks, each of which contains an affine coupling layer and an invertible 1x1 convolution. WaveGlow employs 8layer noncausal WaveNet networks for the affine coupling layers and one layer of transposed 1D convolution for upsampling melspectrograms. WaveGlow factors out 2 of feature channels after every 4 blocks.
In addition to the original WaveGlow network, we also trained the compressed model for the experiments. We stacked 9 blocks, used 4layer networks, and factored out 2 of feature channels after every 3 blocks for the compressed model.
Comments
There are no comments yet.