Speech manipulation algorithms that modify the fundamental frequency and duration of speech are essential for a variety of speech editing applications, such as audio-visual synchronization, prosody editing, auto-tuning, and voice conversion. General-purpose audio editing software such as Pro Tools and Adobe Audition contains algorithms for pitch-shifting. However, these algorithms can alter the timbre of speech to sound unnatural. This motivates the development of natural-sounding pitch-shifting algorithms catered to speech.
Related speech manipulation algorithms include both digital signal processing (DSP) approaches as well as neural networks. DSP-based approaches include TD-PSOLA, WORLD , and STRAIGHT . These methods benefit from fast inference and accurate control, but often degrade the signal with noticeable artifacts. Prior studies  have shown TD-PSOLA to be preferable over WORLD for speech manipulation, and WORLD has been shown to be significantly preferable over STRAIGHT for speech resynthesis .
Prior methods in neural pitch-shifting include Pitch-Shifting WaveNet (PS-WaveNet) , Quasi-Periodic Parallel WaveGAN (QP-PWG) , Unified Source-Filter GAN (uSFGAN) , and Hider-Finder-Combiner (HFC) . PS-WaveNet is too computationally expensive for our use case of real-time interactive editing. QP-PWG and uSFGAN exhibits constant-ratio pitch-shifting quality on par with WORLD, but with worse accuracy. HFC demonstrates variable-ratio pitch-shifting performance with worse accuracy than WORLD, and its subjective quality is significantly degraded by noise induced during vocoding. Only QP-PWG and uSFGAN support multiple speakers, but do not demonstrate an ability to perform variable-rate pitch-shifting. Further, none of these works propose methods for time-stretching. As well, our results indicate that WORLD can perform substantially more accurate pitch shifting than previous studies have reported [13, 31, 34, 30]
. We hypothesize that prior methods performed improper interpolation of WORLD parameters or evaluated pitch error in unvoiced regions.
Neural vocoders are deep neural networks that convert acoustic features (e.g., a mel-spectrogram) to a waveform. Using a neural vocoder, we can perform speech manipulation by encoding speech audio as acoustic features, modifying these acoustic features, and then vocoding to produce a new waveform. Recent neural vocoders include WaveGlow , Parallel WaveGAN , Neural Source Filter (NSF) , and LPCNet . None of these methods address pitch-shifting or time-stretching except LPCNet, which has been informally shown to be able to perform time-stretching . However, no evaluation of time-stretching performance is provided. LPCNet resembles a source-filter model, which decouples the residual (pitch and noise) and spectral (timbre) structure. Source-filter models are usually capable of pitch-shifting speech, exhibiting more natural timbre than traditional phase vocoders . However, our work demonstrates that, without modification, LPCNet does not perform accurate pitch-shifting (Figure 1). We hypothesize this is due to three issues: (1) limitations in the pitch representation used in LPCNet, (2) insufficient disentanglement between pitch and acoustic features, and (3) a lack of training data for very high- and low-pitched speech. Kons et al.  sidestep these limitations by generating the input parameters using a separate neural network. However, their approach necessitates training multiple neural networks and does not generalize to unseen speakers without speaker adaptation.
Rather than sidestepping these limitations, as Kons et al. do, we directly address them to create Controllable LPCNet (CLPCNet), which significantly improves the synthesis quality and pitch-shifting performance of LPCNet. In our objective evaluation, we show that CLPCNet performs both constant- and variable-ratio pitch-shifting and time-stretching with high accuracy on unseen speakers and datasets. In our subjective evaluation, we show that the quality of pitch-shifting and time-stretching with CLPCNet meets or exceeds competitive DSP-based methods. CLPCNet also substantially improves the quality of speech vocoding compared to LPCNet, and permits simultaneous speech coding and speech manipuation. Code is available under an open-source license athttps://github.com/maxrmorrison/clpcnet.111Audio examples are available at https://main.d3ee4zjxcj59ad.amplifyapp.com/.
LPCNet  is a neural vocoder that models each sample of a speech signal as the sum of a deterministic term (the prediction) and a stochastic term (the excitation). The prediction is computed via linear predictive coding (LPC) , where LPC coefficients are derived from Bark-frequency cepstral coefficients (BFCCs). LPCNet autoregressively predicts the parameters of a categorical distribution over 8-bit mu-law-encoded excitation values.
LPCNet consists of two subnetworks: the frame-rate and sample-rate networks. The frame-rate network consists of a pitch embedding layer followed by two 1D convolution layers with tanh activations and two dense layers with tanh activations. The sample-rate network consists of an embedding layer for sample-rate features followed by two gated recurrent units (GRUs) with sigmoid and softmax activations, respectively. The frame-rate network takes as input the YIN pitch, pitch correlation  (henceforth referred to as periodicity), and 18-dimensional BFCCs with a hop size of 10 milliseconds and produces a 128-dimensional embedding for each frame. The sample-rate network takes four inputs: (1) the previously generated excitation, (2) the previous sample value, (3) the current prediction value (see previous paragraph), and (4) the output of the last layer of the frame-rate network after nearest neighbors upsampling.
The time resolutions of the sample-rate and frame-rate networks are related by upsampling factor ; for every frame processed by the frame-rate network, the sample-rate network produces samples without overlap between frames. LPCNet can perform time-stretching by using a variable-rate hop size on a per-frame basis. For example, if a phoneme is spoken for 100 milliseconds (10 frames), we can stretch the phoneme to 200 milliseconds by decoding twice as many samples from each frame.
3 Controllable LPCNet
While LPCNet achieves competitive audio quality and time-stretching performance, it is unable to perform accurate pitch-shifting (see Figure 1). Below, we elaborate on our hypothesis of the three issues prohibiting pitch-shifting in LPCNet (see Section 1) and propose solutions to these issues. In addition, we propose a simplification of the sampling procedure of LPCNet (see Section 3.3).
3.1 Pitch representation
We identify two issues with the pitch representation used in LPCNet. First, pitch values are encoded as the number of samples per period. This design makes pitch bins perceptually uneven; higher frequencies are coarsely sampled, with some bin widths exceeding 50 cents. Given 8-bit quantization at a sample rate of 16 kHz, the minimum representable frequency is 63 Hz, which prohibits modeling very low-pitched voices. We propose a quantization of the frequency range 50-550 Hz that is equally spaced in base-2 log-scale, which makes the width of each bin 16.3 cents.
Second, the YIN pitch and periodicity exhibit significant noise, which harms the performance of LPCNet. Therefore, we use CREPE  (specifically torchcrepe ) to extract the pitch and periodicity. CREPE outputs a distribution over quantized pitch values over time. We apply Viterbi decoding  to extract a smooth pitch trajectory, which reduces half and double frequency errors. We dither the extracted pitch with random noise drawn from a triangular distribution centered at zero, with width equal to two CREPE pitch bins (i.e., 40 cents). This reduces quantization error without increasing the noise floor 
. Our CREPE periodicity measure is the sequence of probabilities associated with the pitch bins selected by Viterbi decoding. CREPE normalizes each frame of input audio, making it invariant to amplitude. This causes low-bit noise to be labeled as periodic during silent regions. We avoid this by setting the periodicity to zero in frames where the A-weighted loudness is less than -60 dB, relative to a reference of 20 dB. Our periodicity measure has a correlation of .82 with the periodicity measure of YIN, and visual inspection indicates that our representation contains significantly less noise (see companion website). This indicates that CREPE learns a representation of speech periodicity at least as good as autocorrelation-based methods.
3.2 Data augmentation
We perform training with a much larger dataset than the original LPCNet (see Section 4.1). For this reason, we omit the original data augmentation, which includes random biquad filtering, volume augmentation, and noise injection. Instead, we propose a novel augmentation to improve pitch-shifting performance.
High-accuracy pitch-shifting with a neural network requires that the input pitch representation is disentangled from other features (e.g., the BFCCs). As well, values close to 50 or 550 Hz are rarely found within speech datasets, which prohibits the network from learning to pitch-shift to these values. We propose a resampling data augmentation to better disentangle pitch features from the BFCCs and allow pitch-shifting of speech to pitch values not seen in the training data. Let be a function that resamples a signal from sampling rate to sampling rate . Given speech signal with original sampling rate , target sampling rate , and constant pitch shift factor , we augment training data with for values of in . Performing pitch-shifting at the original sampling rate ensures that we do not lose high-frequency information when downsampling. This resampling method significantly modifies the speech formants. We hypothesize that this encourages the model to disentangle pitch from the representation of formants within the BFCCs.
3.3 Sampling excitation values
The original LPCNet samples excitation values with sampling temperature dependent on the periodicity. We instead use a constant sampling temperature of 1, which we find performs equivalently when the amount of training data is sufficiently large. We retain the thresholding of the distribution at small values. Let for be the predicted 256-dimensional categorical distribution over mu-law-encoded excitation values. Let , where is a constant threshold. We sample excitations from the categorical distribution . We use , which maximizes the F1 score of the voiced/unvoiced decision.
We design our evaluation to test two hypotheses: (1) CLPCNet allows users to perform pitch-shifting and time-stretching of speech with high accuracy suitable for prosody modification, and (2) the subjective quality of pitch-shifting and time-stretching with CLPCNet meets or exceeds that of TD-PSOLA  and WORLD , two competitive DSP-based methods that have not been outperformed by existing neural methods. We use the Python psola  and pyworld  packages as baseline for TD-PSOLA and WORLD, respectively. We use CREPE to extract pitch contours used to control both CLPCNet, TD-PSOLA, and WORLD. We also compare to the original LPCNet model (using checkpoint lpcnet20h_384_10_G16_80.h5 in the public LPCNet implementation ). In our objective evaluation, we ablate each of our three proposed improvements to LPCNet, corresponding to sections 3.1 through 3.3. For all tables, means higher is better and means lower is better.
We use the VCTK dataset  for training. We train on 100 speakers, withholding four male and four female speakers for unseen speaker evaluation. To evaluate on unseen utterances by speakers seen during training, we set aside four utterances per speaker from four female and four male speakers in the training data. We use microphone 2, which contains less distortion and noise. To test the robustness of CLPCNet to unseen recording conditions, we perform additional evaluation on the clean partition of the DAPS dataset  as well as the RAVDESS  dataset. RAVDESS contains a significant amount of reverb, which we remove using HiFi-GAN .
We resample all audio to 16 kHz and apply a 5th-order Butterworth high-pass filter with a 65 Hz cutoff to remove the 50 Hz hum in VCTK. This filter is shallow enough for CLPCNet to perform accurate pitch-shifting below the cutoff (e.g., see Figure 1). We apply a preemphasis filter with a coefficient of .85, followed by a limiter to prevent clipping . CREPE pitch is extracted from the audio prior to preemphasis. As in the original LPCNet, YIN pitch is extracted after preemphasis. We found that peak normalization to 0.8 or 1.0 as well as LUFS normalization  all harmed performance, but without normalization, examples with low peak amplitude have artifacts in voiced regions. Therefore, we normalize utterances with a peak amplitude less than 0.2 to have a peak amplitude of 0.4.
We train CLPCNet for 45 million steps with a batch size of 64. The number of steps was selected to maximize the F1 score of voiced frame classification. Each item in the batch contains a random slice of 15 frames of BFCCs and pitch features and the corresponding 2400 excitation, prediction, and sample features. We use the AMSGrad  optimizer with a learning rate of and weight decay of to minimize the cross entropy loss between the predicted and ground truth excitations. We omit sparsifying the GRU weights, which does not harm quality when the dataset is sufficiently large.
4.3 Objective evaluation
We report objective metrics to measure the ability of CLPCNet to perform constant- and variable-ratio pitch-shifting. We omit objective evaluation of time-stretching, as the generated audio is precisely sample-aligned by construction. Both LPCNet and CLPCNet take as input a pitch contour (see Section 2.1). To measure pitch accuracy, we replace the input pitch with a target pitch and compare the pitch of the synthesized audio to the target pitch. We report three objective pitch metrics: (1) RMS
, the root-mean-square of the pitch error in cents within frames where both the target and synthesized speech are classified as voiced, (2)F1, the F1 score of the binary voiced/unvoiced decision, and (3) GPE, the gross pitch error, defined as the fraction of voiced pitch values with pitch error greater than cents. We use . For constant-ratio pitch-shifting, we evaluate these metrics using ratios of .71, 1 (unmodified), and 1.41.
We perform objective evaluation of variable-ratio pitch-shifting on RAVDESS. RAVDESS contains an English speech dataset with 24 speakers saying two sentences with many different, expressive prosodies. We select pairs of utterances where the same speaker says the same sentence with different pitch and phoneme durations. We use pitch-shifting and time-stretching to make one utterance in a pair have the pitch and phoneme durations of the other. We use the pitch of the target utterance as ground truth for evaluation. We create 5 pairs each from 20 speakers, for a total of 199 pairs (one speaker only produced 4 pairs) from 277 unique utterances. Given the multimodality of English prosody, this is a suitable prosody transfer task, producing pitch-shifting ratios between .4 and 2.5 and time-stretching ratios between .25 and 4.
We perform objective evaluation of constant-ratio pitch-shifting on VCTK, DAPS, and RAVDESS. For VCTK, we use four utterances from eight seen speakers and four utterances from eight unseen speakers. For DAPS, we use ten utterances from ten speakers. For RAVDESS, we use 100 utterances randomly selected among the 277 used for variable-ratio evaluation.
We perform three ablations to evaluate our methods proposed in sections 3.1-3.3. For section 3.1, we use the YIN pitch and non-uniform bin spacing of the original LPCNet. We set all pitch values less than 63 Hz to 63 Hz, as this representation cannot represent frequencies below this point. We use YIN to evaluate the pitch accuracy of this ablation. To ablate 3.2, we remove our proposed resampling augmentation. For section 3.3, we remove the distribution threshold (i.e., we set ) to demonstrate the importance of this parameter.
4.4 Subjective evaluation
We report the results of subjective experiments designed to evaluate the ability of CLPCNet to perform both constant- and variable-ratio pitch-shifting and time-stretching. All experiments are mean opinion score (MOS) tests conducted on Amazon Mechanical Turk with a scale from 1 (worst) to 5 (best). For all experiments, we test the audio quality using five conditions: (1) the original audio, (2) TD-PSOLA, (3), WORLD, (4) LPCNet, and (5) CLPCNet. We perform variable-ratio evaluation on the RAVDESS dataset and constant-ratio evaluation on DAPS, using the same examples as in our objective evaluation. We evaluate pitch-shifting at constant ratios of .67, .80, 1, 1.25, and 1.5. We evaluate time-stretching at constant ratios of .50, .71, 1, 1.41, and 2.
|- pitch (3.1)||1.00||.923||75.3||.140|
|gray!12 - augmentation (3.2)||1.00||.938||25.0||.049|
|- sampling (3.3)||1.00||.596||17.1||.028|
|gray!12 VCTK (unseen)||1.00||.925||17.6||.024|
In our objective evaluation, we find that CLPCNet significantly improves the F1 and RMS of pitch-shifting (see Tables 1 and 3) compared to the original LPCNet. The pitch-shifting accuracy is less than TD-PSOLA or WORLD. However, prior studies have shown that humans do not register variations in pitch less than 150 cents as a distinct prosody , making CLPCNet suitable for prosody editing. Our ablations highlight crucial design decisions for high-quality pitch-shifting, but do not completely explain the performance gap between CLPCNet and LPCNet. The remaining gap is due to increasing the amount of training data, removing the original data augmentation, and removing the sparsity constraint on the GRU.
While more direct comparison is needed, our F1 on clean speech data substantially outperforms those reported by previous state-of-the-art neural methods such as PS-WaveNet , QP-PWG , and uSFGAN , and our RMS compares favorably or better. Note that a higher F1 score makes having a low RMS more difficult, as more voiced frames are being evaluated (e.g., compare CLPCNet with and without the sampling ablation in Table 1).
We analyze the results of our subjective evaluation (Figure 2) using two-sided -tests with a -value of 0.05. We find that CLPCNet outperforms LPCNet on all conditions. CLPCNet outperforms WORLD on constant-ratio time-stretching for ratios less than one, as well as for constant-ratio pitch-shifting for ratios less than 1.5. WORLD outperforms CLPCNet only for pitch-shifting with a ratio of 1.5. TD-PSOLA outperforms CLPCNet for time-stretching with a ratio of 0.71 and pitch-shifting with a ratio of 1.5. However, TD-PSOLA is non-parametric, and cannot be used for, e.g., speech coding or vocoding. CLPCNet outperforms all conditions on variable-ratio pitch-shifting and time-stretching (e.g., prosody editing), but does not match the quality of the original recording.
Modern speech editing software necessitates high-quality, natural-sounding speech manipulation. In this paper, we introduce CLPCNet, an improved LPCNet vocoder that makes significant progress towards this goal. In objective evaluation, we show that CLPCNet exhibits pitch-shifting accuracy suitable for speech prosody editing. In subjective evaluation, we show that the quality of pitch-shifting and time-stretching with CLPCNet is comparable or better than LPCNet, TD-PSOLA, and WORLD.
-  (2007) Implementation of realtime straight speech manipulation system: report on its first implementation. Acoustical science and technology. Cited by: §1.
-  (2015) A simple limiter in python. GitHub. Note: https://gist.github.com/bastibe/747283c55aad66404046 Cited by: §4.1.
YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America 111 (4), pp. 1917–1930. Cited by: §2.1.
-  (1973) The viterbi algorithm. Proceedings of the IEEE 61 (3), pp. 268–278. Cited by: §3.1.
-  (2021) Pyworld. GitHub. Note: https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder Cited by: §4.
-  (2018) CREPE: a convolutional representation for pitch estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 161–165. Cited by: §3.1.
-  (2019) High quality, lightweight and adaptable tts using lpcnet. arXiv preprint arXiv:1905.00590. Cited by: §1.
-  (2018) The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13 (5), pp. e0196391. Cited by: §4.1.
-  (1975) Linear prediction: a tutorial review. Proceedings of the IEEE 63 (4), pp. 561–580. Cited by: §2.
-  (1936) Tentative standards for sound level meters. Electrical Engineering 55 (3), pp. 260–263. Cited by: §3.1.
-  (2016) WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems. Cited by: §1, §4.
-  (2018) Sound quality comparison among high-quality vocoders by using re-synthesized speech. Acoustical Science and Technology 39 (3), pp. 263–265. Cited by: §1.
-  (2020) Controllable neural prosody synthesis. In Interspeech, Cited by: §1, §1, §5.
-  (2020) Psola. GitHub. Note: https://github.com/maxrmorrison/psola Cited by: §4.
-  (2020) Torchcrepe. GitHub. Note: https://github.com/maxrmorrison/torchcrepe Cited by: §3.1.
-  (1990) Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech communication. Cited by: §1, §4.
-  (2014) Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges. IEEE Signal Processing Letters 22 (8), pp. 1006–1010. Cited by: §4.1.
-  (2019) Waveglow: a flow-based generative network for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
-  (2001) Discrete-time speech signal processing: principles and practice. Cited by: §1.
-  (2019) On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §4.2.
-  (1985) On the relation between pitch excursion size and prominence. Journal of phonetics 13 (3), pp. 299–308. Cited by: §5.
HiFi-gan: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks. In Interspeech, Cited by: §4.1.
-  (2015) Algorithms to measure audio programme loudness and true-peak audio level. Recommendation ITU-R BS.1770-4. Cited by: §4.1.
-  (2019) LPCNet: improving neural speech synthesis through linear prediction. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §2.
-  LPCNet: DSP-boosted neural speech synthesis. Note: https://jmvalin.ca/demo/lpcnet/Accessed: 2020-08-24 Cited by: §1.
-  (2019) LPCNet. GitHub. Note: https://github.com/mozilla/LPCNet Cited by: §4.
-  (2018) A hybrid dsp/deep learning approach to real-time full-band speech enhancement. In International Workshop on Multimedia Signal Processing (MMSP), Cited by: §2.1.
-  (1987) Dither in digital audio. Journal of the Audio Engineering Society 35 (12), pp. 966–975. Cited by: §3.1.
-  (2019) Neural source-filter-based waveform model for statistical parametric speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5916–5920. Cited by: §1.
-  (2020) Hider-finder-combiner: an adversarial architecture for general speech signal modification. In Interspeech, Cited by: §1.
Quasi-periodic parallel wavegan: a non-autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 792–806. External Links: Cited by: §1, §5.
-  (2019) CSTR VCTK corpus: english multi-speaker corpus for CSTR voice cloning toolkit (version 0.92). Cited by: §4.1.
Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. Cited by: §1.
-  (2021) Unified source-filter gan: unified source-filter network based on factorization of quasi-periodic parallel wavegan. arXiv preprint arXiv:2104.04668. Cited by: §1, §5.