Singing voice synthesis (SVS) is the task of generating a natural singing voice from a given musical score. With the development of various deep generative models, research on synthesizing high-quality singing voice has been emerging recently [chen2020hifisinger, zhuang2021litesing, hono2021sinsy, liu2021diffsinger]. As the performance of the SVS improves, there are increasing cases in which the technology is applied to the production of actual music content [huang2020ai]. Accordingly, the SVS system that can control various musical expressions by reflecting the user’s intention draws more attention.
There are two challenging problems in building an SVS system that can easily control various expressions. First, it is challenging to build datasets where various musical expressions are annotated. Unlike information such as pitch and lyrics, which are relatively easy to label, expressive elements such as breathing, intensity, and singing techniques related to pitch contour are more expensive and time-consuming to label. The second problem is that the more information the user has to enter into the SVS system initially, the more burdensome it will become to the user. In general, the SVS system takes a MIDI pitch sequence and lyrics as an input. Although lots of input parameters such as musical expressions (e.g., breath, intensity, vibrato parameters) can be given to the system, it is inconvenient for the user because the number of input parameters one has to specify at frame-level increases whenever creating a new song. Therefore, to build an SVS system that is easy to use and can control various expressions while dealing with these problems, 1) expression elements should be trained based on self-supervised manners, and 2) the parameters for the detail expressions should be generated automatically during the initial generation stage but should still be able to be modified and resynthesized if desired.
To this end, we propose an SVS system capable of controlling expression along with two novel methods. First, to model a variety of unlabeled style representations, we introduce a Local Style Token (LST) module that captures styles in a self-supervised manner from given text and pitch information based on [wang2018style]. Unlike [wang2018style]
, however, we do not design the model to infer a single global style vector from a reference signal but rather predict frame-wise style tokens that change over time.
Second, to take control over the f0 contour, we introduce a Dual-path Pitch Encoder (DPE) that can selectively use MIDI pitch and f0 contour as input. In the training process, the output of the two encoders is randomly selected to produce the same result. In the generation process, we can resynthesize the singing by freely controlling the f0 contour extracted from the results generated by the MIDI pitch. Through the quantitative and qualitative evaluation, we confirmed that the proposed system allows free control of expressions such as breathing, intensity, and detailed f0 techniques while producing a high-quality singing voice.
The main contributions of this study are as follows:
We propose a content-driven local style token module that can model various musical expressions in singing, such as intensity and breathing, trained in a self-supervised manner.
We propose a dual-path pitch encoder that takes either MIDI pitch sequence or f0 contour, allowing the users to control pitch at a coarse or fine level of one’s choice.
2 Proposed System
Our proposed SVS model is designed by adding a local style token (LST) and dual-path pitch encoder (DPE) to model and control various expressions based on [lee2020disentangling]. Based on the source-filter theory [chiba1958vowel], the acoustic model of our SVS system is an auto-regressive model including two decoders that generate the filter and the source signal, respectively. The filter and source were designed to be modeled from (text) and (pitch, previous acoustic feature), respectively, where singer embedding and LST are used together as conditions. Specifically, acoustic feature is generated as follows:
where denotes filter and source decoder, respectively. denotes encoder output of each input, and means local style token sequence. Finally, the generated acoustic feature
is converted to a waveform through a vocoder. The entire network is trained with acoustic feature reconstruction loss and adversarial loss, and the overview of the acoustic model structure is shown in Fig. 1.
2.1 Local style token module
We introduce a local style token module to model unspecified singing expressions in a music score. We assume that musical expression elements in singing can be inferred from a given input text and pitch sequence and that these elements should exist in a time-varying form. To achieve this goal, we modify the attention mechanism proposed in [wang2018style] and retrieve a local style token sequence by referencing the input contents such as pitch and text. We first introduce a style encoder consisting of stacked 1d-CNN layers with gated linear units [dauphin2017language] to obtain query sequence from text , pitch and singer embedding sequences as follows:
where denotes sequence length and channel dimension, respectively, and is obtained by broadcasting the singer encoder output vector as much as length . Because the LST module operates exactly the same on both text and pitch sides, we denote the subscript or as throughout the paper.
Then, randomly initialized trainable style key and value is used to obtain the style score , and is computed as follows:
Finally, we obtain a LST sequence via matrix multiplication between and as follows: . In the inference stage, the predicted LST sequence from the input contents may be used as it is, or the style score can be modified in the desired way in frame-level to control the musical expression of the singing voice, as shown in Fig. 2. The overview of the LST module is illustrated in Fig. 1-(b).
2.2 Dual-path pitch encoder
Another essential factor that determines the expressiveness of singing is the f0 contour. Modeling a natural f0 contour from a MIDI pitch sequence is one of the important research areas of SVS research [ohishi2012stochastic, wada2018sequential, kameoka2013generative]. However, the natural f0 contour must be carefully determined by referencing not only MIDI pitch information but also text, singer information [lee2012study, ikemiya2014transferring].
Meanwhile, the model proposed in [lee2020disentangling] produces a spectrogram having natural f0 implicitly reflecting information from inputs to the system such as singers, MIDI pitch sequence, and lyrics. Inspired by this, we aim to design a model that can use both MIDI pitch and f0 contour as inputs instead of making an additional model that predicts f0 contour explicitly. To this end, a dual-path pitch encoder with the same structure in which two inputs of pitch and f0 contour can be freely used is proposed, and the training is conducted by randomly selecting one of the two pitch representations. This way, we can generate singing voice with natural f0 contour from the initial generation using MIDI pitch input. If we want further control pitch techniques such as vibrato or portamento, we can modify f0 directly and recreate them using the f0 encoder, as shown in Fig. 2.
2.3 Bandwidth extension vocoder
We used a HiFiGAN vocoder [kong2020hifi] to convert the generated acoustic feature into a waveform. Interestingly, we found that the HiFiGAN vocoder can simultaneously perform bandwidth extension and waveform generation. That is, we trained the vocoder to convert a 22.05khz acoustic feature generated by the acoustic model into a 44.1khz waveform to generate a higher quality sound source without having to train an acoustic model that generates a 44.1khz acoustic feature.
We used an internal dataset of 1,150 singing voices of 88 females and 66 males for training. Each singing voice is paired with manually annotated musical scores. The sampling rate of the singing voices was set to 44.1khz, and the notes were annotated for each syllable. A phoneme-level annotation of lyrics was done by assigning one frame to onset and coda each, and the rest of the frame to vowel as proposed in [lee2019adversarially].
The training of the models was done the same as in [lee2020disentangling] except for the newly proposed LST and DPE. The number of the style tokens
was set to 4, and training was conducted using either MIDI pitch sequence or f0 contour with a 50% probability for each pitch input. We used WORLD[morise2016world] to extract f0 contour from the singing voices. The encoder and decoder structure of the model is the same as [lee2020disentangling] except that all highway convolutional units[srivastava2015training] have been changed to GLUs. The structure of the style encoder is as shown in Fig. 1-(b), and the structure of the f0 encoder is the same as that of the MIDI pitch encoder except for the input channel of the first 1d-CNN layer. We used a 128-dimensional mel spectrogram extracted from the waveform of the 22.05kHz sampling rate with window size and hop length set to 1024, 256, respectively, as an acoustic feature. The acoustic model was trained using the adversarial loss proposed by [lee2019adversarially] along with L1 loss.
We trained three models for comparative experiments to see if LST and DPE help improve controllability while generating high-quality singing voices. The three models are as follows: 1) Single model has only MIDI pitch encoder without an f0 encoder. The f0 contour is controlled by modifying the initially generated singing voice with WORLD vocoder [morise2016world]. 2) Dual has a DPE including both an f0 and a MIDI pitch sequence encoder. 3) DualLST model incorporates both the DPE and LST modules.
3.3 Quantitative evaluation
To examine the generation quality and controllability of the proposed model, we organized two test sets for listening evaluations. The first test set consisted of singing voices generated with MIDI pitch and text input. Ten male and ten female singers were randomly selected to generate ten musical verses. To account for degradation from vocoders, we reconstructed the waveforms from ground truth acoustic features using the vocoders and included them in the listening evaluations. Secondly, we conducted a pitch shift experiment to verify if it is possible to control the f0 contour extracted from the initially generated results. We randomly selected 60 musical scores and initially generated them using the midi pitch input. Then, after raising or lowering each of the 60 f0 contours extracted from the generation result by two semitones, re-synthesis was performed to generate pitch shifted sound sources. The listening evaluations were conducted through Amazon Mechanical Turk, and 10 participants per sound source evaluated the overall naturalness and pitch naturalness of each sound source. The evaluation results for each test are shown in Table 1 and Table 2, respectively. 111Audio sample : https://tinyurl.com/cpyfbt6h
|Model||overall naturalness||pitch naturalness|
Table 1 shows that all of the models obtained naturalness results that did not differ significantly from reconstructed singing voices by the vocoder. Using DPE resulted in a slightly lower naturalness score than the Single model, which seems to be the result of the ambiguity that occurred in producing the same result from two different types of inputs. However, this difference was not statistically significant, so we claim that the DPE model can control the f0 contour while maintaining the naturalness similar to the baseline model.
Table 2 shows that using DPE can still produce high-quality sound sources even when the pitch is shifted than using the parametric vocoder capable of f0 control. The Dual and DualLST models showed better naturalness than the Single model in all cases. Although we performed the global pitch shift in the experiment to maintain the temporal context of the song, the pitch shift can be applied to local sections. The model we proposed can naturally reflect various pitch techniques such as vibrato, attack, and release, as introduced in 3.5.
3.4 Dual-path reconstruction analysis
A reconstruction analysis was performed to quantitatively confirm if DPE helps faithfully reflect the input pitch information. First, we initially generated 150 audio samples from randomly selected singers and phrases. Then, we extracted f0 contours from the generated samples and reconstructed them using the extracted f0 contours. Finally, we measured Mel Cepstral Distortion (MCD), f0-RMSE, and V/UV error rates between the initially generated and the reconstructed samples. The results in Table 3 show that the difference between the initially generated sample and the reconstructed sample is small, showing that the proposed DPE module is working as intended. We also report the error between the ground truth audio sample and the reconstructed sample using a HiFiGAN vocoder, which is shown as Recon in Table 3. This shows that the difference between the initially generated and the reconstructed sample is sufficiently small even when compared to the difference between the ground truth audio sample and the reconstructed sample using the vocoder. In particular, adding the LST module always helped lowering evaluation measures. It implies that the LST module not only helps control singing expressions but is also helpful for generating better output with accurate f0 contour.
3.5 Qualitative analysis
To qualitatively examine the controllability of the proposed methods, we tried various style modifications by manipulating the initial LST sequence and f0 contour 222The role of each of the style tokens is randomly permuted for every experiment. In addition, although breath and intensity were always captured by the style tokens in the text side (), no meaningful token was found in the pitch side ()..
Breath control The first noticeable style captured by the style tokens was the token activated in the breathing section of the phrases. We named it as breath token (). By changing the style score () of the breath token from half to double, we found that we can easily control the intensity of breathing sounds, as shown in Fig. 3-(a)-bottom.
Intensity control We also found that one of the tokens captures the intensity of the singing voice. We named it an intensity token (). Here, the term intensity was used as a word to mean the energy and the timbre of singing according to the vocalization method, such as falsetto, chest voice. As shown in Fig. 3-(a)-top, we found that we can change the intensity of the singing voice, which is shown explicitly by an energy contour. Taking advantage of this, we found that we can control musical expressions such as crescendo or decrescendo by linearly increasing or decreasing the attention score of the intensity token ().
f0 contour control
We can easily control f0 contour using DPE with simple operations such as reducing variance (flatten), adding sinusoidal values (vibrato), adding or subtracting small values from the onset and offset positions of the note (attack/release, up/down), as shown in Fig. 3-(b). Note that, compared to directly adjusting the quantized midi pitch, f0 contour control makes it possible to reflect more detailed expressions in Hz unit.
4 Related work
Recently, interest in research on the SVS system that can reflect musical expression is increasing. A method of explicitly modeling information such as pitch curves, energy, V/UV., which can be extracted directly from the vocal signal, was proposed in [zhuang2021litesing]. [hono2021sinsy] proposed a method to interpret the music score more naturally by introducing a module that predicts the difference between the actual singing and the score. Efforts to create natural pitch contour have also been made in various ways, such as directly predicting f0 from note sequences [wada2018sequential, ohishi2012stochastic, lee2012study], or predicting variables of the parametric f0 contours [bonada2020hybrid]. Despite various kinds of efforts to improve the expressive power of the SVS system, there has been no study yet that, to our knowledge, allows users’ to control singing style elements that cannot be extracted directly from the signal.
We proposed a local style token module and a dual-path pitch encoder to design an SVS system capable of modeling and controlling various musical expressions. We confirmed that the LST token predicted from contents can be controlled to modify expressions such as intensity and breathing. The f0 contour can be controlled through DPE to express various singing techniques related to pitch control. Listening evaluations showed that the proposed model could generate a high-quality singing voice by reflecting the users’ intentions.