DeepAI
Log In Sign Up

U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

03/02/2022
by   Sungjae Kim, et al.
Handong International Law School
0

We propose U-Singer, the first multi-singer emotional singing voice synthesizer that expresses various levels of emotional intensity. During synthesizing singing voices according to the lyrics, pitch, and duration of the music score, U-Singer reflects singer characteristics and emotional intensity by adding variances in pitch, energy, and phoneme duration according to singer ID and emotional intensity. Representing all attributes by conditional residual embeddings in a single unified embedding space, U-Singer controls mutually correlated style attributes, minimizing interference. Additionally, we apply emotion embedding interpolation and extrapolation techniques that lead the model to learn a linear embedding space and allow the model to express emotional intensity levels not included in the training data. In experiments, U-Singer synthesized high-fidelity singing voices reflecting the singer ID and emotional intensity. The visualization of the unified embedding space exhibits that U-singer estimates the correct variations in pitch and energy highly correlated with the singer ID and emotional intensity level. The audio samples are presented at https://u-singer.github.io.

READ FULL TEXT VIEW PDF

page 5

page 10

page 15

06/21/2021

UniTTS: Residual Learning of Unified Embedding Space for Speech Style Control

We propose a novel high-fidelity expressive speech synthesis model, UniT...
01/10/2022

Emotion Intensity and its Control for Emotional Voice Conversion

Emotional voice conversion (EVC) seeks to convert the emotional state of...
11/11/2022

Continuous Emotional Intensity Controllable Speech Synthesis using Semi-supervised Learning

With the rapid development of the speech synthesis system, recent text-t...
11/15/2020

Direct Classification of Emotional Intensity

In this paper, we present a model that can directly predict emotion inte...
11/17/2022

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

Although current neural text-to-speech (TTS) models are able to generate...
12/16/2020

How the emotion's type and intensity affect rumor spreading

The implication and contagion effect of emotion cannot be ignored in rum...
05/31/2020

A Unified Feature Representation for Lexical Connotations

Ideological attitudes and stance are often expressed through subtle mean...

1 Introduction

The singing voice synthesis (SVS) system is a generative model that synthesizes singing voices from the lyric, note pitch, and note duration of the music score. Similar to text-to-speech (TTS), SVS converts the lyrics into spectrograms that represent their pronunciation. However, SVS has an additional restriction: the output voice should follow note pitch and note duration. In the past, most SVS systems were based on traditional methods such as the concatenative method or hidden Markov models (HMM)

[1, 2, 3]. In recent years, end-to-end neural SVS has been actively studied [4, 5, 6, 7]. Neural SVS exhibits the flexibility to effectively express various singing styles.

Reflecting singer ID and emotion is important for synthesizing natural and expressive singing voices. In particular, the intensity of emotion, as well as the type, is crucial to convey the feeling of the song. However, there are few studies on expressing emotions in SVS, and they are based on traditional methods such as HMM [8]

. To the best of our knowledge, there is no existing deep learning-based SVS model that expresses emotions of varying intensities

[9]. In the TTS field, many studies have been conducted to express types of emotions [10, 11, 12, 13, 14, 15, 16], but there are few studies to express the intensity of emotions [17, 16].

The main challenge in speaker/singer ID and emotion control is disentangling them from other style attributes such as pitch and rhythm. In particular, expressing emotion in SVS is more challenging than in TTS because the SVS model is restricted by note pitch and note duration. Previous work has shown that F0 contour changes, power envelope, and spectral sequence are important for expressing emotions in singing voices [18, 19, 20]. As well, [21] exhibits that the level and variation of loudness and the variation of F0 have a significant effect on emotional expression. Therefore, to effectively express emotions in a singing voice, it is essential to precisely model the variations in pitch and energy according to the type and intensity of emotion. However, it is challenging to generate subtle changes in pitch and energy while producing accurate pronunciation and following note pitch and note duration. In particular, multi-singer emotional SVS is of greater difficulty, as it has to disentangle and control each singer’s timbre separately, in addition to the above attributes.

In this paper, we propose U-Singer, the first multi-singer emotional SVS model based on deep learning that effectively expresses the type and intensity of emotion. Following FastSpeech2 [22], U-Singer synthesizes spectrograms in a non-autoregressive manner. Based on UniTTS [23], a recently developed multi-speaker emotional TTS, U-Singer represents multiple style attributes in a unified embedding space. Since U-Singer represents each attribute by conditional residual embedding conditional on the preceding attributes, it avoids interference between style attributes. More importantly, it predicts the fine variation of pitch, energy, and phoneme duration conditioned on singer ID and emotion, while accurately following the note pitch and note duration. U-Singer applies emotion interpolation and extrapolation to learn a continuous embedding space and to express the emotional intensities that are not included in the training data. Furthermore, we developed ASPP-Transformer that combines atrous spatial pyramid pooling (ASPP) [24] and Transformer [25]. The ASPP-Transformer incorporates broad context while focusing on local details, thereby improving the fidelity of singing voices with a large variation in phoneme duration.

To train U-Singer, we collected 12.32 hours of Korean singing voices, including the voice of four singers and seven intensity levels of emotion: three levels of happiness and sadness, respectively, and neutral. Although the training data includes only seven intensity levels of emotion, U-Singer can express emotion intensities not included in the training data by the proposed emotion embedding interpolation and extrapolation techniques. In experiments, U-Singer synthesized singing voices of multiple singers controlling emotional intensity. The visualization results demonstrate that U-Singer learns embeddings highly correlated to singer ID and emotional intensity, and the pitch and energy predictors produce embeddings according to the singer ID and emotional intensity.

Our main contributions include 1) the first deep learning-based multi-singer emotional SVS that controls emotional intensity, 2) residual pitch and duration encoders to add variance to the note pitch and note duration according to the singer ID and emotional intensity, 3) emotional embedding interpolation and extrapolation to express emotional intensity, not in the training data, and 4) ASPP-Transformer to incorporate broad context while focusing on local details of singing voices.

2 Related Work

Single-singer SVS

 The SVS models developed in the early days of deep learning have a similar structure to the conventional SVS system but have replaced some modules with deep neural networks

[26]. [27, 28, 29] propose deep learning-based end-to-end SVS models composed of an encoder and an attention-based autoregressive decoder. In addition, [29] improves voice quality by applying adversarial loss. In the speech synthesis field, [30]

proposes a novel non-autoregressive TTS model, FastSpeech, that resolves the skipping and repeating issues of the autoregressive models. Inspired by FastSpeech, many SVS models have adopted the non-autoregressive method based on feed-forward transformer (FFT)

[5, 6, 7]. They are composed of an encoder that extracts high-level embeddings from the lyrics, note pitch, and note duration, a length regulator combined with a duration predictor that aligns a phoneme sequence to the spectrogram frames, and a decoder that synthesizes spectrograms from the aligned phoneme embeddings. There are also SVS models based on GAN [31, 29, 32] and diffusion model [33].

Multi-singer SVS In the speech synthesis fields, many multi-speaker TTS models reflect speaker ID by feeding a fixed-size speaker embedding into the decoder [34, 35]. Most multi-singer SVS models represent singer characteristics in similar ways [31, 36, 37]. However, learning the characteristics requires sufficient data for each singer. There are a few studies to learn singer characteristics from a small number of samples. [38] learns the characteristics of each singer by adapting a pre-trained SVS model to a target singer. [36, 37] present zero-shot style adaptation methods that apply reference encoders that extract singer embeddings from the reference audio. In particular, [37] applies multiple reference encoders and multi-head attention to reflect singer characteristics more effectively. On the other hand, [39] proposes a method to disentangle timbre from singing style and control them separately.

Emotion modeling in SVS Most previous studies on emotional expression in SVS are based on conventional approaches. [18] analyzes the effect of F0 contour on emotional expression in the singing voice and proposes an F0 control module to reflect four types of F0 dynamics characteristics: overshoot, vibrato, preparation, and fine-fluctuation. [8]

proposes an SVS system that controls acoustic parameters that affect emotional expression by hidden semi-Markov models (HSMM).

[19] and [20] analyze the effect of F0 contour, amplitude envelope, and spectral sequence on emotional expression. [21, 40] report that the variation of F0 and duration significantly affects recognized emotion. In TTS, previous work has controlled emotion through emotion embeddings [41]. However, to the best of our knowledge, there is no previous deep learning-based end-to-end SVS system that controls the type and intensity of emotion [9].

Multiple style modeling using unified embedding space The multi-singer emotional SVS should control multiple attributes, such as singer ID, emotional intensity, pitch, energy, and rhythm, together. However, it is challenging to disentangle multiple attributes. In particular, overlapping attributes, such as speaker ID and emotion, often interfere with each other, reducing fidelity and controllability. To address this issue, [23] proposes a unified embedding space to represent all style attributes avoiding interference, as shown in Figure 1. U-Singer applies the unified embedding space to predict pitch, energy, and duration according to singer ID and emotional intensity, avoiding interference.

Figure 1: Conceptual figure of unified embedding space. U-Singer applies the unified embedding space to predict pitch, energy, and duration according to singer ID and emotional intensity, avoiding interference.

3 Multi-singer Emotional Singing Voice Synthesis based on Unified Embedding Space

(a) Overall Architecture
(b) Variance Adaptor
(c) ASPP-Transformer
Figure 2: The architecture of U-Singer.

3.1 Overall Architecture

Figure 2 illustrates the structure of U-Singer. It takes lyrics (phoneme sequence), note pitch, and note duration as input. First, it retrieves the low-level embeddings of the phoneme and note pitch from the embedding tables, respectively. [5] and [6] combine phoneme and pitch embeddings by element-wise addition. However, we combine them by concatenation followed by a linear layer. The linear layer provides a more general form and exhibited higher fidelity than element-wise addition in our preliminary experiments. The lyric-pitch encoder converts the combined embedding into a high-level representation and transmits it to the variance adapter.

The variance adapter adds variance information to the input embedding , as shown in Figure 2(b). Following [23], our variance adapter represents all attributes in a unified embedding space together with the phoneme. The variance adapter consists of a collection of per-attribute predictors and encoders. The variance adapter adds style attributes sequentially, where are singer ID, emotional intensity, pitch, and energy, respectively. While it applies attributes to the phoneme, the embedding moves along the path , , , …, , where is a joint embedding that represents the phoneme with the attributes applied. In Figure 2(b), the vertical arrows show the joint embeddings for accumulating the residual attribute embeddings.

U-Singer represents each attribute by a residual embedding, which is the vector distance between the embeddings before and after applying the attribute, i.e.,

. Since the residual embedding is conditional on the previous attributes, it represents the attribute embeddings adapted to the preceding attribute. U-Singer learns the residual embedding with residual encoders that take as input the previous joint embeddings as well as . The arrows from the vertical line to the attribute encoders in Figure 2(b) represent transmitted to the attribute encoders . Note that the duration predictor takes because duration is affected by (singer ID) and (emotion), but independent of (pitch) and (energy). We learn the residual encoders through knowledge distillation from a style encoder based on global style token (GST). For more details, refer to [23].

The variance adapter outputs the joint embedding . The length regulator aligns the joint embeddings to Mel-frames by duplicating them for the duration of each phoneme. Finally, the decoder converts the aligned joint embeddings to the Mel-spectrogram. The lyric-pitch encoder and the decoder are composed of FFT, while the attribute encoders and predictors are composed of CNNs. The detailed structure of the modules is presented in Section A.1.

3.2 Residual Pitch and Duration Predictors

Residual Pitch Predictor In general, the pitch and phoneme duration of the singing voice follow the note pitch and note duration. Additionally, they accumulate acoustic variation depending on the singer ID as well as the type and intensity of emotion [42, 21, 43, 40]. Since the singer ID and emotional intensity are mainly expressed through the fine variation of pitch, U-Singer predicts the residuals between the note pitch and that of the singing voice by a residual pitch predictor, as in Equation 1, where , , and denote the effective pitch, note pitch, and residual pitch predictor, respectively.

(1)

As shown in Figure 2(b), we add the output of the residual pitch predictor and the note pitch to compute the effective pitch. We learn the pitch predictor by minimizing distance between the effective pitch and the pitch of the training sample, and thereby, lead the predictor to produce the residual pitch from the joint embedding , that encodes phoneme , singer ID , and emotional intensity . also refers to the note pitch because the initial phoneme embedding encodes the phoneme together with its note pitch, as shown in Figure 2(a). The experimental results in Section 4.2.2 demonstrate that the residual pitch predictor and encoder produce different vibrato according to emotional intensity.

A few previous studies have predicted residual pitch instead of absolute pitch [5, 6]. However, our work differs from them in the following aspects: First, while the purpose of the previous works is to attenuate off-pitch issue, we predict residual pitch to control pitch variation according to the singer ID and emotional intensity, and thereby, to synthesize expressive singing voice. Second, we predict residual pitch by a different method and use it in a different way from the previous works.

Residual Duration Predictor In SVS, the accumulation of duration prediction error is fatal to maintain sync with instruments or other voice parts. However, it is challenging to predict accurate phoneme duration while controlling multiple attributes. To minimize duration prediction error, we utilize note duration by applying a residual duration predictor.

Similar to the residual pitch predictor, the residual duration predictor takes the joint embedding as input and outputs the residual duration, which is added to the note duration to compute the effective duration as Equation 2, where , , and denote the effective duration, note duration, and residual duration predictor. Additionally, the residual duration predictor receives the embedding of the note duration because does not carry any information about the note duration of the phoneme.

(2)

Figure 3 demonstrates the effect of our residual duration predictor for a song that is 15.28 seconds long, compared with XiaoiceSing [5] which predicts absolute phoneme durations. [5] improves prediction accuracy by applying syllable duration loss in addition to the phoneme duration loss. The middle row exhibits the duration predicted by U-Singer, which is significantly more accurate than the prediction of [5] displayed in the bottom row.

Figure 3: Duration error comparison between our method and XiaoiceSing’s method. Our duration predictor predicts while XiaoiceSing’s duration predictor directly predicts and applies syllable duration loss.

3.3 Emotion Embedding Interpolation and Extrapolation

Figure 4: Emotion embedding interpolation and extrapolation. U-Singer learns the emotion embedding of intermediate intensity level by interpolation.

In previous work, many emotional TTS represent emotion types through embedding vectors [44, 16, 45, 23]. U-Singer predicts residual emotion embeddings by a combination of an embedding table and a residual encoder, following [23]. The embedding table learns the mean embedding of each emotional intensity level, in which the influence of the singer ID was removed by normalization [23, 46], while the residual encoder adapts the chosen embedding to the phoneme and previously applied attributes, as , where , , are the embedding of emotion , the entry of the embedding table for , the residual emotion encoder, respectively.

We attempted two methods to learn multiple emotional intensity levels: one is level-wise embeddings, and the other is emotion interpolation. The former learns separate embeddings for each emotional intensity level, while the latter learns only one embedding for each emotion type and represents other intensity levels by embedding interpolation, as shown in Figure 4. Our training data contains seven emotional intensity levels: , , , , , , and . Therefore, U-Singer learns seven separate embeddings with the level-wise embedding table. However, with emotion interpolation, it learns only three embeddings each of which is for , , and , and computes the embeddings of intermediate intensity levels by interpolation as and .

Emotion interpolation has multiple advantages over level-wise emotion embeddings: First, it enables to express intermediate intensity levels, such as and , that are not in the training data. Second, it enables emotion extrapolation with in synthesis time to produce emotional intensities beyond those in training data. Our demo page presents audio samples produced with . Third, applying emotion interpolation during training leads the model to learn a linear embedding space, as shown in Figure 6(b).

A previous study [16] also applies emotion interpolation to mix different types of emotion. However, while they only apply emotion interpolation in synthesis, we apply it in both synthesis and training. Moreover, in synthesis, we not only interpolate emotions but also extrapolate them.

3.4 ASPP-Transformer

The variation of phoneme durations in singing voices is substantially higher than that of ordinary voices. To synthesize high-fidelity singing voices, the model requires a sufficiently large receptive field. Meanwhile, to effectively learn fine-grained acoustic features, which are important for expressing emotion[21], the model should catch local details as well. The FFT block is composed of a self-attention sub-layer and a convolution sub-layer. Previous work has shown that convolution refers to a limited context [47] and self-attention requires a large amount of data to learn sufficiently [48, 49]. Naively enlarging the convolution filter drastically increases computational cost and the number of parameters and introduces a risk of overfitting.

To tackle this challenge, we extended FFT by replacing convolution with atrous spatial pyramid pooling (ASPP) [24], as shown in Figure 2(c). We call the new building block ASPP-Transformer. ASPP-Transformer inherits the advantage of ASPP that it can refer to a broad context with a small amount of computation and number of parameters. To focus on the local neighborhoods while incorporating a broad context, we assign a large number of channels to the filters that have low atrous rates.

3.5 Training

U-Singer produces the Mel-spectrogram of a singing voice from lyrics, note pitch, note duration, singer ID, and emotional intensity. We train U-Singer by minimizing the reconstruction loss between the ground truth and the synthesized Mel-spectrograms. We also minimize the reconstruction loss between the predicted pitch, energy, and duration and those of the training sample to improve the accuracy of the predictors. In addition, we combine the adversarial loss to alleviate the over-smoothing issue, following [4, 7, 6]. The total loss combines those losses by weighted sum as Equation 3, where , , , and are the reconstruction losses for Mel-spectrogram, pitch, energy, and duration, while is the adversarial loss. , , , , and are their weights. For the details of our training procedures, see Section A.3.

(3)

3.6 Data collection

To collect a singing voice dataset, we selected 69 Korean pop songs and categorized them into happy and sad songs. We asked two professional singers (one male and one female) and two amateur singers (one male and one female) to sing the selected songs four times at different emotional intensity levels: , , , and for the happy songs, and , , , and for the sad songs.

The biggest challenges were defining the guideline of each emotional intensity level and guiding the singers to maintain the guidelines while singing. We consulted a vocal trainer to establish the guideline of each emotional intensity level. Then, we asked the professional singers to sing the selected songs multiple times at different emotional intensity levels following the guideline. We collected one hour of reference singing voice samples from the professional singers. With the guideline and the reference singing voices, we guided the amateur singers to sing at different emotional intensity levels. In this way, we collected 11.23 hours of singing voice samples from the amateur singers. Since it is challenging to consistently follow the guidelines for an amateur singer, our dataset contains samples with unstable pitch and inaccurate pronunciation.

4 Experiments

4.1 Experimental Settings

Dataset For experiments, we combined our internal dataset, described in Section 3.6, and 2.12 hours of Korean singing voices in the Children’s Song Dataset (CSD) [50]. The combined dataset consists of 7,672 singing voice samples sung by five singers that are 5 10 seconds long. We used 7,120 samples for training and 552 samples for the test. Since the samples in the CSD dataset do not have emotion labels, we labeled all of them as ’neutral’.

Baseline models Because U-Singer is the first multi-singer emotional SVS model, there is no existing baseline model to compare fairly. Therefore, we built an SVS model as the baseline by extending FastSpeech2 [22] to an SVS model and add a speaker embedding table and a emotion embedding table, as shown in Section A.2. We call this SVS model ‘extended FastSpeech2’ (EXT.FS2). EXT.FS2 differs from U-Singer in two aspects: 1) the attribute encoders of EXT.FS2 predict attribute embeddings independently of the previous attribute, while those of U-Singer predict , and 2) EXT.FS2 applies the FFT blocks, while U-Singer applies the ASPP-Transformer block to the decoder.

In addition, we built two more SVS models derived from U-Singer for ablation study, namely ABS.PITCH and KERNEL13. We designed ABS.PITCH to evaluate the effectiveness of the residual pitch predictor described in Section 3.2. ABS.PITCH is identical to U-Singer except for one difference: ABS.PITCH predicts absolute pitch , while U-Singer predicts residual pitch , as described in Section 3.2. On the other hand, KERNEL13 was designed to test the effectiveness of the ASPP-Transformer block. KERNEL13 has the same architecture as U-Singer while it applies the standard FFT block with a large convolution kernel whose width is 13.

Environment and Hyperparameters

 We trained each SVS model on a single RTX 3090 GPU with 24GB memory for 1.5 days. We used ADAM optimizer with learning rate of 0.001 and batch size of 16. The detailed description of the hyperparameters is presented in

Appendix A.

4.2 Experimental results

4.2.1 Quantitative Evaluation

MOS()
Method Pronunciation Accuracy Sound Quality Naturalness
G.T. 4.650.24 4.430.28 4.590.32
EXT.FS2 3.610.37 3.740.33 3.740.39
ABS.PITCH 3.80.37 3.540.29 3.670.47
KERNEL13 3.50.4 3.520.31 3.610.42
U-Singer 4.310.36 4.350.34 3.930.38
Table 1:

MOS Evaluation with 95% confidence intervals

MOS() Accuracy()
Method Singer Emotion Emotional
Similarity Type Intensity
G.T. 4.540.34 89.81% 79.63%
EXT.FS2 2.560.49 93.52% 73.15%
ABS. PITCH 3.130.64 91.67% 83.33%
KERNEL13 2.890.53 91.67% 80.56%
U-Singer 3.430.61 96.3% 95.37%
Table 2: Accuracy of emotion type and intensity

Audio Quality We evaluated the performance of the SVS models by MOS test. We measured the overall audio quality by pronunciation accuracy, voice quality, and naturalness, and expressiveness by singer similarity, emotion type accuracy, and emotion intensity accuracy. Table 1 presents the test results from 18 subjects111Many previous studies in the TTS and SVS fields listed in our references present the result of MOS test evaluated by 10-20 raters [23, 22, 30, 5, 6]. U-Singer exhibited the highest pronunciation accuracy, sound quality, and naturalness among the SVS models. The difference in pronunciation accuracy and the sound quality was substantial, but that in naturalness was less significant than the other two metrics.

Expressiveness In regard to expressiveness, U-Singer exhibited remarkably improved results than those of the other models. The singer similarity score of U-Singer was 3.43, which was substantially higher than 2.56 of the EXT.FS2 model. We conducted two experiments to evaluate the emotional type and intensity expression performance: We first asked the subjects to answer the emotion type of the singing voice samples. Then, we asked the subjects to find the sample with stronger emotional intensity among two randomly chosen samples with different emotional intensities but the same emotion type.

Table 2 displays the results. U-Singer showed the highest accuracy in both tests. All SVS models exhibited high emotion type accuracy. Among them, the score of the KERNEL13 model that does not apply ASPP-Transformer and ABS.Pitch that does not apply residual pitch predictor showed the lowest emotion type accuracy. In particular, the emotional intensity accuracy of the EXT.FS2 model that does not apply the unified embedding space was significantly lower than those of the other models. The scores of the ABS.PITCH and KERNEL13 models suggest that the proposed residual pitch predictor and ASPP-Transformer are effective in improving the emotion expression performance of the SVS model.

Duration prediction We evaluated the effectiveness of the residual duration predictor by measuring the duration prediction error on the test samples as , where , , , and denote the number of samples, the number of phonemes in each sample, the effective duration predicted by the model, and the note duration, respectively. Please note that measures duration error at the sample-level rather than at the phoneme level.

We compared our residual duration predictor with an absolute duration predictor that applies both phoneme-level and syllable-level losses, following XiaoiceSing [5]. In the experiment, the error of our model was 1.8%, while that of XiaoiceSing was 5.4%, which demonstrates that the residual duration predictor is effective in reducing duration prediction error, thereby maintaining singing speed.

4.2.2 Visualization results

(a) Singer Embedding
(b) Emotion Embedding
Figure 5: Visualization of U-Singer’s residual embedding. We visualized which are singer embedding and emotion embedding into 2-dimensional space using principle component analysis (PCA). We manually drew the lines for the convenience of the reader.

Embedding visualization U-Singer represents each style attribute by a residual embedding , following UniTTS [23]

. We visualized the distribution of the residual embeddings using principal component analysis (PCA).

Figure 5 displays the distribution of the residual singer (emotion) embeddings colored by singer (emotion) label. The singer and emotion embeddings are well-clustered according to their labels, which suggests that the proposed methods learn attribute embeddings highly correlated with the corresponding attribute labels. In particular, because of the emotion interpolation during training, the center coordinates of the emotional intensity levels showed a linear arrangement, as the blue (sad) and red (happy) lines in Figure 5(b).

(a) Pitch embeddings learned by conventional encoder
(b) Energy embeddings learned by conventional encoder
(c) Pitch embeddings learned by residual encoder
(d) Energy embeddings learned by residual encoder
Figure 6: Visualization of pitch and energy embeddings obtained by conventional encoder of EXT.FS2(a,b) and residual encoder of U-Singer(c,d).

Conditional residual embeddings To evaluate the effectiveness of our conditional residual encoder in expressing emotional intensity, we compared the attribute embeddings learned by the conditional residual encoder and conventional attribute encoder. Our encoder predicts residual embedding conditioned on the previous attributes, while conventional encoder predicts attribute embedding independent of other attributes. Figure 6 visualizes the embeddings. (a) and (b) display the pitch and energy embeddings learned by the conventional encoders of the EXT.FS2 model, while (c) and (d) show the embeddings learned by our encoders. In (a) and (b), the embeddings are almost independent of emotional intensity, while the embeddings in (c) and (d) show a substantial correlation with emotional intensity.

(a) ABS.PITCH
(b) U-Singer
Figure 7: F0 contour of singing voices synthesized with different emotional intensities.

F0 contour and phoneme durations varying with emotional intensity We analyzed the effectiveness of the proposed residual pitch predictor and encoder by visualizing the pitch contours of the singing voices synthesized by U-Singer and the ABS.PITCH model, respectively. Figure 7 displays the visualization results. From the top, each rows displays the pitch contour synthesized with emotion label , , , and , respectively. U-Singer controlled the strength of vibrato according to the emotional intensity level as (b), while ABS.PITCH produced pitch contours not significantly affected by emotional intensity label as (a). Figure 8 displays the spectrogram synthesized by U-Singer with different emotional intensities. Figure 8 shows that U-Singer predicted pitch and phoneme duration differently according to emotional intensity level, thereby expressing emotion by the variation of prosodic attributes.

Figure 8: Mel-Spectrograms synthesized with different emotional intensities.

5 Conclusion

In this paper, we proposed U-Singer, the first multi-singer emotional singing voice synthesizer that controls emotional intensity based on the unified embedding space. U-Singer synthesizes singing voices from lyric, note pitch, and note duration controlling multiple attributes such as singer ID and emotional intensity. It represents emotional intensity by controlling the fine variation of pitch, energy, and phoneme durations. Additionally, we proposed novel emotion interpolation and extrapolation techniques as well as ASPP-Transformer. In experiments, U-Singer synthesized singing voices reflecting the specified emotional intensity level. We also presented multiple visualization results that confirm the effectiveness of the proposed method in expressing the type and intensity of emotions.

References

  • [1] Michael Macon, Leslie Jensen-Link, E Bryan George, James Oliverio, and Mark Clements. Concatenation-based midi-to-singing voice synthesis. In Audio Engineering Society Convention 103. Audio Engineering Society, 1997.
  • [2] Keijiro Saino, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee, and Keiichi Tokuda. An hmm-based singing voice synthesis system. In 9th International Conference on Spoken Language Processing, 2006.
  • [3] Hideki Kenmochi and Hayato Ohshita. Vocaloid-commercial singing synthesizer based on sample concatenation. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), volume 2007, pages 4010–4011, 2007.
  • [4] Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon, Junghyun Koo, and Kyogu Lee. Adversarially trained end-to-end korean singing voice synthesis system. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), volume 2019, pages 2588–2592, 2019.
  • [5] Peiling Lu, Jie Wu, Jian Luan, Xu Tan, and Li Zhou. Xiaoicesing: A high-quality and integrated singing voice synthesis system. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), volume 2020, pages 1306–1310, 2020.
  • [6] Jiawei Chen, Xu Tan, Jian Luan, Tao Qin, and Tie-Yan Liu. Hifisinger: Towards high-fidelity neural singing voice synthesis. arXiv preprint arXiv:2009.01776, 2020.
  • [7] Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim, and Hoon-Young Cho. N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), volume 2021, pages 1589–1593, 2021.
  • [8] Younsung Park, Sungrack Yun, and Chang D Yoo. Parametric emotional singing voice synthesis. In ICASSP 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4814–4817. IEEE, 2010.
  • [9] Yin-Ping Cho, Fu-Rong Yang, Yung-Chuan Chang, Ching-Ting Cheng, Xiao-Han Wang, and Yi-Wen Liu. A survey on recent deep learning-driven singing voice synthesis systems. In

    2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR)

    , pages 319–323. IEEE, 2021.
  • [10] Florian Eyben, Sabine Buchholz, Norbert Braunschweiler, Javier Latorre, Vincent Wan, Mark JF Gales, and Kate Knill. Unsupervised clustering of emotion and voice styles for expressive tts. In ICASSP 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4009–4012. IEEE, 2012.
  • [11] Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, and Rif A Saurous. Uncovering latent style factors for expressive speech synthesis. arXiv preprint arXiv:1711.00520, 2017.
  • [12] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the 35th International Conference on Machine Learning (ICML)

    , volume 80, pages 5180–5189. Proceedings of Machine Learning Research, 2018.

  • [13] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pages 4693–4702. Proceedings of Machine Learning Research, 2018.
  • [14] Vincent Wan, Chun-An Chan, Tom Kenter, Jakub Vit, and Rob Clark. Chive: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5806–5815. Proceedings of Machine Learning Research, 2019.
  • [15] Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6189–6193. IEEE, 2020.
  • [16] Se-Yun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, ChungHyun Ahn, and Hong-Goo Kang. Emotional speech synthesis with rich and granularized control. In ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7254–7258. IEEE, 2020.
  • [17] Bastian Schnell and Philip N Garner. Improving emotional tts with an emotion intensity input from unsupervised extraction. In Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 11), pages 60–65, 2021.
  • [18] Takeshi Saitou, Masashi Unoki, and Masato Akagi. Extraction of f0 dynamic characteristics and development of f0 control model in singing voice. In Proceedings of International Community For Auditory Display, ICAD, pages 0–3, 2002.
  • [19] Yawen Xue, Yasuhiro Hamada, and Masato Akagi. Emotional speech synthesis system based on a three-layered model using a dimensional approach. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 505–514. IEEE, 2015.
  • [20] Thi-Hao Nguyen and Masato Akagi. Synthesis of expressive singing voice by f0, amplitude envelope and spectral feature conversion. In 2018 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing (NCSP2018). Research Institute of Signal Processing, Japan, 2018.
  • [21] Klaus R Scherer, Johan Sundberg, Bernardino Fantini, Stéphanie Trznadel, and Florian Eyben. The expression of emotion in the singing voice: Acoustic patterns in vocal performance. The Journal of the Acoustical Society of America, 142(4):1805–1815, 2017.
  • [22] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
  • [23] Minsu Kang, Sungjae Kim, and Injung Kim. Unitts: Residual learning of unified embedding space for speech style control. arXiv preprint arXiv:2106.11171, 2021.
  • [24] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  • [25] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), volume 30, pages 5998–6008. Neural information processing systems foundation, 2017.
  • [26] Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. Singing voice synthesis based on deep neural networks. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), volume 2016, pages 2478–2482, 2016.
  • [27] Kazuhiro Nakamura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda. Singing voice synthesis based on convolutional neural networks. arXiv preprint arXiv:1904.06868, 2019.
  • [28] Juntae Kim, Heejin Choi, Jinuk Park, Minsoo Hahn, Sangjin Kim, and Jong-Jin Kim.

    Korean singing voice synthesis system based on an lstm recurrent neural network.

    In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), volume 2018, pages 1551–1555, 2018.
  • [29] Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda.

    Singing voice synthesis based on generative adversarial networks.

    In ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6955–6959. IEEE, 2019.
  • [30] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems (NeurIPS), volume 32. Neural information processing systems foundation, 2019.
  • [31] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, and Emilia Gómez. Wgansing: A multi-voice singing voice synthesizer based on the wasserstein-gan. In 2019 27th European Signal Processing Conference (EUSIPCO), pages 1–5. IEEE, 2019.
  • [32] Shreeviknesh Sankaran, Sukavanan Nanjundan, and G Paavai Anand. Anyone gan sing. arXiv preprint arXiv:2102.11058, 2021.
  • [33] Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Peng Liu, and Zhou Zhao. Diffsinger: Diffusion acoustic model for singing voice synthesis. arXiv preprint arXiv:2105.02446, 2021.
  • [34] Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep voice 2: Multi-speaker neural text-to-speech. In Advances in Neural Information Processing Systems (NeurIPS), volume 30. Neural information processing systems foundation, 2017.
  • [35] Mingjian Chen, Xu Tan, Yi Ren, Jin Xu, Hao Sun, Sheng Zhao, Tao Qin, and Tie-Yan Liu. Multispeech: Multi-speaker text to speech with transformer. arXiv preprint arXiv:2006.04664, 2020.
  • [36] Liqiang Zhang, Chengzhu Yu, Heng Lu, Chao Weng, Chunlei Zhang, Yusong Wu, Xiang Xie, Zijin Li, and Dong Yu. Durian-sc: Duration informed attention network based singing voice conversion system. In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), volume 2020, pages 1231–1235, 2020.
  • [37] Shoutong Wang, Jinglin Liu, Yi Ren, Zhen Wang, Changliang Xu, and Zhou Zhao. Mr-svs: Singing voice synthesis with multi-reference encoder. arXiv preprint arXiv:2201.03864, 2022.
  • [38] Merlijn Blaauw, Jordi Bonada, and Ryunosuke Daido. Data efficient voice cloning for neural singing synthesis. In ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6840–6844. IEEE, 2019.
  • [39] Juheon Lee, Hyeong-Seok Choi, Junghyun Koo, and Kyogu Lee. Disentangling timbre and singing style with multi-singer singing synthesis system. In ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7224–7228. IEEE, 2020.
  • [40] Tua Hakanpää, Teija Waaramaa, and Anne-Maria Laukkanen. Training the vocal expression of emotions in singing: Effects of including acoustic research-based elements in the regular singing training of acting students. Journal of Voice, 2021.
  • [41] Tao Li, Shan Yang, Liumeng Xue, and Lei Xie. Controllable emotion transfer for end-to-end speech synthesis. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–5. IEEE, 2021.
  • [42] Preety Goswami and Makarand Velankar. Study paper for timbre identification in sound. International Journal of Engineering Research and Technology (IJERT), 2(10), 2013.
  • [43] Kuan-Yi Kang, Yi-Wen Liu, and Hsin-Min Wang. Influences of prosodic feature replacement on the perceived singing voice identity. In Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019), pages 296–309. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), 2019.
  • [44] Chunhui Lu, Xue Wen, Ruolan Liu, and Xiao Chen. Multi-speaker emotional speech synthesis with fine-grained prosody modeling. In ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5729–5733. IEEE, 2021.
  • [45] Pengfei Wu, Zhenhua Ling, Lijuan Liu, Yuan Jiang, Hongchuan Wu, and Lirong Dai. End-to-end emotional speech synthesis using style tokens and semi-supervised training. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 623–627. IEEE, 2019.
  • [46] Younggun Lee and Taesu Kim. Robust and fine-grained prosody control of end-to-end speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5911–5915. IEEE, 2019.
  • [47] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 7794–7803, 2018.
  • [48] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy.

    Do vision transformers see like convolutional neural networks?

    In Advances in Neural Information Processing Systems (NeurIPS), volume 34. Neural information processing systems foundation, 2021.
  • [49] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.
  • [50] Soonbeom Choi, Wonil Kim, Saebyul Park, Sangeon Yong, and Juhan Nam. Children’s song dataset for singing voice research. In The 21th International Society for Music Information Retrieval Conference (ISMIR). International Society for Music Information Retrieval, 2020.

Appendix A Detailed Model Architectures

Input data pre-processing The model takes a score consisting of lyric, note pitch, note duration as input sequence, and they are preprocessed by following steps. Lyric is converted to phoneme sequence using rule-based Grapheme-to-Phoneme algorithm (g2pk)222https://github.com/Kyubyong/g2pK, note pitch is converted to pitch index based on Midi Standard, and note duration is converted to the number of Mel-spectrogram frames. Finally, each sequence is mapped into a dense vector using an embedding lookup table.

Model Configuration The U-Singer is largely composed of the encoder, variance adaptor, decoder, and discriminator. The encoder is composed of 6 blocks that consist of multi-head self-attention and two convolution layers. The variance adaptor consists of per-attribute predictors with two convolution layers and per-attribute encoders with one convolution layer. The decoder consists of six blocks, made up of the ASPP-transformer, the model we are proposing. The discriminator was constructed referencing SF-GAN of [6], and the three discriminators each respectively possess three 2-D convolutions and one linear projection layer.

Hyperparameter U-Singer
Phoneme embedding dimension 384
Lyric-pitch Encoder Layers 6
Hidden dim. 384
Conv1D Kernel 9
Conv1D Filter Size 1536
Mel-Spectrogram Decoder Layers 6
Hidden dim. 384
Conv1D Kernel 9
Conv1D Kernel Dilation rate [1, 3, 5, 7]
Conv1D Filter Size [768, 384, 192, 192]
Attention Heads 2
Encoder/Decoder Attention Heads 2
Dropout 0.2
Variance Predictor Layers 2
Conv1D Kernel 3
Conv1D Filter Size 384
Dropout 0.5
Reference Encoder Layers 6
Conv2D Kernel (3, 3)
Conv2D Filter Size (32, 32, 64, 64, 128, 128)

Conv2D Stride

(2, 2)
Hidden dim. of GRU 192
Style Token Layer Tokens 10
Token Dim. 48
Attention Hidden dim. 384
Attention Heads 8
Prosody Encoder Conv1D Kernel 3
Conv1D Filter Size 384
Dropout 0.5
Discriminator Layers 3
Conv2D Kernel (9, 9)
Conv2D Filter Size (1, 64, 64, 64, 64, 64)
Conv2D Stride (1, 1)
Total Number of Parameters 101M
Table 3: The hyperparameters of U-Singer

a.1 Overall architecture

Figure 9: Detailed version of the overall architecture

a.2 Baseline

Figure 10: Detailed version of the baseline

a.3 Training Algorithm

Following UniTTS [23], the training procedure of U-Singer is divided into two, which are as follows:
1. Train a generator with a style encoder activated and a singer encoder, an emotion encoder, a discriminator deactivated.
2. Train all models with style encoder deactivated.
 (1) Train singer and emotion embedding table by distilling the knowledge from the trained style encoder.
 (2) Freezing singer and emotion embedding table, train all models with style encoder deactivated.

  Input  lyric , note pitch , note duration , reference mel-spectrogram , where s is singer id and e is emotion id.
  Initialize  lookup table for each input except for reference mel-spectrgram.
  for predefined number of  do
     Retrieves the low-level embeddings of the phoneme and pitch from the lookup table(LUT)
     
     
     Extract high-level embedding using LyricPitchEncoder
     
     Add style embedding with reference mel-spectrogram
     
     Predict phoneme durations
     
     Predict pitch and add them into joint embedding
     
     
     Predict energy and add them into joint embedding
     
     
     Align by duplicating the joint embedding according to the durations
     
     Synthesize a mel-spectrogram by the decoder
     
     Compute the loss and back-propagate
     
     
     
     
     
  end for
Algorithm 1 Training Phase #1: Train reference encoder (GST). denotes energy, denotes ground truth.
  Input  reference mel-spectrogram , where s is singer id and e is emotion id.
  Knowledge distillation from trained
  for each singer  do
     
  end for
  for each emotion type  do
     
  end for
  Deactivate the
  Input  lyric , note pitch , note duration .
  Initialize  lookup table for each input.
  for predefined number of  do
     Retrieves the low-level embeddings of the phoneme and pitch from the lookup table(LUT)
     
     
     Extract high-level embedding using LyricPitchEncoder
     
     Add singer embeddings
     
     
     Add emotion embeddings
     
     
     Predict phoneme durations
     
     Predict pitch and add them into joint embedding
     
     
     Predict energy and add them into joint embedding
     
     
     Align by duplicating the joint embedding according to the durations
     
     Synthesize a mel-spectrogram by the decoder
     
     Compute the generator loss and back-propagate
     
     
     
     
     
     
     Compute the discriminator loss and back-propagate
     
  end for
Algorithm 2 Training Phase #2: Train U-Singer. denotes stop gradient.

Appendix B Dataset Detail

The entire dataset we used for learning is shown in Table 4 below. We learned a total of 13.49 hours of data, including the CSD dataset [50], and conducted the evaluation using 0.95 hours of test data. Table 5 shows the ratio of professional and amateur singers among the samples.

Number of Samples Length(Hours)
Singer Train Test Train Test
F0 1082 71 1.99 0.13
F1 236 0.57 0
M1 219 0.52 0
F2 3073 247 5.54 0.47
M2 2510 234 4.87 0.35
Total 7120 552 13.49 0.95
Table 4: Dataset Specification
Number of Samples Hours Ratio
Pro 1608 3.21 21%
Amateur 6064 11.23 79%
Total 7672 14.44 100%
Table 5: Dataset Specification (Pro/Amateur Ratio)

Appendix C MOS Evaluation Interface

Figure 11: The interface of MOS evaluation

Appendix C MOS Evaluation Interface