JVS-MuSiC: Japanese multispeaker singing-voice corpus

01/20/2020 ∙ by Hiroki Tamaru, et al. ∙ The University of Tokyo 0

Thanks to developments in machine learning techniques, it has become possible to synthesize high-quality singing voices of a single singer. An open multispeaker singing-voice corpus would further accelerate the research in singing-voice synthesis. However, conventional singing-voice corpora only consist of the singing voices of a single singer. We designed a Japanese multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers' recordings of the same song, Katatsumuri, which is a Japanese children's song. It also includes another song that is different for each singer. In this paper, we describe the design of the corpus and experimental analyses using JVS-MuSiC. We investigated the relationship between 1) the similarity of singing voices and perceptual oneness of unison singing voices and between 2) the similarity of singing voices and that of speech. The results suggest that 1) there is a positive and moderate correlation between singing-voice similarity and the oneness of unison and that 2) the correlation between singing-voice similarity and speech similarity is weak. This corpus is freely available online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Thanks to developments in machine learning techniques, it has become possible to synthesize high-quality singing voices of a single singer [nishimura2016singing, blaauw2017neural]. An open multispeaker singing-voice corpus would further accelerate the research in singing-voice synthesis. However, conventional singing-voice corpora [hts, jsutsong] only consist of the singing voices of a single singer. We designed a Japanese multispeaker singing-voice corpus called “JVS-MuSiC” with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers’ recordings of the same song, Katatsumuri, which is a Japanese children’s song. It also includes another song that is different for each singer. In this paper, we describe the design of the corpus and experimental analyses using JVS-MuSiC. We investigated the relationship between 1) the similarity of singing voices and perceptual oneness of unison singing voices and between 2) the similarity of singing voices and that of speech. The results suggest that 1) there is a positive and moderate correlation between singing-voice similarity and the oneness of unison and that 2) the correlation between singing-voice similarity and speech similarity is weak. This corpus is freely available online.

2 Current Japanese singing-voice corpora

All Japanese singing-voice corpora consist of only a single singer’s voices. The singing-voice corpus included in the demo of HTS [hts] consists of 31 Japanese children’s songs sung by a single female singer. JSUT-song [jsutsong] consists of 25 songs in HTS sung by a different female singer. There is also a singing-voice database of synthesized singing voices called Tohoku Kiritan’s singing-voice database [kiritan], which consists of 50 songs sung by Tohoku Kiritan, who is a female character of VOICEROID.

3 Design of JVS-MuSiC

3.1 Structures

The directory structures of the corpus are listed below. The singer name is formatted as jvs[SPKR_ID], indicating the speaker ID with the range of 1 through 100.

The main purpose of JVS-MuSiC is to cover various singers’ singing voices to enable the analysis and synthesis of the personality of singing voices. The 100 singers are the same as those of JVS corpus [takamichi19jvs]; thus, it is also possible to investigate the relationship between speech and singing voices. The following sections describe how we designed the corpus. .1 . .2 jvs001. .3 song_common. .4 wav. .5 raw.wav. .5 modified.wav. .5 modified_grouped.wav. .4 mpd. .5 modified.mpd. .5 modified_grouped.mpd. .3 song_unique. .4 wav. .5 raw.wav. .2 jvs002. .2 …. .2 jvs100. .2 similarity. .3 similarity_{name of group}.csv. .2 oneness. .3 oneness_{name of group}.csv. .2 singer_info.txt.

3.1.1 Raw voice (raw.wav)

Recorded wav files of Katatsumuri and singer-dependent songs are stored in song_common/wav/ and song_unique/wav/, respectively. Each singer sang Katatsumuri in his or her favorite key and tempo; thus, the key and tempo vary among singers. The key and tempo are not completely consistent in each recording because the singers did not sing along with an example recording or a guide melody sound.

3.1.2 Modified voices

There are two versions of modified voices.

Figure 1: Scatter plot of key and tempo.
  • modified.wav
    We determined the closest key and tempo for each singer and modified raw voices using the singing-voice modification software Melodyne [melodyne], as if the singer had sung in the key and tempo accurately. Figure 1 shows the distribution of the keys and tempos of modified voices. There is a four-beat silence at the beginning of each file. The key and tempo labels of the modified voices are stored in mpd_label.txt, which consists of speaker ID, gender, tempo (BPM: beats per minute), and key number (how many semitones higher than the lowest one).

  • modified_grouped.wav
    For facilitating analysis, we created a grouped version of modified wav files and divided the 100 singers into six groups (three for each gender) by key. We used Melodyne to unify the key within the group and the tempo within all groups (100 BPM). We chose Melodyne for this purpose because it produces empirically less change in voice timbre than with a time stretching- and resampling-based method. Table 1 lists the keys and numbers of singers for the groups.

    Group Key # of singers
    Male-low B 17
    Male-middle D 16
    Male-high E 16
    Female-low A 17
    Female-middle B 17
    Female-high D 17
    Table 1: Description of singer groups

3.1.3 MPD file (*.mpd)

modified.mpd and modified_grouped.mpd are Melodyne project documents that were used to create the corresponding modified voices. We used Melodyne 4 Assistant [melodyne] to modify singing voices. The editor was not a professional engineer, but had some experience in music production including vocal editing as an amateur. We now describe the modification steps.

  1. Determination of key and tempo

    We determined the closest key and tempo for each singing voice by listening to the voice and looking at the graphical user interface.

  2. Editing pitch

    We used the correct pitch macro to modify the pitch of musical notes. We set both pitch center and pitch drift to 100% then manually checked and modified all notes considering the perceptual naturalness.

  3. Editing time

    We used the time tool to manually modify the onset and offset of all musical notes considering the perceptual naturalness.

3.1.4 Similarity and oneness matrices (*.csv)

These csv files are the similarity and oneness matrices of the experimental analysis mentioned in Section 4.

3.2 Recording

We hired 100 native Japanese professional speakers; 49 males and 51 females. Their voices were recorded in a recording studio, and the recording for each speaker was done within one day. The recordings were controlled by a professional sound director. The voices were originally sampled at 48 kHz and downsampled to 24 kHz by the Speech Signal Processing Toolkit [sptk], and the 16-bit/sample RIFF WAV format was used. These settings are the same as those in the JVS corpus [takamichi19jvs]. The total duration of the 100 files of the common song was 49 min and 23 sec, and that of singer-dependent songs was 88 min and 3 sec.

4 Experimental analysis

4.1 Experimental conditions

We evaluated the inter-speaker similarity and oneness of unison for many pairs of singers. We used grouped files because the key should be unified when annotating similarity and must be unified when producing unison voices and because too much pitch shifting may cause artifacts and changes in voice timbre. For annotating perceptual similarity scores, we used 9.6-sec samples, which were made by concatenating two singers’ voice samples of the same musical phrase of 4.8 sec (eight beats). We followed Saito et al.’s study [saito19perceptual]in which each listener scored the perceptual similarity for each pair of speakers from (completely different) to (very similar). For the evaluation of the oneness of unison, we used 9.6-sec (16 beats) samples of two singers’ unison singing-voice samples. The unison voices were obtained by mixing two singers’ voices. We balanced the volume of the two voices by equalizing the mean squared amplitude when mixing. The listeners scored the separateness of unison (i.e., how much the two singers’ voices were heard separately, not as a united one) from 1 (heard as one) to 5 (heard separately). The oneness of unison was obtained by inverting the sign of separateness of unison. A final score for each speaker pair was obtained by averaging listeners’ scores. For analysis, we normalized the measured values into the range (0, 1), used the crowdsourcing platform “Lancers” [lancers], and gathered ten participants for each pair of singers.

4.2 Correlation between similarity and oneness of unison

Figure 2 is the scatter plot of average similarity and oneness of unison for all 784 () pairs. The correlation coefficient is 0.45 and -value is . Table 2 shows the group-wise results, which suggest that, to some extent, a pair of singers with similar voices produces a united unison voice, which is often considered to sound beautiful.

Figure 2: Scatter plot of average similarity and oneness of unison. Correlation coefficient is 0.45 and -value is .
Group -value
Male-low
Male-middle
Male-high
Female-low
Female-middle
Female-high
All
Table 2: Correlation coefficients () and -values for all groups

4.3 Correlation between singing-voice similarity and speech similarity

Figure 3: Scatter plot of average similarity of singing voice and speech. Correlation coefficient is 0.17 and -value is .

Figure 3 is the scatter plot of the average similarity of singing voice and similarity of speech [takamichi19jvs] for the 784 pairs. The correlation coefficient is 0.17 and -value is . Although part of this low correlation may be due to the change in voice timbre caused by pitch shifting, this result suggests that a person’s singing voice and speech are fundamentally different or that listeners perceive singing voices and speech in different manners.

5 Conclusion

We introduced a Japanese multispeaker singing-voice corpus called JVS-MuSiC. This corpus was designed for multispeaker singing-voice analysis and synthesis and can be used for research in unison singing voices. We analyzed the relationship between the similarity of singer pairs and unison singing voices. The experimental results suggest that there is a positive and moderate correlation between the similarity and oneness of unison. We also compared singing-voice similarity to speech similarity and found that the correlation between them was weak.

The similarity and oneness matrices are licensed with CC BY-SA 4.0. The audio data and MPD files may be used for

  • Research by academic institutions

  • Non-commercial research, including research conducted within commercial organizations

  • Personal use, including blog posts.

Our project page at https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music describes the terms for commercial use.

Acknowledgments: Part of this work was supported by the SECOM Science and Technology Foundation and the GAP foundation program of the University of Tokyo.

References