JVS corpus: free Japanese multi-speaker voice corpus

08/17/2019 ∙ by Shinnosuke Takamichi, et al. ∙ The University of Tokyo 0

Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered by a single speaker, for end-to-end text-to-speech synthesis. For more general use in speech synthesis research, e.g., voice conversion and multi-speaker modeling, in this paper, we construct the JVS corpus, which contains voice data of 100 speakers in three styles (normal, whisper, and falsetto). The corpus contains 30 hours of voice data including 22 hours of parallel normal voices. This paper describes how we designed the corpus and summarizes the specifications. The corpus is available at our project page.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Thanks to developments in deep learning techniques, studies on speech have been targeted actively [hinton12dnnasr, oord16wavenet, takamichi17moment, saito18advss]. Nowadays, speech synthesis, e.g., text-to-speech, singing voice synthesis, voice conversion, and speech coding, is becoming a machine learning task. Easily accessible voice corpora help to not only accelerate speech-related research but also improve the reproductivity of a study. In 2017, we released a large-scaled Japanese speech corpus, named the JSUT corpus [sonobe17jsut], for end-to-end text-to-speech synthesis. The corpus included 10 hours of reading-style speech data uttered by a single native Japanese speaker and all pronunciations of daily-use characters and individual readings in Japanese [joyokanji]. Since Oct. 2017, the project page [jsut_corpus] was accessed more than 6,000 times (75% from Japan and 25% from foreign countries) from more than 60 countries. We believe that the JSUT corpus has become one of the most used Japanese corpora for modern speech synthesis research [ueno19multispeakerend2endtts, luo19waveletf0feature].

Towards more general purposes of speech-related research, this paper introduces a new Japanese voice corpus, named the JVS (Japanese versatile speech) corpus. The corpus is designed to have many benefits for many types of users as follows.

  • High-quality format: The audio files are sampled at 24 kHz, encoded at 16 bit, and formatted in RIFF WAV.

  • High-quality recording: The recordings were controlled by a professional sound director and done in a recording studio.

  • Many speakers: The corpus includes 100 native Japanese speakers, and all of the speakers are professional, e.g., voice actor/actress.

  • Many styles: Each speaker utters not only normal speech but also whisper and falsetto voices.

  • Large in scale: In total, the corpus contains 30 hours of voice data.

  • Parallel/non-parallel utterances: Each speaker utters parallel, i.e., common among speakers, and non-parallel, i.e., completely different among speakers, utterances.

  • Many tags: The corpus includes not only voice data but also transcriptions, gender information, ranges, speaker similarity, and phoneme alignments.

  • Free for research: The corpus is free to use for research in academic institutions and commercial companies.

  • Easily accessible: The corpus is freely downloadable online.

The next section describes how we designed the corpus.

2 Corpus design

The corpus consists of the following four sub-corpora. Their names are formatted as [NAME][NUM_UTT]. [NUM_UTT] indicates the number of utterances per speaker.

  • parallel100: 100 parallel normal (reading-style) utterances

  • nonpara30: 30 non-parallel normal utterances

  • whisper10: 10 whisper utterances

  • Falsetto10: 10 falsetto utterances

The directory structures of the corpus are listed below. The speaker name is formatted as jvs[SPKR_ID]. [SPKR_ID] indicates the speaker ID with the range of 1 through 100. .1 jvs001. .2 parallel100. .3 wav24kHz16bit. .3 lab. .3 transcripts_utf8.txt. .2 nonpara30. .3 wav24kHz16bit. .3 lab. .3 transcripts_utf8.txt. .2 whisper10. .3 wav24kHz16bit. .3 transcripts_utf8.txt. .2 falsetto10. .3 wav24kHz16bit. .3 transcripts_utf8.txt. .1 jvs002. .1 …. .1 jvs100. .1 speaker_similarity_male.csv. .1 speaker_similarity_female.csv. .1 duration.txt. .1 gender_f0range.txt.

2.1 Sub-corpora

This section describes how we designed the four sub-corpora.

2.1.1 parallel100

Parallel voices, i.e., utterances that are common among speakers, are used for voice conversion [toda07_MLVC, stylianou88], speaker factorization [lu13factor], multi-speaker modeling [ueno19multispeakerend2endtts], and so on. We used 100 phonetically-balanced sentences of the sub-corpus “voiceactress100”222The original sentences are included in the Voice Actress Corpus [voiceactresscorpus], and the one included in the JSUT corpus had commas added at the phrase break positions. of the JSUT corpus [sonobe17jsut], and we let speakers utter the sentences. This corpus contains not only the audio files but also the transcriptions (stored in “parallel100/transcript_utf8.txt”) and phoneme alignment (stored in “parallel100/lab”).

2.1.2 nonpara30

The use of non-parallel voices, i.e., utterances that are completely different among speakers, is a challenging but more realistic situation than that of parallel voices. Sentences to be uttered are randomly selected from the JSUT corpus excluding its sub-corpus “voiceactress100.” Each speaker uttered 30 utterances that are different among speakers. This sub-corpus also includes transcriptions and phoneme alignments. Note that, the sentences are not phonetically balanced unlike the sub-corpora “parallel100.”

2.1.3 whisper10

Whispering is used to quietly communicate, i.e., convey secret information without being overheard. Analysis [ito05whisperanalysis], synthesis [petrushin10whispertts], recognition [jou05whisperrecognition] and conversion [toda12bodyconductedvc] of whispered voices have the potential to augment our silent-speech communication. The first five sentences of this sub-corpus are the same as those of the sub-corpus “parallel100,” and they are parallel among speakers. The remaining five sentences are the same to those of the sub-corpus “nonpara30,” and they are non-parallel among speakers. Namely, ten utterances per speaker are parallel between whispered voices and normal voices.

2.1.4 Falsetto10

Falsetto is a vocal register occupying the range that is higher than normal voices. The physiology of falsetto is different from that of normal voices [childers91vocalqualityfactor], and the analysis and synthesis of falsetto are remaining tasks for signal processing-based vocoders. The first five sentences of this sub-corpus are the same as those of the sub-corpus “parallel100.” The remaining five sentences are the same as those of the sub-corpus “nonpara30” but different to those of the sub-corpus “whisper10.” Namely, five utterances are parallel among speakers, ten are parallel between normal voice and falsetto, and five are parallel between whisper and falsetto.

2.2 Tags

This section describes some of the annotation results.

  • range (gender_f0range.txt): Typical pitch extractors, e.g., [kawahara99, morise16world, reaper], have a range for search, and the setting is critical for the results ultimately obtained for the voices. This corpus contains manually annotated ranges per speaker for his/her normal voices.

  • Speaker similarity (speaker_similarity_*.csv): Perceptual similarity between speakers is useful for selecting speakers (or models) [lanchantin14mavm] and modeling speaker space [saito19perceptual]. This corpus contains perceptual similarity scores between all pairs of speakers of each gender.

  • Duration (duration.txt): Duration, i.e., data size, and speech rate are also included. Phoneme-level duration is calculated from the results of phoneme alignments.

3 Results of data collection

Mininum [min.] Average [min.] Maximum [min.] Total (100 speaker) [hour]
parallel100 (100 utterances) 10.11 (jvs020) 13.11 18.24 (jvs084) 22
nonpara30 (30 utterances) 2.12 (jvs099) 2.62 3.86 (jvs036) 4.4
whisper10 (10 utterances) 0.95 (jvs045) 1.24 1.69 (jvs018) 2.0
falsetto10 (10 utterances) 0.90 (jvs045) 1.18 1.61 (jvs035) 2.0
Total - - - 30.4
Table 1: Speaker-wise duration statistics. Silence parts were included to calculate these values.

3.1 Corpus specs

We hired 100 native Japanese professional speakers, which included 49 male and 51 female speakers. Their voices were recorded in a recording studio. Recording for each speaker was done within one day. The recordings were controlled by a professional sound director. The voices were originally sampled at 48 kHz and downsampled to 24 kHz by SPTK [sptk]. The 16-bit/sample RIFF WAV format was used. Sentences (transcriptions) were encoded in UTF-8. The full context and monophone labels were automatically generated by Open JTalk [ojtalk]. The phoneme alignments were automatically generated by Julius [lee01julius]. ranges were manually annotated in accordance with hands-on voice conversion [toda19vchandson]. The WORLD vocoder [morise16world, morise16d4c] extracted . Commas were added between breath groups. For annotating perceptual similarity scores, we followed Saito et al.’s study [saito19perceptual] and used a crowdsourcing service, Lancers [lancers], which is a famous crowdsourcing service in Japan. Each listener scored the perceptual similarity for each pair of speakers from (completely different) and (very similar). A final score for each speaker pair was obtained by averaging listeners’ scores. Ten different listeners scored each speaker pair, and 1,000 listeners participated in total.

3.2 Analysis

3.2.1 Duration

Table 1 lists the statistics for speaker-wise duration. This corpus contains 26 hours of normal voices and 4 hours of other-style voices. Each speaker uttered approximately 15.7 minutes of normal voices, 1.24 minutes of whispered voices, and 1.18 minutes of falsetto. In the sub-corpus “parallel100,” the transcription was common among speakers, but the duration was very different; speaker “jvs084” uttered 1.8 times slower than speaker “jvs020.”

3.2.2 Perceptual speaker similarity

Fig. 1 shows matrices of perceptual similarity scores. For example, the most similar pair was “jvs019” and “jvs096.” Also, a speaker that was most dissimilar from the other speakers was “jvs010.”

4 Conclusion

In this paper, we constructed a corpus named the JVS corpus. The corpus was designed for speech-related research using multi-speaker and multi-style voices. Text data of the corpus is licensed as shown in the LICENCE file in the JSUT corpus [jsut_corpus]. The tags are licensed with CC BY-SA 4.0. The audio data may be used for

  • Research by academic institutions

  • Non-commercial research, including research conducted within commercial organizations

  • Personal use, including blog posts.

Our project page at https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus describes the terms for commercial use.

5 Acknowledgements

Part of this work was supported by the GAP foundation program of the University of Tokyo and the MIC/SCOPE #182103104.

Figure 1: Speaker similarity matrix of 51 Japanese females and (b) its sub-matrix obtained by large-scale subjective scoring.