Thanks to developments in deep learning techniques, studies on speech have been targeted actively [hinton12dnnasr, oord16wavenet, takamichi17moment, saito18advss]. Nowadays, speech synthesis, e.g., text-to-speech, singing voice synthesis, voice conversion, and speech coding, is becoming a machine learning task. Easily accessible voice corpora help to not only accelerate speech-related research but also improve the reproductivity of a study. In 2017, we released a large-scaled Japanese speech corpus, named the JSUT corpus [sonobe17jsut], for end-to-end text-to-speech synthesis. The corpus included 10 hours of reading-style speech data uttered by a single native Japanese speaker and all pronunciations of daily-use characters and individual readings in Japanese [joyokanji]. Since Oct. 2017, the project page [jsut_corpus] was accessed more than 6,000 times (75% from Japan and 25% from foreign countries) from more than 60 countries. We believe that the JSUT corpus has become one of the most used Japanese corpora for modern speech synthesis research [ueno19multispeakerend2endtts, luo19waveletf0feature].
Towards more general purposes of speech-related research, this paper introduces a new Japanese voice corpus, named the JVS (Japanese versatile speech) corpus. The corpus is designed to have many benefits for many types of users as follows.
High-quality format: The audio files are sampled at 24 kHz, encoded at 16 bit, and formatted in RIFF WAV.
High-quality recording: The recordings were controlled by a professional sound director and done in a recording studio.
Many speakers: The corpus includes 100 native Japanese speakers, and all of the speakers are professional, e.g., voice actor/actress.
Many styles: Each speaker utters not only normal speech but also whisper and falsetto voices.
Large in scale: In total, the corpus contains 30 hours of voice data.
Parallel/non-parallel utterances: Each speaker utters parallel, i.e., common among speakers, and non-parallel, i.e., completely different among speakers, utterances.
Many tags: The corpus includes not only voice data but also transcriptions, gender information, ranges, speaker similarity, and phoneme alignments.
Free for research: The corpus is free to use for research in academic institutions and commercial companies.
Easily accessible: The corpus is freely downloadable online.
The next section describes how we designed the corpus.
2 Corpus design
The corpus consists of the following four sub-corpora. Their names are formatted as [NAME][NUM_UTT]. [NUM_UTT] indicates the number of utterances per speaker.
parallel100: 100 parallel normal (reading-style) utterances
nonpara30: 30 non-parallel normal utterances
whisper10: 10 whisper utterances
Falsetto10: 10 falsetto utterances
The directory structures of the corpus are listed below. The speaker name is formatted as jvs[SPKR_ID]. [SPKR_ID] indicates the speaker ID with the range of 1 through 100. .1 jvs001. .2 parallel100. .3 wav24kHz16bit. .3 lab. .3 transcripts_utf8.txt. .2 nonpara30. .3 wav24kHz16bit. .3 lab. .3 transcripts_utf8.txt. .2 whisper10. .3 wav24kHz16bit. .3 transcripts_utf8.txt. .2 falsetto10. .3 wav24kHz16bit. .3 transcripts_utf8.txt. .1 jvs002. .1 …. .1 jvs100. .1 speaker_similarity_male.csv. .1 speaker_similarity_female.csv. .1 duration.txt. .1 gender_f0range.txt.
This section describes how we designed the four sub-corpora.
Parallel voices, i.e., utterances that are common among speakers, are used for voice conversion [toda07_MLVC, stylianou88], speaker factorization [lu13factor], multi-speaker modeling [ueno19multispeakerend2endtts], and so on. We used 100 phonetically-balanced sentences of the sub-corpus “voiceactress100”222The original sentences are included in the Voice Actress Corpus [voiceactresscorpus], and the one included in the JSUT corpus had commas added at the phrase break positions. of the JSUT corpus [sonobe17jsut], and we let speakers utter the sentences. This corpus contains not only the audio files but also the transcriptions (stored in “parallel100/transcript_utf8.txt”) and phoneme alignment (stored in “parallel100/lab”).
The use of non-parallel voices, i.e., utterances that are completely different among speakers, is a challenging but more realistic situation than that of parallel voices. Sentences to be uttered are randomly selected from the JSUT corpus excluding its sub-corpus “voiceactress100.” Each speaker uttered 30 utterances that are different among speakers. This sub-corpus also includes transcriptions and phoneme alignments. Note that, the sentences are not phonetically balanced unlike the sub-corpora “parallel100.”
Whispering is used to quietly communicate, i.e., convey secret information without being overheard. Analysis [ito05whisperanalysis], synthesis [petrushin10whispertts], recognition [jou05whisperrecognition] and conversion [toda12bodyconductedvc] of whispered voices have the potential to augment our silent-speech communication. The first five sentences of this sub-corpus are the same as those of the sub-corpus “parallel100,” and they are parallel among speakers. The remaining five sentences are the same to those of the sub-corpus “nonpara30,” and they are non-parallel among speakers. Namely, ten utterances per speaker are parallel between whispered voices and normal voices.
Falsetto is a vocal register occupying the range that is higher than normal voices. The physiology of falsetto is different from that of normal voices [childers91vocalqualityfactor], and the analysis and synthesis of falsetto are remaining tasks for signal processing-based vocoders. The first five sentences of this sub-corpus are the same as those of the sub-corpus “parallel100.” The remaining five sentences are the same as those of the sub-corpus “nonpara30” but different to those of the sub-corpus “whisper10.” Namely, five utterances are parallel among speakers, ten are parallel between normal voice and falsetto, and five are parallel between whisper and falsetto.
This section describes some of the annotation results.
range (gender_f0range.txt): Typical pitch extractors, e.g., [kawahara99, morise16world, reaper], have a range for search, and the setting is critical for the results ultimately obtained for the voices. This corpus contains manually annotated ranges per speaker for his/her normal voices.
Speaker similarity (speaker_similarity_*.csv): Perceptual similarity between speakers is useful for selecting speakers (or models) [lanchantin14mavm] and modeling speaker space [saito19perceptual]. This corpus contains perceptual similarity scores between all pairs of speakers of each gender.
Duration (duration.txt): Duration, i.e., data size, and speech rate are also included. Phoneme-level duration is calculated from the results of phoneme alignments.
3 Results of data collection
|Mininum [min.]||Average [min.]||Maximum [min.]||Total (100 speaker) [hour]|
|parallel100 (100 utterances)||10.11 (jvs020)||13.11||18.24 (jvs084)||22|
|nonpara30 (30 utterances)||2.12 (jvs099)||2.62||3.86 (jvs036)||4.4|
|whisper10 (10 utterances)||0.95 (jvs045)||1.24||1.69 (jvs018)||2.0|
|falsetto10 (10 utterances)||0.90 (jvs045)||1.18||1.61 (jvs035)||2.0|
3.1 Corpus specs
We hired 100 native Japanese professional speakers, which included 49 male and 51 female speakers. Their voices were recorded in a recording studio. Recording for each speaker was done within one day. The recordings were controlled by a professional sound director. The voices were originally sampled at 48 kHz and downsampled to 24 kHz by SPTK [sptk]. The 16-bit/sample RIFF WAV format was used. Sentences (transcriptions) were encoded in UTF-8. The full context and monophone labels were automatically generated by Open JTalk [ojtalk]. The phoneme alignments were automatically generated by Julius [lee01julius]. ranges were manually annotated in accordance with hands-on voice conversion [toda19vchandson]. The WORLD vocoder [morise16world, morise16d4c] extracted . Commas were added between breath groups. For annotating perceptual similarity scores, we followed Saito et al.’s study [saito19perceptual] and used a crowdsourcing service, Lancers [lancers], which is a famous crowdsourcing service in Japan. Each listener scored the perceptual similarity for each pair of speakers from (completely different) and (very similar). A final score for each speaker pair was obtained by averaging listeners’ scores. Ten different listeners scored each speaker pair, and 1,000 listeners participated in total.
Table 1 lists the statistics for speaker-wise duration. This corpus contains 26 hours of normal voices and 4 hours of other-style voices. Each speaker uttered approximately 15.7 minutes of normal voices, 1.24 minutes of whispered voices, and 1.18 minutes of falsetto. In the sub-corpus “parallel100,” the transcription was common among speakers, but the duration was very different; speaker “jvs084” uttered 1.8 times slower than speaker “jvs020.”
3.2.2 Perceptual speaker similarity
Fig. 1 shows matrices of perceptual similarity scores. For example, the most similar pair was “jvs019” and “jvs096.” Also, a speaker that was most dissimilar from the other speakers was “jvs010.”
In this paper, we constructed a corpus named the JVS corpus. The corpus was designed for speech-related research using multi-speaker and multi-style voices. Text data of the corpus is licensed as shown in the LICENCE file in the JSUT corpus [jsut_corpus]. The tags are licensed with CC BY-SA 4.0. The audio data may be used for
Research by academic institutions
Non-commercial research, including research conducted within commercial organizations
Personal use, including blog posts.
Our project page at https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus describes the terms for commercial use.
Part of this work was supported by the GAP foundation program of the University of Tokyo and the MIC/SCOPE #182103104.