Text-to-speech (TTS) synthesis achieved to synthesize human-quality speech [oord16wavenet, wang17tacotron, saito18advss]
in very limited tasks (e.g., reading-style speech synthesis from short-form sentences of some rich-resourced languages). Both open-source code and open speech corpora help open innovation of speech-based technologies. Since 2017, we have released high-quality and large-scale Japanese speech corpora. The JSUT and JSUT-song corpora[sonobe17jsut, jsutsong_corpus] are for speaking-/singing-voice synthesis, and the JVS and JVS-MuSiC corpora [takamichi19jvs, tamaru20jvsmusic] are for multi-speaker/singer modeling. Open projects [neutrino, watanabe18espnet, hayashi20espnettts, nnsvs]
developed by third parties provide synthesis engines and machine learning recipes using our corpora.
With the success of reading-style speech synthesis from short-form sentences, we aim to design two challenging tasks for delivering information to humans: 1) duration-constrained text-to-speech summarization and 2) speaking-style simplification. The former summarizes text by a spoken language at a desired duration, and the latter synthesizes speech intelligible for non-native speakers. These tasks help provide information under the constraints of time limitations or language proficiency. They are challenging because their speech characteristics are far from those of basic reading-style speech.
For these tasks, we developed a new Japanese speech corpus, JSSS (pronounced “j-triple-s”). Our corpus composes speech data and its transcription. We recorded speech with high-quality settings: studio-recording, uncompressed audio format, and a well-experienced native speaker. We also recorded speech of short- and long-form sentences as an optional task. Our corpus has eight hours of high-quality speech data and is available at our project page [jsss_corpus]. From the next section, we describe how we designed the corpus.
2 Corpus design
Our corpus consists of the following four sub-corpora.
summarization: 125 utterances for duration-constrained text-to-speech summarization
simplification: 184 short utterances spoken in slow, intelligible style
short-form: 3284 short utterances spoken with read style
long-form: 168 long utterances spoken with read style
The directory structures of the corpus are listed below. [SUB_DIR_NAME] indicates the sub-directory described in the following sections. .1 summarization. .2 wav24kHz16bit. .2 original_utf8. .2 transcript_utf8. .1 simplification. .2 wav24kHz16bit. .2 transcript_utf8.txt. .2 hiragana_utf8.txt. .1 short-form. .2 SUB_DIR_NAME. .3 wav24kHz16bit. .3 transcript_utf8.txt. .2 …. .1 long-form. .2 SUB_DIR_NAME. .3 wav24kHz16bit. .3 original_utf8. .3 transcript_utf8. .2 ….
Automatic text summarization generates a short, coherent summary of given text[dorr03hedgetrimmer, banko00statisticalsumamrization], shortening it while retaining its important content. Text-length-constrained text summarization [makino19textsummarizationlength] is text summarization technology that has practical application; it abstractively summarizes text to fit a device that displays a summary [saito20lengthcontrollablesummarization]. Against such textual length constraints, this sub-section addresses speech length constraint. Namely, we propose a new task named speech-length-constrained or duration-constrained text-to-speech summarization. It abstractively summarizes text with a spoken language to fit a desired speech duration.
We recorded speech for this task. The text to be summarized was web news, which we saved in original_utf8/*.txt. Our speaker summarized the texts and uttered them to fit duration that the speaker chose in advance. The durations chosen for each text were around 30 and 60 sec. We did not set time limits for recording, and the speaker could re-record as many times as needed. After the recording, first, we manually transcribed the speech. Then, we manually added punctuation at the phrase breaks and added sentence-level time alignment as shown below. We saved the transcription in transcript_utf8/*.txt.
00.000 16.006 株式会社ベネッセコーポレーションが、20歳から40歳の既婚女性に対して、今年を表す漢字は、というアンケートを行った結果、1位には、おかしい、チェンジという意味を持つ、変が選ばれました。
17.240 26.460 更に、来年の漢字では、1位が明るい、2位楽しい、3位幸せと、ポジティブな文字が続きました。
27.613 32.426 来年こそは明るく楽しく幸せな1年にとの願いが感じられます。
00.000 16.006 Benesse Corporation conducted a survey of married women between the ages of 20 and 40 to find out what the Chinese characters for this year would be, and the number one choice was 変 , which means “funny” or “change.”
17.240 26.460 Furthermore, positive characters are listed in the Chinese characters for “next year.” First place is 明るい (bright), second place is 楽しい (fun), and third place is 幸せ (happy).
27.613 32.426 I feel hopeful that next year will be bright, fun, and happy.
Given the effect the global pandemic has had on Japan in 2020, there is a question of how we can convey emergency and lifeline information to the approximately three million foreign residents living in Japan [immigrationjapan]. The Immigration Services Agency of Japan and the Agency for Cultural Affairs reported that many foreign residents prefer simple Japanese rather than English for information services [simplejapaneseguideline]. “Simple Japanese” speech is much different from standard reading-style speech. “Simple Japanese” sentences use daily-use phrases with a limited vocabulary and are uttered in a slow, intelligible style [shibata07nhkreport]. Text simplification with lexical constraint [nishihara19textsimplification] can potentially artificially simplify vocabulary in text. On the other hand, here we deal with speaking-style simplification, which aims to artificially synthesize speech in a slow, intelligible style. Therefore, we instructed a speaker on the speaking style and recorded speech of simple, pre-designed sentences. An example is below.
おおきい じしんが おきました
cf.) There was a big earthquake.
In this sub-corpus, we saved the text in transcript_utf8.txt and manually converted it into hiragana111Japanese syllabary that shows pronunciation in hiragana_utf8.txt.
Synthesizing an isolated short utterance is a basic TTS task. To build a basic TTS system, we recorded speech data of short-form sentence utterances. We prepared three subsets (corresponding to [SUB_DIR_NAME]) from the JSUT corpus [sonobe17jsut] as follows.
voiceactress100: phoneme-balanced minimal set
onomatopee300: mid-sized set that includes Japanese onomatopoeias
basic5000: large-sized set
After the recording, we manually added punctuation at the phrase breaks. Note that, the positions are different from the original data stored in the JSUT corpus.
When uttering a long-form sentence that consists of multiple sentences, human speakers usually insert phrase breaks between word transitions without punctuation. This plays an important role in listenable and expressive speech. Synthesizing such speech is more challenging than synthesizing basic short utterances. To construct a corpus for it, we recorded speech uttering Wikipedia articles [wikipedia] (corresponding to [SUB_DIR_NAME]). Our speaker uttered articles paragraph by paragraph, excluding tables, figures, and their captions. After the recording, we manually added punctuation at the phrase breaks and sentence-level time alignment as shown below.
0.347 13.144 香川県において、うどんは地元で特に好まれている料理であり、一人あたりの消費量も、日本全国の都道府県別統計においても、第1位である。
14.574 32.954 料理等に地域名を冠してブランド化する地域ブランドの1つとしても、観光客の増加、うどん生産量の増加、知名度注目度の上昇などの効果をもたらし、地域ブランド成功例の筆頭に挙げられる。
0.347 13.144 In Kagawa Prefecture, udon is a particularly popular local dish, and the amount consumed per person is also the highest in Japan in terms of prefectural statistics. 14.574 32.954 This is one of the most successful examples of a regional brand using the name of a region as a brand for food and other items, resulting in an increase in the number of tourists, an increase in the amount of udon produced, and an increase in name recognition.
In this sub-corpus, we saved the original text in original_utf8/*.txt and transcription in transcript_utf8/*.txt. Note that, punctuation in the transcribed text was inserted at the phrase breaks, so the positions differ from that of the original text.
3 Results of data collection
|Sub-corpus||Style||#utterances||Duration [hour]||Duration / utt. [sec]|
We hired a female native Japanese speaker who is not a professional speaker but has voice training. We recorded her voice in an anechoic room at the University of Tokyo using an iPad mini with a mounted SHURE MV88A-A microphone. The first author directed the recording. Her voice was originally sampled at 48 kHz and downsampled to 24 kHz by SPTK [sptk]. We recorded in 24-bit/sample RIFF WAV format and encoded in 16-bit/sample format. Sentences (transcriptions) were encoded in UTF-8. For duration-constrained text-to-speech summarization, we used the Livedoor New Corpus [livedoornewscorpus] as the original text to be summarized. For speaking-style simplification, we followed text and speaking-style instructiond provided by Hirosaki University222 We downloaded them from http://human.cc.hirosaki-u.ac.jp/kokugo.html, but currently they are not available because the laboratory was closed in 2020. . For long-form utterances, we used three featured Japanese articles from Wikipedia: Sanuki udon (wheat-flour noodles of Japanese cuisine), Masakazu Katsura (Japanese manga artist), and Washington, D.C. (capital of the United States).
Table 1 lists the number of utterances and durations for each sub-corpus. “Simplification” and “short-form” consist of short utterances approximately 5 seconds per utterance. “Summarization” and “long-form” consist of utterances approximately 50 seconds per utterance, approximately 10 times longer than short utterances. The total duration is approximately 8 hours, which is slightly shorter than our previous corpus [sonobe17jsut] designed for the end-to-end TTS.
In this paper, we constructed the JSSS voice corpus. We designed the corpus for text-to-speech summarization, speaking-style simplification, and short-/long-form TTS synthesis.
The text files are licensed as below.
summarization/ … CC BY-ND 2.1 [livedoornewscorpus]
simplification/ … No commercial use
short-form/ … CC BY-SA 4.0 etc [jsut_corpus]
long-form/ … CC BY-SA 4.0
The speech files may be used for
Research by academic institutions
Non-commercial research, including research conducted within commercial organizations
Personal use, including blog posts.
Part of this work was supported by the GAP foundation program of the University of Tokyo. We thank RONDHUIT for allowing us to distribute the text-to-speech summarization sub-corpus. We thank Mr. Takaaki Saeki, Mr. Yota Ueda, and Mr. Taiki Nakamura of the University of Tokyo for their help.