JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

10/28/2017 ∙ by Ryosuke Sonobe, et al. ∙ 0

Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

chainer-ETTTS

This is an implementation of "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" with chainer.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Thanks to developments in deep learning techniques, studies on speech have accelerated [1, 2, 3, 4]. In particular, in speech-to-text and text-to-speech research, end-to-end conversion from speech to text or from text to speech is an actively targeted task. Some studies on speech synthesis reported methods that do not use linguistic knowledge, e.g., no use of intermediate representations such as phonemes, in English, Spanish, and German [5, 6, 7]

. However, it is known that natural language processing for Japanese is a more difficult task, e.g., semantic parsing and grapheme-to-phoneme conversion

[8]. We expect that a Japanese speech corpus that is freely available would accelerate related research such as on end-to-end speech synthesis. However, there are no existing corpora, e.g., [9], for this purpose.

In this paper, we describe the results of constructing a free, large-scale Japanese speech corpus, named the “JSUT (Japanese speech corpus of Saruwatari Laboratory, the University of Tokyo) corpus.” The corpus is designed to have all pronunciations of daily-use characters and individual readings in Japanese, which are not measured by conventional intermediate representation, such as phonemes and prosody. Also, it includes different-domain utterances, such as loanwords, and travel-domain and precedent utterances. We recorded 10 hours of speech data read by a native Japanese speaker and analyzed its linguistic and speech statistics. The corpus, including Japanese text and speech data, is freely available online [10].

2 Corpus design

2.1 Structures

To accelerate end-to-end research, the main purpose of the JSUT corpus is to cover all of the main pronunciations of daily-use Japanese characters, not to cover intermediate representations such as phonemes. The corpus includes the following nine sub-corpora. Their name is formatted as [NAME][NUMBER]. [NUMBER] indicates the number of utterances of the sub-corpus.

  • basic5000 … utterances to cover all of the main pronunciations of daily-use Japanese characters.

  • countersuffix26: utterances including individual readings of counter suffixes.

  • loanword128: utterances including loanwords, e.g., verbs or nouns.

  • utparaphrase512: utterances for which a word or phrase of a piece of text is replaced with its paraphrase.

  • voiceactress100: para-speech for a free corpus of Japanese voice actresses [11].

  • onomatopee300: utterances including famous Japanese onomatopee (onomatopoeia).

  • repeat500: repeatedly spoken utterances.

  • travel1000: travel-domain utterances.

  • precedent130: precedent-domain utterances.

2.2 Components

We describe how we designed the nine sub-corpora below.

2.2.1 basic5000

This is the main sub-corpus of the JSUT corpus. In Japanese, 2136 kanji characters (kanji are the logographic characters used in the modern Japanese writing system) are officially defined as daily-use characters [12], and each character has individual pronunciations consisting of its individual kunyomi (Chinese readings) and onyomi (Japanese readings). For example, we pronounce “一” (one in English) as “ichi,” “itsu,” “hito,” and “hito (tsu).” we collected 5000 sentences from Wikipedia [13] and the TANAKA corpus [14] so that all pronunciations of the daily-use kanji characters could be covered. Some of the pronunciations cannot be found in these corpora, therefore, we manually made additional sentences to cover the remaining readings.

2.2.2 countersuffix26

In Japanese, numerals cannot quantify nouns by themselves, and the pronunciation of the numerals changes depending on the suffix. For example, “二” (“two” in English) is pronounced “ni” with “個” (ko) as the suffix and “futa” with “つ” (tsu) . We crowdsourced 26 sentences including such counter suffixes.

2.2.3 loanword128

Japanese sentences spoken daily have many loanwords, e.g., verbs and nouns, for example, “ググる (guguru)” is a verb meaning to Google, and “ディズニー (dyizunii)” means Disney. The pronunciations and accents of loanwords are a curious task in spoken language processing [15]. We crowdsourced such words and sentences. Also, we collected sentences from Wikipedia that included pronunciations not included in the modern Japanese system, for example, sentences that had a Japanese-accented foreign proper name.

2.2.4 utparaphrase512

Paraphrasing, e.g., lexical simplification, is a technique that substitutes a word or phrase into another sentence [16, 17]. It can support the reading comprehension of a wide range of readers in speech communication. The SNOW E4 corpus [17, 18] includes sentences and a list of its paraphrased words. We chose one paraphrased word per sentence, and constructed 256 sentences and paraphrased sentences. The total number of sentences was 512.

2.2.5 voiceactress100

The Voice Actress Corpus [11] is a free speech corpus of professional Japanese voice actresses and includes not only neutral but also emotional voices. Collecting para-speech for this speech corpus is very helpful to build attractive and emotional speech synthesis systems. We used sentences from this corpus and manually modified the pause positions.

2.2.6 Onomatopee300

Onomatopee (onomatopoeia) has an important role in connecting speech and non-speech sounds in nature, and Japanese is rich in onomatopoeia words. We crowdsourced 300 sentences having individual onomatopoeia words.

2.2.7 repeat500

Human speech production is not deterministic, i.e., speech waveforms always differ even if we try to reproduce the same linguistic and para-linguistic information. Takamichi et al. [3]

proposed moment matching network-based speech synthesis that synthesizes speech with natural randomness within the same contexts. To quantify randomness, we recorded utterances spoken repeatedly by a single speaker. The speaker made utterances 5 times for each of the 100 sentences of the Voice Actress Corpus.

2.2.8 travel1000 and precedent138

We further constructed sentences whose domain differed from the above corpora. 1000 travel-domain sentences were collected from English-Japanese Translation Alignment Data [19]. Also, 138 copyright-free precedent sentences were collected from [20]. The words and phrases of the precedent sentences were significantly different from the above corpora, but some sentences are too difficult to read. Therefore, we manually removed and modified these sentences to make reading easier.

3 Results of data collection

3.1 Corpus specs

We hired a female native Japanese speaker and recorded her voice in our anechoic room. She was not a professional speaker but had experience working with her voice. The recordings were made in February, March, September, and October of 2017 for a few hours each day. The speaker made the recordings herself with our recording system. The speech data was sampled at 48 kHz. We used Lancers [21] to collect several kinds of Japanese sentences. The total duration was 10 hours including small amounts of the non-speech region. The 16 bit/sample RIFF WAV format was used. Sentences (transcriptions) were encoded in UTF-8.

The distributed corpora included UTF-8-encoded sentences, 48-kHz speech, and recording information. Because the recording period was comparably long and the objective scores among the recording days varied as shown below, the recording information shows what day the speech data was recorded. The power of the speech data was normalized, but basically we made no additional modifications. Commas were added between breath groups. The positions of the commas were manually annotated.

3.2 Analysis

We analyzed the linguistic and speech information of the constructed corpus. Note that not all of the data was used for the analysis to shorten the computation time. First, we counted the number of moras (sub-syllables) and words within one utterance by using MeCab [22] and NEologd [23, 24]. The utterance length is the important factor in speech synthesis using the sequence-to-sequence mechanisms [25, 26]. Fig. 1 and Fig. 2 show histograms of the moras and words, respectively. As we can see, the corpus included a variety of lengths, from short utterances (a few words and moras) to long utterances ( words and moras).

Next, we analyzed the changes in speech statistics per recording day. Speech data recorded during long periods causes objective and subjective differences among recording days [27]. The Mean of log F0 was calculated for each recording day. F0 was extracted by using the WORLD analysis-synthesis system [28]. Fig. 3 shows the result. There was no special tendency in the first half of the recordings, but we can see that the log F0 increased for the days of the second half.

Figure 1: Histogram of number of moras (sub-syllables) in one utterance. Minimum, mean, and maximum values are 7, 37.14, and 133, respectively.
Figure 2: Histogram of number of words in one utterance. Minimum, mean, and maximum values are 2, 18.03, and 70, respectively.
Figure 3:

Mean of log-scaled F0 for each recording day. Ordinal number of x-axis means how much time passed from “1st” recording day. For example, “5th” means 4 days after 1st recording day.

4 Conclusion

In this paper, we constructed a free, large-scale Japanese speech corpus (JSUT corpus) for end-to-end speech synthesis research. The corpus was designed to have all pronunciations of daily-use kanji characters of Japanese and sentences of several domains. The corpus may be used for research by academic institutions and non-commercial research including research conducted within commercial organizations.

Acknowledgements: Part of this work was supported by the SECOM Science and Technology Foundation. We thank Dr. Masahiro Mizukami of the Nara Institute of Science and Technology for the fruitful discussion on the paraphrase corpus, Assistant Prof. Kazuhide Yamamoto of the Nagaoka University of Technology and Tomoyuki Kajiwara of the Tokyo Metropolitan University for the use of the SNOW E4 corpus, and the person in charge of the Voice Actress Corpus for the use of their corpus.

References

  • [1] G. Hinton, L. Deng, D. Yu, G. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury,

    “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,”

    Signal Processing Magazine of IEEE, vol. 29, no. 6, pp. 82–97, 2012.
  • [2] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” vol. abs/1609.03499, 2016.
  • [3] S. Takamichi, K. Tomoki, and H. Saruwatari, “Sampling-based speech parameter generation using moment-matching network,” in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017.
  • [4] Y. Saito, S. Takamichi, and H. Saruwatari, “Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis,” in Proc. ICASSP, Orleans, U.S.A., Mar. 2017.
  • [5] Y. Wang, RJ Skerry-Ryan, D. Stanton, Y. Wu, Ron J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” vol. abs/1609.03499, 2017.
  • [6] S. Jose, M. Soroush, K. Kundan, S. João F., K. Kyle, C. Aaron, and B. Yoshua, “Char2Wav: End-to-end speech synthesis,” in International Conference on Learning Representations (Workshop Track), April 2017.
  • [7] O. Watts,

    Unsupervised learning for text-to-speech synthesis,”

    Ph. D thesis of the University of Edinburgh, 2012.
  • [8] K. Kubo, S. Sakti, G. Neubig, T. Toda, and S. Nakamura, “Narrow adaptive regularization of weights for grapheme-to-phoneme conversion,” in Proc. ICASSP, Florence, Italy, May 2014.
  • [9] M. Abe, Y. Sagisaka, T. Umeda, and H. Kuwabara, “ATR technical report,” , no. TR-I-0166M, 1990.
  • [10] “JSUT: Japanese speech corpus of Saruwatari Lab, the University of Tokyo corpus,” https://sites.google.com/site/shinnosuketakamichi/publication/jsut.
  • [11] y_benjo and MagnesiumRibbon, “Voice-actress corpus,” http://voice-statistics.github.io/.
  • [12] Governments of Japan Agency for Cultural Affairs, “List of daily-use kanjis http://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/kijun/naikaku/kanji/index.html,” 2010.
  • [13] “Wikipedia,” https://ja.wikipedia.org/.
  • [14] Y. Tanaka, “Compilation of a multilingual parallel corpus,” in Proc. Pacling2001, 2001.
  • [15] H. Kubozono, “Where does loanword prosody come from?: A case study of Japanese loanword accent,” Lingua, vol. 116, no. 7, pp. 1140–1170, 2006.
  • [16] M. Moku, K. Yamamoto, and A. Makabi, “Automatic easy Japanese translation for information accessibility of foreigners,” in the Workshop on Speech and Language Processing Tools in Education, 2012, pp. 85–90.
  • [17] K. Tomoyuki and Y. Kazuhide, “Evaluation dataset and system for japanese lexical simplification,” in Proceedings of the ACL-IJCNLP 2015 Student Research Workshop, Beijing, China, July 2015, pp. 35–40.
  • [18] “SNOW E4: evaluation data set of japanese lexical simplification,” http://www.jnlp.org/SNOW/E4, 2010.
  • [19] M. Utiyama and M. Takahashi, “English-japanese translation alignment data,” http://www2.nict.go.jp/astrec-att/member/mutiyama/align/index.html, 2003.
  • [20] “COURTS IN JAPAN,” http://www.courts.go.jp/app/hanrei_jp/search1.
  • [21] “Lancers http://www.lancers.jp,” .
  • [22] T. Kudo, K. Yamamoto, and Y. Matsumoto, “Applying conditional random fields to Japanese morphological analysis,” in Proc. EMNLP, Barcelona, Spain, Jul. 2004, pp. 230–237.
  • [23] T. Sato, T. Hashimoro, and M. Okumura, “Implementation of a word segmentation dictionary called mecab-ipadic-neologd and study on how to use it effectively for information retrieval (in Japanese),” in Proceedings of the Twenty-three Annual Meeting of the Association for Natural Language Processing, 2017, pp. NLP2017–B6–1.
  • [24] T. Sato, “Neologism dictionary based on the language resources on the web for Mecab,” 2015.
  • [25] W. Wang, S. Xu, and B. Xu, “First step towards end-toend parametric TTS synthesis: Generating spectral parameters with neural attention,” in Proc. INTERSPEECH, San Francisco, U.S.A., Sep. 2016, pp. 2243–2247.
  • [26] H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari,

    “Voice conversion using sequence-to-sequence learning of context posterior probabilities,”

    in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 1268–1272.
  • [27] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda., “XIMERA: a new TTS from ATR based on corpus-based technologies,” in Proc. SSW5, Pittsburgh, USA, June 2004, pp. 179–184.
  • [28] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE transactions on information and systems, vol. E99-D, no. 7, pp. 1877–1884, 2016.