CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment

08/07/2020 ∙ by Si-Ioi Ng, et al. ∙ The Chinese University of Hong Kong 0

This paper describes the design and development of CUCHILD, a large-scale Cantonese corpus of child speech. The corpus contains spoken words collected from 1,986 child speakers aged from 3 to 6 years old. The speech materials include 130 words of 1 to 4 syllables in length. The speakers cover both typically developing (TD) children and children with speech disorder. The intended use of the corpus is to support scientific and clinical research, as well as technology development related to child speech assessment. The design of the corpus, including selection of words, participants recruitment, data acquisition process, and data pre-processing are described in detail. The results of acoustical analysis are presented to illustrate the properties of child speech. Potential applications of the corpus in automatic speech recognition, phonological error detection and speaker diarization are also discussed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech is one of the most common media of human communication. The natural speech sound can be captured and recorded in the form of acoustic signal for subsequent analysis. The recorded speech data are stored in a structured database that is known as a speech corpus. The speech data contain the information about the acoustic properties of speech, linguistic usage of the language concerned, as well as the characteristics of speakers and recording conditions. With sufficient amount of speech data, statistical analysis can be performed to investigate and understand the properties of speech from different perspectives. Statistical modeling of speech data also allows the development of a wide range of speech technologies and applications. In short, speech data play an important role in multi-disciplinary research on speech communication.

The development of a specific speech technology needs to consider the target speakers and choose a suitable speech corpus according to the nature and scope of intended applications. Nowadays main-stream speech technologies are built mainly with adult speech data. They often show significantly degraded performance on child speakers, who account for a large population in the society. The performance degradation is clearly due to the differences between adult and child speech in many aspects. While abundant resources of well-annotated adult speech data are available and continually accumulated in the public domain, speech corpora of child speech are far less common. Part of this issue comes from privacy-related concerns of the parents and the difficulties in data collection due to limited attention span of child subjects. Despite the challenges, a number of child speech corpora were developed over the years. Examples are the OGI Kids Speech corpus [shobaki2000ogi], the University of Colorado’s Kids’ Speech Corpus [cole2006university] and the CID children’s speech corpus [lee1999acoustics]. These corpora target at healthy child/adolescent speakers whose ages range from to . They are useful resources to support both speech technology development [yeung2018difficulties] and acoustical analysis [lee1999acoustics].

In the population studies of speech acquisition, children of younger age tend to commit more mistakes in producing target words [to2013population]. The mistakes are caused by their underdeveloped vocal tract and motor skill to produce speech sounds as well as the developing phonological abilities. This implies that, when a large-scale collection of child speech data is carried out, there is a high chance that speech errors would be included. By incorporating erroneous speech in the design of corpus with detailed annotation on the relevant errors, it opens a new way for error analysis and development of new systems targeting on problems of child speech acquisition. There were a few related research works on corpus development in recent years, e.g., [kothalkar2018fusing][ramteke2019nitk]. A large part of these corpora are free from speech errors and could be used to support general speech technology development.

In this paper, we present the CUCHILD child speech corpus, which is the outcome of close collaboration between the Department of Electronic Engineering and the Department of Otorhinolaryngology, Head and Neck Surgery of the Chinese University of Hong Kong. The primary goal of this effort is to provide data resources to support acoustic analysis and identification of children with speech sound disorder, targeting the native Cantonese-speaking children of the age to . The speech data would be useful to the research of automatic speech recognition, speaker diarization and other speech technologies.

The rest of the paper is organised as follows. Section 2 introduces the background of Hong Kong Cantonese and describes the details of the CUCHILD corpus design. It is followed by Section 3, which describes the results of spectral and duration analysis of Cantonese vowels produced by children. Section 4 discusses the potential applications of the CUCHILD, followed by short conclusion in Section 5.

2 Design of Corpus

2.1 Hong Kong Cantonese

Cantonese, which is a traditional prestige variety of the Yue Chinese Dialect group, is a major Chinese dialect widely spoken by about 68 million native speakers in Hong Kong, Macau, Guangdong and Guangxi Provinces of Mainland China, as well as overseas Chinese communities. It is a monosyllabic and tonal language. Each Chinese character is pronounced as a single syllable carrying a lexical tone. A Cantonese syllable can be divided into an onset and a rime. The onset is a consonant while the rime can contain a nucleus or a nucleus followed by a coda. The nucleus can be a vowel or a diphthong and the coda is a final consonant. There are initial consonants, vowels, diphthongs, final consonants and distinct lexical tones (plus allotones). The tones are characterised by different pitches and duration patterns. Present-day Cantonese uses over legitimate base syllables. If tone difference is taken into account, the number of distinct syllables exceeds [bauer2011modern][lee2002spoken].

Age (years;months) 3;0-3;11 4;0-4;11 5;0-5;11 6;0-6;11
Table 1: Number of participants in different age groups.
Districts Hong Kong Island New Territories Kowloon
Table 2: Distribution of participants/ kindergartens in different districts.

2.2 Participants

The speech samples in the CUCHILD corpus were collected from 1,986 Hong Kong pre-school children (1,006 female, 980 male, age 3;0 to 6;11) during the period from February 2017 to January 2018. All speakers use Hong Kong Cantonese as their first language (L1). The children were grades K1 to K3 students, recruited via normal local kindergartens which use Cantonese as their medium of teaching. Children from the special child care centres were not included. Parental consents were obtained for each participating child. Information on age and gender were collected and they are summarized in Table 1. 17 kindergartens from different districts of Hong Kong participated in the study and the information of their distribution is presented in Table 2.

2.3 Recording session setup

Each participant was seen individually in a separated area inside the kindergarten. He/she was arranged to sit face-to-face in front of a research assistant with a mini-game setting to engage his/her attention. A digital recorder (TASCOM DR-44WL) was located at - centimeters in front of the children’s mouth. As the environmental noise such as reverberation, school bells, people walking around, etc. was unavoidable, the gain of the recorder was adjusted to maintain the background noise level below - dB (relative to the maximum input level) with the best effort. The sampling rate was set to be kHz with two-channel stereo recording.

For each child subject, the recording contains an interactive conversation between the child and the research assistant. The research assistants were student clinicians from speech therapy programmes in local universities. A technician, who was a student with engineering background, was responsible to monitor the operation of recording devices. As children would lose concentration easily, sufficient break time was allowed during the session. With previous experience of working with children, the research assistants were able to engage the participants with the mini-game and elicit targeted verbal outputs during the sessions. The majority of the participants were co-operative in the recording process.

Each recording session consisted of two major parts involving three stimuli booklets. In the first part, a single word articulation test, namely Hong Kong Cantonese Articulation Test (HKCAT)[cheung2006hong], was used to obtain the information about the child’s speech sound ability at single word level. In the second part, the subject were asked to read aloud two stimuli booklets with pictures that illustrate 130 Cantonese words (223 syllables) with 1-4 syllables.

2.4 Composition of stimuli

Hong Kong Cantonese Articulation Test (HKCAT) is a standardized single word articulation test commonly used by qualified speech therapists in Hong Kong. It provides information about the speech sound inventory, speech sound errors and patterns of the participant. All research assistants had received proper training on the use of HKCAT and transcription of Cantonese speech sounds. The procedure during data collection was monitored by the supervisor, who is a qualified speech therapist with more than 10 years of clinical experience in working with children with speech sound disorders. The results of the HKCAT were instantly transcribed on recording forms by the research assistants.

After HKCAT, the child subject was asked to name the 130 Cantonese words one by one. When a subject failed to name a picture, the research assistant would provide a direct model for the child to repeat and imitate. The target words were selected with the consideration of their age-adequacy and are illustrated with children-friendly colorful drawings. Samples of the stimuli are illustrated in Figure 1(a)-(d). These words were selected with an aim to elicit and collect speech samples covering all Cantonese phonemes in words of different lengths, with different syllable structures (CV, CVV, CVC) and at different syllable positions. The initial consonants include plosives, affricates, nasals, fricatives, approximants and lateral approximants. The initial consonant [n] was not included as it is commonly regarded as an allophone of [l] in Hong Kong Cantonese. Seven long vowels, four short vowels, eleven diphthongs, six final consonants and six lexical tones were all covered in the 223 syllables. The list of phonemes are summarized as in Table 3.

Initial consonants
p ph t th k kh kw kwh
ts tsh m N f s h w j l
Long vowels a: i: E: œ: O: u: y:
Short vowels 5 I 8 U
ai ei 5i ui Oi
au 5u iu ou 8y Eu
Final consonants -p -t -k -m -n -N
High-level Mid-rising
Mid-level Mid-falling
Low-rising Low-level
Table 3: Cantonese phonemes included in CUCHILD
Figure 1: Samples of stimuli: Cantonese words with 1-4 syllables. (a) ”thO:N25” (Candy) (b) ”fUN55 si:n33”(Fan) (c) ”hO:n33 pou25 pa:u55”(Hamburger) (d) ”tshi:u55 kh5p55 si:33 tshœ:N21” (Supermarket)

2.5 Pre-processing of collected speech data

Upon the collection of the speech samples from the children, the HKCAT results charted by the research assistants were validated by the supervisor, partially onsite and entirely at the laboratory with reference to the audio and audio-visual recordings. An analysis of the screening results from face-to-face analysis, audio recordings and audio-visual recordings suggested that no significant difference was found with the HKCAT scores among different modes of judgement [ng2018comparison]. The HKCAT scores provide important information about the children’s speech inventory, speech sound errors and patterns, and serve as a reference transcription of the collected speech data with the 223 syllables in CUCHILD. Age appropriate errors made by typical developing children, age inappropriate phonological processes produced by children with suspected speech sound disorders and articulation errors are included. The referenced transcription is used to categorise the collected speech data into accurate pronunciation and expected erroneous speech collected from typically developing (TD) children and unexpected erroneous speech collected from children with disordered speech (DS). Thus, in addition to the full coverage of all Cantonese phonemes, the speech data in CUCHILD give a spectrum of phonological processes and articulation errors which are typically/ atypically found in Cantonese-speaking children at age 3;0 to 6;11. The pre-processed information and manual annotation of speech data can be used as training data for speech recognition system as well as the other proposed functions and applications.

3 Acoustical Analysis

Figure 2: Formant analysis of the Cantonese vowels over different age group, illustrated by F1-F2 scatter plots: (a) Age between - ; (b) Age between - ; (c) Age between - ; (d) Age between - . The legend (from top to bottom) represents the vowels [i: u: œ: y: O: E: ].

Acoustical analysis of child speech aims to provide better understanding about developmental changes of acoustic patterns. Clinically the findings can provide a reference of each speaker group for assessment. In this section, we measure the fundamental frequency (F0) and the first two formants (F1, F2) of the Cantonese long vowels [i: y: E: œ: a: O: u:] using automatic F0 and formant tracking algorithms. The long vowels involve monosyllable words with the syllable structure of (C)V:. The effect of lexical tone is not considered in this study.

Age (years;months) 3;0-3;11 4;0-4;11 5;0-5;11 6;0-6;11
Table 4: Number of speakers used in acoustic analysis.

A subset of speech data is selected from TD speakers, as summarized in Table 4. The audio signals are down-sampled from kHz to kHz and converted to single-channel signals. Each target word in the recording is manually segmented and transcribed by trained research assistants using the software Wavesurfer [sjolander2000wavesurfer]. To locate the vowel segments for subsequent analysis, forced alignment is applied to the speech data with a GMM-HMM triphone acoustic model. The acoustic model is trained with -dimensional Mel-frequency cepstral coefficients (MFCC) and their first- and second-order derivatives, which are extracted every ms with a

ms Hamming window. Linear discriminant analysis (LDA), semi-tied covaraicne (STC) transform and feature space Maximum Likelihood Linear Regression (fMLLR) are also applied in the tri-phone model training

[duda2012pattern][gales1999semi][gales1998maximum]. The acoustic modeling and forced alignment are implemented using the Kaldi speech recognition toolkit. [povey2011kaldi]. Vowel segments shorter than

ms are not included in the analysis. F0 and formant frequencies are estimated by Praat using the auto-correlation method and linear predictive analysis with Burg’s algorithm respectively

[boersma1993accurate] [andersen1974calculation][boersma2018praat][jadoul2018introducing].

Child speech is known to have higher F0 and formant frequencies than adult speech. The wide spacing of harmonic peaks makes the analysis more difficult [kent2018static]. To avoid erroneous estimation of formant frequencies, the ceiling values of formant frequencies for front vowels [i: y: E: œ:], central vowel [a:] and back vowels [O: u:] are empirically set to be Hz, Hz and Hz respectively. We allow a maximum of formants (F1 - F5) to be estimated in each analysis frame. For F0 estimation, the pitch floor is set to be Hz.

Each vowel segment consists of a number of analysis frames, from each of which F0 and formant frequencies can be extracted. The median values over all frames are used to represent the whole segment. The mean F0 values of male and female speakers are listed as in Figure 3. As the age increases, the child speakers of both genders show a declining trend in F0. Boys generally have lower F0 than girls, but the difference is very small. At age of , boys have a mean F0 of Hz whereas the mean F0 of girls is Hz. At age of , the mean F0 values of boys and girls are Hz and Hz respectively.

Figure 3: Results of fundamental frequency analysis of different age groups and genders

Estimation of formant frequencies exhibits frequent occurrences of errors, especially that closely located formants may not be identified. A procedure of data cleansing is applied to make the statistical analysis more meaningful. Estimated raw values for each formant (F1 - F3) are grouped according to vowel identity, age and gender. For each group, the mean and standard deviation are computed. Any measured value deviating by

standard deviation from the mean is removed. The F1-F2 plots for different age ranges are illustrated as in Figure 2. Different vowels are marked in different colors. The vowel ellipses are drawn to represent the

% confidence interval. It is known that the F1 value is related closely with the height of tongue, whereas F2 is determined mainly by the frontness and backness of the tongue body. The mean values of F1 and F2, as well as the mean duration of

vowels [i: E: a: O: u:] for the age groups of and are shown as in Table 5. Comparing the two age groups, there is a trend of decrease in F1 values of all vowels when the age increases. Similar observation applies to F2 except for [u:]. The vowel duration by children of age is slightly longer than those of age .

Vowel Age / Age
F1 (Hz) F2 (Hz) Duration (s)
[i:] / / /
[E:] / / /
[a:] / / /
[O:] / / /
[u:] / / /
Table 5: Formant values and duration of 5 long vowels [i: E: a: O: u:] which are commonly used to represent the vowel loop.

4 Applications of CUCHILD

4.1 Speech recognition and speaker diarization

In automatic speech recognition (ASR), the high diversity of acoustic properties and limited language proficiency in child speech explain that statistical models trained from adult speech are not applicable to child speech. The presence of child speech data is necessary in the development of ASR systems for child users. The CUCHILD corpus is expected to address the issue by providing a large amount of child speech data. Speaker diarization (SD), aiming to solve the ”who speaks when” problem, is another important research topic with practical significance. Currently SD systems are commonly trained on adult speech. The spontaneity and phonetic variation in child speech make the extraction of speaker information difficult [xie2019multi]. A high-performance SD system for child speech is expected to bring the benefit in different aspects. For instance, a SD system can be used to analyze adult-child interaction and extract target speech from child in a conversation [kothalkar2019tagging]. The extracted child speech data can be used to provide training data for ASR system development [wang2018study] or support the development of clinical assessment tools [shahin2019automatic]. In addition, the analysis of adult-child speech interaction would be helpful to understanding children’s typical or atypical social behaviours [hansen2019speech].

4.2 Detection of speech sound errors

Speech Sound Disorder (SSD) is diagnosed when a child shows difficulties in acquisition, production and perception of speech, and makes errors in pronunciations that do not match the normal variation expectation for his/her age[WinNT]. Poor speech sound production skills are found to have significant impacts on social, emotional and academic developments[hitchcock2015social], and associated with lower literacy outcomes [overby2012preliteracy], [lewis2011literacy] and a greater likelihood of suffering reading disorders[peterson2009influences]

. With large amount of child speech data, automatic detection of phonological and articulation errors is feasible using the machine learning approach. Automatic detection tools are expected to accelerate the screening of children who are at-risk for SSD, thus bringing early identification and intervention. In the long-run, early intervention can bring positive impacts to the children development, and thus reduce the service load of the current healthcare system on children with special education needs. The CUCHILD includes recordings of accurate production and expected erroneous speech produced by TD children, as well as the unexpected erroneous speech produced by disordered children. It is designed to support the development and evaluation of the detection systems. Relevant works can be found in


4.3 Developmental studies on children

Children’s acquisition of speech sounds can be investigated by large scale population studies, as in [to2013population][so1995acquisition]. Using the articulation test, the subject-level statistics of the results describe the overall picture of phonological acquisition and indicate the developmental error patterns. These studies often involve huge demand in manpower and professional costs, and take long period of time in data-collection, validation and drawing result conclusion. Alternatively, child speech can be collected and analysed based on acoustic signal. The signal captures rich linguistic and speaker information. The findings from the studies of acoustic features can bring new insight to the developmental changes of child speech, as well as inspire new approaches to differentiate atypical from healthy speech with the automated system. The CUCHILD satisfies the above-mentioned motivations and supports the studies of acoustic properties of pre-school child speech.

5 Conclusion

In this paper, a large-scale child speech corpus CUCHILD with Cantonese speech sounds collected from 1,986 children of age from 3;0 to 6;11 is presented. The corpus includes the recordings of the speech sounds collected from both typically developing children and children with disordered speech when reading 130 Cantonese words with 1 to 4 syllables. All initial consonants, vowels, diphthongs, final consonants and lexical tones of Cantonese were covered in the corpus. Acoustical analysis with a subset of speech sample including the measurement of fundamental frequency and the first three formants was illustrated in this paper. Future work with the corpus includes child speech recognition, speaker diarization, detection of speech sound errors and further spectral analysis are suggested and to be investigated.

6 Acknowledgements

This research was partially supported by a direct grant and a Research Sustainability Fund from the Research Committee of the Chinese University of Hong Kong, as well as the financial support by the Hear Talk Foundation under the project titled ”Speech Analysis for Cantonese Speaking Children”.