Automatic Detection of Phonological Errors in Child Speech Using Siamese Recurrent Autoencoder

08/07/2020 ∙ by Si-Ioi Ng, et al. ∙ The Chinese University of Hong Kong 0

Speech sound disorder (SSD) refers to the developmental disorder in which children encounter persistent difficulties in correctly pronouncing words. Assessment of SSD has been relying largely on trained speech and language pathologists (SLPs). With the increasing demand for and long-lasting shortage of SLPs, automated assessment of speech disorder becomes a highly desirable approach to assisting clinical work. This paper describes a study on automatic detection of phonological errors in Cantonese speech of kindergarten children, based on a newly collected large speech corpus. The proposed approach to speech error detection involves the use of a Siamese recurrent autoencoder, which is trained to learn the similarity and discrepancy between phone segments in the embedding space. Training of the model requires only speech data from typically developing (TD) children. To distinguish disordered speech from typical one, cosine distance between the embeddings of the test segment and the reference segment is computed. Different model architectures and training strategies are experimented. Results on detecting the 6 most common consonant errors demonstrate satisfactory performance of the proposed model, with the average precision value from 0.82 to 0.93.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Children who suffer from speech sound disorder (SSD) commit persistent errors in producing certain speech sounds after the expected age of acquisition. Untreated children with SSD may experience social and academic difficulties, which impact their personal growth in the long term. Currently clinical assessment of SSD is carried out by qualified speech and language pathologists (SLPs) based on perceptual evaluation. The assessment can take various forms, including articulation test, conversation, story telling, etc. The result of each form of test reveals the severity and details of specific speech sound developmental problems. The assessment criteria are established and validated by experts. Timely diagnosis of SSD is crucial to effective treatment and rehabilitation. This is, however, hindered by the significant manpower shortage of SLPs globally. Methods of automatically detecting speech sound errors are highly desired to reduce the pressure on SLPs and benefit a large population of patients.

Child SSD detection is the task of distinguishing abnormal speech sound production from typical ones based on acoustic speech signals. Possible approaches include template matching, statistical modeling and automatic speech recognition (ASR). Given a limited amount of speech data, Yeung et al.

[yeung2017predicting] investigated an exemplar-based approach to evaluating English rhotic sounds in child speech. With a good amount of data for statistical modeling, Dudy et al. [dudy2015pronunciation][dudy2018automatic] improved the goodness of pronunciation (GOP) [witt2000phone] measure for pronunciation analysis in disordered child speech. Phonetic knowledge about common realizations of target phonemes was applied in the analysis. Similar approaches of knowledge incorporation were found in other works. In [shahin2014comparison], assessment of childhood apraxia of speech (CAS) was performed using constrained lattice in an ASR system. The lattice has the advantage that the type of mispronunciation could be beyond a binary decision. For each target word, the lattice was created according to expected mispronunciation rules. This approach was further extended in [ward2016automated], where phonological error patterns were identified via fine tuning of state transition weights between the correct and mispronounced phone sequences of a target word. However, the nature of being unpredictable in mispronunciation would challenge such systems, which rely on prior knowledge about the concerned errors.

In recent years fixed-dimension representation of speech has been applied widely to speech modeling and classification problems. Such representation encodes the information of variable-length speech segments in low-dimension vectors, which allow different segments to be compared and analyzed in the same embedding space. The similarity between segments can be evaluated by Euclidean distance, cosine distance, or other distance measures. Many approaches have been proposed for extracting embedding from speech. In the present study, the use of sequence-to-sequence auto-encoder (AE) is investigated. It is a neural network model that encompasses an encoder-decoder architecture. The encoder converts the input sequence into a low-dimension embedding while the decoder aims to reconstruct from the embedding an output sequence that is the same as or closely related to the input. The applications of sequence-to-sequence AE are found in unsupervised spoken term discovery, query-by-example spoken term detection and speaker verification, etc.


A common type of child SSD can be described as the desired phone, typically a consonant, being substituted by another phone. In this study, detection of such phonological errors is formulated as the problem of pairwise contrast between relevant phone segments, based on the embedding representations generated by an AE model. In terms of the network architecture, the AE is combined with a Siamese network, which is jointly trained to contrast the phone segments in the embedding space. Different model setups are evaluated first on test data of “artificial” substitution errors. Subsequently the proposed approach is applied to detect real phonological errors produced by children with SSD.

2 Background & Speech Database

2.1 Speech acquisition by Cantonese-speaking children

The present study is focused on Cantonese, a major Chinese dialect that is widely spoken in Hong Kong, Macau, Guangdong and Guangxi Provinces of Mainland China, as well as overseas Chinese communities. Cantonese is a monosyllabic and tonal language. Each Chinese character is pronounced as a single syllable carrying a lexical tone. A Cantonese syllable can be divided into an Initial part and a Final part. The Initial is a consonant while the Final could be a diphthong or comprise a vowel nucleus followed by a consonant coda (final consonant). There are a total of consonants, vowels and diphthongs in Cantonese. The present-day Cantonese uses over legitimate syllables (Initial-Final combinations). If the tone difference is taken into account, the number of distinct syllables exceeds [bauer2011modern][lee2002spoken]. In this study, we focus on Cantonese spoken in Hong Kong. The target group of speakers is pre-school children in Hong Kong.

In [so1995acquisition], So and Dodd examined speech sounds of typically developing (TD) and Cantonese-speaking pre-school children. It was shown that children were able to acquire tones, most of the vowels and diphthongs by the age of ; (years;months). The acquisition of final consonants and initial consonants was achieved by the age of ; and ; respectively. To at el. [to2013population] investigated acquisition of Hong Kong Cantonese by children aged ; to ;. The study revealed a longer time required for speech sound acquisition. Vowels and diphthongs were acquired by ; and ; respectively, and all initial consonants were acquired by ;. In the process of speech sound acquisition, children may try to simplify a target speech sound by substituting it with other sounds. This is mainly due to the undeveloped motor skills for speech sound production. TD children gradually stop using the substitution sounds and return to typical pronunciation when they grow up. Nonetheless, some children would persist the substitution errors beyond the expected age of acquisition. The symptoms are referred to as phonological disorder and disordered children are recommended to seek treatment offered by SLPs.

Age (years;months) 3;0-3;11 4;0-4;11 5;0-5;11 6;0-6;11
Male, healthy
Female, healthy
Male, atypical
Female, atypical
Table 1: Statistics of speakers in available speech data

2.2 Child Speech Database: CUCHILD

A Cantonese child speech corpus named CUCHILD is used in the present study [ng2020cuchild]. The corpus contains speech data collected from kindergarten children (aged ;-;) in Hong Kong. All speakers use Cantonese as their first language (L1). CUCHILD is designed to support acoustic modeling of Cantonese child speech and research on automatic assessment of SSD [wang2018study][ng2018automated]. The speech material consists of a total of Cantonese words of to syllables in length, covering the consonants and most commonly used vowels. Speech recording was carried out in classrooms provided by the kindergartens. A digital recorder was located at - centimeters in front of the children’s mouth. Yet environmental noise such as reverberation, school bells, people walking around, etc. was unavoidable. To minimize effects of background noise, the gain and the position of recorders were adjusted manually. Child speech was elicited via a picture naming task. Each word was also accompanied by a pictorial illustration. A research assistant showed the pictures one by one and guided the child to speak the intended words.

All participants were assessed with the Hong Kong Cantonese Articulation Test (HKCAT) [cheung2006hong]. The HKCAT is a standardized test for children which reflects the severity of developmental delay and the types of speech sound errors. Among all participants, children were found to have SSD.

The speech data were collected recently and detailed work of data processing and annotation are still ongoing, The present study makes use of a subset of the whole corpus, which covers the recordings from child speakers. The data was manually annotated and segmented into child speech and research assistants’ speech. Spoken words manifesting SSD were labelled manually by SLPs. The syllable-level orthographic transcriptions were manually verified. Table 1 summarizes the speaker information in our dataset.

Figure 1: Speech sound disorder (SSD) detection system.

3 The Proposed System of SSD Detection

3.1 SSD detection system

In clinical assessment of SSD, the child is guided to speak a list of test words. The responsible speech pathologist observes the speech production and decides if the child makes errors on specific parts of the words. The judgement depends highly on the clinician’s experience in differentiating atypical speech sounds from typical ones.

Towards automated assessment of SSD, the proposed system aims to determine whether a phonological error occurs in a test speech segment. The test segment contains a specific phoneme as part of a test word spoken by the child. To detect the error, we may choose one or multiple reference segments of the expected speech sound to compare with the test segment in a pairwise manner. Using multiple reference segments is preferred as they can represent the deviation of the expected speech sound. As illustrated in Figure 1

, the comparison is in an embedding space and all embeddings are extracted by the encoder obtained from the trained Siamese RAE model. The cosine distance is computed for each pair of embedding. The binary decision is based on a pre-defined threshold. If the score is above the threshold, the test segment is classified as typical pronunciation. Otherwise it is a disordered pronunciation.

The present study is focused on a set of initial consonants in Cantonese, which are considered as reliable markers for child speech acquisition. Details of the model are described in the following sections.

3.2 Recurrent autoencoder

A recurrent autoencoder (RAE) model is used to generate a compact representation of phone segment. This representation is referred to as the embedding. The RAE converts variable-length phone segments into fixed-dimensional embedding vectors, on which distance or similarity measure could be applied straightforwardly. The RAE has three components. The encoder receives an input sequence. The hidden state of the encoder’s last layer reaches the linear layer and generates the embedding, which is passed to the decoder to construct the output sequence. The RAE is trained such that the embedding is adequate for reconstructing a certain type of target output. One common choice of the target output sequence is to make it equal to the input sequence. This can be achieved by minimizing the mean squared error (MSE) loss in the training of encoder and decoder networks. For the input sequence

, the MSE loss is given as,


where refers to the decoder output at time step and denotes the last hidden layer output of the encoder, while is the length of input sequence.

The RAE model is also commonly applied with a weakened input-output relation, i.e., without requiring the decoder to perform exact recovery of the input sequence. Such design aims at sharing mutual information between non-identical but closely related training segments [kamper2019truly]. This type of RAE is known as the correspondence RAE (Cor-RAE). In this work, the Cor-RAE model is trained using speech segments carrying the same phoneme. Consider a pair of segments and from the same phoneme category, the MSE loss for the training of Cor-RAE is,


where and denote the lengths of and respectively.

Figure 2: Siamese network architecture.

3.3 Siamese recurrent autoencoder

As discussed earlier, the task of phonological error detection is formulated as a process of contrasting a test segment against the target phonemes. This process is realized with a Siamese network. It consists of two identical neural networks with shared parameters, which process two input representations in parallel. By inserting the Siamese loss in training, the network parameters are optimized to learn the similarity between the input representations. Our implementation of the Siamese RAE follows the work in [zhu2018siamese], where the loss is computed with a pair of embeddings extracted from the RAE, as shown in Figure 2.

Two types of Siamese loss are considered and compared in this work. The first one is the contrastive loss, which is expressed as,


where , and are the pair of embeddings representing two input speech segments. Both embeddings are generated by the encoder in the Siamese RAE.

The other type of loss function is the triplet loss defined as,


The loss function involves three embeddings as input, which include an anchor , a positive sample and a negative sample . and are the cosine distances of the anchor-positive pair and the anchor-negative pair respectively, and .

The overall objective function for Siamese RAE training combines the MSE loss and the contrastive/triplet loss as,


where is a scalar weight to balance the reconstruction loss and similarity loss.

4 Experiments and Results

4.1 Data pre-processing

Consonant segments in TD and atypical child speech were extracted automatically by forced alignment with GMM-HMM triphone models. The triphone models were trained with speech data from TD children of age - among the speakers as summarized in Table 1. TD children in this age range are expected to make few mistakes in speech production and their speech are considered to be free of SSD problems. Acoustic features for GMM-HMM training consist of -dimensional Mel-frequency cepstral coefficients (MFCC) and their first- and second-order derivatives extracted every

second. For triphone model training, linear discriminant analysis (LDA), semi-tied covariance (STC) transform and feature space Maximum Likelihood Linear Regression (fMLLR) were applied

[duda2012pattern][gales1999semi][gales1998maximum]. With a basic syllable pronunciation dictionary, an error rate of % was achieved on the task of free-loop syllable recognition with test speech from unseen TD children in the same age range. Forced alignment was applied to the speech data shown in Table 1 according to the canonical pronunciations of the

test words. Feature extraction, acoustic model training and forced alignment were all carried out with the Kaldi speech recognition toolkit

[povey2011kaldi]. As a result, a pool of consonant segments were extracted and they were divided into different subsets as shown in Table 2.

Name of subset Clinical group Age range No. of segments
Training TD 5;0 - 6;11
Reference TD 5;0 - 6;11
Development TD 5;0 - 6;11
Test1 TD 3;0 - 4;11
Test2 TD & Disordered 3;0 - 6;11 &
Table 2: Summary of data (phone segments) for training and evaluation of the RAE model.

4.2 Training of the Siamese RAE

In this study, the gated recurrent units (GRU) are adopted as the recurrent neural network architecture in the Siamese RAE

[audhkhasi2017end]. The input representations are

dimensional Filter-bank features, with mean and variance being globally normalized. The training phone segments are paired randomly. A training target ’

’ is assigned to the pairs of same-class segments , and ’’ assigned to pairs of segments from different classes. The encoder and decoder networks both consist of hidden layers and hidden units. The embeddings are L2-normalized. Both the Siamese RAE and Siamese Cor-RAE models are trained by the Adam optimizer [kingma2014adam] with a batch size of , a learning rate of , weight decay of and for epochs. Training of the Siamese Cor-RAE starts with a pre-trained standard Siamese RAE model. The margin is for the contrastive loss and for the triplet loss. A loss weight of

is applied to both loss functions. The training processes are implemented with PyTorch


Figure 3: Performance on the development set.

Different embedding sizes, loss functions (contrastive vs. triplet) and Siamese RAE model designs are evaluated on the development set. The evaluation is carried out on the same-different discriminability task as described in [carlin2011rapid]. Given a pair of test segments , and are declared to contain the same phoneme if the embedding distance , where is the decision threshold. Each development segment is randomly paired with a segment from the reference dataset. The segment pairs assigned with ’

’ are regarded as artificial substitution errors, in which the target phone is substituted by another phone. The cosine distance is computed for each segment pair. The average precision (AP) is used as the evaluation metric of system performance. The value of AP is obtained from the precision-recall (PR) curve, which portrays the system performance across varying decision thresholds. The results in terms of AP are shown as in Figure

3. Overall, using the triplet loss and bi-directional network structure (BiGRU-MSE_T) leads to better performance on the same-different task. In the following experiments, this setting with an embedding size of is used.

4.3 Performance evaluation on artificial errors

In this part, the Test1 dataset in Table 2 is used to evaluate the performance of the Siamese RAE and Siamese Cor-RAE. Each test segment in Test1 is paired up with a segment randomly selected from the reference set. The results in terms of AP are reported as in Figure 4. In the figure we also compare different training strategies in which each segment in the training dataset is used to form , and training pairs with other training segments. It can be seen that the conventional Siamese RAE consistently outperforms the Siamese Cor-RAE. The change of training pairs shows a noticeable impact on the performance level. The results imply that it is beneficial to use more training pairs. However, a large number of training pairs would not yield further improvement, in particular for the Siamese Cor-RAE. It seems the learning of mutual information shared across short segments from the same phoneme in Siamese Cor-RAE does not work as successfully as where sub-word and word level speech units are used [kamper2019truly][renshaw2015comparison]

. This could be caused by the short duration of speech units or hyperparameter settings of the model. More works are required to draw definitive conclusions.

Figure 4: Performance on artificial errors.

4.4 Performance evaluation on real errors

The Siamese RAE with Bi-GRU trained with training pairs, i.e., the best performing model shown in Figure 4, is evaluated on the task of detecting real phonological errors with the Test2 dataset. The test data cover the most common error patterns, which concern the Cantonese consonants /f/, /k/, /s/, /kh/, /th/ and /ph/. The errors are made on the phonological processes of stopping (e.g. /f/ to /p/), fronting (e.g. /k/ to /t/), deaspirtation (e.g. /kh/ to /k/) and affrication (e.g. /s/ to /ts/) etc. They are caused mainly by incorrect place or manner of articulation. It should be noted that children with SSD often had incomplete speech sound inventories and the errors were not limited to these common patterns.

Consonant Error Pattern No. of Consonant Segments {Disordered, Typical, Reference} AP
Affrication, Stopping
Deaspiration, Fronting
Deaspiration, Backing
Table 3: AP on real consonant errors.

Each test segment is compared with all reference segments carrying the same consonant. The average cosine distance is computed. The results in terms of AP are shown as in Table 3. The highest AP is achieved on the detection of atypical aspirated consonant /kh/ sound, while the performance of detecting unaspirated consonant /k/ is the worst among all test patterns. This suggests that unaspirated consonants may not be reliably detected in automatic assessment of SSD. It was noted that the Siamese Cor-RAE did not yield good performance. The use of the correspondence model for SSD detection with higher-level speech units (e.g. syllable or word level) will be investigated in our future work.

5 Conclusion

An approach to automatic detection of phonological errors in child speech has been investigated and evaluated with both artificial and real speech sound errors. It has been shown that the proposed Siamese Recurrent Auto-encoder model is able to learn compact representations from variable-length speech segments, which are effective in distinguishing erroneous segments from correct ones. Specifically, for the most common consonant errors in Cantonese, the achieved values of average precision range from to . These results reveal the good potential of applying the proposed approach to automatic assessment of speech sound disorder in real-world settings. Future work will include the incorporation of clinical knowledge in the model design and the discovery of domain knowledge through acoustical analysis of child speech.