Speech disfluencies are inconsistencies and interruptions in the flow of otherwise normal speech. Of these speech impediments, stuttering is one of the most prominent, affecting over 70 million people, about one percent of the global population . 5-10% of children stutter at some point in their childhood, with a quarter of these children maintaining their stutters throughout their entire lives . Common therapy methods often involve helping the patient monitor and maintain awareness of their speaking patterns in order to correct them . Moreover, therapeutic success rates have been reported to be over 80%, especially when detected and dealt with in early stages 
. Accordingly, with the recent advances in machine learning, deep learning, and language/speech processing techniques, developing smart and interactive tools for detection and therapy is now a real possibility.
In addition to interactive therapy purposes, other applications can be realized for automated stutter recognition. Fluent speech is crucial and influential in presentations such as talks and business communications . There are currently a number of applications available to assist speakers in monitoring and improving their presentation skills. For example, monitoring of features like volume, rate of speech, and intonation, among others have been explored in this context  . However, detection and quantification of stutters has not yet been fully explored for such applications.
Despite the many potential applications for automated stutter detection, little research has been done in this area. This is partially due to the fact that the notion of detecting and classifying the type and location of stutters can be a difficult problem, especially when factoring in variables such as gender, speech rate, accent, and phone-realization . Existing works in the area mostly rely on automatic speech recognition (ASR) to first convert audio signals to text, and then utilize language models to detect and identify the stutters   . While this approach has proven effective and achieved promising results, the reliance on ASR can both be a potential source for error, as well as an unnecessary additional computational step.
In this paper, we propose a model that directly utilizes audio speech signals to detect and classify stutters, skipping the ASR step and the need for language models. Our method uses spectrogram features to train a deep neural network with residual layers followed by bidirectional long short-term memory (Bi-LSTM) units to learn to locate and identify different types of stutters. The overview of our method is presented in Figure1. Our experiments show the effectiveness of our approach in generalizing across multiple classes of stutters while maintaining a high accuracy and strong consistency between classes.
2 Related Work
Early studies on the topic focused on the feasibility of stutter differentiation, with training and testing often being performed on a small set of specific stuttered words. For example, a hidden Markov model (HMM) was used to create a stutter recognition assistance tool. Testing results averaged to 96% and 90% accuracy on human and artificially generated stuttered speech samples respectively for a single pre-determined word .
A number of assumptions are often made in order to simplify the problem of disfluency detection. For example, as different disfluencies vary heavily by nature, proposed solutions often tackle one single type of stutter (such as interjections, prolongations, or repetitions) at a time. In , for instance, sound repetition stutters were accurately detected on a small set of trained words. Another common assumption used for simplification has been to remove under-represented subject classes (for example based on gender or age) , , .
As ASR and natural language processing (NLP) has evolved greatly in recent years, such methods have become increasing popular for the problem of stutter classification and recognition. One such method incorporated annotations from speech language pathologist to a word lattice model, improving the baseline method by a relative 7.5%
. Another model using Bi-LSTMs with condition random fields (CRFs) to get an average F-score of 85.9% across all stutter types. The current state-of-the-art stutter classification method uses task-oriented finite state transducer (FST) lattices to detect repetition stutters with an average 37% miss rate across 4 different types of .
3 Proposed Network
Our proposed method first generates spectrogram feature vectors from the audio clips. The spectrograms are then passed through a deep residual neural network, mapping the spectrogram matrices to a linear vector. These are then be passed through a bidirectional LSTM to learn the extracted feature embeddings for different types of stutters. Following the different steps of our proposed pipeline are described.
3.1 Feature extraction
3.2 Feature embedding layers
We utilize a residual network 
in our model in order to effectively learn the stutter-specific features while avoiding issues such as the vanishing gradient problem. The use of this type of network also allows a deep architecture (a depth of 18 convolution layers) without overfitting, especially when considering the relatively small size of the dataset. Moreover, architectures with residual components have recently shown considerable promise in speech analysis . In our proposed solution, each group of 3 convolutional layers is referred to as a convolutional block. Figure 21
presents the hyperparameters of our network. The detection task for each stutter is formulated as a binary problem, with the same architecture mentioned being used for every disfluency type.
3.3 Recurrent layers
The learned feature embeddings are provided to 2 recurrent layers, each consisting of 512 bidirectional LSTM units . We utilized LSTM layers as they have been proven to be effective in classification when dealing with short sequential data, and are a popular approach in speech and NLP . In the context of the problem at hand, most stutters tend to be quick and last only a fraction of the 4-second audio clip that they are contained in. Therefore the LSTM layers don’t suffer from memory issues . Lastly, the use of bidirectional LSTMs allow the model to learn both past and future embeddings, providing further context for our problem. Dropout rates of 0.2 and 0.4 are utilized after each recurrent layer.
4 Experiment Setup and Results
4.1 Data and annotation
Speech samples were collected from the University College London’s Archive of Stuttered Speech (UCLASS) Release One  dataset, created by the Division of Psychology and Language Sciences within the university. The dataset contains samples of monologues from 139 participants, ranging between 8 and 18 years of age, with known stuttered speech impediments of different severity. Of these recordings, 25 unique participants were used due to the availability of their orthographic transcriptions of the monologues.
|Module||Input Spectrogram||Output Size|
Forced time-alignment was used on the audio and transcriptions to generate a timestamp for each word and stutter spoken . The stutter annotation approach is similar to previously used methods [29, 13]. We then manually annotated each recording for one of 7 stutter disfluencies : sound repetition, word repetition, phrase repetition, revision, interjection, or prolongation. A description of each type of stutter can be found in Table 2. We leave out part-word repetition disfluencies as the dataset contained only few samples of such stutters, preventing our deep learning approach from properly learning the classification task. Each monologue recording was segmented into 4-second samples, totalling to 800 labeled audio clips.
|Sound Repetition||Repetion of any phenome||th-th-this|
|W||Word Repetition||Repetition of any word||why why|
|PH||Phrase Repetition||Repetition of multiple successive words||I know I know that|
|R||Revision||Repetition of thought, rephrased mid sentence||I think that- I believe that|
|I||Interjection||Fabricated word or sounds, added to stall for time||um, uh|
|PR||Prolongation||Prolonged sounds||whoooooo is it|
4.2 Implementation details
over 30 epochs, with minimal improvement in results seen in following epochs. A root means square propagation (RMSProp) optimizer was used, as well as the softmax loss function. An Nvidia 1080 Ti GPU was used to perform the training.
|Alharbi et al. ||Word Lat.||60||–||0||–||25||–||0||–|
|Paper||Method||Ave. MR||Ave. Acc|
|Alharbi et al. ||Word Lat.||37%||–|
To rigorously test our proposed model, leave-one-subject-out (LOSO) cross validation was used: the model was trained on the speech of 24 of the UCLASS participants, while the last subject’s audio was used for testing. The process was repeated 25 times, testing the model on a different subject every time. For this dataset, accuracy (Acc) and miss rate (MR) values have been reported in prior work. Lastly, it should be noted that we target 6 categories of stutter disfluencies, as opposed to most prior work where fewer classes are considered.
4.4 Performance and Comparison
The results of our experiments for the UCLASS dataset is summarized in Table 3, where we compare our method to . Additionally, to evaluate the need for bidirectional LSTM as opposed to a unidirectional LSTM, we compare our results to a baseline model where a ResNet with LSTM is used instead of our proposed model. The table shows that our method outperforms the state-of-the-art in detection of sound repetition and revisions by considerable margins (an improvements of 41.90% and 22.14% respectively).
The statistical language models and task-oriented word lattices used in other methods rely heavily on generating a strong orthographic transcriptions for each speaker. As a result, while these methods struggle with sub-word stutters such as sound repetition or revision, they perform well for word repetition or prolongation. This can be observed in Table 3 as  performs better than our method by a small margin (3.2%) for word repetition. Additionally,  performs with a lower miss rate than ours for detection of prolongation (5.92%). Since our method relies on spectrogram features as opposed to a language model, some longer utterances can exceed the four-second windows or suffer from alignment issues, causing those stutters to be misclassified. Hence our approach produces slightly more false negatives in classification of prolongation.
As shown in Table 3, our classifier is able to identify interjections with an accuracy of 81.4%. Many other works on stutter classification tend to avoid interjection disfluencies as a class, since interjection stutters tend to be more diverse and lack the consistency of repetition and prolongation stutters, making them more difficult to classify. While other works such as [17, 18] were able to robustly detect interjections with mel-frequency cepstral coefficients (MFCC), small subsets of the UCLASS dataset were used, preventing us from performing a fair comparison to our model.
The comparison between our proposed method and the baseline approach (using LSTM instead of bidirectional LSTM) fares similarly, with the bidirectional LSTM having slightly better or similar results for every class. This lack of significant difference between the two LSTM variations is most likely due to the fact that the feature embeddings learned using the ResNet portion of our pipeline are quite robust, accurately capturing the information required to represent different stutters. While the difference in bidirectional and unidirectional LSTMs is marginal, we opt to use the Bi-LSTM approach as the additional computational cost for Bi-LSTM is not significant.
Table 4 presents the average performance of our model compared to  and the baseline approach. It can be observed that our model achieves an improvement of 26.97% lower miss rate on the UCLASS dataset over the previous state-of-the-art. Moreover, as previously shown, our method slightly outperforms the unidirectional LSTM baseline when averaged across all stutter types. Lastly, Figure 3 shows the performance of our method for different stutter types against different training epochs. It can be seen that after approximately 20 epochs, our model reaches a steady-state, indicating stable learning of disfluency-related features throughout the learning phase.
5 Conclusion and Future Work
We present a method for detection and classification of different types of stutter disfluencies. Our model utilizes a residual network and bidirectional LSTM units trained using input spectrogram features calculated from labeled audio segments of stuttered speech. Six classes of stutter were considered in this paper: sound repetition, word repetition, phrase repetition, revision, interjection, and prolongation. Investigations show that our method performs robustly across all classes and performs with very high average accuracy and low average miss rate, achieving state-of-the-art with a significant improvement over previous the previous state-of-the-art for stutter detection.
In future work, building upon the current model, we will conduct research on multi-class learning of different stutter disfluencies. As multiple stutter types may occur at once (e.g. ’I went to uh to to uh to’), this approach may result in more robust classification of stutters.
The authors would like to thank Prof. Jim Hamilton for his support and valuable feedback throughout the course of this of this work.
-  (2017) Detecting stuttering events in transcripts of children’s speech. International Conference on Statistical Language and Speech Processing, pp. 217–228. Cited by: §1.
-  (2018) A lightly supervised approach to detect stuttering in children’s speech. INTERSPEECH, pp. 3433–3437. Cited by: §1, §2, §2, §4.4, §4.4, §4.4, Table 3, Table 4.
-  (2009) Automatic detection of prolongations and repetitions using lpcc. International Conference for Technical Postgraduates, pp. 1–4. Cited by: §2.
-  (2015) Keras. External Links: Cited by: §4.2.
-  (2019) Stuttering foundation website. Note: https://www.stutteringhelp.org Cited by: §1.
-  (2014) Stuttering: An Integrated Approach to Its Nature and Treatment. 4 edition, Lippincott Williams & Wilkins. External Links: Cited by: §1.
-  (2019) A deep neural network for short-segment speaker recognition. INTERSPEECH, pp. 2878–2882. Cited by: §3.2.
-  (2006) Automatic alignment and error correction of human generated transcripts for long speech recordings. Ninth International Conference on Spoken Language Processing. Cited by: §4.1.
-  (2016) Deep residual learning for image recognition. , pp. 770–778. Cited by: §3.2.
-  (2016) Using clinician annotations to improve automatic speech recognition of stuttered speech. INTERSPEECH, pp. 2651–2655. Cited by: §1, §2.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §3.3.
-  (2009) The university college london archive of stuttered speech (uclass). Journal of Speech, Language, and Hearing Research 52, pp. 556–569. External Links: Cited by: §4.1.
-  (2011) Speech disfluency types of fluent and stuttering individuals: age effects. International Journal of Phoniatrics, Speech Therapy and Communication Pathology 63. External Links: Cited by: §4.1.
-  (2019) An approach for objective assessment of stuttered speech using mfcc features. Digital Signal Processing Journal 9, pp. 19–24. Cited by: §2.
-  (1998) Robust speech recognition using the modulation spectrogram. Speech Communication 25, pp. 117–132. Cited by: §3.1.
-  (2003) On the lateralization of emotional prosody: an event-related functional mr investigation. Brain and Language 86 (3), pp. 366–376. Cited by: §1.
-  (2016) Automatic segmentation and classification of dysfluencies in stuttering speech. Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, pp. 1–6. Cited by: §4.4.
-  (2017) LP-hillbert transform based mfcc for effective discrimination of stuttering dysfluencies. International Conference on Wireless Communications, Signal Processing and Networking (), pp. 2561–2565. External Links: Cited by: §4.4.
-  (2000) Why communication is important: a rationale for the centrality of the study of communication. Journal of the Association for Communication Administration 29, pp. 1–25. Cited by: §1.
-  (2019) Stuttering. Note: https://www.nidcd.nih.gov Cited by: §1.
-  (2016) A review of challenges in automatic speech recognition. International Journal of Computer Applications 151 (3), pp. 23–26. Cited by: §1.
-  (2004) How effective is therapy for childhood stuttering? dissecting and reinterpreting the evidence in light of spontaneous recovery rates. International Journal of Language & Communication Disorders 40, pp. 359–374. Cited by: §1.
-  (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §3.3.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.3.
-  (2013) Computer-based speech analysis in stutter. Applied Computer Science 9 (2), pp. 34–42. Cited by: §3.1.
-  (2007) Application of malay speech technology in malay speech therapy assistance tools. International Conference on Intelligent and Advanced Systems, pp. 330–334. Cited by: §2.
-  (2017) RoboCOP: a robotic coach for oral presentations. ACM Conference on Interactive, Mobile, Wearable and Ubiquitous Technologies 1 (2), pp. 27. Cited by: §1.
-  (2018) The microsoft 2017 conversational speech recognition system. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5934–5938. Cited by: §3.2.
-  (1999) Early childhood stuttering i: persistency and recovery rates. Journal of Speech, Language, and Hearing Research 42. Cited by: §4.1.
-  (1997) Clinical measurement of stuttering behaviors. Contemporary Issues in Communication Science and Disorders 24. Cited by: §4.1.
-  (2016) Disfluency detection using a bidirectional lstm. INTERSPEECH, pp. 2523–2527. Cited by: §2.