Dysfluencies in speech such as sound repetitions, word repetitions, and blocks are common amongst everyone and are especially prevalent in people who stutter. Frequent occurrences can make social interactions challenging and limit an individual’s ability to communicate with ubiquitous speech technology including Alexa, Siri, and Cortana [4, 5, 28, 25, 6]. In this work we investigate the ability to automatically detect dysfluencies, which may be valuable for clinical assessment or development of accessible speech recognition technology.
This problem is challenging because there are many variations in how a given individual expresses each dysfluency type, in the patterns of dysfluencies between users, and even how the situation or environment affects their speech. For example, an individual may stutter when conversing but not while reading aloud; when talking with a teacher but not a friend; or when stressed before an exam but not in every day-to-day interaction. The speech pathology community has spent decades characterizing, developing diagnosis tools, and developing strategies to mitigate these behaviors [27, 23, 26, 12], however, there has been limited success in taking these learnings and applying them to speech recognition technology, where individuals may be frequently cut off or have their speech inaccurately transcribed.
A major bottleneck in this area is that dysfluency datasets tend to be small and have few or inconsistent annotations not inherently designed for work on speech recognition tasks. Kourkounakis et al.  used 800 speech clips (53 minutes) with custom annotations to detect dysfluencies from 25 children who stutter using the UCLASS dataset . Riad et al.  performed a similar task using 1429 utterances from 22 adults who stutter with the recent FluencyBank  dataset. Bayer et al.  collected a 3.5 hour German dataset with 37 speakers and developed a model for automated stuttering severity assessment. Unfortunately, none of the annotations from these efforts have been released. A core contribution of our paper is the introduction of the Stuttering Events in Podcasts dataset (SEP-28k) dataset which contains 28k annotated clips (23 hours) of speech curated from public podcasts. We have released these along with annotations for 4k clips (3.5 hours) from FluencyBank targeted at stuttering event detection.
|Block||Gasps for air or stuttered pauses||12.0%||10.3%|
|Prolongation||Elongated syllable “M[mmm]ommy”||10.0%||8.1%|
|Sound Repetition||Repeated syllables “I [pr-pr-pr-]prepared dinner”||8.3%||13.3%|
|Word/Phrase Repetition||“I made [made] dinner”||9.8%||10.4%|
|No dysfluencies||Affirmation that there are no discernable dysfluencies||56.9%||54.1%|
|Interjection||Filler words e.g., “um,” “uh,” & “you know”||21.2%||27.3%|
|Natural pause||A pause in speech (not as part of a stutter event)||8.5%||2.7%|
|Unintelligible||It is difficult to understand the speech||3.7%||3.0%|
|Unsure||An annotator was unsure of their response||0.1%||0.4%|
|No Speech||The clip is silent or only contains background noise||1.1%||-|
|Poor Audio Quality||There are microphone or other quality issues||2.1%||-|
|Music||Music is playing in the background||1.1%||-|
The focus of this paper is on detection of five stuttering event types: Blocks, Prolongations, Sound Repetitions, Word/Phrase Repetitions, and Interjections. Existing work has explored this problem using traditional signal processing techniques [8, 13, 7], language modeling (LM) [22, 1, 11, 2, 17], and acoustic modeling (AM) [17, 15]. Each approach has be shown to be effective at identifying one or two event types typically on data from a small number of users. Prolongations, or extended sounds, have been detected using short-window autocorrelations  and low-level acoustic models . Word/phrase repetitions, if they are well articulated, are easily detected using LM-based approaches , with the caveat that single-syllable words such as in the phrase “I-I-I am” will often be smoothed into “I am” due to the underlying acoustic model and phrases like “I am [am]” may be pruned because the LM has never seen the word “am” repeated before. This is fine for speech recognition but bad for stuttering event analysis. Arjun et al.  addressed this repetition problem by segmenting pairs of subsequent words and analyzing correlations in their spectral features. Interjections, including “um”, “uh”, “you know” and other filler words, are perhaps the easiest type to recognize with a language model if well articulated. Blocks, or gasps/pauses typically within or between words, are difficult to detect because the gasp for breath or pause is often inaudible. Sound repetitions are also challenging because syllables may vary in duration, count, style, and articulation (e.g.,“[moh-muh-mm]-ommy”).
2.1 Stuttering Events in Podcasts (SEP-28k)
We manually curated a set of podcasts, many of which contain speech from people who stutter talking with other people who stutter, using a two step process. Shows were initial selected by searching metadata from a podcast search engine with terms related to dysfluencies such as stutter, speech disorder, and stammer. This resulted in approximately 40 shows and 100s of hours of audio. Many of these were about speech disorders but did not contain high rates of speech from people who stutter. After culling down the data we extracted clips from 385 episodes across 8 shows. Specific show names and links to each episode can be found in the dataset respository.
We extracted 40250 segments per episode for a total of 28,177 clips. Dysfluency events are more likely to occur soon before, during, or after a pause so we used a voice activity detector to extract 3-second intervals near pauses. We varied where we sampled each interval with respect to a breakpoint to capture a more representative set of dysfluencies.
We used all of the FluencyBank  interview data which contains recordings from 32 adults who stutter. As with Riad et al.  we found the temporal alignment for some transcriptions and dysfluency annotations provided were inaccurate, so we ignored these and used the same process as SEP-28k to annotate 4,144 clips (3.5 hours).
Annotating stuttering data is difficult because of ambiguity in what constitutes stuttering for a given individual. Repetitions, for example, can occur during stuttering events or when an individual wants to emphasize a word or phrase. Speech may be unintelligible which makes it challenging to identify how a word was stuttered. We annotated our data using a variant of time-interval based assessment  in which audio recordings are broken into 3 second clips and annotated with binary labels as defined in Table 1. A clip may contain multiple stuttering event types along with non-dysfluency labels such as natural pause and unintelligible speech. SEP-28k was also annotated with: no speech, poor audio quality, and music to identify issues specific to this medium.
Clips were annotated by at least three people who received training via written descriptions, examples, and audio clips on how to best identify each dysfluency but were not clinicians. We measured Fleiss Kappa inter-annotator agreement and found word repetitions, interjections, sound repetitions, and no dysfluencies were more consistent (0.62, 0.57, 0.40, 0.39) and blocks and prolongations had only fair or slight agreement (0.25, 0.11). Blocks can be difficult to assess from audio alone; clinicians often rely on physical signs of grasping for air when making this assessment. As such, results when using the block labels should be more speculative.
2.4 Evaluation & Metrics
We use F1 score and Equal Error Rate to evaluate dysfluency detection where each annotation constitutes a binary label. F1 is the harmonic mean of precision () and recall (): . Equal Error Rate (EER) is the point in the Receiver Operating Characteristic (ROC) curve where the false acceptance rate is equal to the false rejection rate and reflects how well the two classes are separated. The lower the EER, the better the performance of the model. We report results for each label individually and as a combined “Any” label which includes all five stutter types.
SEP-28k is partitioned into three splits containing 25k samples for training, 2k for validation, and 1k for testing. FluencyBank is partitioned across the 32 individuals in the dataset: 26 individuals (3.6k clips) for training, 3 (500 clips) for validation, and 3 (500 clips) for testing. We encourage others to explore alternative splits to tease out differences between speakers, podcasts, or other analyses.
Our approach takes an audio clip, extracts acoustic features per-frame, applies a temporal model, and outputs a single set of clip-level dysfluency labels. We investigated baselines that are inspired by the dysfluency model in 
and alternative input features, model architectures, and loss functions.
3.1 Acoustic Features
Our baseline input is a set of 40 dimensional mel-filterbank energy features (). We use frequency cut-offs at 0 and 8000 , a 25 window, and a sample rate of 100 . We compare with three additional feature types:
(3 dim): pitch, pitch-delta and voicing features;
Pitch, voicing, and articulatory features encode voice quality and often change across dysfluency events. We hypothesize these may improve detection of blocks or gasps. Phoneme probabilities may make it easier to identify sound repetitions where the same phoneme fires multiple times in a row.
3.2 Model Architectures
The baseline stutter detection model consists of a single-layer LSTM network and an improved model adds convolutional layers per-feature type and learns how the features should be weighted, as shown in Figure 2
. We refer to the latter as ConvLSTM. Feature maps from the convolution layer are combined after batch normalization and fed to the LSTM layer. The temporal convolution size for
feature was set to 3 frames and for the remaining features were set to 5 frames. We use unidirectional recurrent networks where the final state is fed into the per-clip classifier. Both models have two output branches: a fluent/dysfluent prediction and a soft prediction for each of the five event types.
3.3 Loss functions
The baseline model has a single cross-entropy loss term. Our improved models are trained with a multi-task objective where the fluent/dysfluent branch has a weighted cross-entropy term with focal loss  and the per-dysfluency branch has a concordance correlation coefficient () loss using the inter-annotator agreement for each clip.
Models were trained with a mini-batch size of 256, using the Adam optimizer, with an initial learning rate of 0.01. Early stopping was used based on cross-validation error. Networks had 64 neurons in recurrent and embedding layers.
4 Experiments & Analysis
4.1 Model Design
Table 2 compares performance across features and architectures types. Spectral features with pitch generally perform well and when using the improved model achieve best performance when adding articulatory signals. This improvement matches our intuition that variation in intonation and articulation coincides with dysfluent speech. The phoneme-based models perform worst, despite their ability to extract features one might think would be useful for sound repetitions. The ConvLSTM and CCC loss moderately improve F1, likely because this loss explicitly encodes uncertainty in annotators.
Table 3 shows performance per-dysfluency type. Performance is worse for Blocks and Word Repetitions. These dysfluencies tend to last longer in time and have more variation in expression, which may contribute to the lower performance. Interjections and prolongations tend to have less variability and are easier to detect. SEP-28k performance is consistently worse than FluencyBank, likely given the larger variety of individuals and speaking styles.
|Baseline (LSTM, XEnt)|
|Improved (ConvLSTM, CCC)|
4.2 Data Quantity & Type
The central hypothesis for this work was that existing datasets are too small and contain too few participants for training effective dysfluency detection models. This is corroborated by results in Figure 3 which shows performance on SEP-28k and FluencyBank while training on different subsets. In the best case, there is a 24% relative F1 improvement in FluencyBank when training on all 25k SEP training samples compared to the 3k FluencyBank set. Even using only 5k SEP clips already performs FluencyBank performance by 16% F1. This could be because there are a larger number of users in the dataset and the data contains more variability in speaking styles. As expected, performance on SEP-28k is worst when training on FluencyBank and increases with larger numbers of training samples.
We introduced SEP-28, which contains over an order of magnitude more annotations than existing public datasets and added new annotations to FluencyBank. These annotations can be used for many tasks so we encourage others to explore the data, labels, and splits in ways beyond what was is described here. Future work should explore alternative approaches, e.g., using language models, which may improve performance for some dysfluency types that are more difficult to detect. Lastly, while dysfluencies are most common in those who stutter, future work should address how they can be detected from people with other speech disorders, such as dysarthria, which may be characterized differently.
Acknowledgment: Thanks to Lauren Tooley for countless discussions on the clinical aspects of stuttering.
-  A lightly supervised approach to detect stuttering in children’s speech. In Interspeech 2018, Cited by: §1.
-  (2017) Automatic recognition of children’s read speech for stuttering application. WOCCI. Cited by: §1.
-  (2020) Towards automated assessment of stuttering and stuttering therapy. In International Conference on Text, Speech, and Dialogue, Cited by: §1.
-  (2018) Accessible voice interfaces. In CSCW, Cited by: §1, §1.
-  (2020) Speech diversity and speech interfaces: considering an inclusive future through stammering. In Conversational User Interfaces, Cited by: §1, §1.
-  (2018-10) When alexa can’t understand you. Note: Slate (online) Cited by: §1.
-  (2003) Intelligent processing of stuttered speech. Journal of Intelligent Information Systems. Cited by: §1.
-  (2018) Speech recognition and correction of a stuttered speech. In ICACCI, Cited by: §1.
-  (2009) The UCLASS archive of stuttered speech. J. Speech Lang. Hear. Res. Cited by: §1.
-  (2019) Sequence-to-sequence speech recognition with time-depth separable convolutions. In Interspeech, Cited by: 3rd item.
Using clinician annotations to improve automatic speech recognition of stuttered speech. In Interspeech, Cited by: §1.
-  (1993) Time-interval measurement of stuttering: systematic replication of Ingham, Cordes, and Gow (1993). Journal of Speech, Language, and Hearing Research. Cited by: §1.
-  (2020) Automatic correction of stutter in disfluent speech. CoCoNet. Cited by: §1.
-  (2020) Sense and accessibility: understanding people with physical disabilities’ experiences with sensing systems. In ACM ASSETS, Cited by: §1.
Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory. In ICASSP, Cited by: §1, §1, §3.
-  (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.3.
-  (2016) Gaussian mixture model based classification of stuttering dysfluencies. Journal of Intelligent Systems. Cited by: §1.
-  (2019) Leveraging acoustic cues and paralinguistic embeddings to detect expression from voice. In ICASSP, Cited by: 2nd item.
Retrieving tract variables from acoustics: a comparison of different machine learning strategies. IEEE Journal of Selected Topics in Signal Processing. Cited by: 2nd item.
-  (2015) Librispeech: an ASR corpus based on public domain audio books. In ICASSP, Cited by: 3rd item.
-  (2018) Fluency bank: a new resource for fluency research and practice. Journal of Fluency Disorders. Cited by: §1, §2.2.
-  (2020) Identification of primary and collateral tracks in stuttered speech. In LREC, Cited by: §1, §1, §2.2.
-  (2009) SSI-4 stuttering severity instrument fourth edition. Austin, TX: Pro-Ed. Cited by: §1.
-  (2017) Noise robust acoustic to articulatory speech inversion. InterSpeech. Cited by: 2nd item.
-  (2020-04) Stammering accessibility and testing for voice assistants & devices. Note: Personal Blog (online) Cited by: §1.
-  (2015) Event-and interval-based measurement of stuttering: a review. IJLCD. Cited by: §1, §2.3.
-  (1983) The nature of stuttering (2nd ed.). Applied Psycholinguistics. Cited by: §1.
-  (2020-01) For people who stutter, the convenience of voice assistant technology remains out of reach. Note: USA Today (online) Cited by: §1.