SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

02/24/2021 ∙ by Colin Lea, et al. ∙ 0

The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns. Despite increasing interest in this area, existing public datasets are too small to build generalizable dysfluency detection systems and lack sufficient annotations. In this work, we introduce Stuttering Events in Podcasts (SEP-28k), a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter. We benchmark a set of acoustic models on SEP-28k and the public FluencyBank dataset and highlight how simply increasing the amount of training data improves relative detection performance by 28% and 24% F1 on each. Annotations from over 32k clips across both datasets will be publicly released.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dysfluencies in speech such as sound repetitions, word repetitions, and blocks are common amongst everyone and are especially prevalent in people who stutter. Frequent occurrences can make social interactions challenging and limit an individual’s ability to communicate with ubiquitous speech technology including Alexa, Siri, and Cortana [4, 5, 28, 25, 6]. In this work we investigate the ability to automatically detect dysfluencies, which may be valuable for clinical assessment or development of accessible speech recognition technology.

Figure 1: Speech from someone who stutters may contain events including sound repetitions (orange), interjections (blue), blocks/pauses (green), or other events that make speech recognition challenging.

This problem is challenging because there are many variations in how a given individual expresses each dysfluency type, in the patterns of dysfluencies between users, and even how the situation or environment affects their speech. For example, an individual may stutter when conversing but not while reading aloud; when talking with a teacher but not a friend; or when stressed before an exam but not in every day-to-day interaction. The speech pathology community has spent decades characterizing, developing diagnosis tools, and developing strategies to mitigate these behaviors [27, 23, 26, 12], however, there has been limited success in taking these learnings and applying them to speech recognition technology, where individuals may be frequently cut off or have their speech inaccurately transcribed.

A major bottleneck in this area is that dysfluency datasets tend to be small and have few or inconsistent annotations not inherently designed for work on speech recognition tasks. Kourkounakis et al. [15] used 800 speech clips (53 minutes) with custom annotations to detect dysfluencies from 25 children who stutter using the UCLASS dataset [9]. Riad et al. [22] performed a similar task using 1429 utterances from 22 adults who stutter with the recent FluencyBank [21] dataset. Bayer et al. [3] collected a 3.5 hour German dataset with 37 speakers and developed a model for automated stuttering severity assessment. Unfortunately, none of the annotations from these efforts have been released. A core contribution of our paper is the introduction of the Stuttering Events in Podcasts dataset (SEP-28k) dataset which contains 28k annotated clips (23 hours) of speech curated from public podcasts. We have released these along with annotations for 4k clips (3.5 hours) from FluencyBank targeted at stuttering event detection.

Stuttering Labels Definition SEP-28k FluencyBank
Block Gasps for air or stuttered pauses 12.0% 10.3%
Prolongation Elongated syllable “M[mmm]ommy” 10.0% 8.1%
Sound Repetition Repeated syllables “I [pr-pr-pr-]prepared dinner” 8.3% 13.3%
Word/Phrase Repetition “I made [made] dinner” 9.8% 10.4%
No dysfluencies Affirmation that there are no discernable dysfluencies 56.9% 54.1%
Interjection Filler words e.g., “um,” “uh,” & “you know” 21.2% 27.3%
Non-dysfluent Labels
Natural pause A pause in speech (not as part of a stutter event) 8.5% 2.7%
Unintelligible It is difficult to understand the speech 3.7% 3.0%
Unsure An annotator was unsure of their response 0.1% 0.4%
No Speech The clip is silent or only contains background noise 1.1% -
Poor Audio Quality There are microphone or other quality issues 2.1% -
Music Music is playing in the background 1.1% -
Table 1: Distribution of annotations in each dataset where at least two of three annotators applied a given label.

The focus of this paper is on detection of five stuttering event types: Blocks, Prolongations, Sound Repetitions, Word/Phrase Repetitions, and Interjections. Existing work has explored this problem using traditional signal processing techniques [8, 13, 7], language modeling (LM) [22, 1, 11, 2, 17], and acoustic modeling (AM) [17, 15]. Each approach has be shown to be effective at identifying one or two event types typically on data from a small number of users. Prolongations, or extended sounds, have been detected using short-window autocorrelations [13] and low-level acoustic models [15]. Word/phrase repetitions, if they are well articulated, are easily detected using LM-based approaches [11], with the caveat that single-syllable words such as in the phrase “I-I-I am” will often be smoothed into “I am” due to the underlying acoustic model and phrases like “I am [am]” may be pruned because the LM has never seen the word “am” repeated before. This is fine for speech recognition but bad for stuttering event analysis. Arjun et al. [13] addressed this repetition problem by segmenting pairs of subsequent words and analyzing correlations in their spectral features. Interjections, including “um”, “uh”, “you know” and other filler words, are perhaps the easiest type to recognize with a language model if well articulated. Blocks, or gasps/pauses typically within or between words, are difficult to detect because the gasp for breath or pause is often inaudible. Sound repetitions are also challenging because syllables may vary in duration, count, style, and articulation (e.g.,“[moh-muh-mm]-ommy”).

Efforts in HCI have sought out an understanding of speech recognition needs for users with speech impairments, which is critical for framing problems like ours [4, 5, 14].

2 Data

2.1 Stuttering Events in Podcasts (SEP-28k)

We manually curated a set of podcasts, many of which contain speech from people who stutter talking with other people who stutter, using a two step process. Shows were initial selected by searching metadata from a podcast search engine with terms related to dysfluencies such as stutter, speech disorder, and stammer. This resulted in approximately 40 shows and 100s of hours of audio. Many of these were about speech disorders but did not contain high rates of speech from people who stutter. After culling down the data we extracted clips from 385 episodes across 8 shows. Specific show names and links to each episode can be found in the dataset respository.

We extracted 40250 segments per episode for a total of 28,177 clips. Dysfluency events are more likely to occur soon before, during, or after a pause so we used a voice activity detector to extract 3-second intervals near pauses. We varied where we sampled each interval with respect to a breakpoint to capture a more representative set of dysfluencies.

2.2 FluencyBank

We used all of the FluencyBank [21] interview data which contains recordings from 32 adults who stutter. As with Riad et al. [22] we found the temporal alignment for some transcriptions and dysfluency annotations provided were inaccurate, so we ignored these and used the same process as SEP-28k to annotate 4,144 clips (3.5 hours).

2.3 Annotations

Annotating stuttering data is difficult because of ambiguity in what constitutes stuttering for a given individual. Repetitions, for example, can occur during stuttering events or when an individual wants to emphasize a word or phrase. Speech may be unintelligible which makes it challenging to identify how a word was stuttered. We annotated our data using a variant of time-interval based assessment [26] in which audio recordings are broken into 3 second clips and annotated with binary labels as defined in Table 1. A clip may contain multiple stuttering event types along with non-dysfluency labels such as natural pause and unintelligible speech. SEP-28k was also annotated with: no speech, poor audio quality, and music to identify issues specific to this medium.

Clips were annotated by at least three people who received training via written descriptions, examples, and audio clips on how to best identify each dysfluency but were not clinicians. We measured Fleiss Kappa inter-annotator agreement and found word repetitions, interjections, sound repetitions, and no dysfluencies were more consistent (0.62, 0.57, 0.40, 0.39) and blocks and prolongations had only fair or slight agreement (0.25, 0.11). Blocks can be difficult to assess from audio alone; clinicians often rely on physical signs of grasping for air when making this assessment. As such, results when using the block labels should be more speculative.

2.4 Evaluation & Metrics

We use F1 score and Equal Error Rate to evaluate dysfluency detection where each annotation constitutes a binary label. F1 is the harmonic mean of precision (

) and recall (): . Equal Error Rate (EER) is the point in the Receiver Operating Characteristic (ROC) curve where the false acceptance rate is equal to the false rejection rate and reflects how well the two classes are separated. The lower the EER, the better the performance of the model. We report results for each label individually and as a combined “Any” label which includes all five stutter types.

SEP-28k is partitioned into three splits containing 25k samples for training, 2k for validation, and 1k for testing. FluencyBank is partitioned across the 32 individuals in the dataset: 26 individuals (3.6k clips) for training, 3 (500 clips) for validation, and 3 (500 clips) for testing. We encourage others to explore alternative splits to tease out differences between speakers, podcasts, or other analyses.

Figure 2: Multi-feature acoustic stutter detection model

3 Methods

Our approach takes an audio clip, extracts acoustic features per-frame, applies a temporal model, and outputs a single set of clip-level dysfluency labels. We investigated baselines that are inspired by the dysfluency model in [15]

and alternative input features, model architectures, and loss functions.

3.1 Acoustic Features

Our baseline input is a set of 40 dimensional mel-filterbank energy features (). We use frequency cut-offs at 0 and 8000 , a 25 window, and a sample rate of 100 . We compare with three additional feature types:

  • [noitemsep]

  • (3 dim): pitch, pitch-delta and voicing features;

  • (8 dim): articulatory features in the form of vocal-tract () constriction variables [24]. These define degree and location of constriction actions within the human vocal tract [24, 19] as implemented in [18];

  • (41 dim): phoneme probabilities extracted from an acoustic model trained on LibriSpeech 

    [20] using a Time-depth Separable CNN architecture [10].

Pitch, voicing, and articulatory features encode voice quality and often change across dysfluency events. We hypothesize these may improve detection of blocks or gasps. Phoneme probabilities may make it easier to identify sound repetitions where the same phoneme fires multiple times in a row.

3.2 Model Architectures

The baseline stutter detection model consists of a single-layer LSTM network and an improved model adds convolutional layers per-feature type and learns how the features should be weighted, as shown in Figure 2

. We refer to the latter as ConvLSTM. Feature maps from the convolution layer are combined after batch normalization and fed to the LSTM layer. The temporal convolution size for

feature was set to 3 frames and for the remaining features were set to 5 frames. We use unidirectional recurrent networks where the final state is fed into the per-clip classifier. Both models have two output branches: a fluent/dysfluent prediction and a soft prediction for each of the five event types.

3.3 Loss functions

The baseline model has a single cross-entropy loss term. Our improved models are trained with a multi-task objective where the fluent/dysfluent branch has a weighted cross-entropy term with focal loss [16] and the per-dysfluency branch has a concordance correlation coefficient () loss using the inter-annotator agreement for each clip.

Models were trained with a mini-batch size of 256, using the Adam optimizer, with an initial learning rate of 0.01. Early stopping was used based on cross-validation error. Networks had 64 neurons in recurrent and embedding layers.

4 Experiments & Analysis

4.1 Model Design

Table 2 compares performance across features and architectures types. Spectral features with pitch generally perform well and when using the improved model achieve best performance when adding articulatory signals. This improvement matches our intuition that variation in intonation and articulation coincides with dysfluent speech. The phoneme-based models perform worst, despite their ability to extract features one might think would be useful for sound repetitions. The ConvLSTM and CCC loss moderately improve F1, likely because this loss explicitly encodes uncertainty in annotators.

Table 3 shows performance per-dysfluency type. Performance is worse for Blocks and Word Repetitions. These dysfluencies tend to last longer in time and have more variation in expression, which may contribute to the lower performance. Interjections and prolongations tend to have less variability and are easier to detect. SEP-28k performance is consistently worse than FluencyBank, likely given the larger variety of individuals and speaking styles.

Baseline (LSTM, XEnt)
74.6 74.8 24.7
77.7 75.8 23.8
81.6 81.8 18.0
81.8 80.1 19.0
Improved (ConvLSTM, CCC)
80.8 80.2 17.1
83.0 81.9 16.1
83.4 82.7 16.9
83.6 83.6 16.9
Table 2: Weighted Accuracy (WA), F1-score and Equal Errors Rate () from each model on FluencyBank (eval).
SEP-28k Bl Pro Snd Wd Int Any
Random 13.7 12.8 9.5 4.3 13.6 46.0
Baseline (STL) 54.9 65.4 57.2 60.7 64.9 61.5
Baseline (MTL) 56.4 65.1 60.5 56.2 69.5 64.5
Improved 55.9 68.5 63.2 60.4 71.3 66.8
FluencyBank Bl Pro Snd Wd Int Any
Random 12.9 10.7 28.2 10.3 31.7 31.7
Baseline (STL) 58.6 63.2 60.8 61.8 57.2 73.2
Baseline (MTL) 54.6 67.6 74.2 55.8 75.0 74.8
Improved 56.8 67.9 74.3 59.3 82.6 80.8
Table 3: F1 score per dysfluency type with a baseline LSTM model (XEnt loss) trained using single- or multi-task learning (STL, MTL) and the Improved ConvLSTM model (CCC loss). Bl=Block, Pro=Prolongation, Snd=Sound Repetition, Wd=Word Repetition, Int=Interjection

4.2 Data Quantity & Type

The central hypothesis for this work was that existing datasets are too small and contain too few participants for training effective dysfluency detection models. This is corroborated by results in Figure 3 which shows performance on SEP-28k and FluencyBank while training on different subsets. In the best case, there is a 24% relative F1 improvement in FluencyBank when training on all 25k SEP training samples compared to the 3k FluencyBank set. Even using only 5k SEP clips already performs FluencyBank performance by 16% F1. This could be because there are a larger number of users in the dataset and the data contains more variability in speaking styles. As expected, performance on SEP-28k is worst when training on FluencyBank and increases with larger numbers of training samples.

Figure 3: Test performance when training models only on FluencyBank clips or subsets of clips from SEP-28k.

5 Conclusion

We introduced SEP-28, which contains over an order of magnitude more annotations than existing public datasets and added new annotations to FluencyBank. These annotations can be used for many tasks so we encourage others to explore the data, labels, and splits in ways beyond what was is described here. Future work should explore alternative approaches, e.g., using language models, which may improve performance for some dysfluency types that are more difficult to detect. Lastly, while dysfluencies are most common in those who stutter, future work should address how they can be detected from people with other speech disorders, such as dysarthria, which may be characterized differently.

Acknowledgment: Thanks to Lauren Tooley for countless discussions on the clinical aspects of stuttering.


  • [1] S. Alharbi, M. Hasan, A. J. Simons, S. Brumfitt, and P. Green A lightly supervised approach to detect stuttering in children’s speech. In Interspeech 2018, Cited by: §1.
  • [2] S. Alharbi, A.J.H. Simons, S. Brumfitt, and P.D. Green (2017) Automatic recognition of children’s read speech for stuttering application. WOCCI. Cited by: §1.
  • [3] S. Bayerl, F. Hönig, J. Reister, and K. Riedhammer (2020) Towards automated assessment of stuttering and stuttering therapy. In International Conference on Text, Speech, and Dialogue, Cited by: §1.
  • [4] R. Brewer, L. Findlater, J. Kaye, W. Lasecki, C. Munteanu, and A. Weber (2018) Accessible voice interfaces. In CSCW, Cited by: §1, §1.
  • [5] L. Clark, B. Cowan, A. Roper, S. Lindsay, and O. Sheers (2020) Speech diversity and speech interfaces: considering an inclusive future through stammering. In Conversational User Interfaces, Cited by: §1, §1.
  • [6] M. Corcoran (2018-10) When alexa can’t understand you. Note: Slate (online) Cited by: §1.
  • [7] A. Czyzewski, A. Kaczmarek, and B. Kostek (2003) Intelligent processing of stuttered speech. Journal of Intelligent Information Systems. Cited by: §1.
  • [8] A. Dash, N. Subramani, T. Manjunath, V. Yaragarala, and S. Tripathi (2018) Speech recognition and correction of a stuttered speech. In ICACCI, Cited by: §1.
  • [9] S. Devis, P. Howell, and J. Batrip (2009) The UCLASS archive of stuttered speech. J. Speech Lang. Hear. Res. Cited by: §1.
  • [10] A. Hannun, A. Lee, Q. Xu, and R. Collobert (2019) Sequence-to-sequence speech recognition with time-depth separable convolutions. In Interspeech, Cited by: 3rd item.
  • [11] P. Heeman, R. Lunsford, A. McMillin, and J. Yaruss (2016)

    Using clinician annotations to improve automatic speech recognition of stuttered speech

    In Interspeech, Cited by: §1.
  • [12] R. Ingham, A. Cordes, and P. Finn (1993) Time-interval measurement of stuttering: systematic replication of Ingham, Cordes, and Gow (1993). Journal of Speech, Language, and Hearing Research. Cited by: §1.
  • [13] A. K N, K. S, K. D, P. Chanda, and S. Tripathi (2020) Automatic correction of stutter in disfluent speech. CoCoNet. Cited by: §1.
  • [14] S. Kane, A. Guo, and M. Morris (2020) Sense and accessibility: understanding people with physical disabilities’ experiences with sensing systems. In ACM ASSETS, Cited by: §1.
  • [15] T. Kourkounakis, A. Hajavi, and A. Etemad (2020)

    Detecting multiple speech disfluencies using a deep residual network with bidirectional long short-term memory

    In ICASSP, Cited by: §1, §1, §3.
  • [16] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.3.
  • [17] P. Mahesha and D.S. Vinod (2016) Gaussian mixture model based classification of stuttering dysfluencies. Journal of Intelligent Systems. Cited by: §1.
  • [18] V. Mitra, S. Booker, E. Marchiand, D. Farrarand, U. Peitzand, B. Chengand, E. Tevesand, A. Mehtaand, and D. Naik (2019) Leveraging acoustic cues and paralinguistic embeddings to detect expression from voice. In ICASSP, Cited by: 2nd item.
  • [19] V. Mitra, H. Nam, C. Y. Espy-Wilson, E. Saltzman, and L. Goldstein (2010)

    Retrieving tract variables from acoustics: a comparison of different machine learning strategies

    IEEE Journal of Selected Topics in Signal Processing. Cited by: 2nd item.
  • [20] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In ICASSP, Cited by: 3rd item.
  • [21] N. Ratner and B. MacWhinney (2018) Fluency bank: a new resource for fluency research and practice. Journal of Fluency Disorders. Cited by: §1, §2.2.
  • [22] R. Riad, A. Bachoud-Lévi, F. Rudzicz, and E. Dupoux (2020) Identification of primary and collateral tracks in stuttered speech. In LREC, Cited by: §1, §1, §2.2.
  • [23] G. Riley (2009) SSI-4 stuttering severity instrument fourth edition. Austin, TX: Pro-Ed. Cited by: §1.
  • [24] N. Seneviratne (2017) Noise robust acoustic to articulatory speech inversion. InterSpeech. Cited by: 2nd item.
  • [25] P. Soundararajan (2020-04) Stammering accessibility and testing for voice assistants & devices. Note: Personal Blog (online) Cited by: §1.
  • [26] A. Valente, L. Jesus, A. Hall, and M. Leahy (2015) Event-and interval-based measurement of stuttering: a review. IJLCD. Cited by: §1, §2.3.
  • [27] C. Van Riper and K. S. Harris (1983) The nature of stuttering (2nd ed.). Applied Psycholinguistics. Cited by: §1.
  • [28] K. Wheeler (2020-01) For people who stutter, the convenience of voice assistant technology remains out of reach. Note: USA Today (online) Cited by: §1.