E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

by   W. Ronny Huang, et al.

Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form audio (YouTube) with lengths of up to 30 minutes, we demonstrate 8.5 median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.


page 1

page 2

page 3

page 4


A comparison of end-to-end models for long-form speech recognition

End-to-end automatic speech recognition (ASR) models, including both att...

Long-Running Speech Recognizer:An End-to-End Multi-Task Learning Framework for Online ASR and VAD

When we use End-to-end automatic speech recognition (E2E-ASR) system for...

Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer

The demand for fast and accurate incremental speech recognition increase...

Recognizing long-form speech using streaming end-to-end models

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems t...

Segmenting Subtitles for Correcting ASR Segmentation Errors

Typical ASR systems segment the input audio into utterances using purely...

ASR Error Detection via Audio-Transcript entailment

Despite improved performances of the latest Automatic Speech Recognition...

End-to-End Mandarin Tone Classification with Short Term Context Information

In this paper, we propose an end-to-end Mandarin tone classification met...

1 Introduction

Streaming End-to-end (E2E) models for ASR have achieved low word error rates (WERs) for short to medium length utterances of up to a few minutes long [liu2021exploiting, sainath2021cascadedlm]. However, E2E models have high WERs and suffer from deletion error on long-form utterances of tens of minutes to hours long [chiu2021rnn, lu2021input, wang2022vadoi]. Such utterances are found in tasks like meetings, lectures, and video captions.

A common practice for processing long-form utterances is to first segment the audio upstream with a separate voice activity detector (VAD). Whenever the VAD detects a long silence, it splits the audio at that location into two segments [ramirez2007voice, yoshimura2020end], which are then processed independently by the E2E model. At each segment boundary, the beam search finalizes the top hypothesis by discarding all other hypotheses. This introduces more diversity into the beam search by occasionally clearing away stale hypotheses and making room for new ones, ultimately improving WER by seeing more potentially correct hypotheses. Maintaining beam diversity is particularly important for E2E models which are typically decoded with small beams.

Despite its crucial role in segmenting audio and regimenting the beam search, very little attention has been paid to improving end-of-segment prediction task [ali2018innovative, hou2020segment]. Current segmenters suffer from high latency because the VAD, by design, must wait through a long silence before deciding to segment. This delays subsequent functions like rescoring [sainath2019two] or prefetching [chang2020prefetch] that must wait for the hypotheses to be finalized. Improving the latency is important because it can improve user experience by making smart assistants more responsive via faster prefetching, or by helping dictation or captioning apps reduce the amount of “flickering” due to switching between top hypotheses. Current segmenters also suffer from high segmentation error because the VAD bases its decision purely on the audio and not the decoded text [li2021long], which can contain semantic clues as to when to segment. Improving segmentation correctness is important because it can improve WER. As a motivating example, consider two segmentations of the spoken audio below. Note the segment boundary is denoted by “|”.

Audio: “Shaq… dunks—game over!”

[1mm] S1:  shaq dunks | game over

S2:  shaq | dunks game over

An ideal segmenter would give S1 because it separates the speech into semantically consistent chunks. However, a VAD segmenter would give S2 because the speaker pauses after saying “Shaq…”. Importantly, S2’s suboptimal segmentation can lead to a word error if “shack” was the top hypothesis during finalization. This is because by the time “dunks” is spoken, there is no more opportunity to revise “shack” to “shaq” due to the finalization. On the other hand, not segmenting at all would lead to bloating the beam with no diversity in the hypotheses, which could also induce word errors [prabhavalkar2021less].

A related problem exists for end-of-query (EOQ) prediction, or endpointing, which historically also used audio-based VAD or EOQ detectors [shannon2017improved]. Recently, WER and latency gains have been achieved by combining endpointing and ASR into a single E2E model that is jointly optimized on both tasks, allowing them to share acoustic and semantic information [maas2018combining, Shuoyiin19, li2020towards, hwang2020end, lu2022endpoint].

Taking inspiration from the E2E endpointing work above, we now introduce E2E Segmenter, an E2E model jointly optimized on both end-of-segment detection and ASR tasks. A central challenge to end-to-end segmenting is that unlike the end-of-query label which indisputably belongs at the end of the transcript, there is no ground truth for where end-of-segment labels ought to be—making supervised training difficult. We address this challenge by proposing a novel end-of-segment annotation scheme based on modeling hesitations and word timings. To avoid degrading wordpiece prediction, we also introduce a new joint layer in the RNN-T architecture that independently predicts the end-of-segment token while leveraging shared acoustic and semantic features. Compared to the VAD baseline, E2E segmenter achieves quality improvements of up to 8.5% WER relative while simultaneously reducing 50th percentile latency by 250 ms on the YouTube captioning task.

2 Method

The primary job of the segmenter is to send segment boundary signals to the beam search in a streaming fashion. Upon receiving this signal, the beam search finalizes the top hypothesis, clears the beam, resets the encoder state, and passes the top-hypothesis decoder state to the new segment. The decision of when to send the segment boundary signal is conventionally made by an upstream VAD model; but here, the signal is produced by the decoder itself whenever the top hypothesis in the beam search predicts it has reached the end-of-segment with confidence above a threshold. We now discuss how the E2E model is designed to perform this end-of-segment (<eos>) prediction task.

2.1 End-of-segment joint layer

Figure 1 illustrates our architecture, which is similar to that in [chang2021turn]

. The original RNN-T wordpiece joint network is a shallow, single layer of the RNN-T model that fuses both acoustic (from the encoder) and linguistic (from the prediction network) sources of information and emits token posteriors. A natural way of conferring the end-of-segment prediction task to the RNN-T decoder would be assign the joint layer an additional output logit representing

<eos>, as is done for endpointing [Shuoyiin19, li2020towards], but we found in pilot experiments that this interferes with wordpiece decoding and hurts WER. Instead, to decouple wordpiece prediction from end-of-segment prediction, we add a second joint layer—the end-of-segment joint layer—that emits an <eos> posterior, i.e.


where is the i-th audio frame and is the i-th decoded token in the beam. The end-of-segment joint layer is identical in structure to the wordpiece joint layer, containing all wordpieces as logits. Standard wordpiece training of the RNN-T model with the wordpiece joint layer first occurs; then the end-of-segment joint layer is initialized with the same weights as the wordpiece joint layer and fine-tuned on the training data with <eos> prediction included. During inference, the wordpiece joint layer is used for wordpiece prediction while the end-of-segment joint layer is used for end-of-segment prediction.

Figure 1: RNN-T with additional joint layer for emitting the end-of-segment posterior.

2.2 End-of-segment annotation

While the architecture now allows for emitting the <eos> token, how can we train the model to emit it at the appropriate time? What patterns from the audio or text data can be used as supervision for when an <eos> ought to

occur? Human annotation is expensive and inconsistent—it is not even clear in principle where best to insert segment boundaries. Thus, we opt for a heuristic-based, weak supervision approach where

<eos> ground truth labels are automatically inserted into the training transcripts based on the rules shown in Table 1.

These heuristics include rules for inserting an <eos> when there is a long silence (1.2s) or at the end of the utterance. To eliminate common mis-insertions, we also specify two exceptions for patterns where the model might otherwise insert <eos>, but are in fact places where the speaker is likely not finished with the sentence. Refer to Figure 2

for an example. Specifically these include silences following lengthened words (heyyy) or filler words (um) which signal speaker hesitation. We identify as lengthened words those with a phoneme duration exceeding 5 times the standard deviation; and we use an in-house model to detect filler words. Implementing these heuristics required obtaining silence, word, and phoneme timings via running a forced alignment model on all audio-text pairs in the training set.

When this happens…
It’s likely because…
Rule 1 Long silence between words Speaker finished <eos>
Rule 2 Silence following last word Speaker finished <eos>
Exception 1
Silence following lengthened
word (e.g. heyyy)
Speaker not finished
Exception 2
Silence following
filler word (e.g. um)
Speaker not finished
Table 1: Rules and exceptions for inserting <eos> annotations.
Figure 2: Example of <eos> annotation. “sil” = silence.

2.3 FastEmit training

Now that the model emits <eos> correctly, we wish to make it emit quickly. After all, one of the advantages of E2E segmenting is that it does not need to wait a fixed silence duration before emitting <eos> like the VAD. Therefore, we train our model with the FastEmit regularization term [yu21fastemit] which encourages each token to be emitted as soon as sufficient context is available. During inference, the FastEmit-trained model can emit <eos> sooner than the silence duration required to insert that token during the ground truth annotation procedure (Table 1, Rule 1).

3 Setup

Test set # Utt. Tot words Tot. length 50th 75th
YT_LONG 77 207191 22.2h 14.8m 30m
YT_SHORT 105 84862 9.0h 6.3m 7.4m
Table 2: Length statistics of YouTube testsets.

. YT_LONG YT_SHORT Segmenter WER EOS50 EOS75 # Seg. # State WER EOS50 EOS75 # Seg. # State (a) B1: Fixed-10s 20.0517.0520.05 - - 99.851.399.8 549654968396 13.6310.4913.63 - - 33.917.533.9 396539655689 B2: Fixed-20s 18.2217.0520.05 - - 51.351.399.8 575354968396 11.5710.4913.63 - - 17.517.533.9 404839655689 B3: VAD 18.1617.0520.05 260130260 490353490 95.251.399.8 839654968396 11.4610.4913.63 460210460 660365660 28.317.533.9 568939655689 E1: E2E (best) 17.0517.0520.05 130130260 353353490 56.251.399.8 809854968396 10.4910.4913.63 210210460 365365660 18.217.533.9 567239655689    E1 vs. B3 -6.1% -130 -137 -8.5% -250 -265 (b) E2: E2E-0.0 17.6017.0517.76 18075220 550210550 16.4 16.4477.6 724454108513 11.1010.4911.10 510140510 855230855 5.8 5.8119.1 541639755771 E3: E2E-0.5 17.3817.0517.76 18075220 550210550 18.7 16.4477.6 795854108513 10.9110.4911.10 460140510 800230855 6.8 5.8119.1 570539755771 E4: E2E-1.0 17.1917.0517.76 22075220 535210550 26.3 16.4477.6 851354108513 10.6810.4911.10 390140510 745230855 9.6 5.8119.1 577139755771 E5: E2E-1.5 17.0917.0517.76 19575220 445210550 38.6 16.4477.6 836654108513 10.6310.4911.10 280140510 395230855 13.3 5.8119.1 570039755771 E6: E2E-2.0 17.0517.0517.76 13075220 353210550 56.2 16.4477.6 809854108513 10.4910.4911.10 210140510 365230855 18.2 5.8119.1 567239755771 E7: E2E-2.5 17.0817.0517.76 10075220 355210550 80.7 16.4477.6 791754108513 10.4910.4911.10 200140510 345230855 25.7 5.8119.1 537139755771 E8: E2E-3.0 17.0617.0517.76 10075220 415210550 116.316.4477.6 736254108513 10.5810.4911.10 180140510 230230855 35.6 5.8119.1 503939755771 E9: E2E-3.5 17.2217.0517.76 90 75220 280210550 178.516.4477.6 692654108513 10.5610.4911.10 180140510 450230855 49.8 5.8119.1 479039755771 E10: E2E-4.0 17.4817.0517.76 75 75220 210210550 300.016.4477.6 623754108513 10.5810.4911.10 180140510 380230855 74.7 5.8119.1 456939755771 E11: E2E-4.5 17.7617.0517.76 90 75220 255210550 477.616.4477.6 541054108513 10.7810.4911.10 140140510 280230855 119.15.8119.1 397539755771

Table 3: (a) Main results. (b) End-of-segment threshold ablation study. Naming convention is E2E-{eos_threshold_value}

3.1 Dataset

YouTube videos cover many domains (TV shows, sports, conversations, etc.) and are often be very long [narayanan2019recognizing], making YouTube captioning an ideal task for our long-form study. Thus, we evaluate on two standard YouTube testsets used in [Soltau2017, chiu2019comparison, chiu2021rnn]: YT_LONG is sampled from YouTube video-on-demand and YT_SHORT is sampled from Google Preferred channels on YouTube. Table 2 shows their length statistics.

The training set, identical to that in [sainath2020streaming], is a sample of Google traffic from multiple domains such as voice search, farfield, telephony, and YouTube, making up about 300M utterances with 400k hours of audio. All utterances are anonymized and hand-transcribed, with the exception of YouTube being semi-supervised [liao2013large]. Note the YouTube utterances used for training are cut into small chunks no more than about 20 seconds long. The data is diversified via multi-style training [kim2017mtr], random down-sampling from 16 to 8 kHz [li2012improving], and SpecAug [Park2019].

3.2 Model

Our RNN-T model is similar to the first-pass network of [sainath2021cascadedlm]. The encoder is a streaming 12-layer, 512 dimensional Conformer encoder with causal convolution kernels of size 15 and 8 left-context self-attention heads. The decoder consists of a stateless prediction network [Rami21] with output dimension 640. The joint layers (both wordpiece and end-of-segment) are single layers which input the concatenation of encoder and prediction network features. In total, the model has 140M parameters, of which less than 1M are due to the additional end-of-segment joint layer. The model emits 4096 wordpieces, with the blank token factored out with HAT factorization [variani2020hybrid]. Model training minimizes the RNN-T and MWER loss [prabhavalkar2018minimum]. We also add the FastEmit regularization term [yu21fastemit] with a weight of 5e-3. The optimizer was Adam with and . A transformer learning rate schedule [Vaswani17] with peak learning rate of 1.8e-3 and 32k steps of warm-up is used, along with exponential-moving-average-stabilized gradient updates. All models are implemented in Lingvo [shen2019lingvo] and trained on 64 TPU chips with a global batch size of 4096 for 500k steps.

3.3 Beam search

We use a frame-synchronous beam search with a beam size of 8 and a pruning threshold of 5; i.e. partial hypotheses with negative log posterior exceeding that of the top hypothesis by 5 are removed. At each frame, we apply a breadth-first search for possible expansions similar to [tripathi2019monotonic], ignoring any expansion with a negative log posterior of 5 or greater, and limiting the search depth to 10 expansions. The production streaming client we run on has a maximum segment duration of 65 seconds before it forces a finalization.

3.4 Voice activity detector

Our pipeline contains a lightweight voice activity detector [zazo2016feature]

upstream of the E2E model that classifies each frame as silence or speech in a streaming fashion. Whenever it detects 0.2 seconds of continued silence, it sends a segment boundary signal forcing the beam search to reset encoder state and discard all except the top hypothesis. VAD-based segment finalization is turned on only in our baselines; it is turned off for all E2E segmenter experiments.

4 Results

In Table 3, we run the ASR pipeline with different segmenters on YT_LONG and YT_SHORT. Other than the segmenter, all aspects of the ASR pipeline are identical. We track the following metrics for each experiment:

  • [leftmargin=*]

  • WER: Word error rate—measure of overall ASR quality.

  • EOS50, EOS75: End-of-segment latency in milliseconds, i.e., how long after speaking does the transcription get finalized. Since the only segment boundary that can be considered ground truth is the one at the end of the utterance, we measure the time difference from the end of the last word (whose timing is determined by forced alignment) to the last segment boundary, averaged across utterances. We report the 50th and 75th percentile EOS latencies. Anomalous latencies below -0.5s or exceeding 2s are left out of the percentile calculation.

  • # Segment: Average number of segments for each utterance.

  • # State: Average number of model states in the beam search for each utterance. This metric, used also in [prabhavalkar2021less], is equivalent to the number of joint network forward passes and is thus a measure of the beam search efficiency.

4.1 Main results

In Table 3a, we first show that the quality of the segmentation matters by presenting two baseline fixed-interval segmenters, B1 and B2, and comparing it to the VAD segmenter, B3. The VAD segmentation is determined by silences and achieves better WER than the fixed-length segmenters which do not depend on any features.

Next we pick E1, our best E2E segmenter (operating point determined in §4.2), and compare it against B3, the VAD segmenter. E1 outperforms B3 by 6.1% WER relative on YT_LONG and 8.5% on YT_SHORT, highlighting the segmenter’s ability to improve overall quality. E1 also finalizes the segments faster than the VAD by 130/137 ms on YT_LONG, measured at 50th/75th percentiles, and by 250/265 ms on YT_SHORT. These improvements are within perceivable range for user experience.

E1 also achieves a slightly better beam search efficiency (lower number of states), which may be due to the fact that its hypotheses are more stable, obviating the need for many joint expansions. The number of segments in the VAD and E2E are similar to those in the two fixed-length segmenters (B1-B2), which isolates out the fact that segmentation correctness matters as opposed to the number of segments.

4.2 <eos> threshold ablation study

Table 3b shows an ablation study on the <eos> threshold. When the <eos> negative log-posterior from the model falls below the <eos> threshold, the segment is finalized. Higher thresholds finalize more aggressively, leading to lower latency and more segments, but at the cost of more segmentation errors, e.g., finalizing in the middle of a sentence. Conversely, lower thresholds may not finalize as often as is needed to maintain beam diversity. The sweet spot for WER occurs—for both test sets—at a threshold of 2.0, and we pick that as our operating point.

4.3 Utterance length dependence

In Figure 3, we evaluate the per-example WER-relative between E1 and B3 as a function of utterance length. We define utterance length here as the number of words in the ground truth transcript rather than the audio duration (though they are correlated) because it corresponds more closely to the beam search lattice length. This allows us to analyze whether our WER gains are limited to long-form utterances. Surprisingly, WER-relative is rather invariant of utterance length for both test sets, even for utterances with a few hundred words (a few minutes). This suggests that E2E segmentation can be applied more widely to medium-form utterances as well.

Figure 3: Per-example WER-relative of E2E (E1) to VAD (B3) segmenters versus utterance length. Lower is better.

4.4 Results with frame filtering

In Table 4, we evaluate the VAD and E2E segmenters with frame filtering turned on. Like VAD-based finalization, frame filtering starts when the VAD detects 0.2 seconds of silence, jettisoning forthcoming frames until speech is detected again. This is a practical measure to save computation for on-device deployments, because it prevents silence frames from being unnecessarily processed by the expensive E2E model. Segmenting and frame filtering are conventionally tightly coupled; when the VAD decides to segment, it simultaneously kicks off frame filtering, ensuring that the segmentation decision has access to all the audio frames. Replacing the segmenting with an E2E model requires an assessment of how it interacts with the VAD-controlled frame filtering, since segmenting may now happen before or after frame filtering begins.

A first observation is that frame filtering increases absolute WER by around 2% compared to no frame filtering due to the reduced acoustic context (see B4 vs. B3 and E12 vs. E1). However E2E still prevails over VAD by about 3.1% WER relative (E12 vs. B4) and 120 ms EOS50 latency. It also achieves better beam search efficiency (lower number of states), which is aligned with frame filtering’s goal of reducing computational load. Compared to no frame filtering (E1), E12’s number of segments is decreased from 56.1 to 28.1. This is because the model, though trained with FastEmit, still needs to see some silence in order to confidently predict end-of-segment, and overly aggressive frame filtering prevents that silence from being seen.

The frame filtering can be gradually reduced by increasing its margin, or the additional silence time beyond 0.2 seconds the VAD must detect before initializing frame filtering. As the margin is increased in E13-E20, the WER converges towards its value without frame filtering (17.05%), at the cost of slightly increasing computation. The EOS50 latency is also reduced; investigating the cause of this is a point of future work. This table suggests that a trade-off between quality and beam search efficiency must be made in resource-constrained situations with frame filtering.

Segmenter WER EOS50 # Seg # State
B4: VAD-0s 20.5917.1220.59 26080260 94.42895 710867437228
E12: E2E-0s 19.9417.1220.59 14080260 28.12895 679967437228
E12 vs. B4 -3.1% -120
E13: E2E-1s 19.5517.1220.59 85 80260 33.02895 683467437228
E14: E2E-2s 19.2117.1220.59 85 80260 41.92895 677667437228
E15: E2E-4s 18.7117.1220.59 80 80260 48.92895 674367437228
E16: E2E-8s 18.0517.1220.59 10580260 53.12895 698167437228
E17: E2E-16s 17.5217.1220.59 90 80260 56.22895 705067437228
E18: E2E-32s 17.1817.1220.59 90 80260 56.82895 702367437228
E19: E2E-64s 17.1217.1220.59 90 80260 56.32895 722867437228
E20: E2E-128s 17.1217.1220.59 90 80260 56.82895 705867437228
Table 4: Segmenting with frame filtering for YT_LONG. YT_SHORT results are similar and not displayed for brevity. Naming convention is {segmenter}-{margin_length}

5 Conclusion

Our work presents a way to improve streaming long-form audio decoding by replacing the VAD-based segmenter with an E2E model. We proposed an E2E architecture that predicts segment boundaries and provided an automatic end-of-segment data annotation strategy required for learning that task in an end-to-end fashion. Our results demonstrate significant WER and end-of-segment latency improvements compared to a VAD baseline on a long-form YouTube captioning task.