Streaming End-to-end (E2E) models for ASR have achieved low word error rates (WERs) for short to medium length utterances of up to a few minutes long [liu2021exploiting, sainath2021cascadedlm]. However, E2E models have high WERs and suffer from deletion error on long-form utterances of tens of minutes to hours long [chiu2021rnn, lu2021input, wang2022vadoi]. Such utterances are found in tasks like meetings, lectures, and video captions.
A common practice for processing long-form utterances is to first segment the audio upstream with a separate voice activity detector (VAD). Whenever the VAD detects a long silence, it splits the audio at that location into two segments [ramirez2007voice, yoshimura2020end], which are then processed independently by the E2E model. At each segment boundary, the beam search finalizes the top hypothesis by discarding all other hypotheses. This introduces more diversity into the beam search by occasionally clearing away stale hypotheses and making room for new ones, ultimately improving WER by seeing more potentially correct hypotheses. Maintaining beam diversity is particularly important for E2E models which are typically decoded with small beams.
Despite its crucial role in segmenting audio and regimenting the beam search, very little attention has been paid to improving end-of-segment prediction task [ali2018innovative, hou2020segment]. Current segmenters suffer from high latency because the VAD, by design, must wait through a long silence before deciding to segment. This delays subsequent functions like rescoring [sainath2019two] or prefetching [chang2020prefetch] that must wait for the hypotheses to be finalized. Improving the latency is important because it can improve user experience by making smart assistants more responsive via faster prefetching, or by helping dictation or captioning apps reduce the amount of “flickering” due to switching between top hypotheses. Current segmenters also suffer from high segmentation error because the VAD bases its decision purely on the audio and not the decoded text [li2021long], which can contain semantic clues as to when to segment. Improving segmentation correctness is important because it can improve WER. As a motivating example, consider two segmentations of the spoken audio below. Note the segment boundary is denoted by “|”.
Audio: “Shaq… dunks—game over!”
[1mm] S1: shaq dunks | game over
S2: shaq | dunks game over
An ideal segmenter would give S1 because it separates the speech into semantically consistent chunks. However, a VAD segmenter would give S2 because the speaker pauses after saying “Shaq…”. Importantly, S2’s suboptimal segmentation can lead to a word error if “shack” was the top hypothesis during finalization. This is because by the time “dunks” is spoken, there is no more opportunity to revise “shack” to “shaq” due to the finalization. On the other hand, not segmenting at all would lead to bloating the beam with no diversity in the hypotheses, which could also induce word errors [prabhavalkar2021less].
A related problem exists for end-of-query (EOQ) prediction, or endpointing, which historically also used audio-based VAD or EOQ detectors [shannon2017improved]. Recently, WER and latency gains have been achieved by combining endpointing and ASR into a single E2E model that is jointly optimized on both tasks, allowing them to share acoustic and semantic information [maas2018combining, Shuoyiin19, li2020towards, hwang2020end, lu2022endpoint].
Taking inspiration from the E2E endpointing work above, we now introduce E2E Segmenter, an E2E model jointly optimized on both end-of-segment detection and ASR tasks. A central challenge to end-to-end segmenting is that unlike the end-of-query label which indisputably belongs at the end of the transcript, there is no ground truth for where end-of-segment labels ought to be—making supervised training difficult. We address this challenge by proposing a novel end-of-segment annotation scheme based on modeling hesitations and word timings. To avoid degrading wordpiece prediction, we also introduce a new joint layer in the RNN-T architecture that independently predicts the end-of-segment token while leveraging shared acoustic and semantic features. Compared to the VAD baseline, E2E segmenter achieves quality improvements of up to 8.5% WER relative while simultaneously reducing 50th percentile latency by 250 ms on the YouTube captioning task.
The primary job of the segmenter is to send segment boundary signals to the beam search in a streaming fashion. Upon receiving this signal, the beam search finalizes the top hypothesis, clears the beam, resets the encoder state, and passes the top-hypothesis decoder state to the new segment. The decision of when to send the segment boundary signal is conventionally made by an upstream VAD model; but here, the signal is produced by the decoder itself whenever the top hypothesis in the beam search predicts it has reached the end-of-segment with confidence above a threshold. We now discuss how the E2E model is designed to perform this end-of-segment (<eos>) prediction task.
2.1 End-of-segment joint layer
Figure 1 illustrates our architecture, which is similar to that in [chang2021turn]
. The original RNN-T wordpiece joint network is a shallow, single layer of the RNN-T model that fuses both acoustic (from the encoder) and linguistic (from the prediction network) sources of information and emits token posteriors. A natural way of conferring the end-of-segment prediction task to the RNN-T decoder would be assign the joint layer an additional output logit representing<eos>, as is done for endpointing [Shuoyiin19, li2020towards], but we found in pilot experiments that this interferes with wordpiece decoding and hurts WER. Instead, to decouple wordpiece prediction from end-of-segment prediction, we add a second joint layer—the end-of-segment joint layer—that emits an <eos> posterior, i.e.
where is the i-th audio frame and is the i-th decoded token in the beam. The end-of-segment joint layer is identical in structure to the wordpiece joint layer, containing all wordpieces as logits. Standard wordpiece training of the RNN-T model with the wordpiece joint layer first occurs; then the end-of-segment joint layer is initialized with the same weights as the wordpiece joint layer and fine-tuned on the training data with <eos> prediction included. During inference, the wordpiece joint layer is used for wordpiece prediction while the end-of-segment joint layer is used for end-of-segment prediction.
2.2 End-of-segment annotation
While the architecture now allows for emitting the <eos> token, how can we train the model to emit it at the appropriate time? What patterns from the audio or text data can be used as supervision for when an <eos> ought to
occur? Human annotation is expensive and inconsistent—it is not even clear in principle where best to insert segment boundaries. Thus, we opt for a heuristic-based, weak supervision approach where<eos> ground truth labels are automatically inserted into the training transcripts based on the rules shown in Table 1.
These heuristics include rules for inserting an <eos> when there is a long silence (1.2s) or at the end of the utterance. To eliminate common mis-insertions, we also specify two exceptions for patterns where the model might otherwise insert <eos>, but are in fact places where the speaker is likely not finished with the sentence. Refer to Figure 2
for an example. Specifically these include silences following lengthened words (heyyy) or filler words (um) which signal speaker hesitation. We identify as lengthened words those with a phoneme duration exceeding 5 times the standard deviation; and we use an in-house model to detect filler words. Implementing these heuristics required obtaining silence, word, and phoneme timings via running a forced alignment model on all audio-text pairs in the training set.
|Rule 1||Long silence between words||Speaker finished||<eos>|
|Rule 2||Silence following last word||Speaker finished||<eos>|
2.3 FastEmit training
Now that the model emits <eos> correctly, we wish to make it emit quickly. After all, one of the advantages of E2E segmenting is that it does not need to wait a fixed silence duration before emitting <eos> like the VAD. Therefore, we train our model with the FastEmit regularization term [yu21fastemit] which encourages each token to be emitted as soon as sufficient context is available. During inference, the FastEmit-trained model can emit <eos> sooner than the silence duration required to insert that token during the ground truth annotation procedure (Table 1, Rule 1).
|Test set||# Utt.||Tot words||Tot. length||50th||75th|
YouTube videos cover many domains (TV shows, sports, conversations, etc.) and are often be very long [narayanan2019recognizing], making YouTube captioning an ideal task for our long-form study. Thus, we evaluate on two standard YouTube testsets used in [Soltau2017, chiu2019comparison, chiu2021rnn]: YT_LONG is sampled from YouTube video-on-demand and YT_SHORT is sampled from Google Preferred channels on YouTube. Table 2 shows their length statistics.
The training set, identical to that in [sainath2020streaming], is a sample of Google traffic from multiple domains such as voice search, farfield, telephony, and YouTube, making up about 300M utterances with 400k hours of audio. All utterances are anonymized and hand-transcribed, with the exception of YouTube being semi-supervised [liao2013large]. Note the YouTube utterances used for training are cut into small chunks no more than about 20 seconds long. The data is diversified via multi-style training [kim2017mtr], random down-sampling from 16 to 8 kHz [li2012improving], and SpecAug [Park2019].
Our RNN-T model is similar to the first-pass network of [sainath2021cascadedlm]. The encoder is a streaming 12-layer, 512 dimensional Conformer encoder with causal convolution kernels of size 15 and 8 left-context self-attention heads. The decoder consists of a stateless prediction network [Rami21] with output dimension 640. The joint layers (both wordpiece and end-of-segment) are single layers which input the concatenation of encoder and prediction network features. In total, the model has 140M parameters, of which less than 1M are due to the additional end-of-segment joint layer. The model emits 4096 wordpieces, with the blank token factored out with HAT factorization [variani2020hybrid]. Model training minimizes the RNN-T and MWER loss [prabhavalkar2018minimum]. We also add the FastEmit regularization term [yu21fastemit] with a weight of 5e-3. The optimizer was Adam with and . A transformer learning rate schedule [Vaswani17] with peak learning rate of 1.8e-3 and 32k steps of warm-up is used, along with exponential-moving-average-stabilized gradient updates. All models are implemented in Lingvo [shen2019lingvo] and trained on 64 TPU chips with a global batch size of 4096 for 500k steps.
3.3 Beam search
We use a frame-synchronous beam search with a beam size of 8 and a pruning threshold of 5; i.e. partial hypotheses with negative log posterior exceeding that of the top hypothesis by 5 are removed. At each frame, we apply a breadth-first search for possible expansions similar to [tripathi2019monotonic], ignoring any expansion with a negative log posterior of 5 or greater, and limiting the search depth to 10 expansions. The production streaming client we run on has a maximum segment duration of 65 seconds before it forces a finalization.
3.4 Voice activity detector
Our pipeline contains a lightweight voice activity detector [zazo2016feature]
upstream of the E2E model that classifies each frame as silence or speech in a streaming fashion. Whenever it detects 0.2 seconds of continued silence, it sends a segment boundary signal forcing the beam search to reset encoder state and discard all except the top hypothesis. VAD-based segment finalization is turned on only in our baselines; it is turned off for all E2E segmenter experiments.
In Table 3, we run the ASR pipeline with different segmenters on YT_LONG and YT_SHORT. Other than the segmenter, all aspects of the ASR pipeline are identical. We track the following metrics for each experiment:
WER: Word error rate—measure of overall ASR quality.
EOS50, EOS75: End-of-segment latency in milliseconds, i.e., how long after speaking does the transcription get finalized. Since the only segment boundary that can be considered ground truth is the one at the end of the utterance, we measure the time difference from the end of the last word (whose timing is determined by forced alignment) to the last segment boundary, averaged across utterances. We report the 50th and 75th percentile EOS latencies. Anomalous latencies below -0.5s or exceeding 2s are left out of the percentile calculation.
# Segment: Average number of segments for each utterance.
# State: Average number of model states in the beam search for each utterance. This metric, used also in [prabhavalkar2021less], is equivalent to the number of joint network forward passes and is thus a measure of the beam search efficiency.
4.1 Main results
In Table 3a, we first show that the quality of the segmentation matters by presenting two baseline fixed-interval segmenters, B1 and B2, and comparing it to the VAD segmenter, B3. The VAD segmentation is determined by silences and achieves better WER than the fixed-length segmenters which do not depend on any features.
Next we pick E1, our best E2E segmenter (operating point determined in §4.2), and compare it against B3, the VAD segmenter. E1 outperforms B3 by 6.1% WER relative on YT_LONG and 8.5% on YT_SHORT, highlighting the segmenter’s ability to improve overall quality. E1 also finalizes the segments faster than the VAD by 130/137 ms on YT_LONG, measured at 50th/75th percentiles, and by 250/265 ms on YT_SHORT. These improvements are within perceivable range for user experience.
E1 also achieves a slightly better beam search efficiency (lower number of states), which may be due to the fact that its hypotheses are more stable, obviating the need for many joint expansions. The number of segments in the VAD and E2E are similar to those in the two fixed-length segmenters (B1-B2), which isolates out the fact that segmentation correctness matters as opposed to the number of segments.
4.2 <eos> threshold ablation study
Table 3b shows an ablation study on the <eos> threshold. When the <eos> negative log-posterior from the model falls below the <eos> threshold, the segment is finalized. Higher thresholds finalize more aggressively, leading to lower latency and more segments, but at the cost of more segmentation errors, e.g., finalizing in the middle of a sentence. Conversely, lower thresholds may not finalize as often as is needed to maintain beam diversity. The sweet spot for WER occurs—for both test sets—at a threshold of 2.0, and we pick that as our operating point.
4.3 Utterance length dependence
In Figure 3, we evaluate the per-example WER-relative between E1 and B3 as a function of utterance length. We define utterance length here as the number of words in the ground truth transcript rather than the audio duration (though they are correlated) because it corresponds more closely to the beam search lattice length. This allows us to analyze whether our WER gains are limited to long-form utterances. Surprisingly, WER-relative is rather invariant of utterance length for both test sets, even for utterances with a few hundred words (a few minutes). This suggests that E2E segmentation can be applied more widely to medium-form utterances as well.
4.4 Results with frame filtering
In Table 4, we evaluate the VAD and E2E segmenters with frame filtering turned on. Like VAD-based finalization, frame filtering starts when the VAD detects 0.2 seconds of silence, jettisoning forthcoming frames until speech is detected again. This is a practical measure to save computation for on-device deployments, because it prevents silence frames from being unnecessarily processed by the expensive E2E model. Segmenting and frame filtering are conventionally tightly coupled; when the VAD decides to segment, it simultaneously kicks off frame filtering, ensuring that the segmentation decision has access to all the audio frames. Replacing the segmenting with an E2E model requires an assessment of how it interacts with the VAD-controlled frame filtering, since segmenting may now happen before or after frame filtering begins.
A first observation is that frame filtering increases absolute WER by around 2% compared to no frame filtering due to the reduced acoustic context (see B4 vs. B3 and E12 vs. E1). However E2E still prevails over VAD by about 3.1% WER relative (E12 vs. B4) and 120 ms EOS50 latency. It also achieves better beam search efficiency (lower number of states), which is aligned with frame filtering’s goal of reducing computational load. Compared to no frame filtering (E1), E12’s number of segments is decreased from 56.1 to 28.1. This is because the model, though trained with FastEmit, still needs to see some silence in order to confidently predict end-of-segment, and overly aggressive frame filtering prevents that silence from being seen.
The frame filtering can be gradually reduced by increasing its margin, or the additional silence time beyond 0.2 seconds the VAD must detect before initializing frame filtering. As the margin is increased in E13-E20, the WER converges towards its value without frame filtering (17.05%), at the cost of slightly increasing computation. The EOS50 latency is also reduced; investigating the cause of this is a point of future work. This table suggests that a trade-off between quality and beam search efficiency must be made in resource-constrained situations with frame filtering.
|Segmenter||WER||EOS50||# Seg||# State|
|E12 vs. B4||-3.1%||-120|
|E13: E2E-1s||19.5517.1220.59||85 80260||33.02895||683467437228|
|E14: E2E-2s||19.2117.1220.59||85 80260||41.92895||677667437228|
|E15: E2E-4s||18.7117.1220.59||80 80260||48.92895||674367437228|
|E17: E2E-16s||17.5217.1220.59||90 80260||56.22895||705067437228|
|E18: E2E-32s||17.1817.1220.59||90 80260||56.82895||702367437228|
|E19: E2E-64s||17.1217.1220.59||90 80260||56.32895||722867437228|
|E20: E2E-128s||17.1217.1220.59||90 80260||56.82895||705867437228|
Our work presents a way to improve streaming long-form audio decoding by replacing the VAD-based segmenter with an E2E model. We proposed an E2E architecture that predicts segment boundaries and provided an automatic end-of-segment data annotation strategy required for learning that task in an end-to-end fashion. Our results demonstrate significant WER and end-of-segment latency improvements compared to a VAD baseline on a long-form YouTube captioning task.