Speaker-attributed automatic speech recognition (ASR) of natural meetings has been one of the very challenging tasks since the early 2000s, when the NIST Rich Transcription Evaluation series [FiscusEtAl:rt07] started. Systems developed in the early days yielded high error rates, especially when distant microphones were used as input. However, with the rapid progress in conversational speech transcription [Xiong16, Saon17], far-field speech recognition [Yoshioka15b, Du16, Li17, Li18], and speaker identification and diarization [Zhang18, Sell18], realizing accurate meeting transcription from a distance seems to be within reach, especially with microphone arrays. In addition to microphone array setups, single-microphone systems have also been evaluated.
Using multiple asynchronous audio capturing devices, such as mobile phones and laptops, adds another dimension to the task. On the one hand, the use of the spatially distributed microphones allows us to measure acoustic events at different points in a room. Therefore, meeting attendees may want to lend their own recording devices to the transcription system, i.e., they can put their devices on the table and connect with their devices to a server to improve transcription quality. On the other hand, while there are several pioneering studies [Araki18], it is unclear what the best strategies are for consolidating multiple asynchronous audio streams and to what extent they work for natural meetings in online and offline setups.
In this paper, we investigate a meeting transcription architecture based on asynchronous distant microphones by combining both front-end and back-end techniques. The resulting system is analyzed through experiments on real-world meeting recordings. Our proposed system is designed to generate word recognition results in real time and then provide improved speaker-attributed transcriptions with limited latency.
In addition to the end-to-end system analysis, we make the following specific contributions: we examine the idea of “leave-one-out beamforming” in the asynchronous multi-microphone setup. This method was proposed to benefit from both beamforming and system combination approaches but tested only with synchronized signals [Stolcke11]. The computational cost required for calculating multiple beamformers can be reduced by taking advantage of the properties of spatial covariance matrices. We investigate a similar diversity-preserving strategy for acoustic model fusion. Further, we describe system combination schemes that take account of both word recognition and speaker attribution. Finally, we show results based on incremental ROVER that processes the ASR and diarization outputs with low latency.
2 Task and System Overview
We record a meeting with audio capturing devices, such as cell phones, tablets, and laptops. The devices can be randomly placed at any locations in a conference room. The acoustic signal picked up by each device is transmitted to a common server. The server then generates a speaker-attributed transcription of the meeting conversation in real time as it receives the signals from the devices. In this paper, we assume that all meeting attendees have enrolled in the system and have provided their voiceprints for speaker identification.
Figure 1 shows the processing flow of the proposed system. The input signals received by the server are misaligned for various reasons such as clock drift on each recording device, differences in on-device signal processing, packet generation, and signal transmission channels. As in Fig. 1, the audio stream alignment module constantly corrects the inter-channel signal misalignments. This is followed by a beamforming module, which receives the time-aligned audio signals and yields enhanced signals. In this paper, we deal with the case where while this is not a requirement. Each enhanced signal is fed to a speech recognition module to produce a real-time transcription as well as n-best recognition hypotheses with word-level time marks. The diarization module then generates a speaker label sequence for each of the segments detected by the ASR decoder by utilizing both word-level time marks and speaker embeddings extracted from the enhanced audio. Eventually, the speaker labels and word hypotheses are gathered by the system combination module to yield a final speaker-attributed word transcription.
We have settled on this architecture based on several considerations: it supports both beamforming and later-stage system combination approaches, which are found to be beneficial together [Stolcke11]. Also, we perform diarization after speech recognition, unlike in many previous systems. Since diarization typically has a longer algorithmic delay this allows preliminary recognition results to be displayed in real time. Finally, system combination, coming last, is designed to merge and benefit both word recognition and speaker attribution.
3 System Components
3.1 Audio Stream Alignment
The audio stream alignment module picks one of the input streams as a reference and aligns each of the other signals to the reference signal. To align a signal to the reference one, we first detect the time lag between the two signals and then adjust the non-reference signal. For this purpose, we have two variable-length first-in-first-out buffers for each stream: one for time lag detection, one for output generation. After a few seconds (2 s in our experiments), we extract as many samples from the output buffer as those pushed to the buffer. These samples are given to the downstream modules.
At -second intervals ( in our experiments), we calculate the cross-correlation coefficients between the two signals stored in the time-lag detection buffer and pick the sample lag that maximizes the cross-correlation value. We decimate the samples in the non-reference stream’s output buffer by if . Otherwise, we increase the number of samples by . This can be done with resampling. The time-lag detection buffer can then be refreshed.
At the beginning of the alignment processing, we may calculate the cross-correlation more frequently (e.g., every 1 s) until we find a significant peak in the cross-correlation sequence. This ‘global’ time lag can also be used to adjust the output wait time. In an online client-server setting, the global time lag is small. When we apply the system to offline independent recordings, the global time lag can be in the order of minutes. In this case, we may use a sliding window to first obtain an approximate estimate of the global time lag and then fine-tune the estimate by using the sample-level cross-correlation as described above.
3.2 Blind beamforming
For beamforming, we adopt a mask-based blind processing approach [Heymann16, Higuchi16]. This approach was shown to perform as well as carefully designed beamformers that utilize array geometry information [Boeddeker18].
Mask-based blind beamforming. Assuming
microphones to be available, an enhanced short time Fourier transform (STFT) coefficient can be computed as an inner product of the
-dimensional beamformer coefficient vectorand an input multi-channel STFT coefficient vector, where subscript
denote a frequency bin index. In one formulation, the beamformer coefficient vector is estimated with a minimum variance distortionless response (MVDR) principle as
where denotes a spatial covariance matrix ( for speech; for noise) while is a one-hot unit vector which has at a reference microphone position, which may be chosen based on a maximum signal-to-noise ratio (SNR) principle [Erdogan16]. The speech and noise spatial covariance matrices are estimated using spectral masks.
In our experiments, a neural network trained to minimize the mean squared error between clean and enhanced log-mel features was used[Boeddeker18]. The spectral masks were estimated for every 1 s-batch. The beamformer coefficients are also updated accordingly.
Strategies for generating multiple different outputs. System combination relies on errors being partly uncorrelated among inputs. For this reason, [Stolcke11] suggested manipulating early-fusion approaches to keep the outputs as decorrelated as possible, specifically using a leave-one-out approach to beamforming. Two such schemes are investigated in this work.
In the first scheme, called the all-channel approach, we rotate the 1’s position in unit vector from the first element to the last to create different beamformer coefficient vectors based on Eqn. (1). A potential drawback of this approach is that the beamformer outputs might not retain enough diversity among different channels because they are still based on the same input signals.
The second, “leave-one-out”, scheme forms an acoustic beam by using channels while varying the left-out microphone in a round-robin manner. This scheme requires different -dimensional noise spatial covariance matrices to be inverted in order to calculate beamformers based on Eqn. (1). It can be shown that all the inverse spatial covariance matrices of size can be derived from a shared -dimensional inverse spatial convariance matrix by utilizing the matrix inversion properties of block and permutation matrices. Therefore, both two schemes can be run with similar computational cost.
3.3 Speech recognition
The speech recognition module converts an incoming audio signal to an n-best list with word-level time marks. In the experiments reported later, we used a conventional hybrid speech recognition system, consisting of a latency-controlled bidirectional long short term memory (LSTM) acoustic model (AM)[Xue17]
, an n-gram, an LSTM rescoring language model (LM), and a weighted finite state transducer (WFST) decoder. Our AM was trained on 33K hours of in-house audio data, including close-talking, distant-microphone, and artificially noise-corrupted speech. Decoding was performed with a trigram LM. Whenever a silence segment longer than 300 ms was detected, the decoder generated an n-best list, which was rescored with the LSTM-LM.
3.4 Speaker diarization
Given a speech region detected by the speech recognition module, speaker diarization assigns a person label to each word in the top recognition hypothesis. We adopt an approach consisting of three steps: d-vector generation, segmentation, and speaker identification. With our decoder configuration, each incoming speech region typically contains up to 20 words.
The d-vector generation step calculates speaker embeddings [Variani14] for every fixed time interval (320 ms in our system). We trained a ResNet-style embedding extraction network [He16] on the VoxCeleb Corpus [VoxCeleb] to generate 128-dimensional d-vectors.
The speaker segmentation step decomposes the received word sequence into speaker-homogeneous subsegments. This is performed with an agglomerative clustering approach [Gauvain98, Tranter06]
by using the d-vectors as observed samples. Initially, every single word comprises a unique subsegment. For every neighboring subsegment pair, the degree of proximity between the two subsegments is estimated in the embedding space. The closest pair is then merged to form a new subsegment. The proximity is defined as the cosine similarity between the mean d-vectors. This process is repeated until the cosine similarity drops below a threshold (0.15 in our experiments).
Finally, a speaker label is assigned to each subsegment. In this paper, we assume that a list of meeting attendees is available. For each subsegment, a segment-level embedding is computed by averaging the d-vectors over the subsegment. Likewise, the embedding of each speaker is pre-computed from enrollment audio samples, which were around 30 s long. The speaker label that gives the highest cosine similarity to the subsegment embedding is selected.
3.5 System combination
System combination consolidates the multiple speaker-attributed ASR results to produce a final transcription result. ROVER [Fiscus97] and confusion network combination (CNC) [StolckeEtAl:nist2000, EvermannWoodland:nist2000] are two popular system combination approaches. The goal of this step is to combine evidence from all channels, after beamforming, for both word and speaker recognition. As discussed in Section 3.4, a speaker label is assigned to every word based on the acoustics of the available audio streams. For purposes of ROVER, the speaker identities are encoded as audio channel numbers. Then, they are submitted to the NIST ROVER algorithm [Fiscus97] along with the word hypotheses, which combines them by aligning words based on dynamic programming and their time marks and extracting the words with the highest vote count. We have modified the interface to the ROVER algorithm in such a way that this process can be invoked online, as new speaker-attributed word hypotheses become available from the diarization module, by using a sliding window shared across streams. Due to misalignment between different decoder outputs, some words may appear twice. We run a simple filter removing the duplicates.
For CNC-based system combination, we devised an alternative algorithm that currently operates in batch mode. On each channel, for each speech segment, the decoder generates n-best lists, which are aligned into confusion networks (CNs). The speaker recognition output from each channel is also encoded as a CN, using special tags for the speaker identities, interspersed with 1-best word hypotheses. We modified the CN algorithms in SRILM [srilm:icslp2002] to support aligning word and speaker CNs, and augmented the usual minimal edit distance objective function with a time-misalignment penalty. The end result of the modified CNC is that n-best word hypotheses from all channels are merged with the speaker information, and the speakers and words with highest combined posteriors can be decoded jointly.
3.6 Acoustic model combination
In addition to the channel-fusion approaches described above, i.e., beamforming and system combination, it is also possible to combine frame-level senone posterior probabilities from multiple streams before ASR decoding[TibrewalaHermansky:icassp97]. While this approach is not integrated into the end-to-end system yet, we have investigated the effectiveness of senone-level AM fusion, with strategies aimed at increasing the diversity of the output results for later processing with ROVER or CNC.
The baseline results (first row) in Table 1 use senone posteriors from a single channel, produced by the AM and used as input to the decoder. Next, the sum and max of senone posteriors across channels are investigated. This results in a single word hypothesis stream, with ROVER/CNC combining speaker hypotheses only. Similar to the leave-one-out strategy for beamforming, we can preserve diversity by sampling from the channels, followed by hypothesis combination. In the last two rows of Table 1, we present results with 6-out-7 senone fusion (resulting in 7 different senone subsets), and 3-out-7 with 35 outputs. In the latter case, we sample 7 of the 35 possible outputs to reduce computation. Either way, the 7 resulting decoding outputs are routed to system combination as before.
|Max 6 of 7||23.8||27.2||22.1||26.7|
|Max 3 of 7||24.2||26.8||22.3||26.9|
4 Experiments and Results
4.1 Data and metrics
We conducted a series of experiments to analyze the performance of the system described so far. We recorded five internal meetings; three meetings were recorded with seven independent consumer devices, four of which were iOS devices and three based on Android. All devices were different products. The other two meetings were recorded with a seven-channel circular microphone array. For these meetings, we did not make use of the fact that the signals were synchronous and let the signals through the entire pipeline including the audio stream alignment module. Those meetings took place in several different rooms and lasted for 30 minutes to one hour each, with three to eleven participants per meeting. The meetings were neither scripted nor staged; the participants conducted normal work discussions and were familiar with each other. Partly as a result, about 10% of all speech occurred in overlap with at least one other speaker. Reference transcriptions were created by professional transcribers based on both close-talking and far-field recordings.
The system outputs were scored with NIST’s scoring toolkit [Fiscus06] to calculate both standard, speaker-agnostic word error rates (WERs) and speaker-attributed WERs (SAWERs). For the latter, a word is counted as correct only if both the word label and its speaker are identified correctly. Note that these metrics count overlapped speech as any other. Since our system, at present, does not attempt to separate overlapping speech we thus have a floor on the error rate of about 10%.
|(real time)||All channels||24.8||30.8|
|Leave one out||24.9||30.9|
|Leave one out||24.2||27.2|
|Leave one out||22.3||26.7|
|IHM + reference diarization||14.4||14.4|
4.2 Speech transcription accuracy
Table 2 shows the results for various configurations. For the systems that do not perform any form of system combination, seven different results were obtained, each corresponding to a different one of the microphones, and the averages are reported in the table. As a best case condition, and to calibrate the difficulty of the distant-microphone task, the final table row gives results for individual head-mounted microphones (IHM), with reference speaker segmentation.
The best system, combining beamforming and CNC, achieved substantial improvement over the single microphone system. The WER and SAWER relative gains were 17.4% and 22.4%, respectively. Relative to the IHM scenario as a floor, WER and SAWER were reduced by 37% and 39%, respectively.
We can see that both beamforming and system combination (either with ROVER or CNC) contributed to the final performance, even though both steps combined information across channels. CNC provided the largest performance gain. While beamforming yielded a smaller gain, it is more easily used for real-time applications. The leave-one-out scheme provided slightly larger gains than the all-channel beamforming when combined with system combination, especially CNC, confirming our rationale in Section 3.2.
|No. of microphones||1||3||5||7|
Table 3 shows the WERs and SAWERs for different numbers of microphones. There is a clear correlation between the number of microphones and the amount of improvement over the single channel system. Even with only three microphones, our system yielded relative gains of 11.1% and 14.8% in WER and SAWER, respectively.
|System||SDM||BF||BF + CNC||IHM|
To assess the speech recognition accuracy when a single person is speaking, we scored the results only against segments that did not contain any forms of overlap.111This was done by using NIST’s asclite with the “-overlap-limit 1” option. Note that this discarded 58% of the words. The results are shown in Table 4. By comparing the numbers with the results of Table 2, we can see that the system produced around 25% more accurate transcriptions for the non-overlapped segments. For the full system, the WER on non-overlapped speech is only 3.0% worse than with close-talking microphones. Considering that the overlaps make up about 10% of the speech duration, this result shows that segments including overlaps are more affected by the speaker-microphone distance.
|Avg. by channel||10.5||3.3||1.8||15.6|
4.3 Speaker diarization accuracy
We took the speaker-attributed recognition output, added 0.5 s of extra duration at the margins of contiguous output from the same speaker, and evaluated the result according to the NIST “Who spoke when” task [Tranter06]. Note that our task is not speaker-agnostic diarization, but recognizing the known speakers. Also, we are not trying to recognize overlapping speakers, so about 10% of speech is missed, thus putting a floor on the missed speech and overall diarization error rate (DER).
Table 5 gives the speaker diarization error of the system, by channel and for the combined output. The false alarm rate is quite low since the recognizer acts as a very conservative speech detection engine. Similar to word recognition, CNC reduces the speaker error (44% relative) by pooling speaker label posterior probabilities across all channels.
We studied a meeting transcription architecture for asynchronous distant microphones, combining front-end and back-end techniques, and evaluated it on real meeting recordings. We found that both front-end (blind beamforming) and back-end (model or system combination) algorithms improve word error, speaker-attributed word error, and diarization error metrics. Both beamforming and senone posterior fusion can be made more effective in conjunction with system combination by using leave-one-out techniques. System combination was generalized such that it benefits both word and speaker hypotheses. On non-overlapped speech, the error rate is only 3% absolute worse than with close-talking microphones. In summary, our study shows the effectiveness of multiple asynchronous microphones for meeting transcription in real-world scenarios. A major remaining challenge is recognition of overlapped speech [Yoshioka18b].