Overlapping speech is quite common for many scenarios, as the meetings recorded in the AMI corpus [McCowan2005_AMIMeetingCorpus] show overlap in about to of the time, and in informal situations, as recorded in the CHiME-5 [Barker2018_FifthCHiMESpeech] challenge data, it can exceed . A typical task for analysis of such recordings is ASR. In the case of overlapping speech, this is called multi-talker speech recognition where the speech of multiple concurrently active talkers is to be recognized. Conventional speech recognizers are limited to handling a single talker at a time which makes them inapplicable in those scenarios.
Many efforts have already been put into the field of multi-talker speech recognition [Seki2018_PurelyEndtoendSystem, Chang2019_EndtoendMonauralMultispeaker, Kanda2020_SerializedOutputTraining, Settle2018_EndtoendMultispeakerSpeech, Menne2019_AnalysisDeepClustering, vonNeumann2019_EndtoendTrainingTime, Yu2017_RecognizingMultitalkerSpeech, Qian2018_SinglechannelMultitalkerSpeech, Bahmaninezhad2019_ComprehensiveStudySpeech]. The approaches can basically be divided into monolithic and cascade systems. The monolithic ones form one large neural network that is optimized as a whole, e.g., by extending a single-talker CTC/attention ASR system so that its encoder outputs one latent representation for each speaker in the mixture and one or multiple attention decoders reconstruct one transcription for each representation [Seki2018_PurelyEndtoendSystem, Chang2019_EndtoendMonauralMultispeaker]. Another monolithic approach called Serialized Output Training modifies the target label sequence of a single-speaker recognizer to output the transcriptions for all talkers serially delimited by a special “speaker change” token [Kanda2020_SerializedOutputTraining]. These systems have the disadvantage that they do not provide interpretable signals as, e.g., separated speech signals.
The cascade systems use source separation techniques followed by single-talker speech recognizers. These systems have the advantage that intermediate signals are interpretable and individual system parts can be trained and tested on their own. As separation front-ends, DPCL [Isik2016_SinglechannelMultispeakerSeparation], PIT and the currently most promising TasNet architecture [Luo2017_TasNetTimedomainAudio] have been combined with different speech recognizers [Settle2018_EndtoendMultispeakerSpeech, Menne2019_AnalysisDeepClustering, Yu2017_RecognizingMultitalkerSpeech, Qian2018_SinglechannelMultitalkerSpeech, Bahmaninezhad2019_ComprehensiveStudySpeech, vonNeumann2019_EndtoendTrainingTime]. The experiments in general suggest that having both, dedicated parts for the separation and the recognition part, and joint end-to-end fine-tuning is beneficial.
Most separation and multi-talker ASR approaches assume that the number of talkers is known [Kolbaek2017_MultitalkerSpeechSeparation, Luo2017_TasNetTimedomainAudio, Luo2019_DualpathRNNEfficient, Seki2018_PurelyEndtoendSystem, Chang2019_EndtoendMonauralMultispeaker, Settle2018_EndtoendMultispeakerSpeech, Qian2018_SinglechannelMultitalkerSpeech], although this is not the case in realistic situations. Source number counting has been combined with source separation in iterative speech extraction [Kinoshita2018_ListeningEachSpeaker, Takahashi2019_RecursiveSpeechSeparation] and model selection schemes [Nachmani2020_VoiceSeparationUnknown], but not yet with speech recognition.
We propose the first jointly optimized multi-talker ASR system for an unknown number of speakers by combining source separation and number counting techniques from [Luo2019_DualpathRNNEfficient, Kinoshita2018_ListeningEachSpeaker, Takahashi2019_RecursiveSpeechSeparation] with a single-speaker CTC/attention speech recognizer [Watanabe2017_HybridCTCAttention, Kim2017_JointCTCattentionBased, Watanabe2018_EspnetEndtoendSpeech, Xiao2018_HybridCTCattentionBased]. To do that, we first investigate a DPRNN-TasNet separator for fixed numbers of speakers as a front-end for ASR. We jointly fine-tune the DPRNN with a pre-trained ASR system, which was shown to be effective in other scenarios [Heymann2017_BeamnetEndtoendTraining] and multi-talker ASR [vonNeumann2019_EndtoendTrainingTime], to optimize the overall model performance. By doing so, we achieve new state-of-the-art performance of WER. We then integrate the DPRNN into the OR-PIT architecture and extend it with elegant mechanisms for source number counting. The counting mechanisms show a promising performance. Finally, we combine the OR-PIT with an ASR system to form our new multi-speaker ASR system for unknown numbers of speakers. Experiments show that our system generalizes to a larger number of speakers than it saw during training.
2 Source separation and counting
In the following two subsections we first introduce the baseline approach to source separation, where the maximum number of talkers is assumed to be known. Then, we generalize it to separate an a priori unknown number of concurrent talkers.
2.1 Known number of talkers: Dual-Path RNN-TasNet
A single channel discrete-time speech mixture signal is modeled as a sum of single-talker speech signals :
where is the talker index. Source separation is concerned with extracting the source signals from the mixture .
We use the TasNet [Luo2017_TasNetTimedomainAudio, Luo2018_ConvTasNetSurpassingIdeal] with a DPRNN [Luo2019_DualpathRNNEfficient]
as the separation network to obtain a fixed number of estimationsfor the sources , where the order of can be permuted to the actual source signals:
It works by encoding the time-domain signal into a latent domain with a convolutional encoder, separating this representation with the DPRNN, and transforming it back to time domain with a de-convolutional decoder. The DPRNN models short- and long-term dependencies in an alternating manner by segmenting the input and skipping different numbers of time-steps in adjacent layers. It was shown in [Luo2019_DualpathRNNEfficient] to be superior to a separator based on 1-D convolutions used in the Conv-TasNet [Luo2018_ConvTasNetSurpassingIdeal].
2.1.1 Time-domain training objective
In recent work, the training objective often is to maximize the scale-invariant source-to-distortion ratio (SI-SDR)111Sometimes called SI-SNR [Luo2017_TasNetTimedomainAudio, Luo2018_ConvTasNetSurpassingIdeal]. [Luo2017_TasNetTimedomainAudio, Luo2018_ConvTasNetSurpassingIdeal] by minimizing the negative SI-SDR. By setting and removing terms that do not depend on , the time-domain logarithmic MSE loss can be obtained [Heitkaemper2019_DemystifyingTasNetDissecting]:
The missing scale-invariance did not show negative effects on the separation performance [Heitkaemper2019_DemystifyingTasNetDissecting]
. To be able to handle different numbers of speakers with such a model, it is common for frequency-domain separators to set the missing outputs to silence[Kolbaek2017_MultitalkerSpeechSeparation], i.e., . To use silent targets here, the loss has to be prevented from going to negative infinity for perfect reconstruction and masking any loss terms from other target signals by, e.g., adding to the argument of the logarithm:
The total loss is calculated in a permutation-invariant manner with the set of all permutations of length :
2.2 Unknown number of talkers: OR-PIT
Conventional source separators are limited to a fixed number of talkers. An arbitrary number of talkers can be handled by iterative source extraction approaches [Kinoshita2018_ListeningEachSpeaker, Takahashi2019_RecursiveSpeechSeparation]. Instead of directly separating the mixture into one stream for each talker, they apply a network repeatedly to extract one talker at a time.
Following the OR-PIT [Takahashi2019_RecursiveSpeechSeparation] scheme, an iterative source extractor is a two-output separator, in our case a DPRNN-TasNet, trained to output one talker at its primary output and the sum of all remaining talkers at its secondary output so that can be fed back to extract the next talker, as visualized in the left part of Fig. 1. It is first trained with
on clean mixtures, where is a time-domain loss. It is then fine-tuned by feeding as additional training data.
The number of talkers can be determined by counting the number of iterations required, until does not contain speech. The authors of [Takahashi2019_RecursiveSpeechSeparation] use Alexnet [Sutskever2012_ImagenetClassificationDeep]
as an external classifier to detect the absence of speech. A stopping criterion can, however, be integrated into the separator by using thresholding or an additional output flag, as inspired by[Kinoshita2018_ListeningEachSpeaker].
Assuming that a speech signal contains a certain minimal amount of average energy and the network is able to suppress speech well enough, absence of speech can be detected by:
with being the length of the signal and the threshold value being determined manually. Models that use this stopping criterion are called “threshold” models.
An elegant way to integrate the classifier into the separator is to let the separation network predict a stop flag. This is done by adding an additional scalar output
to the network. The output size of the DPRNN is increased. The newly added DPRNN outputs are transformed by a fully connected layer to a scalar for each time step. The stop flag per utterance is obtained with, e.g., an average pooling over time followed by a sigmoidal function. The estimated flagis trained to indicate when to stop iterating using a binary cross-entropy objective
The target flag is set to when the second output should be empty and otherwise.
3 Joint optimization of source separation and speech recognition
We propose to jointly fine-tune a source separation front-end (FE) with source number counting and a speech recognition back-end. The FE is either a TasNet or an OR-PIT and the back-end is a CTC/attention speech recognizer similar to [Watanabe2017_HybridCTCAttention, Seki2018_PurelyEndtoendSystem] taken from the ESPnet toolkit [Watanabe2018_EspnetEndtoendSpeech]:
Joint fine-tuning is straight forward for the TasNet-based system without counting. The gradients are propagated from the back-end into the FE and their losses are combined like
We solve the permutation problem with the FE signal-level loss.
For the iterative OR-PIT system, there are different options that arise from the fact that can contain more than one talker and is, thus, unusable for training the back-end. We formulate two training schemes, namely the single- and multi-iteration schemes. The single-iteration scheme unrolls a single iteration of the OR-PIT FE and uses to train the back-end. is optimized using only the signal-level loss. Here, the OR-PIT always sees unprocessed data, so there is a mismatch to evaluation where its own output is fed back. To mitigate this, the model can be unrolled in the multi-iteration scheme, where the secondary output is used as the input for the following iteration and all primary outputs are used to train the ASR back-end.
We evaluate our systems in terms of average improvement of signal-to-distortion ratio (SDRi)222Provided by the mir_eval toolbox [Raffel2014_MirEvalTransparent]. [Vincent2006_PerformanceMeasurementBlind, Fevotte2005_BSSEVALToolbox], CER, WER and counting accuracy on the WSJ, WSJ0-2mix, WSJ0-3mix and WSJ0-4mix databases [Paul1992_DesignWallStreet, Isik2016_SinglechannelMultispeakerSeparation, Kolbaek2017_MultitalkerSpeechSeparation] with a sample rate of . We use WSJ0-4mix as created in [Takahashi2019_RecursiveSpeechSeparation]. The experiments on source separation and number counting are conducted on the min subsets of the WSJ0-mix databases that contain full overlap only. Speech recognition is evaluated on the max subset where no utterances are truncated.
4.1 Source Separation
|number of talkers|
|TasNet 2 talker||—||—|
|TasNet 3 talker||—|
|TasNet 2+3 talker||—|
|RSAN stop-flag [Kinoshita2018_ListeningEachSpeaker]||—||—||—|
|Original DPRNN-TasNet [Luo2019_DualpathRNNEfficient]||—||—||—|
We first assume that the number of talkers is known. We train all source separators on long signal segments to comply with [Luo2017_TasNetTimedomainAudio, Luo2019_DualpathRNNEfficient]. We choose the DPRNN parameters according to the original paper [Luo2019_DualpathRNNEfficient], i.e., six blocks with two BLSTM with 128 units, an encoder window size of 16 and a chunk size of 100. We train two DPRNN-TasNet for two or three speakers only, respectively, and one with three outputs for two and three speakers where the last target is set to silence for two-speaker mixtures. Our OR-PIT is trained on single-, two- and three-speaker recordings. We use for “DPRNN-TasNet 2+3 talker” to cope with silent output, and for all other models.
The separation performance displayed in Table 1 is evaluated given the oracle number of talkers, so that counting issues are not reflected in the SDRi. The OR-PIT is forced to the correct number of iterations. For the TasNet, only the outputs with the largest energy are considered for evaluation.
Our two-talker TasNet (DPRNN-TasNet 2 talker) achieves a separation performance slightly better than [Luo2019_DualpathRNNEfficient] on two talkers and the same architecture works well on three talkers (DPRNN-TasNet 3 talker). Both do not generalize well to smaller numbers of talkers. The models specialized to one specific number are not able to produce a better reconstruction quality than what they learned during training. The model trained on both, two and three speakers, handles both types of mixtures well and generalizes even to single-speaker recordings.
The OR-PIT can handle one to three speakers slightly better than the TasNet. It especially has advantages in the three-speaker case, as also observed in [Takahashi2019_RecursiveSpeechSeparation], where the OR-PIT network extracts one speaker at a time while the TasNet has to solve the more complex task of separating three speakers simultaneously. Its iterative structure allows the model to generalize to some extent to a larger number of talkers. The performance, however, drops compared to the original Conv-TasNet-OR-PIT. This might be caused by slight differences in training scheme.
4.2 Source Number Counting
|number of talkers|
|TasNet 2+3 talker||—|
|RSAN stop-flag [Kinoshita2018_ListeningEachSpeaker]||—||—|
|Model selection [Nachmani2020_VoiceSeparationUnknown]||—|
|(1)||DPRNN-TasNet (2 talker) + ASR||—||—||—||—||—||—|
|(2)||+ fine-tune ASR||—||—||—||—||—||—|
|(3)||+ fine-tune FE + ASR||—||—||—||—||—||—|
|(4)||DPRNN-TasNet (3 talker) + ASR||—||—||—||—||—||—|
|(5)||+ fine-tune ASR||—||—||—||—||—||—|
|(6)||+ fine-tune FE + ASR||—||—||—||—||—||—|
|(7)||DPRNN-TasNet (2+3 talker) + ASR||—||—||—|
|(8)||+ fine-tune ASR||—||—||—|
|(9)||+ fine-tune FE + ASR||—||—||—|
|(10)||DPRNN-OR-PIT + ASR|
|(12)||+ single-iteration fine-tune ASR|
|(13)||+ single-iteration fine-tune FE + ASR|
|(14)||+ multi-iteration fine-tune ASR|
|(15)||+ multi-iteration fine-tune FE + ASR|
|(16)||End-to-end ASR [Chang2019_EndtoendMonauralMultispeaker]||—||—||—||—||—||—||—||—|
|(17)||Conv-TasNet + ASR [vonNeumann2019_EndtoendTrainingTime]||—||—||—||—||—||—|
|(18)||End-to-end ASR [Zhang2020_ImprovingEndtoEndSingleChannel]||—||—||—||—||—||—||—||—|
|(19)||Oracle ASR result based on ground truth data|
The same models are evaluated for counting accuracy in Table 2. Counting for the 2+3 talker TasNet is performed by an energy-based threshold on its third output. All thresholds are chosen to maximize the accuracy on “cv” and “dev” data.
The “DPRNN 2+3 talker” model performs well in discriminating between two- and three-talker mixtures, but the threshold does not generalize well to the single-talker scenario due to higher energy levels in the estimated signals. A counting accuracy of over is possible with an adjusted threshold, but this degrades the accuracy on two- and three-speaker mixtures.
The OR-PIT detects the number of sources correctly in most cases. The threshold model performs slightly better for the numbers of talkers it was trained on, but the stop-flag model generalizes better to larger numbers of talkers.
Compared to the model selection scheme in [Nachmani2020_VoiceSeparationUnknown], our system performs better in source counting but worse in separation, where their model achieves SI-SDR improvement and our system on two talkers. Similar modifications as [Nachmani2020_VoiceSeparationUnknown] could be applied to the OR-PIT to possibly achieve a similar separation performance.
Although not directly comparable, the stopping criteria introduced in this work seem to perform comparable to the Alexnet classifier with accuracy for less than three talkers [Takahashi2019_RecursiveSpeechSeparation] while being simpler.
4.3 Single-talker speech recognition
We use a configuration similar to [Seki2018_PurelyEndtoendSystem] without the speaker dependent layers for the speech recognizer. This results in two CNN layers followed by two BLSTMP layers with units for the encoder, one LSTM layer with 300 units for the decoder and a feature dimension of . We use a location-aware attention mechanism and ADADELTA [Zeiler2012_ADADELTAadaptivelearning] as optimizer. Decoding is performed with a word-level RNN language model. Our ASR system achieves a WER of on the WSJ eval92 set.
4.4 Two-talker speech recognition
We evaluate the speech recognition performance of the TasNet-based recognizer in rows 1 to 9 of Table 3. We observe similar tendencies as were found in [Heymann2017_BeamnetEndtoendTraining] and [vonNeumann2019_EndtoendTrainingTime]. Fine-tuning the ASR part gives a greater improvement than fine-tuning the TasNet but fine-tuning both jointly gives the best performance on two speakers (compare rows 1 to 3) of relative improvement compared to not fine-tuning. The final WER (3) forms a new state-of-the-art result on WSJ0-2mix and is a huge improvement over previous techniques [vonNeumann2019_EndtoendTrainingTime]. Similar to the separation evaluation, the two-speaker TasNet performs better than the two- and three-speaker model (rows 7 to 9).
4.5 Multi-talker speech recognition with counting
After having shown that the joint system performs well with a separator for a fixed number of talkers, we evaluate joint training for the OR-PIT in rows 10 to 15 of Table 3. We use the stop-flag OR-PIT due to better counting accuracy on max data.
The OR-PIT shows an odd behavior that is not present in the TasNet. In regions where only a single talker is active, one output should be silent. This works for the TasNet, but the OR-PIT outputs the speech signal scaled down heavily. The ASR system picks it up and creates insertion errors. An energy-based voice activity detection (VAD) improves the WER substantially to(11) for two talkers without fine-tuning (10).
Fine-tuning the ASR system with the single-iteration scheme (12) can improve the performance, but the mismatch created by fine-tuning the FE this way degrades the performance compared to fine-tuning ASR (13). Using the multi-iteration fine-tuning scheme (14 and 15) can prevent this, but the overall performance is not comparable to the two-speaker TasNet model, although the OR-PIT achieves a better separation performance, according to Tab. 1. This might be caused by the more complex training procedure.
The performance on larger numbers of speakers is listed in the last two columns in Table 3. The OR-PIT again performs better than the TasNet on more than two speakers. It generalizes to the unseen larger number of four speakers. Contradicting the results for two speakers, fine-tuning just the ASR part performs best for more than two speakers (rows 5, 8, 12 and 14). These results suggest that it is beneficial to only fine-tune the ASR back-end in some scenarios, especially when the separation task is more challenging, while fine-tuning the front-end contributes to over-fitting to a specific scenario.
We build the first joint end-to-end system that performs source number counting and multi-talker speech recognition. Our specialized model to two talkers provides a new state-of-the-art WER for the WJS0-2mix database of . The OR-PIT-based system that can count the numbers of speakers performs better than the TasNet-based model on three speakers and generalizes well to an unknown number of speakers, i.e., four speakers. Evaluation of the source counting abilities show very promising performance.
Computational resources were provided by the Paderborn Center for Parallel Computing.