Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

by   Thilo von Neumann, et al.

Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database.



There are no comments yet.


page 1

page 2

page 3

page 4


Multi-Decoder DPRNN: High Accuracy Source Counting and Separation

We propose an end-to-end trainable approach to single-channel speech sep...

All-neural online source separation, counting, and diarization for meeting analysis

Automatic meeting analysis comprises the tasks of speaker counting, spea...

All-neural beamformer for continuous speech separation

Continuous speech separation (CSS) aims to separate overlapping voices f...

The Cone of Silence: Speech Separation by Localization

Given a multi-microphone recording of an unknown number of speakers talk...

Multi-accent Speech Separation with One Shot Learning

Speech separation is a problem in the field of speech processing that ha...

Surrogate Source Model Learning for Determined Source Separation

We propose to learn surrogate functions of universal speech priors for d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Overlapping speech is quite common for many scenarios, as the meetings recorded in the AMI corpus [McCowan2005_AMIMeetingCorpus] show overlap in about   to   of the time, and in informal situations, as recorded in the CHiME-5 [Barker2018_FifthCHiMESpeech] challenge data, it can exceed . A typical task for analysis of such recordings is ASR. In the case of overlapping speech, this is called multi-talker speech recognition where the speech of multiple concurrently active talkers is to be recognized. Conventional speech recognizers are limited to handling a single talker at a time which makes them inapplicable in those scenarios.

Many efforts have already been put into the field of multi-talker speech recognition [Seki2018_PurelyEndtoendSystem, Chang2019_EndtoendMonauralMultispeaker, Kanda2020_SerializedOutputTraining, Settle2018_EndtoendMultispeakerSpeech, Menne2019_AnalysisDeepClustering, vonNeumann2019_EndtoendTrainingTime, Yu2017_RecognizingMultitalkerSpeech, Qian2018_SinglechannelMultitalkerSpeech, Bahmaninezhad2019_ComprehensiveStudySpeech]. The approaches can basically be divided into monolithic and cascade systems. The monolithic ones form one large neural network that is optimized as a whole, e.g., by extending a single-talker CTC/attention ASR system so that its encoder outputs one latent representation for each speaker in the mixture and one or multiple attention decoders reconstruct one transcription for each representation [Seki2018_PurelyEndtoendSystem, Chang2019_EndtoendMonauralMultispeaker]. Another monolithic approach called Serialized Output Training modifies the target label sequence of a single-speaker recognizer to output the transcriptions for all talkers serially delimited by a special “speaker change” token [Kanda2020_SerializedOutputTraining]. These systems have the disadvantage that they do not provide interpretable signals as, e.g., separated speech signals.

shared param.

shared param.











Figure 1: Example of the proposed iterative multi-speaker speech recognition system processing a three-speaker mixture.

The cascade systems use source separation techniques followed by single-talker speech recognizers. These systems have the advantage that intermediate signals are interpretable and individual system parts can be trained and tested on their own. As separation front-ends, DPCL [Isik2016_SinglechannelMultispeakerSeparation], PIT and the currently most promising TasNet architecture [Luo2017_TasNetTimedomainAudio] have been combined with different speech recognizers [Settle2018_EndtoendMultispeakerSpeech, Menne2019_AnalysisDeepClustering, Yu2017_RecognizingMultitalkerSpeech, Qian2018_SinglechannelMultitalkerSpeech, Bahmaninezhad2019_ComprehensiveStudySpeech, vonNeumann2019_EndtoendTrainingTime]. The experiments in general suggest that having both, dedicated parts for the separation and the recognition part, and joint end-to-end fine-tuning is beneficial.

Most separation and multi-talker ASR approaches assume that the number of talkers is known [Kolbaek2017_MultitalkerSpeechSeparation, Luo2017_TasNetTimedomainAudio, Luo2019_DualpathRNNEfficient, Seki2018_PurelyEndtoendSystem, Chang2019_EndtoendMonauralMultispeaker, Settle2018_EndtoendMultispeakerSpeech, Qian2018_SinglechannelMultitalkerSpeech], although this is not the case in realistic situations. Source number counting has been combined with source separation in iterative speech extraction [Kinoshita2018_ListeningEachSpeaker, Takahashi2019_RecursiveSpeechSeparation] and model selection schemes [Nachmani2020_VoiceSeparationUnknown], but not yet with speech recognition.

We propose the first jointly optimized multi-talker ASR system for an unknown number of speakers by combining source separation and number counting techniques from [Luo2019_DualpathRNNEfficient, Kinoshita2018_ListeningEachSpeaker, Takahashi2019_RecursiveSpeechSeparation] with a single-speaker CTC/attention speech recognizer [Watanabe2017_HybridCTCAttention, Kim2017_JointCTCattentionBased, Watanabe2018_EspnetEndtoendSpeech, Xiao2018_HybridCTCattentionBased]. To do that, we first investigate a DPRNN-TasNet separator for fixed numbers of speakers as a front-end for ASR. We jointly fine-tune the DPRNN with a pre-trained ASR system, which was shown to be effective in other scenarios [Heymann2017_BeamnetEndtoendTraining] and multi-talker ASR [vonNeumann2019_EndtoendTrainingTime], to optimize the overall model performance. By doing so, we achieve new state-of-the-art performance of WER. We then integrate the DPRNN into the OR-PIT architecture and extend it with elegant mechanisms for source number counting. The counting mechanisms show a promising performance. Finally, we combine the OR-PIT with an ASR system to form our new multi-speaker ASR system for unknown numbers of speakers. Experiments show that our system generalizes to a larger number of speakers than it saw during training.

2 Source separation and counting

In the following two subsections we first introduce the baseline approach to source separation, where the maximum number of talkers is assumed to be known. Then, we generalize it to separate an a priori unknown number of concurrent talkers.

2.1 Known number of talkers: Dual-Path RNN-TasNet

A single channel discrete-time speech mixture signal is modeled as a sum of single-talker speech signals :


where is the talker index. Source separation is concerned with extracting the source signals from the mixture .

We use the TasNet [Luo2017_TasNetTimedomainAudio, Luo2018_ConvTasNetSurpassingIdeal] with a DPRNN [Luo2019_DualpathRNNEfficient]

as the separation network to obtain a fixed number of estimations

for the sources , where the order of can be permuted to the actual source signals:


It works by encoding the time-domain signal into a latent domain with a convolutional encoder, separating this representation with the DPRNN, and transforming it back to time domain with a de-convolutional decoder. The DPRNN models short- and long-term dependencies in an alternating manner by segmenting the input and skipping different numbers of time-steps in adjacent layers. It was shown in [Luo2019_DualpathRNNEfficient] to be superior to a separator based on 1-D convolutions used in the Conv-TasNet [Luo2018_ConvTasNetSurpassingIdeal].

2.1.1 Time-domain training objective

In recent work, the training objective often is to maximize the scale-invariant source-to-distortion ratio (SI-SDR)111Sometimes called SI-SNR [Luo2017_TasNetTimedomainAudio, Luo2018_ConvTasNetSurpassingIdeal]. [Luo2017_TasNetTimedomainAudio, Luo2018_ConvTasNetSurpassingIdeal] by minimizing the negative SI-SDR. By setting and removing terms that do not depend on , the time-domain logarithmic MSE loss can be obtained [Heitkaemper2019_DemystifyingTasNetDissecting]:


The missing scale-invariance did not show negative effects on the separation performance [Heitkaemper2019_DemystifyingTasNetDissecting]

. To be able to handle different numbers of speakers with such a model, it is common for frequency-domain separators to set the missing outputs to silence

[Kolbaek2017_MultitalkerSpeechSeparation], i.e., . To use silent targets here, the loss has to be prevented from going to negative infinity for perfect reconstruction and masking any loss terms from other target signals by, e.g., adding to the argument of the logarithm:


The total loss is calculated in a permutation-invariant manner with the set of all permutations of length :


2.2 Unknown number of talkers: OR-PIT

Conventional source separators are limited to a fixed number of talkers. An arbitrary number of talkers can be handled by iterative source extraction approaches [Kinoshita2018_ListeningEachSpeaker, Takahashi2019_RecursiveSpeechSeparation]. Instead of directly separating the mixture into one stream for each talker, they apply a network repeatedly to extract one talker at a time.

Following the OR-PIT [Takahashi2019_RecursiveSpeechSeparation] scheme, an iterative source extractor is a two-output separator, in our case a DPRNN-TasNet, trained to output one talker at its primary output and the sum of all remaining talkers at its secondary output so that can be fed back to extract the next talker, as visualized in the left part of Fig. 1. It is first trained with


on clean mixtures, where is a time-domain loss. It is then fine-tuned by feeding as additional training data.

The number of talkers can be determined by counting the number of iterations required, until does not contain speech. The authors of [Takahashi2019_RecursiveSpeechSeparation] use Alexnet [Sutskever2012_ImagenetClassificationDeep]

as an external classifier to detect the absence of speech. A stopping criterion can, however, be integrated into the separator by using thresholding or an additional output flag, as inspired by


Assuming that a speech signal contains a certain minimal amount of average energy and the network is able to suppress speech well enough, absence of speech can be detected by:


with being the length of the signal and the threshold value being determined manually. Models that use this stopping criterion are called “threshold” models.

An elegant way to integrate the classifier into the separator is to let the separation network predict a stop flag. This is done by adding an additional scalar output

to the network. The output size of the DPRNN is increased. The newly added DPRNN outputs are transformed by a fully connected layer to a scalar for each time step. The stop flag per utterance is obtained with, e.g., an average pooling over time followed by a sigmoidal function. The estimated flag

is trained to indicate when to stop iterating using a binary cross-entropy objective


The target flag is set to when the second output should be empty and otherwise.

3 Joint optimization of source separation and speech recognition

We propose to jointly fine-tune a source separation front-end (FE) with source number counting and a speech recognition back-end. The FE is either a TasNet or an OR-PIT and the back-end is a CTC/attention speech recognizer similar to [Watanabe2017_HybridCTCAttention, Seki2018_PurelyEndtoendSystem] taken from the ESPnet toolkit [Watanabe2018_EspnetEndtoendSpeech]

. We replace the original feature extraction with differentiable STFT-based features. It is trained with a multi-target loss function with the factor



Joint fine-tuning is straight forward for the TasNet-based system without counting. The gradients are propagated from the back-end into the FE and their losses are combined like


We solve the permutation problem with the FE signal-level loss.

For the iterative OR-PIT system, there are different options that arise from the fact that can contain more than one talker and is, thus, unusable for training the back-end. We formulate two training schemes, namely the single- and multi-iteration schemes. The single-iteration scheme unrolls a single iteration of the OR-PIT FE and uses to train the back-end. is optimized using only the signal-level loss. Here, the OR-PIT always sees unprocessed data, so there is a mismatch to evaluation where its own output is fed back. To mitigate this, the model can be unrolled in the multi-iteration scheme, where the secondary output is used as the input for the following iteration and all primary outputs are used to train the ASR back-end.

4 Experiments

We evaluate our systems in terms of average improvement of signal-to-distortion ratio (SDRi)222Provided by the mir_eval toolbox [Raffel2014_MirEvalTransparent]. [Vincent2006_PerformanceMeasurementBlind, Fevotte2005_BSSEVALToolbox], CER, WER and counting accuracy on the WSJ, WSJ0-2mix, WSJ0-3mix and WSJ0-4mix databases [Paul1992_DesignWallStreet, Isik2016_SinglechannelMultispeakerSeparation, Kolbaek2017_MultitalkerSpeechSeparation] with a sample rate of . We use WSJ0-4mix as created in [Takahashi2019_RecursiveSpeechSeparation]. The experiments on source separation and number counting are conducted on the min subsets of the WSJ0-mix databases that contain full overlap only. Speech recognition is evaluated on the max subset where no utterances are truncated.

4.1 Source Separation

number of talkers
1 2 3 4


TasNet 2 talker
TasNet 3 talker
TasNet 2+3 talker
OR-PIT stop-flag
OR-PIT threshold
RSAN stop-flag [Kinoshita2018_ListeningEachSpeaker]
Conv-TasNet-OR-PIT [Takahashi2019_RecursiveSpeechSeparation]
Original DPRNN-TasNet [Luo2019_DualpathRNNEfficient]
Table 1: Source separation performance in SDRi in dB for varying numbers of talkers given the oracle number of sources. The mixtures contain full overlap only (min subset).

We first assume that the number of talkers is known. We train all source separators on long signal segments to comply with [Luo2017_TasNetTimedomainAudio, Luo2019_DualpathRNNEfficient]. We choose the DPRNN parameters according to the original paper [Luo2019_DualpathRNNEfficient], i.e., six blocks with two BLSTM with 128 units, an encoder window size of 16 and a chunk size of 100. We train two DPRNN-TasNet for two or three speakers only, respectively, and one with three outputs for two and three speakers where the last target is set to silence for two-speaker mixtures. Our OR-PIT is trained on single-, two- and three-speaker recordings. We use for “DPRNN-TasNet 2+3 talker” to cope with silent output, and for all other models.

The separation performance displayed in Table 1 is evaluated given the oracle number of talkers, so that counting issues are not reflected in the SDRi. The OR-PIT is forced to the correct number of iterations. For the TasNet, only the outputs with the largest energy are considered for evaluation.

Our two-talker TasNet (DPRNN-TasNet 2 talker) achieves a separation performance slightly better than [Luo2019_DualpathRNNEfficient] on two talkers and the same architecture works well on three talkers (DPRNN-TasNet 3 talker). Both do not generalize well to smaller numbers of talkers. The models specialized to one specific number are not able to produce a better reconstruction quality than what they learned during training. The model trained on both, two and three speakers, handles both types of mixtures well and generalizes even to single-speaker recordings.

The OR-PIT can handle one to three speakers slightly better than the TasNet. It especially has advantages in the three-speaker case, as also observed in [Takahashi2019_RecursiveSpeechSeparation], where the OR-PIT network extracts one speaker at a time while the TasNet has to solve the more complex task of separating three speakers simultaneously. Its iterative structure allows the model to generalize to some extent to a larger number of talkers. The performance, however, drops compared to the original Conv-TasNet-OR-PIT. This might be caused by slight differences in training scheme.

4.2 Source Number Counting

number of talkers
1 2 3 4


TasNet 2+3 talker
OR-PIT stop-flag
OR-PIT threshold
RSAN stop-flag [Kinoshita2018_ListeningEachSpeaker]
Model selection [Nachmani2020_VoiceSeparationUnknown]
Table 2: Source number counting accuracy in % for varying numbers of talkers. The mixtures contain full overlap only (min subset). (: Threshold not optimized for this number of sources)
WSJ0-2mix WSJ0-3mix WSJ0-4mix
(1) DPRNN-TasNet (2 talker) + ASR
(2) ++ fine-tune ASR
(3) ++ fine-tune FE + ASR
(4) DPRNN-TasNet (3 talker) + ASR
(5) ++ fine-tune ASR
(6) ++ fine-tune FE + ASR
(7) DPRNN-TasNet (2+3 talker) + ASR
(8) ++ fine-tune ASR
(9) ++ fine-tune FE + ASR
(11) ++ VAD
(12) +++ single-iteration fine-tune ASR
(13) +++ single-iteration fine-tune FE + ASR
(14) +++ multi-iteration fine-tune ASR
(15) +++ multi-iteration fine-tune FE + ASR
(16) End-to-end ASR [Chang2019_EndtoendMonauralMultispeaker]
(17) Conv-TasNet + ASR [vonNeumann2019_EndtoendTrainingTime]
(18) End-to-end ASR [Zhang2020_ImprovingEndtoEndSingleChannel]
(19) Oracle ASR result based on ground truth data
Table 3: Recognition performance of the multi-talker ASR systems for varying numbers of speakers given the oracle number of sources. The mixture signals contain full utterances (max subset). (“FE”: front-end, “ASR”: speech recognition part.)

The same models are evaluated for counting accuracy in Table 2. Counting for the 2+3 talker TasNet is performed by an energy-based threshold on its third output. All thresholds are chosen to maximize the accuracy on “cv” and “dev” data.

The “DPRNN 2+3 talker” model performs well in discriminating between two- and three-talker mixtures, but the threshold does not generalize well to the single-talker scenario due to higher energy levels in the estimated signals. A counting accuracy of over is possible with an adjusted threshold, but this degrades the accuracy on two- and three-speaker mixtures.

The OR-PIT detects the number of sources correctly in most cases. The threshold model performs slightly better for the numbers of talkers it was trained on, but the stop-flag model generalizes better to larger numbers of talkers.

Compared to the model selection scheme in [Nachmani2020_VoiceSeparationUnknown], our system performs better in source counting but worse in separation, where their model achieves SI-SDR improvement and our system on two talkers. Similar modifications as [Nachmani2020_VoiceSeparationUnknown] could be applied to the OR-PIT to possibly achieve a similar separation performance.

Although not directly comparable, the stopping criteria introduced in this work seem to perform comparable to the Alexnet classifier with accuracy for less than three talkers [Takahashi2019_RecursiveSpeechSeparation] while being simpler.

4.3 Single-talker speech recognition

We use a configuration similar to [Seki2018_PurelyEndtoendSystem] without the speaker dependent layers for the speech recognizer. This results in two CNN layers followed by two BLSTMP layers with units for the encoder, one LSTM layer with 300 units for the decoder and a feature dimension of . We use a location-aware attention mechanism and ADADELTA [Zeiler2012_ADADELTAadaptivelearning] as optimizer. Decoding is performed with a word-level RNN language model. Our ASR system achieves a WER of on the WSJ eval92 set.

4.4 Two-talker speech recognition

We evaluate the speech recognition performance of the TasNet-based recognizer in rows 1 to 9 of Table 3. We observe similar tendencies as were found in [Heymann2017_BeamnetEndtoendTraining] and [vonNeumann2019_EndtoendTrainingTime]. Fine-tuning the ASR part gives a greater improvement than fine-tuning the TasNet but fine-tuning both jointly gives the best performance on two speakers (compare rows 1 to 3) of relative improvement compared to not fine-tuning. The final WER (3) forms a new state-of-the-art result on WSJ0-2mix and is a huge improvement over previous techniques [vonNeumann2019_EndtoendTrainingTime]. Similar to the separation evaluation, the two-speaker TasNet performs better than the two- and three-speaker model (rows 7 to 9).

4.5 Multi-talker speech recognition with counting

After having shown that the joint system performs well with a separator for a fixed number of talkers, we evaluate joint training for the OR-PIT in rows 10 to 15 of Table 3. We use the stop-flag OR-PIT due to better counting accuracy on max data.

The OR-PIT shows an odd behavior that is not present in the TasNet. In regions where only a single talker is active, one output should be silent. This works for the TasNet, but the OR-PIT outputs the speech signal scaled down heavily. The ASR system picks it up and creates insertion errors. An energy-based voice activity detection (VAD) improves the WER substantially to

(11) for two talkers without fine-tuning (10).

Fine-tuning the ASR system with the single-iteration scheme (12) can improve the performance, but the mismatch created by fine-tuning the FE this way degrades the performance compared to fine-tuning ASR (13). Using the multi-iteration fine-tuning scheme (14 and 15) can prevent this, but the overall performance is not comparable to the two-speaker TasNet model, although the OR-PIT achieves a better separation performance, according to Tab. 1. This might be caused by the more complex training procedure.

The performance on larger numbers of speakers is listed in the last two columns in Table 3. The OR-PIT again performs better than the TasNet on more than two speakers. It generalizes to the unseen larger number of four speakers. Contradicting the results for two speakers, fine-tuning just the ASR part performs best for more than two speakers (rows 5, 8, 12 and 14). These results suggest that it is beneficial to only fine-tune the ASR back-end in some scenarios, especially when the separation task is more challenging, while fine-tuning the front-end contributes to over-fitting to a specific scenario.

5 Conclusion

We build the first joint end-to-end system that performs source number counting and multi-talker speech recognition. Our specialized model to two talkers provides a new state-of-the-art WER for the WJS0-2mix database of . The OR-PIT-based system that can count the numbers of speakers performs better than the TasNet-based model on three speakers and generalizes well to an unknown number of speakers, i.e., four speakers. Evaluation of the source counting abilities show very promising performance.

6 Acknowledgements

Computational resources were provided by the Paderborn Center for Parallel Computing.