The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

by   Ashish Arora, et al.
Johns Hopkins University

This paper summarizes the JHU team's efforts in tracks 1 and 2 of the CHiME-6 challenge for distant multi-microphone conversational speech diarization and recognition in everyday home environments. We explore multi-array processing techniques at each stage of the pipeline, such as multi-array guided source separation (GSS) for enhancement and acoustic model training data, posterior fusion for speech activity detection, PLDA score fusion for diarization, and lattice combination for automatic speech recognition (ASR). We also report results with different acoustic model architectures, and integrate other techniques such as online multi-channel weighted prediction error (WPE) dereverberation and variational Bayes-hidden Markov model (VB-HMM) based overlap assignment to deal with reverberation and overlapping speakers, respectively. As a result of these efforts, our ASR systems achieve a word error rate of 40.5 evaluation set. This is an improvement of 10.8 challenge baselines for the respective tracks.



There are no comments yet.


page 1

page 2

page 3

page 4


"This is Houston. Say again, please". The Behavox system for the Apollo-11 Fearless Steps Challenge (phase II)

We describe the speech activity detection (SAD), speaker diarization (SD...

The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines

The CHiME challenge series aims to advance robust automatic speech recog...

Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR

In this paper, we present Hitachi and Paderborn University's joint effor...

Improved Speaker-Dependent Separation for CHiME-5 Challenge

This paper summarizes several follow-up contributions for improving our ...

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

Significant performance degradation of automatic speech recognition (ASR...

The JHU ASR System for VOiCES from a Distance Challenge 2019

This paper describes the system developed by the JHU team for automatic ...

BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge

This paper describes joint effort of BUT and Telefónica Research on deve...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Far-field automatic speech recognition (ASR) and speaker diarization are important areas of research and have many real-world applications such as transcribing meetings [Hain2012TranscribingMW, Hori2012LowLatencyRM, Renals2017DistantSR, Yoshioka2018RecognizingOS] and in-home conversations [barker2018fifth]

. Although deep learning methods (including end-to-end approaches) have achieved promising results for several tasks such as Switchboard 

[Xiong2016AchievingHP, wang2019espresso] and LibriSpeech [Lscher2019RWTHAS, Synnaeve2019EndtoendAF], their performance remains unsatisfactory for far-field conditions in real environments, such as the CHiME-5 dataset [barker2018fifth]. This can be attributed to: (i) noise and reverberation in the acoustic conditions, (ii) conversational speech and speaker overlaps, and (iii) challenge-specific restrictions such as insufficient training data.

Several advances have been made in the last decade to tackle the challenges offered by real, far-field speech. For ASR, this improvement can be attributed to improved neural network architectures 

[zorilatoshiba, kanda2018hitachi], effective data augmentation techniques [zorila2019investigation], and advances in speech enhancement [boeddeker2018front]. Previous work has tried tackling reverberation and noise present in the far-field recording by multi-style training with data augmentation via room impulse responses and background noises [ko2017study, wang2019jhu]. Recently, spectral augmentation has been successfully used for both end-to-end [park2019specaugment] and hybrid ASR systems [Zhou2020TheRA]. Adapting the acoustic model to the environment [seltzer2013investigation] and speaker [saon2013speaker] has also been studied. Another popular direction is front-end based approaches such as dereverberation [drude2018nara] and denoising through beamforming [anguera2007acoustic, Nakatani2019AUC], which utilize multi-microphone data. Far-field speaker diarization [Garca2019SpeakerDI] has also benefited from enhancement methods [Sun2020ProgressiveMN, Kataria2020FeatureEW] and approaches to handle overlapping speech [bullock2019overlap]. Recently, guided source separation (GSS) [boeddeker2018front]

was proposed, which makes use of additional information such as time and speaker annotations for mask estimation. However, it requires a strong diarization system to perform good separation.

In this paper, we describe a multi-microphone multi-speaker ASR system developed using many of these methods for the CHiME-6 challenge [watanabe2020chime]. The challenge aims to improve speech recognition and speaker diarization for far-field conversational speech in challenging environments in a multi-microphone setting. The CHiME-6 data [barker2018fifth] contains a total of 20 4-speaker dinner party recordings. Each dinner party is two to three hours long and is recorded simultaneously on the participants’ ear-worn microphone and six microphone arrays placed in the kitchen, dining room, and the living room. The challenge consists of two tracks. Track 1 allows the use of oracle start and end times of each utterance, and speaker labels for each segment. This track focuses on core ASR techniques, and measures system performance in terms of transcription accuracy. Track 2 is a “diarization+ASR” track. It additionally requires end-pointing speech segments in the recording, and assigning them speaker labels, i.e diarization. To this end, VoxCeleb2 data [nagrani2017voxceleb] is permitted for training a diarization system, and concatenated minimum-permutation word error rate (cpWER) is used to measure speaker-attributed transcription accuracy.

Our system for Track 2, as shown in Figure 1, consists of three main modules: enhancement, diarization and recognition, described in Sections 2, 3 and 4 below. The enhancement module performs (i) dereverberation using online multichannel weighted prediction error (WPE), followed by (ii) denoising with a weighted delay-and-sum beamformer, and (iii) multi-array guided source separation (GSS), described below. In the diarization module, the beamformer outputs, one per array, are used for (i) speech activity detection and (ii) speaker diarization, both of which fuse information across arrays to improve accuracy, and (iii) overlap-aware variational Bayes hidden Markov model (VB-HMM) resegmentation to assign multiple speakers to overlapped speech regions. Speaker-marks from this diarization module are used in the multi-array GSS (part of the enhancement module) to produce enhanced, speaker-separated waveforms. The recognition module

processes the GSS output using (i) an acoustic and n-gram language model for ASR decoding, (ii) an RNN language model for lattice rescoring, and (iii) sMBR lattice combination. We augment the clean acoustic training data with dereverberated, beamformed and GSS-enhanced far-field data to match the test conditions.

The diarization module is replaced with oracle speech segments and speaker labels in our system for Track 1.

Figure 1: Overview of the decoding pipeline for track 2. For track 1, we use a similar system, with the exception that the diarization module (shown in the dotted box in the figure) is replaced with oracle speech segments and speaker labels.

2 Speech Enhancement

2.1 Dereverberation and Denoising

We used an online version of the publicly available NARA-WPE [drude2018nara] implementation of weighted prediction error (WPE) based dereverberation for multi-channel signals [nakatani2010speech] for all the channels in each array. This was followed by array-level weighted delay-and-sum beamforming using the BeamformIt tool [anguera2007acoustic]. All further processing was done on the dereverberated and beamformed signals.

2.2 Guided Source Separation (GSS)

Multi-array GSS [boeddeker2018front, kanda2019guided] was applied to enhance target speaker speech signals. For track 1, we used oracle speech segmentations and speaker labels, while for track 2, we used the segmentation estimated by the speaker diarization module described in Section 3. In GSS, the source activity pattern for each speaker derived from the segmentation aids in resolving the permutation ambiguity. A context window is used to obtain an extended segment which can bring sufficient sparsity in the activity pattern of the target speaker to further reduce speaker permutation issues. We found a 20-second context to be ideal from our experiments. GSS gave a relative WER improvement of 9.3% and 6.5% on Dev and Eval respectively for track 2 over the baseline delay-and-sum beamformer, as seen in Section 5, Table 8.

3 Speaker Diarization

Figure 2: The two-pass speaker diarization module. Synchronous SAD marks across microphone arrays enables PLDA score fusion before AHC in first-pass diarization (dotted rectangle). Overlap detection (bottom) enables the second-pass resegmentation, initialized by the first-pass output (upper RTTM), to assign more than one speaker to overlapped-speech regions in the final output (right RTTM).

3.1 Speech Activity Detection

For speech activity detection (SAD), we first trained a neural network classifier to assign each frame in an utterance a label from

= {silence, speech, garbage}. We used the architecture shown in Table 1, consisting of time-delay neural network (TDNN) layers to capture long temporal contexts [Peddinti2015ATD], interleaved with stats-pooling layers to aggregate utterance-level statistics. To obtain classifier training targets, we used a speaker-independent GMM-HMM ASR system to align whole recordings with the training transcriptions. To leverage multiple channels, we carried out posterior fusion of the classifier outputs across all the arrays at test time. i.e. if

denotes the classifier’s probability for class

at frame based on array , then , where is the fusion criterion.

We then post-processed the per-frame classifier outputs to enforce minimum and maximum speech/silence durations, by constructing a simple HMM whose state transition diagram encodes these constraints, treating the per-frame SAD posteriors like emission probabilities, and performing Viterbi decoding to obtain the most likely SAD label-sequence.

max width= Layer Layer context Total context InputOutput tdnn1 [, ] 5 tdnn2 [, ] 8 tdnn3 {, , , } 17 stats1 [0, ) tdnn4 {, , , } stats2 [0, ) tdnn5 {, , , } 256

Table 1: Neural network architecture for speech activity detection (SAD). is the input length, the set of output classes.

max width= System Dev Eval MS FA Total MS FA Total Baseline (U06) 2.7 0.6 3.3 4.4 1.5 5.9 Posterior Mean 1.7 0.7 2.4 3.2 1.9 5.1 Posterior Max 1.1 0.8 1.9 2.4 2.8 5.2

Table 2:

Reducing SAD errors by fusing posterior probabilities from multiple arrays. SAD errors are comprised of missed speech (MS) and false alarms (FA).

For the fusion criterion, we experimented with simple element-wise mean and max, and found them to be effective. From Table 2, we can see that applying posterior fusion across arrays improved the SAD error by approximately 34% relative. Since most of the gain comes from reduction in missed speech, this also positively impacts downstream recognition rate.

We also tried microphone-level posterior fusion jointly across all the arrays, but it did not yield any improvements over using the beamformed signals as described above.

3.2 First Pass Speaker Diarization

Our first-pass diarization followed the method described in [sell2018diarization]. The test recordings were cut into into overlapping 1.5 second segments with a 0.25 second111The CHiME baseline system used a 0.75 second stride.stride [landini2019but]

, an x-vector was extracted from each segment, and agglomerative hierarchical clustering (AHC) was performed on the x-vectors using the probabilistic linear discriminant analysis (PLDA) score of each pair of x-vectors as their pairwise similarity.

The x-vector extractor we used is similar to that of [snyder2018x]: it is comprised of several TDNN layers and stats-pooling, and was trained, per Challenge stipulations, on only the VoxCeleb data [nagrani2017voxceleb]. Data augmentation was performed by convolving the audio with simulated room impulse responses [ko2017study] and by adding background noises from the CHiME-6 training data. PLDA parameters were trained on (x-vectors of) 100k speech segments from the ca 40 speakers in the CHiME-6 training data.

Similar to the SAD posterior probability fusion described in Section 3.1, we investigated improving diarization by leveraging multiple microphone( array)s at test time. To compute the pairwise similarity of two 1.5 second segments during clustering, we fused the PLDA scores for their x-vectors extracted from different arrays. We found that multi-array PLDA score fusion, specifically the element-wise maximum across arrays, provided noticeable gains.

3.3 Overlap-Aware Resegmentation

Since AHC is not designed to handle overlapping speakers, we resegmented the audio using an overlap-aware version of the VB-HMM of [Dez2018SpeakerDB]. Speaker labels from the first-stage diarization of Section 3.2 were used to initialize the per-frame speaker posterior matrix, also known as the -matrix, and one iteration of VB-HMM inference was performed to convert this (binary) -matrix into per-frame speaker-probabilities. Separately, we trained an overlap detector—a 2-layer bidirectional LSTM with SincNet input features [Ravanelli2018SpeakerRF] and binary (non-overlapped/overlapped speech) output labels—using the CHiME-6 training data. The per-frame decisions of the overlap detector on the test data were then used to assign each frame to either one or two most likely speakers according to , as described in [bullock2019overlap]. An unintended attribute of VB resegmentation was a significant number of very short segments. The computational complexity of the GSS module (cf Section 2.2) is severely impacted by this growth in the number of segments. We therefore removed all segments shorter than 200ms.

The overall diarization process is shown in Figure 2.

3.4 Diarization Performance

The CHiME Challenge provided two “ground truths” for diarization, i.e. two NIST-style rich transcription time marks (RTTM): one based on utterance-level time marks by human annotators (named Annotation RTTM), and another on forced-alignment of the acoustics to the transcripts (resp Alignment RTTM). The diarization output was scored against each RTTM using the DiHARD dscore toolkit222, and diarization error rate (DER) as well as Jaccard error rate (JER) were computed.

max width= System Dev Eval DER JER DER JER Alignment RTTM Baseline (U06) 63.42 70.83 68.20 72.54 PLDA Fusion 63.97 71.65 71.56 71.32 + 0.25s stride 61.00 66.23 69.64 69.81 + overlap assign. 58.18 59.92 69.92 65.64 Annotation RTTM Baseline (U06) 61.62 69.84 62.01 71.43 PLDA Fusion 60.09 70.31 62.97 70.09 + 0.25s stride 57.85 65.36 61.60 69.35 + overlap assign. 50.43 57.81 58.26 64.38

Table 3: Diarization performance on track 2, showing the impact of the modifications of Sections 3.2 and 3.3 to the baseline x-vector/AHC system. Some of the improvement derives from the improved SAD of Section 3.1.

Table 3 shows improvements in diarization performance, relative to the AHC baseline, due to PLDA score fusion, using a 0.25s stride (v/s 0.75s), and overlap-aware resegmentation. The gains from PLDA fusion across arrays appears modest and somewhat inconsistent relative to the single array (U06) DER, but was consistently better than the average single-array DER across the six arrays. The shorter (0.25s) x-vector stride yielded a robust improvement, and the most significant improvement came from overlap detection and multiple-speaker assignment (denoted overlap assign.)

Finally, note that diarization performance seems to degrade (particularly for the Eval data) when scored against the Alignment RTTM, but shows significant improvement with the Annotation RTTM. The former stipulates tighter speech boundaries by, for instance, marking short pauses between words as non-speech. This increases the (measured) false alarm errors of our diarization module. However, retaining such pauses is beneficial for the downstream ASR task, by inducing more appropriate utterance segmentation.

4 Automatic Speech Recognition

We built a hybrid DNN-HMM system using algorithms and tools available in the Kaldi ASR toolkit.

4.1 Acoustic Modeling

Our baseline acoustic model (AM) was a 15-layer factorized time-delay neural network or TDNN-F [PoveyEtAlIS2018]. Training data for this model comprised of 80h clean CHiME-6 audio from the 2 worn-microphones of the speaker of each transcribed utterance, 320h from a 4x distortion of this clean audio using synthetic room impulse responses and CHiME-6 background noises, and 200h raw far-field audio derived by randomly sampling utterances from the many arrays. These 600h were subject to 0.9x and 1.1x speed-perturbation to yield 1800h of AM training data. The baseline TDNN-F was trained with the lattice-free maximum mutual information (LF-MMI) objective. A context-dependent triphone HMM-GMM system was first trained using standard procedures in Kaldi, and frame-level forced-alignments were generated between the speech and reference transcripts to guide LF-MMI training [PoveyEtAlIS2016].

Our CHiME-6 Challenge baseline system used this acoustic model and achieved a word error rate (WER) of 51.8% and 51.3% respectively on the CHiME-6 Dev and Eval test sets. We will designate this AM by (a) in this paper.

We then experimented with several other model architectures and data selection and augmentation methods.

Figure 3: Illustration of the acoustic model architecture, inputs and outputs. N-targets is the number of leaves (context dependent HMM states) in the bi-phone clustering trees.

4.1.1 Neural Network Architecture

We first trained an AM with 6 CNN layers comprising 33 convolutional kernels, and 16 TDNN-F layers using the 1800h training set, designated model (b) in this paper, and lowered the WERs to 49.6% and 49.3% on the Dev and Eval sets, respectively. But we discovered that these models take an inordinate amount of time to train, limiting our ability for exploration.

To expedite the turnaround time of AM training experiments, and informed by experiments in data selection and augmentation described in Section 4.1.2, we created a 500h training set comprised of the 80h of clean ear-worn microphone audio and 320h of 4x distortions of the clean audio using synthetic room impulse responses and CHiME-6 background noises, as above, supplemented with 60h raw far-field audio derived by randomly sampling utterances from the many arrays and 40h from multi-array GSS enhancement, as described in Section 2.2.

With this 500h training set, we experimented with the following AM architectures.

  1. [wide, labelwidth=!, labelindent=0pt, label=(), start=3]

  2. 6 CNN + 12 TDNN-F, comprised of 6 CNN layers with 3x3 convolutional kernels, followed by 12 TDNN-F layers. Frequency subsampling by a factor of two was applied at the third and fifth convolutional layers. The TDNN-F layers were 1536-dimensional with a factorized (lower) rank of 160.

  3. 6 CNN + 10 TDNN-F + 2 Self-attention, in which the last two TDNN-F layers were replaced by self-attention layers. Each self-attention layer had 15 attention heads with key and value dimensions of 40 and 80, respectively.

  4. CNN + TDNN + LSTM, comprised of 6 CNN layers, followed by 9 TDNN and 3 LSTM layers, interleaved (3+1)3. The CNN layers the are same as above, TDNN layers have 1024-dimensional hidden units, and the LSTM has a 1024-dim layer and 256-dim output and recurrent projections.

The input to these AMs was 40-dim log-Mel filterbank coefficients, and the outputs were context-dependent HMM states (Senones) derived from a left-biphone tree. We determined empirically that a tree with 4500 leaves, an L

-regularization value of 0.03 for the CNN and TDNN-F layers, and 0.04 for the dense layers performed well. Using 6-epochs instead of 4-epochs (in the baseline) further improved the AMs.

As shown in Table 4, the CNN + TDNN-F architecture performed better than the others, so we selected it for subsequent experiments.

max width= (Model) Architecture Dev Eval (a) 15 TDNN-F 51.8% 51.3% (b) 6 CNN + 16 TDNN-F 49.6% 49.3% (c) 6 CNN + 12 TDNN-F 48.3% 48.5% (d) 6 CNN + 10 TDNN-F + 2 SA 49.9% 49.4% (e) 6 CNN + 3 (3 TDNN + 1 LSTM) 50.1% 49.8%

Table 4: Track 1 ASR WERs for AM architecture described in Section 4.1.1. The CNN + TDNN-F configuration works best.

4.1.2 Training Data Selection and Augmentation

max width= (Model) Training Data Dev Eval (b) 1800h incl. reverb’ed & raw far-field 49.6% 49.3% (f) 675h incl. only clean & enhanced 44.5% 44.9% (g) 675h of model (f) + sp/sil probs 45.0% 45.3% (h) 3 500h from models (c)-(e) 44.6% 45.4%

Table 5: Track 1 WERs when training the 6 CNN + 16 TDNN-F AM on only enhanced (and not reverberated, noisy) speech.

Inspired by the findings of [zorila2019investigation], where they reported good performance on CHiME data not by using the matched far-field speech, or synthetically reverberated speech, for AM training, but instead using speech enhancement during both AM training and test time, we created a new AM training data set as follows. We applied beam-forming to the 4 microphone arrays to obtain 4 40h of speech, obtained another 40h from multi-array GSS described in Section 2.2, and combined them with the 80h of ear-worn microphone data described above. Data clean-up was applied to this 280h data set, to remove utterances that failed forced-alignment under a narrow beam, followed by speed perturbation, resulting in a 675h AM training data set.

Note from Table 5 that careful data selection indeed confirmed the findings of [zorila2019investigation], reducing the WER to 44.5% and 44.9% respectively on the Dev and Eval sets. The model that attained this is designated (f) in this paper.

We also trained two additional AMs, one with the same 675h of data as the model (f), but with improved estimates of the inter-word silence and pronunciation probabilities (see [ChenEtAlIS2015]), and another with speed perturbation of the 500h data set of models (c)-(e). These models are designated (g) and (h) respectively in Table 5. While they do not perform better than the model (f), they were used for system combination in track 1, as described later in Table 7 below.

4.1.3 Overlap-Aware Training

Since the CHiME data have a significant proportion of overlapped speech, we looked into providing the AM a 1-bit input indicating the presence/absence of overlap. This overlap bit

was determined during training from the time alignments, and used alongside the 40-dim filter-bank features and 100-dim i-vectors. We first projected the overlap bit to 40-dim, and the i-vector to 200-dim, and applied batch normalization and L

regularization. We then combined the two with the filter-bank features to create a single-channel input to the first CNN layer of model (f) described above.

The resulting model is designated (i) in Table 6, which illustrates that knowledge of the presence of overlapped speech yields a modest WER improvement in track 1 conditions. While we could have used model (i) in track 2 by using the output of the overlap detector of Section 3.3 as the overlap bit, we did not have sufficient time to carefully conduct these experiments.

max width= (Model) Input features Dev Eval (f) log-Mel filter-bank and i-vector 44.5% 44.9% (i) + overlapped speech indicator bit 44.4% 44.5%

Table 6: Track 1 WERs when the presence of overlapped speech is known to the acoustic model.

4.2 Language Modeling and Rescoring

We used the training transcriptions to build our language models (LMs). We used a 3-gram LM trained with the SRILM toolkit [stolcke2002srilm]

in the first pass decoding. For neural LMs, we used Kaldi to train recurrent neural network LMs (RNNLMs) 

[xu2018neural]. We performed a pruned lattice rescoring [xu2018pruned] with a forward and a backward (reversing the text at the sentence level) LSTM. We first rescored with the forward LSTM and then performed another rescoring on top of the rescored lattices using the backward LSTM. Both LSTMs are 2-layer projected LSTMs with hidden and projection layer dimensions 512 and 128, respectively. We also used L-regularization on the embedding, hidden, and output layers.

4.3 Lattice Combination

For track 1, we used the lattice combination method to combine four CNN + TDNN-F acoustic models, named models (f), (g), (h) and (i) in Section 4.1.2, all with WER’s in the 44%-45% range, as seen in Tables 5 and 6. While models (f), (g) and (i) were trained with only worn-microphone and enhanced far-field data, model (h) included raw far-field data and synthetically reverberated/noisy data. For track 2, we used only one acoustic model (f), we performed GSS twice with two individual arrays and once with all arrays together, followed by lattice combination. Diarization output was shared by all input array signals, and minimum Bayes risk decoding [xu2011minimum] was applied on top of the combined lattice.

5 CHiME Challenge Results

We show improvement in WER for tracks 1 and 2, obtained using the different modifications described in the previous sections, in Table 7 and Table 8, respectively.

From Table 7, we see that using a larger CNN + TDNN-F acoustic models improved performance over the TDNN-F baseline system by approx. 2% absolute. For further improvement, we trained the AM on data closer to test conditions by augmenting the clean worn-microphone data with beamformed and GSS-enhanced data [zorila2019investigation]. This provided almost 5% absolute WER reduction. Adding the overlap bit provided a modest improvement. Lattice combination and LM rescoring were individually effective, and their combination provided a significant WER reduction of 4% absolute. Cumulatively, we obtained more than 10% absolute WER improvement over the Challenge baseline system for track 1.

Table 8 shows a similar step-wise WER improvement for track 2. Again, we obtained 2% improvement using the larger CNN + TDNN-F architecture. Using multi-array fusion techniques in SAD and first-pass diarization reduced missed speech and speaker confusion, resulting in additional 4% and 2% improvement on the Dev and Eval sets, respectively. Using GSS and overlap-aware VB-HMM diarization provided significant improvements of 5%–8%, since they permit separating the overlapped speech for ASR. Finally, LM rescoring and lattice combination of ASR output from multiple input streams (but one ASR system) provided additional gains, similar to those observed in track 1 from combining multiple ASR systems. Cumulatively, we obtained an absolute WER improvement of 17% and 10% respectively on the Dev and Eval sets.

max width= System components (acoustic model) Dev Eval Baseline TDNN-F (a) 51.8% 51.3% CNN-TDNN-F (b) 49.6% 49.4% + Data selection and augmentation (f) 44.5% 44.9% + Overlap feature (i) 44.4% 44.5% + RNN LM rescoring (i) 42.8% 42.9% + Lattice combination (f)+(g)+(h)+(i) 41.8% 42.1% + Lattice combination & LM rescoring 40.3% 40.5%

Table 7: Stepwise improvement in WER on Track 1.

max width= System components (acoustic model) Dev Eval Baseline TDNN-F (a) 84.3% 77.9% CNN-TDNN-F (f) 82.5% 75.8% + Multi-array SAD & PLDA fusion (f) 78.3% 73.6% + Multi-array GSS (f) 71.0% 68.8% + VB Overlap Assignment (f) 69.3% 68.8% + RNN LM Rescoring (f) 68.7% 67.9% + Lattice combination (f) 68.3% 68.3% + Lattice combination & LM rescoring (f) 67.8% 67.5%

Table 8: Stepwise improvement in WER on Track 2.

6 Conclusion

We described our system for the sixth CHiME challenge for distant multi-microphone conversational speaker diarization and speech recognition in everyday home environments. We explored several methods to incorporate multi-microphone and multi-array information for speech enhancement, diarization, and ASR. For track 1, most of the improvements in WER were obtained from data selection and augmentation, and language model rescoring. Through careful training data selection, we reduced the training time of the system 3-fold while also improving its performance. In track 2, array fusion and overlap handling in the diarization module provided more accurate speaker segments than the Challenge baseline, resulting in improved speech enhancement via multi-array GSS. The gains from acoustic modeling and RNNLM rescoring developed in track 1 also largely carried over to track 2.

7 Acknowledgement

We thank Daniel Povey for the tremendous support throughout the system development, Jing Shi for help with data simulation in our trials with neural beamforming, and Yiming Wang for help with the lattice combination setup. This work was partially supported by grants from the JHU Applied Physics Laboratory, Nanyang Technological University, and the Government of Israel, and an unrestricted gift from Applications Technology (AppTek) Inc.