Multi-Channel End-to-End Neural Diarization with Distributed Microphones

10/10/2021
by   Shota Horiguchi, et al.
0

Recent progress on end-to-end neural diarization (EEND) has enabled overlap-aware speaker diarization with a single neural network. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of the number and geometry of microphones and suitable for distributed microphone settings. We also propose a model adaptation method using only single-channel recordings. With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel input was given while maintaining comparable performance with a single-channel input. We also showed that the proposed method performed well even when spatial information is inoperative given multi-channel inputs, such as in hybrid meetings in which the utterances of multiple remote participants are played back from the same loudspeaker.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/10/2020

End-to-End Multi-speaker Speech Recognition with Transformer

Recently, fully recurrent neural network (RNN) based end-to-end models h...
04/28/2020

Neural Speech Separation Using Spatially Distributed Microphones

This paper proposes a neural network based speech separation method usin...
11/22/2021

Multi-Channel Multi-Speaker ASR Using 3D Spatial Feature

Automatic speech recognition (ASR) of multi-channel multi-speaker overla...
02/17/2022

Non-Autoregressive ASR with Self-Conditioned Folded Encoders

This paper proposes CTC-based non-autoregressive ASR with self-condition...
02/09/2022

The Volcspeech system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

This paper describes our submission to ICASSP 2022 Multi-channel Multi-p...
11/05/2020

Exploring End-to-End Multi-channel ASR with Bias Information for Meeting Transcription

Joint optimization of multi-channel front-end and automatic speech recog...
06/14/2021

End-to-end Neural Diarization: From Transformer to Conformer

We propose a new end-to-end neural diarization (EEND) system that is bas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Meeting transcription is one of the largest application areas of speech-related technologies. One important component of meeting transcription is speaker diarization [1, 25], which gives speaker attributes to each transcribed utterance. In recent years, many end-to-end diarization methods have been proposed [9, 12, 22, 13] have achieved comparative accuracy to that of modular-based methods [18, 24]. However, many attempts have been made based on single-channel recordings, where no spatial information is available. Some meeting transcription systems are based on distributed microphones [5, 4, 34, 11], which enables the flexibility of recording devices and a wide range of sound collection. If we can improve diarization accuracy by extending the diarization methods to distributed microphone settings, it will be compatible with those systems.

Even if multi-channel inputs are given, diarization methods that heavily rely on spatial information are sometimes inoperative. The best examples are direction-of-arrival (DOA) based diarization [3, 14]. Due to COVID-19, meetings are now often held remotely or in a hybrid version of in-person and virtual gatherings. In hybrid meetings, remote attendees’ utterances are played via one loudspeaker, and DOA is no longer a clue to distinguish these speakers. To cope with this situation, spatial information needs to be properly incorporated into speaker-characteristic-based speaker diarization.

In this paper, we propose multi-channel end-to-end neural diarization (EEND) that is invariant to the number and order of channels for distributed microphone settings. We replaced Transformer encoders in the conventional EEND [12, 13] with two types of multi-channel encoders. One is a spatio-temporal encoder [30, 31], in which cross-channel and cross-frame self-attentions are stacked. It was reported in the context of speech separation that the encoder performs well when the number of microphones is large but degrads significantly when the number of microphones is small [30]. The other encoder is a co-attention encoder, in which both single- and multi-channel inputs are used and cross-frame co-attention weights are calculated from the multi-channel input. There are only cross-frame attentions; thus, its performance does not heavily depend on the number of channels. We further propose to adapt multi-channel EEND only with single-channel real recordings without losing the ability to benefit from spatial information given a multi-channel input during inference. We show that the proposed method can utilize spatial information and outperform the conventional EEND.

2 Related work

Some multi-channel diarization methods are fully based on DOA estimation

[3, 14], but assume that different speakers are not in the same direction, thus are not appropriate for hybrid meetings. Therefore, spatial information needs to be incorporated with single-channel-based methods as in e.g. [2]. Another possible approach is to combine channel-wise diarization results by using an ensemble method [28, 26], but it does not fully utilize spatial information. Some recent neural-network-based diarization methods utilize spatial information by aggregating multi-channel features. For example, online RSAN [16] uses inter-microphone phase difference features in addition to a single-channel magnitude spectrogram. However, the number of channels is fixed due to the network architecture, making the method less flexible. Moreover, phase-based features are not suited for distributed microphone settings, in which clock drift between channels exists. Multi-channel target-speaker voice activity detection (TS-VAD) [22]

combines embeddings extracted from the second from the last layer of single-channel TS-VAD. Although it is flexible in terms of the number of channels because an attention-based combination is used, it requires an external diarization system that gives an initial i-vector estimation for each speaker.

If we broaden our view to speech processing other than diarization, there are several methods for neural-network-based end-to-end multi-channel speech processing that are invariant to the number of channels, e.g. speech recognition [23, 32, 7], separation [19, 20, 30, 10, 31], and dereverberation [33]. Many use attention mechanisms to work with an arbitrary number of channels. Our proposed method also uses attention-based multi-channel processing.

3 Conventional single-channel EEND

3.1 Formulation of EEND

In the EEND framework, speakers’ speech activities are jointly estimated. Given -dimensional acoustic features for each frames , we first apply a linear projection parameterized by and followed by layer normalization [6] to obtain -dimensional frame-wise embeddings

(1)

where is the -dimensional all-one vector. It is further converted by -stacked encoders, where the -th encoder converts frame-wise embeddings into the same dimensional embeddings :

(2)

Finally, the frame-wise posteriors of speech activities for speakers are estimated. In this paper, we used EEND-EDA [12, 13], with which the speaker-wise attractor is first calculated using an encoder-decoder based attractor calculation module (EDA) and then the posteriors are estimated as

(3)
(4)

where

is the element-wise sigmoid function. A permutation-free objective is used for optimization as in previous studies

[9, 12, 13].

3.2 Transformer encoder

(a)
(b)
(c)
Figure 1:

Encoder blocks. Each yellow area is skipped via residual connection.

EEND-EDA uses a Transformer encoder [29] without positional encodings (Figure a) for in (missing). Given , the encoder converts it into as follows:

(5)
(6)

where , , and are sets of parameters, and and denote multi-head scaled dot-product attention and a feed-forward network, respectively, each of which is formulated in the following sections.

3.2.1 Multi-head scaled dot-product attention

Given -dimensional query , key , and -dimensional value inputs, multi-head scaled dot-product attention is calculated as

(7)
(8)
(9)
(10)
(11)

where is the number of heads, is the head index, and is the column-wise softmax function. The set of parameters and are defined as

(12)
(13)

3.2.2 Feed-forward network

The feed-forward network consists of two fully connected layers:

(14)
(15)

where and are projection matrices, and are biases, and is the ramp function.

4 Multi-channel EEND

To accept multi-channel inputs, we replaced Transformer encoders in EEND-EDA with multi-channel encoders. In this paper, we investigated two types of encoders: spatio-temporal encoder and co-attention encoder.

4.1 Spatio-temporal encoder

The spatio-temporal encoder was originally proposed for speech separation on the basis of distributed microphones [30, 31]. It uses stacked cross-channel and cross-frame self-attentions in one encoder block, as illustrated in Figure b. In the encoder, frame-wise -channel embeddings , where

, are first converted to the same shape of tensor

using cross-channel self-attention as

(16)
(17)

The tensor is then converted to by cross-frame self-attention as

(18)
(19)

In the final encoder block, cross-frame self-attention is calculated over the embeddings that are averaged across channels to form in (missing), i.e., the following are used instead of (missing) and (missing) as

(20)
(21)

to calculate speech activities using (missing) and (missing). All calculations using (missing)(missing) do not involve a specific number of channels or microphone geometry, which makes this encoder independent of the number and geometry of microphones. Note that we did not include feed-forward networks in this encoder following previous studies [30, 31] because we observed performance degradation.

4.2 Co-attention encoder

The spatio-temporal encoder includes cross-channel self-attention, the performance of which highly depends on the number of channels. Therefore, we also propose an encoder based only on cross-frame attention, which is characterized by the use of co-attention. The encoder accepts two inputs: frame-wise embeddings and frame-channel-wise embeddings . The proposed encoder converts these inputs to and as follows:

(22)
(23)
(24)
(25)
(26)

where , , , , , , , and are the sets of parameters. The single-channel input is converted by multi-head co-attention in (missing), multi-head attention in (missing), and feed-forward network in (missing). Each channel in the multi-channel input is first converted by in (missing), the attention weights of which are shared with those in (missing), then processed using in (missing).

Multi-head co-attention is similar to in (missing), but the attention weights are calculated using multi-channel inputs as

(27)
(28)

Here, and for are calculated using (missing) and (missing) for each channel, and are calculated using (missing). Note that the parameter sets and are shared among channels.

After the final encoder block, two outputs are concatenated as

(29)

to calculate speech activities using (missing) and (missing).

4.3 Domain Adaptation

EEND performance can be improved by domain adaptation using real recordings. However, the number of real recordings is usually limited, and even more the case when distributed microphones are used. Therefore, it would be useful if multi-channel EEND can be adapted to the target domain only with single-channel recordings. To ensure that adaptation using only single-channel recordings does not lose the ability to benefit from multi-channel recordings, we propose to update only the channel-invariant part of the model. For the spatio-temporal encoder, we freeze the parameters of cross-channel self-attention and in (missing). For the co-attention encoder, we freeze the parameters related to multi-channel processing: in (missing) and (missing), in (missing), and in (missing).

Conver- Average Overlap
Dataset sation Record #Mic #Session duration ratio
SRE+SWBD-train Simulated Simulated 10 20,000
SRE+SWBD-eval Simulated Simulated 10 500
SRE+SWBD-eval-hybrid Simulated Simulated 10 500
CSJ-train Simulated Recorded 9 100
CSJ-eval Simulated Recorded 9 100
CSJ-dialog Real Recorded 9 58
Table 1: Two-speaker conversational datasets.
SRE+SWBD-eval SRE+SWBD-eval-hybrid
Method 1ch 2ch 4ch 6ch 10ch 1ch 2ch 4ch 6ch 10ch
1ch + posterior avg. 5.13 4.60 4.31 4.19 4.10 6.07 5.68 5.42 5.38 5.33
Spatio-temporal 32.86 2.97 1.49 1.19 1.03 34.73 10.60 8.65 8.36 8.21
Spatio-temporal 6.34 3.02 1.56 1.28 1.07 8.11 8.23 6.98 6.72 6.40
Co-attention 7.23 2.83 1.85 1.59 1.50 9.03 7.53 6.82 6.51 6.65
Co-attention 4.68 2.52 1.71 1.40 1.23 5.73 5.34 5.05 5.18 5.35
  • Channel dropout was used during training.

Table 2: DERs on SRE+SWBD-eval and SRE+SWBD-hybrid.
tableDERs on CSJ-eval and CSJ-dialog. CSJ-eval CSJ-dialog Method Adapt 1ch 2ch 4ch 6ch 9ch 1ch 2ch 4ch 6ch 9ch 1ch + posterior avg. None 11.17 9.44 8.94 8.89 8.44 28.15 26.01 25.56 24.74 24.87 1ch + posterior avg. 1ch 3.27 2.31 2.25 2.05 1.75 22.56 20.82 20.34 19.68 20.25 Spatio-temporal None 10.98 10.20 4.29 3.27 2.63 36.13 45.19 36.48 37.14 37.63 Spatio-temporal 1ch 3.44 1.60 1.34 1.07 1.13 20.06 20.02 17.83 16.19 19.74 Spatio-temporal 1ch 3.64 1.78 1.64 1.27 1.32 20.57 19.02 17.37 15.49 18.70 Spatio-temporal 4ch 3.82 1.06 0.61 0.43 0.39 21.01 15.87 14.21 15.71 14.20 Co-attention None 9.49 3.36 1.42 1.40 0.94 27.96 22.52 19.37 18.23 17.99 Co-attention 1ch 2.75 1.41 0.75 0.63 0.52 23.49 22.83 20.70 17.59 15.77 Co-attention 1ch 3.26 1.46 0.68 0.48 0.42 22.45 17.90 15.53 14.34 14.05 Co-attention 4ch 3.31 1.19 0.57 0.40 0.39 21.42 17.51 14.95 14.21 13.87 Adapted only channel-invariant part of each model.
Figure 2: VRAM usage during training with and batch size of 64.

5 Experiment

5.1 Datasets

For the experiments, we created three fully simulated two-speaker conversational datasets using NIST Speaker Recognition Evaluation (2004–2006, 2008) (SRE), Switchboard-2 (Phase I–III), and Switchboard Cellular (Part 1, 2) (SWBD). To emulate a reverberant environment, we generated room impulse responses (RIRs) using gpuRIR

[8]. Following the procedure in [17], we sampled 200 rooms for each of the three room sizes: small, medium, and large. In each room, a table was randomly placed, 10 speakers were randomly placed around the table, and 10 microphones were randomly placed on the table. To create SRE+SWBD-train and SRE+SWBD-eval, two-speaker conversations were simulated following [9] then RIRs of the randomly selected room and two speaker positions were convolved to obtain a 10-channel mixture. MUSAN corpus [27] was also used to add noise to each mixture. SRE+SWBD-eval-hybrid was created using the same utterances in SRE+SWBD-eval, but two speakers were placed at the same position. This dataset was designed to mimic the part of hybrid meetings, in which multiple speakers’ utterances are played from a single loudspeaker.

We also prepared three real-recorded datasets on the basis of the corpus of spontaneous Japanese (CSJ) [21]: CSJ-train, CSJ-eval, and CSJ-dialog. For CSJ-train and CSJ-eval, 100 two-speaker conversations were simulated using single-speaker recordings in the CSJ training and evaluation sets, respectively. For CSJ-dialog, we directly used the dialog portion of CSJ. To record each session, we distributed nine smartphone devices on a tabletop in a meeting room and four loudspeakers around the table. We played back two speakers’ utterances from two of the four loudspeakers that were randomly selected and recorded them on the smartphone devices. Recorded signals were roughly synchronized to maximize the correlation coefficient and neither clock drift nor frame dropping was compensated.

All the experiments were based on two-speaker mixtures because our scope was investigating multi-channel diarization. Note that EEND-EDA can also be used when the number of speakers is unknown [12, 13].

5.2 Settings

As inputs to a single-channel baseline model [12, 13], 23-dimensional log-mel filterbanks were extracted for each followed by splicing ( frames) and subsampling by factor of 10, resulting in 345-dimensional features for each

. For the spatio-temporal model, we extracted features from each channel in the same manner. For the co-attention model, the 345-dimensional features were averaged across channels to be used as the single-channel input. As the multi-channel input, the log-mel filterbanks of

frames were averaged followed by subsampling; thus, a 23-dimensional feature was obtained for each . We set the embedding dimensionalities as and , i.e., 345-dimensional features were first converted to 256 dimensional via (missing) and 23-dimensional features were converted to 64 dimensional in the same manner. For each model, the four encoder blocks illustrated in Figure 1 were stacked.

Each model was first trained on SRE+SWBD-train for 500 epochs with the Adam optimizer

[15] using Noam scheduler [29] with 100,000 warm-up steps. At each iteration, four of ten channels were randomly selected and used for training. The models were then evaluated on SRE+SWBD-eval and SRE+SWBD-eval-hybrid using -channel inputs. Each model was further adapted to CSJ-train for 100 epochs with the Adam optimizer with a fixed learning rate of . The adapted models were evaluated on CSJ-eval and CSJ-dialog using -channel inputs. To evaluate the conventional EEND-EDA [12, 13] with multi-channel inputs, we first found the optimal speaker permutation between results from each channel by using correlation coefficients of posteriors and then averaged the posteriors among channels. To prevent the models from being overly dependent on spatial information, we also introduce channel dropout, in which multi-channel inputs are randomly dropped to be a single channel. The ratio of channel dropout was set to 0.1. Each method was evaluated using diarization error rates (DERs) with of collar tolerance.

5.3 Results

Table 2 shows the DERs on SRE+SWBD-eval and SRE+SWBD-eval-hybrid. From the results on SRE+SWBD-eval, both spatio-temporal and co-attention models outperformed the single-channel model with posterior averaging. Comparing the two multi-channel models, the spatio-temporal model significantly degraded DER with single-channel inputs. Channel dropout eased the situation, but the co-attention model still outperformed the spatio-temporal model when the number of channels was small. In the evaluation of SRE+SWBD-eval-hybrid, the co-attention model always achieved the same or better DERs than the single-channel model. This means that the lack of spatial information does not lead to degradation in diarization performance in the co-attention model.

Figure 2 shows the DERs on CSJ-eval and CSJ-dialog. The evaluation was based on the models trained using channel dropout. Without adaptation, we can see that the co-attention model generalized well. The performance of all models improved through adaptation, regardless of whether the data used for adaptation were 1ch or 4ch. Of course, both spatio-temporal and co-attention models can benefit more from 4ch adaptation; however, it is worth mentioning that they can still utilize spatial information provided by multi-channel inputs even if only 1ch recordings are used for adaptation. By freezing the parameters related to the calculation across channels during 1ch adaptation, we obtained almost the same DERs to those of 4ch adaptation with the co-attention model, while this did not work well for the spatio-temporal model.

Finally, we show the peak VRAM usage with and batch size of 64 in Figure 2. VRAM usage of the co-attention model increased more slowly than the spatio-temporal model as the number of microphones increased because the multi-channel processing part is based on layers with a lower number of units. Thus, the co-attention model can be trained using a larger number of channels.

6 Conclusion

In this paper, we proposed a multi-channel end-to-end neural diarization method based on distributed microphones. We replaced Transformer encoders in the conventional EEND with two types of multi-channel encoders. Each showed better DERs with multi-channel inputs than the conventional EEND on both simulated and real-recorded datasets. We also proposed a model adaptation method using only single-channel recordings, and achieved comparable DERs as when using multi-channel recordings.

References

  • [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals (2012) Speaker diarization: a review of recent research. IEEE TASLP 20 (2), pp. 356–370. Cited by: §1.
  • [2] X. Anguera, C. Wooters, and J. Hernando (2007) Acoustic beamforming for speaker diarization of meetings. IEEE TASLP 15 (7), pp. 2011–2022. Cited by: §2.
  • [3] S. Araki, M. Fujimoto, K. Ishizuka, H. Sawada, and S. Makino (2008) A DOA based speaker diarization system for real meetings. In HSCMA, pp. 29–32. Cited by: §1, §2.
  • [4] S. Araki, N. Ono, K. Kinoshita, and M. Delcroix (2018) Meeting recognition with asynchronous distributed microphone array using block-wise refinement of mask-based MVDR beamformer. In ICASSP, pp. 5694–5698. Cited by: §1.
  • [5] S. Araki, N. Ono, K. Konoshita, and M. Delcroix (2017) Meeting recognition with asynchronous distributed microphone array. In ASRU, pp. 32–39. Cited by: §1.
  • [6] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. In

    NIPS 2016 Deep Learning Symposium

    ,
    Cited by: §3.1.
  • [7] F. Chang, M. Radfar, A. Mouchtaris, and M. Omologo (2021) Multi-channel transformer transducer for speech recognition. In INTERSPEECH, pp. 296–300. Cited by: §2.
  • [8] D. Diaz-Guerra, A. Miguel, and J. R. Beltran (2021) gpuRIR: a python library for room impulse response simulation with GPU acceleration. Multimedia Tools and Applications 80 (4), pp. 5653–5671. Cited by: §5.1.
  • [9] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe (2019) End-to-end neural speaker diarization with permutation-free objectives. In INTERSPEECH, pp. 4300–4304. Cited by: §1, §3.1, §5.1.
  • [10] N. Furnon, R. Serizel, I. Illina, and S. Essid (2021) Distributed speech separation in spatially unconstrained microphone arrays. In ICASSP, pp. 4490–4494. Cited by: §2.
  • [11] S. Horiguchi, Y. Fujita, and K. Nagamatsu (2020) Utterance-wise meeting transcription system using asynchronous distributed microphones. In INTERSPEECH, pp. 344–348. Cited by: §1.
  • [12] S. Horiguchi, Y. Fujita, S. Wananabe, Y. Xue, and K. Nagamatsu (2020) End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors. In INTERSPEECH, pp. 269–273. Cited by: §1, §1, §3.1, §5.1, §5.2, §5.2.
  • [13] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and P. Garcia (2021) Encoder-decoder based attractor calculation for end-to-end neural diarization. Note: arXiv:2106.10654 Cited by: §1, §1, §3.1, §5.1, §5.2, §5.2.
  • [14] K. Ishiguro, T. Yamada, S. Araki, T. Nakatani, and H. Sawada (2011) Probabilistic speaker diarization with bag-of-words representations of speaker angle information. IEEE TASLP 20 (2), pp. 447–460. Cited by: §1, §2.
  • [15] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §5.2.
  • [16] K. Kinoshita, M. Delcroix, S. Araki, and T. Nakatani (2020) Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system. In ICASSP, pp. 381–385. Cited by: §2.
  • [17] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017) A study on data augmentation of reverberant speech for robust speech recognition. In ICASSP, pp. 5220–5224. Cited by: §5.1.
  • [18] F. Landini, J. Profant, M. Diez, and L. Burget (2022) Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks. Computer Speech & Language 71, pp. 101254. Cited by: §1.
  • [19] Y. Luo, E. Ceolini, C. Han, S. Liu, and N. Mesgarani (2019) FaSNet: low-latency adaptive beamforming for multi-microphone audio processing. In ASRU, pp. 260–267. Cited by: §2.
  • [20] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka (2020) End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP, pp. 6394–6398. Cited by: §2.
  • [21] K. Maekawa (2003) Corpus of spontaneous Japanese: its design and evaluation. In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Cited by: §5.1.
  • [22] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko (2020) Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario. In INTERSPEECH, pp. 274–278. Cited by: §1, §2.
  • [23] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey (2017) Multichannel end-to-end speech recognition. In ICML, pp. 2632–2641. Cited by: §2.
  • [24] T. J. Park, K. J. Han, M. Kumar, and S. Narayanan (2020)

    Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap

    .
    IEEE Signal Processing Letters 27, pp. 381–385. Cited by: §1.
  • [25] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan (2021) A review of speaker diarization: recent advances with deep learning. Note: arXiv:2101.09624 Cited by: §1.
  • [26] D. Raj, L. P. Garcia-Perera, Z. Huang, S. Watanabe, D. Povey, A. Stolcke, and S. Khudanpur (2021) DOVER-Lap: a method for combining overlap-aware diarization outputs. In SLT, pp. 881–888. Cited by: §2.
  • [27] D. Snyder, G. Chen, and D. Povey (2015) MUSAN: a music, speech, and noise corpus. Note: arXiv:1510.08484 Cited by: §5.1.
  • [28] A. Stolcke and T. Yoshioka (2019) DOVER: a method for combining diarization outputs. In ASRU, pp. 757–763. Cited by: §2.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §3.2, §5.2.
  • [30] D. Wang, Z. Chen, and T. Yoshioka (2020) Neural speech separation using spatially distributed microphones. In INTERSPEECH, pp. 339–343. Cited by: §1, §2, §4.1.
  • [31] D. Wang, T. Yoshioka, Z. Chen, X. Wang, T. Zhou, and Z. Meng (2021) Continuous speech separation with ad hoc microphone arrays. In EUSIPCO, Cited by: §1, §2, §4.1.
  • [32] X. Wang, R. Li, S. H. Mallidi, T. Hori, S. Watanabe, and H. Hermansky (2019) Stream attention-based multi-array end-to-end speech recognition. In ICASSP, pp. 7105–7109. Cited by: §2.
  • [33] Z. Wang and D. Wang (2020) Multi-microphone complex spectral mapping for speech dereverberation. In ICASSP, pp. 486–490. Cited by: §2.
  • [34] T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, and X. Huang (2019) Meeting transcription using asynchronous distant microphones. In INTERSPEECH, pp. 2968–2972. Cited by: §1.