With the rapid popularization of smart terminal devices, such as smart phones, vehicle-mounted devices, smart speakers, etc., far-field automatic speaker verification (ASV) has been widely studied. Due to noise, reverberation and speech signal attenuation, the performance of single-channel ASV has dropped sharply and still faces challenges in the far-field environment. In order to make these smart devices robust against noise and reverberation environment, one approach is to equip them with multiple microphones so that the spectral and spatial diversity of the target and interference signals can be leveraged using beamforming approaches [1, 2, 3]. It has been demonstrated in [4, 5, 6, 7] that single-channel and multi-channel speech enhancement leads to substantial improvement of ASV.
The above research focuses mainly on single channel or multi-channel front-ends on single devices, where the prior knowledge (e.g. the geometry) of the microphone arrays is known to be important to the performance. Different from the fixed microphone array, the ad-hoc microphone arrays is composed of a group of microphone nodes randomly distributed in space, and its number and arrangement are unknown . An advantage of the ad-hoc microphone array is that it allows users to use their own mobile devices to virtually form a microphone array system flexibly. Recently, several studies on ad-hoc microphone arrays have been conducted. In 
, authors proposed deep ad-hoc beamforming based on speaker extraction, which used a supervised channel selection framework and a deep learning based MVDR algorithm. In, authors used the attention mechanisam to obtain the relevant information between and within channels for multi-channel speech separation. In , the authors leveraged neural transformer architectures for multi-channel speech recognition systems. However, there are few studies on the ASV with ad-hoc microphone arrays.  proposed a far-field multi-channel text-related speaker verification dataset, and proposed a set of single-channel ASV baseline systems. In order to further improve the performance, author proposed a testing background aware enrollment augmentation strategy.  proposed an utterance-level cross-channel attention (UCCA) layer to fuse utterance-level speaker embeddings from each channel for the final speaker representation.
However, the UCCA layer  is conducted at the utterance-level, which misses spatial and temporal information between the channels. To remedy this problem, inspired by , in this paper, we propose a frame-level multi-channel ASV with ad-hoc microphone arrays. The core idea is to fuse the channels of an ad-hoc array at the frame-level before the pooling layer by stacking multiple spatio-temporal processing blocks (STB), where STB consists of a cross-frame processing layer (CFL) and a cross-channel processing layer (CCL). Several techniques and training schemes, such as the sparsemax function  and pretraining techniques , are further added to the system to make it effective and efficient in handling large-scale ad-hoc microphone arrays.
We conducted an extensive experiment on a simulated corpus generated from Librispeech with ad-hoc microphone arrays and a semi-real Libri-adhoc40  corpus. Experimental results with ad-hoc microphone arrays of as many as nodes demonstrate the effectiveness of the proposed method in both data. For example, the result on Libri-adhoc40 shows that STB-ASV achieves a relative EER reduction of about over UCCA, and over a single-channel ASV system whose microphone is physically the closest one to the speaker source, in the mismatched 30-channel test environment.
2 Proposed Method
Figure 1 shows the processing flow of our proposed multi-channel speaker verification model with ad-hoc microphone arrays. It consists of two elements: a frame-level feature processor and a spatio-temporal processor.
2.1 Spatio-temporal processing block
As shown in Figure 2, the spatio-temporal processing block consists of a cross-frame processing layer and a cross-channel processing layer. By stacking multiple blocks, the entire network can make use of information across channels and frames. A cross-channel self-attention layer exploits the nonlinear spatial correlation between different channels, which has demonstrated its effectiveness in .
2.1.1 Cross-channel processing layer (CCL)
We apply the self-attention mechanism along the channel dimension to integrate the information of different channels. The detailed calculation process is as follows:
In this paper, we define as frame length, as the number of input channels and as the feature dimension of each channel. For each layer, is the input of the layer, where is the input feature matrix at time . Let denote the number of attention heads. For each attention head, the input features are transformed into query, key and value as follows:
where , the matrices , , denote the query, key, and value embeddings respectively, all of which are in . For the -th attention head at time , and are the trainable parameters where . The cross-channel similarity matrix is computed as the product of the query and key matrices. is the attention scores from the previous across-channel processing layer . The output of the -th attention head is then computed by:
where and . Finally, the new attention scores are sent to the upper layer, which are further concatenated across the subspaces by:
is a weight matrix of the linear projection layer. After that, the position-wise feed-forward network (FFN) with ReLU activation is applied toto generate the output of the current cross-channel processing layer. Layer normalization 
is applied on the input before the attention module and FFN module, separately. A residual connection is applied between the input and output of the attention module as well as the FFN to alleviate the gradient dispersion problem .
2.1.2 Cross-frame processing layer (CFL)
A cross-frame self-attention layer allows the network to efficiently learn the contextual relationship within a single channel. We use the multi-head scaled dot-product attention as the scoring function to compute the attention weights across time. Let denote the input, where is the input feature matrix of the -th channel. Similar to (1
), we obtain the query, key and value matrices via linear transformations:
where and are learnable weights and bias parameters. Similar to (2), the output of the -th attention head is computed by:
where is the raw attention from the previous layer. Similar to CCL, we then add the residual connection and layer normalization before feeding the contextual time-attended representations through the feed forward layers in order to get the final cross-frame self-attention output.
2.2 Self-attention with sparsemax
The output elements of softmax can never be zero, which leads to consider all channels in ad-hoc microphone arrays. However, in practice, considering the large noise and reverberation, the information of many channels is useless and should be discarded.To address this problem, we replace the softmax in CCL with sparsemax  which has shown its effectiveness in ASV .
|Method||Number of parameters||Libri-adhoc-simu||Libri-adhoc40|
|Oracle one-best||2.77 M||12.48||11.45||10.91||17.11||14.80||13.54|
|UCCA-ASV ||2.82 M||8.10||7.94||7.88||11.18||10.95||10.84|
|STB-ASV (proposed)||2.49 M||7.69||7.55||7.43||9.51||9.34||9.01|
Our experiments use three data sets, which are the Librispeech corpus , Librispeech simulated with ad-hoc microphone arrays (Libri-adhoc-simu), and Libri-adhoc40 . Each node of the ad-hoc microphone arrays of Libri-adhoc-simu and Libri-adhoc40 has only one microphone. Therefore, a channel refers to a node in the remaining of the paper.
Considering that a large amount of data from a massive ad-hoc array, denoted as ad-hoc data for short, leads to a large memory requirement for the model training, all comparison multi-channel ASV models based on ad-hoc microphone arrays first trained a single-channel ASV with clean speech data, then used the single channel ASV to initialize the multi-channel ASV model, and finally used the ad-hoc data to fine-tune the spatio-temporal processing blocks. In our experiments, the single channel ASV systems were trained with hours of the clean data of Librispeech, while another hours of the clean data was used for development.
The Libri-adhoc-simu corpus is a simulation database of Librispeech. We use ’train-clean-100’, ’dev-clean’ and ’test-clean’ as the training set, validation set and test set to simulate ad-hoc data respectively. The training set contains speakers, and the validation set and test set contain speakers respectively. Then we add room impulse response and noise to these clean speech data and the setting of simulation parameters is the same as , as follows: the range of length and width is meters, the range of height is meters, and the range of reverberation time is seconds. Then we randomly select these parameters, and place a speaker and forty microphones in the room. The noise of the training set and the validation set is randomly selected from a large-scale noise segments library . The noise of test set comes from the CHiME-3 dataset  and NOISEX-92 . The RIR-Generator111https://github.com/ehabets/RIR-Generator and ANF-Generator222https://github.com/ehabets/ANF-Generator are used for data simulation. The Libri-adhoc40 corpus is collected by playing the speech data of Librispeech in a large room, in which forty microphones and a speaker are placed .
For the above ad-hoc datasets, we randomly select
channels as the training set and the validation set, and in each epoch, the training set is reselected to improve the generalization performance of the model. For the test set, we randomly select, , channels to construct three test scenarios to test the performance of the model.
3.2 Experimental Setup
We denote the proposed model as STB-ASV. For the STB-ASV training, the network structure of its initial single-channel ASV is the same as in 
, which contains three main components: a front-end residual convolution neural network (ResNet), a self-attentive pooling (SAP)  layer and a fully-coonnected layer. It was trained for epochs on the Librispeech corpus. Then, the parameters of the ResNet layer and SAP layer were fixed and sent to the multi-channel ASV. Finally, we trained the STB blocks of the proposed STB-ASV with Libri-adhoc-simu and Libri-adhoc40 data respectively, where the number of the spatio-temporal blocks is , and the number of the attention heads is . We used voxceleb_trainer333https://github.com/clovaai/voxceleb_trainer to build our models. The preprocessing of the data and training setting of the proposed model is the same as . We appropriately adjust the size of the model to ensure the rationality of the comparison system. The following two baselines are used for comparison:
Oracle one-best + ASV: This serves as the single-channel baseline. We pick the channel that is physically closest to the speaker source as the input of the single-channel ASV model. Note that, for the oracle one-best baseline, the distances between the speaker and the microphones are known beforehand. In order to verify the rationality of the oracle one-best baseline, we selected six closest channels as the input of the single-channel ASV respectively. As shown in figure 3, it is reasonable to select the closest channel.
Utterance-level cross-channel self-attention + ASV (UCCA+ASV) : It adds an utterance-level cross-channel self-attention layer and a global fusion layer after the pooling layer of a single-channel ASV.
Table 1 lists the preformance of the comparison methods on Libri-adhoc-simu and Libri-adhoc40. From the table, we see that all of the proposed methods perform well in both test scenarios. Specifically, compared with oracle one-best, STB-ASV with softmax achieves a relative EER reduction of over on Libri-adhoc-simu, and over on Libri-adhoc40. Compared with UCCA-ASV with softmax, STB-ASV with softmax achieves a relative EER reduction of over on Libri-adhoc40, which demonstrates the effectiveness of modeling frame-level information over utterance-level information.
Moreover, the sparsemax function achieves slightly better performance than softmax. For example, the STB-ASV with sparsemax achieves an EER of , which is relatively lower than the STB-ASV with softmax, in the mismatched -channel test environment on Libri-adhoc40.
To study the effects of CCL and CFL separately on performance, we train two variants of STB-ASV with CCL or CFL only. Table 2 lists the results. From the table, we see that the EER drops significantly when either the attention layer is removed.
In this paper, we present a novel multi-channel ASV model with ad-hoc microphone arrays. It conducts channel fusion at the frame-level by stacking multiple spatio-temporal processing blocks before the pooling layer. The block can be trained in a way that is independent of the number and permutation of the microphones. Compared to the utterance-level multi-channel ASV, the proposed STB-ASV model is able to mine the spatial and temporal information for better performance. To further handle large-scale ad-hoc microphone arrays, we further replace the softmax operator in the cross-channel self-attention with the sparsemax operator, which forces the channel weights of very noisy channels to zero. We evaluate our model on the corpora of the Libri-adhoc-simu with additive diffuse noise and the Libri-adhoc40 with high reverberation. Experimental results show that the proposed STB-ASV achieves better performance than its utterance-level counterpart.
-  Ladislav Mošner, Pavel Matějka, Ondřej Novotnỳ, and Jan Honza Černockỳ, “Dereverberation and beamforming in far-field speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5254–5258.
-  Qin Jin, Runxin Li, Qian Yang, Kornel Laskowski, and Tanja Schultz, “Speaker identification with distant microphone speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2010, pp. 4518–4521.
-  Hassan Taherian, Z-Q Wang, and DeLiang Wang, “Deep learning based multi-channel speaker recognition in noisy and reverberant environments,” in Interspeech, 2019.
-  Hassan Taherian, Zhong-Qiu Wang, Jorge Chang, and DeLiang Wang, “Robust speaker recognition based on single-channel and multi-channel speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1293–1302, 2020.
Yi Jiang, DeLiang Wang, RunSheng Liu, and ZhenMing Feng,
“Binaural classification for reverberant speech segregation using deep neural networks,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 2112–2121, 2014.
-  Zhong-Qiu Wang and DeLiang Wang, “All-neural multi-channel speech enhancement.,” in Interspeech, 2018, pp. 3234–3238.
-  Samia Abd El-Moneim, MA Nassar, Moawad I Dessouky, Nabil A Ismail, Adel S El-Fishawy, and Fathi E Abd El-Samie, “Text-independent speaker recognition using lstm-rnn and speech enhancement,” Multimedia Tools and Applications, vol. 79, no. 33, pp. 24013–24028, 2020.
-  Xiao-Lei Zhang, “Deep ad-hoc beamforming,” Computer Speech & Language, vol. 68, pp. 101201, 2021.
-  Ziye Yang, Shanzheng Guan, and Xiao-Lei Zhang, “Deep ad-hoc beamforming based on speaker extraction for target-dependent speech separation,” arXiv preprint arXiv:2012.00403, 2020.
-  Dongmei Wang, Takuya Yoshioka, Zhuo Chen, Xiaofei Wang, Tianyan Zhou, and Zhong Meng, “Continuous speech separation with ad hoc microphone arrays,” arXiv preprint arXiv:2103.02378, 2021.
-  Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Brian King, and Siegfried Kunzmann, “End-to-end multi-channel transformer for speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5884–5888.
-  Xiaoyi Qin, Hui Bu, and Ming Li, “Hi-mia: A far-field text-dependent speaker verification database and the baselines,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7609–7613.
-  Chengdong Liang, Junqi Chen, Shanzheng Guan, and Xiao-Lei Zhang, “Attention-based multi-channel speaker verification with ad-hoc microphone arrays,” arXiv preprint arXiv:2107.00178, 2021.
Andre Martins and Ramon Astudillo,
“From softmax to sparsemax: A sparse model of attention and
International conference on machine learning. PMLR, 2016, pp. 1614–1623.
-  Shanzheng Guan, Shupei Liu, Junqi Chen, Wenbo Zhu, Shengqiang Li, Xu Tan, Ziye Yang, Menglong Xu, Yijiang Chen, Jianyu Wang, et al., “Libri-adhoc40: A dataset collected from synchronized ad-hoc microphone arrays,” arXiv preprint arXiv:2103.15118, 2021.
-  Ruining He, Anirudh Ravula, Bhargav Kanagal, and Joshua Ainslie, “Realformer: Transformer likes residual attention,” arXiv preprint arXiv:2012.11747, 2020.
-  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in , 2016, pp. 770–778.
-  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
-  Xu Tan and Xiao-Lei Zhang, “Speech enhancement aided end-to-end multi-task learning for voice activity detection,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6823–6827.
Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe,
“The third ‘chime’speech separation and recognition challenge:
Dataset, task and baselines,”
2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 504–511.
-  Andrew Varga and Herman JM Steeneken, “Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp. 247–251, 1993.
-  Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han, “In defence of metric learning for speaker recognition,” arXiv preprint arXiv:2003.11982, 2020.
-  Yingke Zhu, Tom Ko, David Snyder, Brian Mak, and Daniel Povey, “Self-attentive speaker embeddings for text-independent speaker verification.,” in Interspeech, 2018, vol. 2018, pp. 3573–3577.