Frame-level multi-channel speaker verification with large-scale ad-hoc microphone arrays

10/12/2021 ∙ by Chengdong Liang, et al. ∙ 0

Ad-hoc microphone arrays has recieved attention, in which the number and arrangement of microphones are unknown. Traditional multi-channel processing methods can not be directly used in ad-hoc. Recently, to solve this problem, an utterance-level ASV with ad-hoc microphone arrays has been proposed, which first extracts utterance-level speaker embeddings from each channel of an ad-hoc microphone array, and then fuses the embeddings for the final verification. However, this method cannot make full use of the cross-channel information. In this paper, we present a novel multi-channel ASV model at the frame-level. Specifically, we add spatio-temporal processing blocks (STB) before the pooling layer, which models the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from STB are sent to the pooling layer to obtain an utterance-level speaker representation. Experimental results demonstrate the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rapid popularization of smart terminal devices, such as smart phones, vehicle-mounted devices, smart speakers, etc., far-field automatic speaker verification (ASV) has been widely studied. Due to noise, reverberation and speech signal attenuation, the performance of single-channel ASV has dropped sharply and still faces challenges in the far-field environment. In order to make these smart devices robust against noise and reverberation environment, one approach is to equip them with multiple microphones so that the spectral and spatial diversity of the target and interference signals can be leveraged using beamforming approaches [1, 2, 3]. It has been demonstrated in [4, 5, 6, 7] that single-channel and multi-channel speech enhancement leads to substantial improvement of ASV.

The above research focuses mainly on single channel or multi-channel front-ends on single devices, where the prior knowledge (e.g. the geometry) of the microphone arrays is known to be important to the performance. Different from the fixed microphone array, the ad-hoc microphone arrays is composed of a group of microphone nodes randomly distributed in space, and its number and arrangement are unknown [8]. An advantage of the ad-hoc microphone array is that it allows users to use their own mobile devices to virtually form a microphone array system flexibly. Recently, several studies on ad-hoc microphone arrays have been conducted. In [9]

, authors proposed deep ad-hoc beamforming based on speaker extraction, which used a supervised channel selection framework and a deep learning based MVDR algorithm. In

[10], authors used the attention mechanisam to obtain the relevant information between and within channels for multi-channel speech separation. In [11], the authors leveraged neural transformer architectures for multi-channel speech recognition systems. However, there are few studies on the ASV with ad-hoc microphone arrays. [12] proposed a far-field multi-channel text-related speaker verification dataset, and proposed a set of single-channel ASV baseline systems. In order to further improve the performance, author proposed a testing background aware enrollment augmentation strategy. [13] proposed an utterance-level cross-channel attention (UCCA) layer to fuse utterance-level speaker embeddings from each channel for the final speaker representation.

However, the UCCA layer [13] is conducted at the utterance-level, which misses spatial and temporal information between the channels. To remedy this problem, inspired by [10], in this paper, we propose a frame-level multi-channel ASV with ad-hoc microphone arrays. The core idea is to fuse the channels of an ad-hoc array at the frame-level before the pooling layer by stacking multiple spatio-temporal processing blocks (STB), where STB consists of a cross-frame processing layer (CFL) and a cross-channel processing layer (CCL). Several techniques and training schemes, such as the sparsemax function [14] and pretraining techniques [13], are further added to the system to make it effective and efficient in handling large-scale ad-hoc microphone arrays.

We conducted an extensive experiment on a simulated corpus generated from Librispeech with ad-hoc microphone arrays and a semi-real Libri-adhoc40 [15] corpus. Experimental results with ad-hoc microphone arrays of as many as nodes demonstrate the effectiveness of the proposed method in both data. For example, the result on Libri-adhoc40 shows that STB-ASV achieves a relative EER reduction of about over UCCA, and over a single-channel ASV system whose microphone is physically the closest one to the speaker source, in the mismatched 30-channel test environment.

2 Proposed Method

Figure 1 shows the processing flow of our proposed multi-channel speaker verification model with ad-hoc microphone arrays. It consists of two elements: a frame-level feature processor and a spatio-temporal processor.

2.1 Spatio-temporal processing block

As shown in Figure 2, the spatio-temporal processing block consists of a cross-frame processing layer and a cross-channel processing layer. By stacking multiple blocks, the entire network can make use of information across channels and frames. A cross-channel self-attention layer exploits the nonlinear spatial correlation between different channels, which has demonstrated its effectiveness in [13].

2.1.1 Cross-channel processing layer (CCL)

We apply the self-attention mechanism along the channel dimension to integrate the information of different channels. The detailed calculation process is as follows:

In this paper, we define as frame length, as the number of input channels and as the feature dimension of each channel. For each layer, is the input of the layer, where is the input feature matrix at time . Let denote the number of attention heads. For each attention head, the input features are transformed into query, key and value as follows:

(1)

where , the matrices , , denote the query, key, and value embeddings respectively, all of which are in . For the -th attention head at time , and are the trainable parameters where . The cross-channel similarity matrix is computed as the product of the query and key matrices. is the attention scores from the previous across-channel processing layer [16]. The output of the -th attention head is then computed by:

(2)

where and . Finally, the new attention scores are sent to the upper layer, which are further concatenated across the subspaces by:

(3)

where

is a weight matrix of the linear projection layer. After that, the position-wise feed-forward network (FFN) with ReLU activation is applied to

to generate the output of the current cross-channel processing layer. Layer normalization [17]

is applied on the input before the attention module and FFN module, separately. A residual connection

[18] is applied between the input and output of the attention module as well as the FFN to alleviate the gradient dispersion problem [13].

Figure 1: Architecture of the proposed multi-channel ASV system. The blue block is trained with ad-hoc data.
Figure 2: Architecture of the spatio-temporal processing block.

2.1.2 Cross-frame processing layer (CFL)

A cross-frame self-attention layer allows the network to efficiently learn the contextual relationship within a single channel. We use the multi-head scaled dot-product attention as the scoring function to compute the attention weights across time. Let denote the input, where is the input feature matrix of the -th channel. Similar to (1

), we obtain the query, key and value matrices via linear transformations:

(4)

where and are learnable weights and bias parameters. Similar to (2), the output of the -th attention head is computed by:

(5)

where is the raw attention from the previous layer. Similar to CCL, we then add the residual connection and layer normalization before feeding the contextual time-attended representations through the feed forward layers in order to get the final cross-frame self-attention output.

2.2 Self-attention with sparsemax

The output elements of softmax can never be zero, which leads to consider all channels in ad-hoc microphone arrays. However, in practice, considering the large noise and reverberation, the information of many channels is useless and should be discarded.To address this problem, we replace the softmax in CCL with sparsemax [14] which has shown its effectiveness in ASV [13].

Method Number of parameters Libri-adhoc-simu Libri-adhoc40

20-ch 30-ch 40-ch 20-ch 30-ch 40-ch
Oracle one-best 2.77 M 12.48 11.45 10.91 17.11 14.80 13.54
UCCA-ASV [13] 2.82 M 8.10 7.94 7.88 11.18 10.95 10.84
+sparsemax 2.82 M 7.59 7.53 7.47 10.89 10.62 10.53
STB-ASV (proposed) 2.49 M 7.69 7.55 7.43 9.51 9.34 9.01
+sparsemax 2.49 M 7.46 7.34 7.25 9.21 9.06 8.73
Table 1: EER (%) comparision on Libri-adhoc-simu and Libri-adhoc40. The term ‘ch’ is short for channels in test.

3 Experiments

3.1 Dataset

Our experiments use three data sets, which are the Librispeech corpus [19], Librispeech simulated with ad-hoc microphone arrays (Libri-adhoc-simu), and Libri-adhoc40 [15]. Each node of the ad-hoc microphone arrays of Libri-adhoc-simu and Libri-adhoc40 has only one microphone. Therefore, a channel refers to a node in the remaining of the paper.

Considering that a large amount of data from a massive ad-hoc array, denoted as ad-hoc data for short, leads to a large memory requirement for the model training, all comparison multi-channel ASV models based on ad-hoc microphone arrays first trained a single-channel ASV with clean speech data, then used the single channel ASV to initialize the multi-channel ASV model, and finally used the ad-hoc data to fine-tune the spatio-temporal processing blocks. In our experiments, the single channel ASV systems were trained with hours of the clean data of Librispeech, while another hours of the clean data was used for development.

The Libri-adhoc-simu corpus is a simulation database of Librispeech. We use ’train-clean-100’, ’dev-clean’ and ’test-clean’ as the training set, validation set and test set to simulate ad-hoc data respectively. The training set contains speakers, and the validation set and test set contain speakers respectively. Then we add room impulse response and noise to these clean speech data and the setting of simulation parameters is the same as [13], as follows: the range of length and width is meters, the range of height is meters, and the range of reverberation time is seconds. Then we randomly select these parameters, and place a speaker and forty microphones in the room. The noise of the training set and the validation set is randomly selected from a large-scale noise segments library [20]. The noise of test set comes from the CHiME-3 dataset [21] and NOISEX-92 [22]. The RIR-Generator111https://github.com/ehabets/RIR-Generator and ANF-Generator222https://github.com/ehabets/ANF-Generator are used for data simulation. The Libri-adhoc40 corpus is collected by playing the speech data of Librispeech in a large room, in which forty microphones and a speaker are placed [15].

For the above ad-hoc datasets, we randomly select

channels as the training set and the validation set, and in each epoch, the training set is reselected to improve the generalization performance of the model. For the test set, we randomly select

, , channels to construct three test scenarios to test the performance of the model.

Figure 3: Results of the six closest channels for the single-channel ASV.

3.2 Experimental Setup

We denote the proposed model as STB-ASV. For the STB-ASV training, the network structure of its initial single-channel ASV is the same as in [23]

, which contains three main components: a front-end residual convolution neural network (ResNet)

[18], a self-attentive pooling (SAP) [24] layer and a fully-coonnected layer. It was trained for epochs on the Librispeech corpus. Then, the parameters of the ResNet layer and SAP layer were fixed and sent to the multi-channel ASV. Finally, we trained the STB blocks of the proposed STB-ASV with Libri-adhoc-simu and Libri-adhoc40 data respectively, where the number of the spatio-temporal blocks is , and the number of the attention heads is . We used voxceleb_trainer333https://github.com/clovaai/voxceleb_trainer to build our models. The preprocessing of the data and training setting of the proposed model is the same as [13]. We appropriately adjust the size of the model to ensure the rationality of the comparison system. The following two baselines are used for comparison:

  • Oracle one-best + ASV: This serves as the single-channel baseline. We pick the channel that is physically closest to the speaker source as the input of the single-channel ASV model. Note that, for the oracle one-best baseline, the distances between the speaker and the microphones are known beforehand. In order to verify the rationality of the oracle one-best baseline, we selected six closest channels as the input of the single-channel ASV respectively. As shown in figure 3, it is reasonable to select the closest channel.

  • Utterance-level cross-channel self-attention + ASV (UCCA+ASV) [13]: It adds an utterance-level cross-channel self-attention layer and a global fusion layer after the pooling layer of a single-channel ASV.

3.3 Results

Table 1 lists the preformance of the comparison methods on Libri-adhoc-simu and Libri-adhoc40. From the table, we see that all of the proposed methods perform well in both test scenarios. Specifically, compared with oracle one-best, STB-ASV with softmax achieves a relative EER reduction of over on Libri-adhoc-simu, and over on Libri-adhoc40. Compared with UCCA-ASV with softmax, STB-ASV with softmax achieves a relative EER reduction of over on Libri-adhoc40, which demonstrates the effectiveness of modeling frame-level information over utterance-level information.

Moreover, the sparsemax function achieves slightly better performance than softmax. For example, the STB-ASV with sparsemax achieves an EER of , which is relatively lower than the STB-ASV with softmax, in the mismatched -channel test environment on Libri-adhoc40.

To study the effects of CCL and CFL separately on performance, we train two variants of STB-ASV with CCL or CFL only. Table 2 lists the results. From the table, we see that the EER drops significantly when either the attention layer is removed.

CCL CFL Libri-adhoc-simu Libri-adhoc40
20-ch 30-ch 20-ch 30-ch
7.33% 8.27% 7.82% 7.15%
5.12% 6.01% 3.72% 4.51%
0 0 0 0
Table 2: Relative EER reduction of the proposed model with the models that use CCL or CFL only.

4 Conclusion

In this paper, we present a novel multi-channel ASV model with ad-hoc microphone arrays. It conducts channel fusion at the frame-level by stacking multiple spatio-temporal processing blocks before the pooling layer. The block can be trained in a way that is independent of the number and permutation of the microphones. Compared to the utterance-level multi-channel ASV, the proposed STB-ASV model is able to mine the spatial and temporal information for better performance. To further handle large-scale ad-hoc microphone arrays, we further replace the softmax operator in the cross-channel self-attention with the sparsemax operator, which forces the channel weights of very noisy channels to zero. We evaluate our model on the corpora of the Libri-adhoc-simu with additive diffuse noise and the Libri-adhoc40 with high reverberation. Experimental results show that the proposed STB-ASV achieves better performance than its utterance-level counterpart.

References