1 Introduction
Farfield speaker recognition has gained much interest in the research community, with its prevalent applications in consumer devices such smart speakers and smartphones. The farfield condition presents additional challenges in speaker recognition, due to the severity of reverberation and background noise. Similar to automatic speech recognition, deep learning acoustic features have shown great improvements in these conditions compared to prior techniques. A number of speaker recognition systems based on deep neural network (DNN) embeddings have been reported in the literature
[6][9][18]. More recently, SRI developed the VOiCES dataset [15]specifically for farfield speaker recognition, and showed their DNN embeddings significantly outperformed the ivector systems
[12].The objective of our work is to develop a speaker recognition system robust to the farfield channel conditions, using advanced model training methodology. Furthermore, we designed the system to be simple to perform inference using ultralow power accelerators such as the Intel® GNA [19]. In contrast to the popular approach of using probabilistic linear discriminant analysis (PLDA) [8] in the backend, our system only relies on the simple cosine distance for scoring. This allows for the computations to be performed endtoend on the accelerator. Finally, to get the best model size efficiency, the crux of the paper will focus on the application of structural sparsification to our DNN model.
There have been extensive studies on accelerating DNN models. Pruning [4] and sparsity methods [10] can effectively reduce the size of CNN models while keeping the performance similar to the original models. However, randomly distributed zeros in models do not have benefit for execution on hardware. [24] elaborates the benefit of structural sparsity over nonstructural sparsity on locality and parallelism during hardware execution. To force zero parameters to form a regular arrangement, structural sparsity [22] is proposed for CNNs to learn sparse structures like channel and filter.
In order to reduce the model size and the inference time, we apply a structural sparsity learning method to speaker recognition models. The sparse structure we achieve is computationally friendly to specific hardware. Specifically, we add a group Lasso [23]
penalty to the loss function, where the group is the structure desired to be sparse. The sparse model performance is the same or even better compared to the baseline with fewer nonzero parameters. Also, we test our method on three different sparse granularity levels and found that under the same number of nonzero parameters, models with smaller granularity achieve lower equal error rate (EER) than models with larger granularity. Sparse model performance exceeds that of dense models regardless of the granularity with the same number of nonzero parameters.
2 Related Work
Computational acceleration methods have been heavily explored for the past years. Pruning and sparsification have proven effective at removing redundant parameters and structures. In [4][5], pruning connections of fully connected layers was proved effective at reducing the size of Alexnet and VGG16. However, most of the computation and parameters are from convolution layers. From this perspective, Wei et al. [22] propose a framework that can reduce model size by eliminating redundant structures in CNNs such as filters or channels. They claimed to achieve speedup on Alexnet on GPU while keep the accuracy the same.
For speech recognition tasks, recurrent neural network (RNN) and long shortterm memory (LSTM) models are widely used. It is more difficult to learn sparse structures for these models because the structures usually contain information on time sequences. Eliminating those structures would have more impact on performance. Narang
et al. [13] conducted Connection Pruning for RNNs and reduced 90% of connections. Wei et al. [21] further applied group Lasso regularization on LSTMs and achieved speedup without perplexity loss. Zhang et al. [24] also extended the structural sparsity learning method to LSTM models for speech recognition and removed 72.5% parameters with negligible accuracy loss.3 Experimental setup
3.1 Model topology
This work is based the xvector model structure [18], with some simplifications. Compared to the original xvector model, our architecture, shown on Table 1, has increased the input feature dimension from 24 to 40, reduced the pooling dimension from 1500 to 512, removed a fullyconnected layer between the embedding and speaker output layers, and reduced the embedding dimension from 512 to 256. In our testing, these modifications did not degrade recognition performance and had much lower complexity. We use this topology as the baseline for structural sparsity learning. Also, in this particular case, TDNN can be written as a onedimension convolution, so we implemented the model as a 5layer CNN.
The softmax output is only used for model training purposes; for speaker enrollment and verification, the DNN embedding is taken at the output of Segment6 on Table 1. One speaker embedding is computed for an entire utterance, regardless of length. We use cosine distance of this 256dimension embedding vectors between enrollment and test utterances to produce the speaker recognition score.
layer context  Affine  Convolution  

Layer1  [t2,t+2]  200512  512 405 
Layer2  {t2,t,t+2}  1536512  512 5123 
Layer3  {t2,t,t+3}  1536512  512 5123 
Layer4  {t}  512512  512 5121 
Layer5  {t}  512512  512 5121 
Stats pooling  [0,T)  512T1024  N/A 
Segment6  {0}  1024256  N/A 
Softmax  {0}  256N  N/A 

N denotes the number of training speakers.
3.2 Loss function
While the conventional softmax loss works reasonably well for training speaker embeddings, it is specifically designed for classification, not verification tasks. Speaker recognition systems trained with softmax loss typically use PLDA in the backend to improve separation between speakers. The triplet loss function, which is designed to reduce intraspeaker and increase interspeaker distance, has shown to be more effective for speaker recognition [9]. Likewise, the endtoend loss [6]
has better performance than softmax. The downside to these kinds of losses is that the training infrastructure is significantly more complicated than one used for supervised learning with softmax. In a prior study
[7], we explored the use of several recently proposed loss functions that were first introduced in face recognition research. These loss functions are dropin replacements for softmax, thus modification to training code is simple with little overhead in training speed. We found Additive Margin Softmax (AMsoftmax)
[20] to perform best in the farfield test set, and incorporating PLDA did not improve performance against the simpler cosine distance. The elimination of the PLDA in the inference pipeline makes the entire model easy to deploy to target hardware, with the help of tools such as the Intel® Distribution of OpenVINO^{TM} toolkit [14].3.3 Datasets and augmentation
We use VoxCeleb 1 and 2 [11] [1] to train the system. These datasets have 7323 identities combined. We perform 9x data augmentation plus original clean speech to produce 12.7 million training utterances. For each data augmentation, we randomly choose from 2000 room impulse responses generated from Pyroomacoustics [16], and add randomly selected background noise from MUSAN [17] and AudioSet [3]. For the test set, we used the VOiCES farfield dataset [15], which we believe captures the essence of challenging channel conditions. For all speech utterances, we use 40dimension logmel filterbanks, with 3second sliding window mean subtraction.
3.4 Training details
We describe our training pipeline as a three step process:

Baseline model training:
We find that we get significantly better results when we start the sparsification process with a welltrained dense model. We train the model with AMsoftmax loss, SGD optimizer learning rate decaying from 0.01 to 0.0001 in 30 epochs with cosine annealing. The weight decay and batch size are set to 1e6 and 256, respectively. For each batch, we select random segments of training utterances between 2.5 to 3.0 seconds. These settings, except for the number of epochs, are used in subsequent steps. The output of this step is the best dense model we can produce, and it also serves as a baseline to measure EER against.

Learning sparse structure: We use the model from step 1 to initialize the dense model, and trained 20 epochs with the group Lasso regularization together with the AMsoftmax loss:
(1) where the first term is the original AMsoftmax loss function, and the second term is the contribution from the group Lasso loss function. The group Lasso loss is essentially the summation of (the total number of groups) L2 norm of group weights in predefined groups (e.g. chunks of 8 or 16, or entire convolution filter). It rewards to total loss function for forcing low values to group weights. The coefficient controls the balance between AMsoftmax loss and group Lasso loss. This step produces sparse structures by training on the new loss function . Groups with L2 values below a threshold are set to 0, and discarded in the learning process for the next step.

Finetuning: Lastly, we finetune the training for 20 epochs on the sparse model produced by step 2 using only AMsoftmax loss.
More detail on step 2 and step 3 can be found in [24].
3.5 Hardware implementation
This work is targeting TDNN inference on the Intel® Gaussian & Neural Accelerator (GNA) [2]. Intel® GNA is designed for continuous inference with neural networks on edge devices with high performance and very low power consumption. Since Intel® GNA fetches weight matrices in 16byte chunks of int8 or int16 weights, we investigated structural sparsity on chunks of 8 int16 elements or 16 int8 elements. Inference measurements were made on an Intel® Celeron® Processor J4005 with Intel® GNA inside.
4 Results
In the experiments, the sparsity of filters is defined as the number of zero filters over all filters, while the sparsity of chunks is defined as the the number of zero chunks over all chunks. We applied the sparsity learning only to layers 14. Our experiments showed that Layers 5 and above were reluctant to achieve sparsity. We suspect that this is because near the output of the network, the hidden representations contain high density of information for speaker recognition. This seems to happen at the input of the stats pooling layer.
4.1 Result analysis
The experimental results are shown in Table 2. We applied the structural sparsity on filters and chunks. Filter sparsity can be deployed on all hardware without any special modification. While applying sparsity on chunk8 and chunk16 are targeted at Intel® GNA. Also, we run experiments on dense models to compare the performance of sparse models and dense models.
Figure 1 is the visualization of the relationship between the coefficient and sparsity in each layer. Yaxis denotes the overall percentage of sparsity in four layers. It is shown clearly when increases, the sparsity increases. However, the sparsity growth in each layer is different. Filter sparsity shows a different growth trend from chunk sparsity. In Figure 1(a), sparse filters in the first layer (blue bar) account for much of the overall sparsity. However, in Figure 1(b) and (c), the first layer is not very sparse while layers 2 and 3 have a majority of chunks learned to be zero. We suspect that the low sparsity in layer 1 is due to the denser spectral input dimension compared to other layers; and that in layer 4 the output representation is becoming more relevant for the speaker recognition task, thus having making the network sparse here would result in higher penalty on the AMsoftmax loss.
Figure 2 is the visualization of the relationship of nonzero parameters and EER or min detection cost function (minDCF). The Xaxis represents the number of nonzero parameters and Yaxis is the EER and minDCF. We compared the filter sparsity, chunk8 sparsity, and chunk16 sparsity with dense models of different sizes. It is shown in Figure 2(a) that when the number of parameters is large, sparse models achieve lower EER than dense models of the same size. However when the number of nonzero parameters is small, dense models have better performance. In our experimental setting, the turning point is around 0.7 million parameters. For example, with EER around 2.0%, it is clear that models with smaller granularity have lower size. Chunk8 models can reach 1.99% EER with 0.99 million parameters and chunk16 has 1.96% EER under 1.07 million parameters. Comparing with baseline, smaller dense models reach 2.03% EER with 1.73 million parameters, chunk8 and chunk16 both reach lower EER with less than 60% of the parameters. Also, when nonzero parameter count is larger than 1.5 million, there is a tendency that chunk8 has the best performance while filter sparsity has higher EER under the same nonzero parameter count. As for the relationship of minDCF, as is shown in Figure 2(b), we observe similar patterns as seen in EER.
A somewhat surprising finding in these results is that, filter_1, chunk8_1, and chunk16_1 with less parameters have slightly lower EER, 1.76%, 1.61%, and 1.68%, respectively, compared to the baseline of 1.81%. We believe this is because the group Lasso loss is an effective regularizer, and when used in a small dose, helps produce more generalized models.
Method  size  EER  min  Method  size  EER  min  
dense  (m)  (%)  DCF  filter  (M)  (%)  DCF  
baseline    2.47  1.81  0.23  baseline    2.47  1.81  0.23 
dense_1    1.73  2.03  0.25  filter_1  0.002  2.14  1.76  0.22 
dense_2    1.42  2.12  0.25  filter_2  0.005  1.70  1.88  0.24 
dense_3    1.15  2.20  0.27  filter_3  7.5e3  1.27  2.07  0.26 
dense_4    0.91  2.44  0.30  filter_4  0.01  1.04  2.24  0.27 
dense_5    0.70  2.51  0.30  filter_5  0.015  0.79  2.49  0.31 
dense_6    0.54  2.79  0.35  filter_6  0.02  0.69  2.55  0.31 
dense_7    0.41  3.62  0.41  filter_7  0.04  0.50  3.49  0.38 
Method  size  EER  min  Method  size  EER  min  
chunk16  (m)  (%)  DCF  chunk8  (M)  (%)  DCF  
baseline    2.47  1.81  0.23  baseline    2.47  1.81  0.23 
chunk16_1  2.5e5  2.28  1.68  0.22  chunk8_1  2e5  2.29  1.61  0.21 
chunk16_2  5e5  1.73  1.86  0.24  chunk8_2  5e5  1.33  1.93  0.25 
chunk16_3  1e4  1.33  1.90  0.25  chunk8_3  7.5e5  0.99  1.99  0.27 
chunk16_4  1.5e4  1.07  1.96  0.26  chunk8_4  1e4  0.85  2.29  0.28 
chunk16_5  2e4  0.84  2.28  0.29  chunk8_5  1.5e4  0.65  2.57  0.33 
chunk16_6  3e4  0.70  2.49  0.32  chunk8_6  2e4  0.57  3.10  0.36 
chunk16_7  4e4  0.56  3.13  0.37  chunk8_7  4e4  0.43  3.62  0.42 
4.2 Measurements on Intel® GNA
We also measured the actual inference time of some models to find out how much speedup sparse models could achieve on Intel® GNA. We selected 4 different models with similar EER but with different sparsity granularity. Specifically, they are dense_1, filter_3, chunk16_4, chunk8_3. We measure the actual inference time on Intel® GNA and use Dense_1 as baseline.
As is shown in Figure 3, although the EERs only range within 10%, sparse models are much faster than dense models. Among them, chunk8_3 has 1.99% EER, lower than 2.03% for Dense_1, while it is more than faster than dense_1 model. The result for chunk16_4 is similar. Filter sparsity has worse performance than chunk sparsity because filter_3 has higher EER but smaller speedup. Overall, Figure 3 shows the benefit of using sparse models compared to dense models.
5 Conclusion
In this paper, we applied structural sparsification for speaker recognition models. By using pretrained models and group Lasso regularization, we kept the good performance of the original model while reducing the number of parameters and accelerating the actual execution. For structural sparse models that are only slight smaller than the full size dense model, we achieved better performance on both EER and minDCF metrics.
Acknowledgements. This work is supported by the National Science Foundation CCF1910299.
References
 [1] (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §3.3.
 [2] (2017) Implementation of efficient, low power deep neural networks on nextgeneration intel client platforms. http://sigport.org/1777. Cited by: §3.5.
 [3] (2017) Audio set: an ontology and humanlabeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §3.3.
 [4] (2015) Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
 [5] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
 [6] (2016) Endtoend textdependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. Cited by: §1, §3.2.
 [7] (2019) Intel FarField Speaker Recognition System for VOiCES Challenge 2019. In Proc. Interspeech 2019, pp. 2473–2477. External Links: Document, Link Cited by: §3.2.

[8]
(2006)
Probabilistic linear discriminant analysis.
In
European Conference on Computer Vision
, pp. 531–542. Cited by: §1.  [9] (2017) Deep speaker: an endtoend neural speaker embedding system. arXiv preprint arXiv:1705.02304. Cited by: §1, §3.2.

[10]
(2015)
.
In
The IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §1.  [11] (2017) Voxceleb: a largescale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §3.3.
 [12] (2018) Robust speaker recognition from distant speech under real reverberant environments using speaker embeddings.. In Interspeech, pp. 1106–1110. Cited by: §1.
 [13] (2017) Exploring sparsity in recurrent neural networks. arXiv:1704.05119. Cited by: §2.
 [14] OpenVINO toolkit. Note: https://docs.openvinotoolkit.org/Accessed: 20191014 Cited by: §3.2.
 [15] (2018) Voices obscured in complex environmental settings (voices) corpus. arXiv preprint arXiv:1804.05053. Cited by: §1, §3.3.
 [16] (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 351–355. Cited by: §3.3.
 [17] (2015) MUSAN: A Music, Speech, and Noise Corpus. Note: arXiv:1510.08484v1 External Links: 1510.08484 Cited by: §3.3.
 [18] (2018) Xvectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §3.1.
 [19] (2017) Speech recognition and understanding on hardwareaccelerated DSP. In Proc. Interspeech, pp. 2036–2037. Cited by: §1.
 [20] (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25 (7), pp. 926–930. Cited by: §3.2.
 [21] (2017) Learning intrinsic sparse structures within long shortterm memory. arXiv:1709.05027. Cited by: §2.
 [22] (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.

[23]
(2006)
Model selection and estimation in regression with grouped variables
. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), pp. 49–67. Cited by: §1.  [24] (2019) Learning efficient sparse structures in speech recognition. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2717–2721. Cited by: §1, §2, §3.4.
Comments
There are no comments yet.