Far-field speaker recognition has gained much interest in the research community, with its prevalent applications in consumer devices such smart speakers and smartphones. The far-field condition presents additional challenges in speaker recognition, due to the severity of reverberation and background noise. Similar to automatic speech recognition, deep learning acoustic features have shown great improvements in these conditions compared to prior techniques. A number of speaker recognition systems based on deep neural network (DNN) embeddings have been reported in the literature. More recently, SRI developed the VOiCES dataset 
specifically for far-field speaker recognition, and showed their DNN embeddings significantly outperformed the i-vector systems.
The objective of our work is to develop a speaker recognition system robust to the far-field channel conditions, using advanced model training methodology. Furthermore, we designed the system to be simple to perform inference using ultra-low power accelerators such as the Intel® GNA . In contrast to the popular approach of using probabilistic linear discriminant analysis (PLDA)  in the back-end, our system only relies on the simple cosine distance for scoring. This allows for the computations to be performed end-to-end on the accelerator. Finally, to get the best model size efficiency, the crux of the paper will focus on the application of structural sparsification to our DNN model.
There have been extensive studies on accelerating DNN models. Pruning  and sparsity methods  can effectively reduce the size of CNN models while keeping the performance similar to the original models. However, randomly distributed zeros in models do not have benefit for execution on hardware.  elaborates the benefit of structural sparsity over non-structural sparsity on locality and parallelism during hardware execution. To force zero parameters to form a regular arrangement, structural sparsity  is proposed for CNNs to learn sparse structures like channel and filter.
In order to reduce the model size and the inference time, we apply a structural sparsity learning method to speaker recognition models. The sparse structure we achieve is computationally friendly to specific hardware. Specifically, we add a group Lasso 
penalty to the loss function, where the group is the structure desired to be sparse. The sparse model performance is the same or even better compared to the baseline with fewer non-zero parameters. Also, we test our method on three different sparse granularity levels and found that under the same number of non-zero parameters, models with smaller granularity achieve lower equal error rate (EER) than models with larger granularity. Sparse model performance exceeds that of dense models regardless of the granularity with the same number of non-zero parameters.
2 Related Work
Computational acceleration methods have been heavily explored for the past years. Pruning and sparsification have proven effective at removing redundant parameters and structures. In , pruning connections of fully connected layers was proved effective at reducing the size of Alexnet and VGG-16. However, most of the computation and parameters are from convolution layers. From this perspective, Wei et al.  propose a framework that can reduce model size by eliminating redundant structures in CNNs such as filters or channels. They claimed to achieve speedup on Alexnet on GPU while keep the accuracy the same.
For speech recognition tasks, recurrent neural network (RNN) and long short-term memory (LSTM) models are widely used. It is more difficult to learn sparse structures for these models because the structures usually contain information on time sequences. Eliminating those structures would have more impact on performance. Naranget al.  conducted Connection Pruning for RNNs and reduced 90% of connections. Wei et al.  further applied group Lasso regularization on LSTMs and achieved speedup without perplexity loss. Zhang et al.  also extended the structural sparsity learning method to LSTM models for speech recognition and removed 72.5% parameters with negligible accuracy loss.
3 Experimental setup
3.1 Model topology
This work is based the x-vector model structure , with some simplifications. Compared to the original x-vector model, our architecture, shown on Table 1, has increased the input feature dimension from 24 to 40, reduced the pooling dimension from 1500 to 512, removed a fully-connected layer between the embedding and speaker output layers, and reduced the embedding dimension from 512 to 256. In our testing, these modifications did not degrade recognition performance and had much lower complexity. We use this topology as the baseline for structural sparsity learning. Also, in this particular case, TDNN can be written as a one-dimension convolution, so we implemented the model as a 5-layer CNN.
The softmax output is only used for model training purposes; for speaker enrollment and verification, the DNN embedding is taken at the output of Segment6 on Table 1. One speaker embedding is computed for an entire utterance, regardless of length. We use cosine distance of this 256-dimension embedding vectors between enrollment and test utterances to produce the speaker recognition score.
N denotes the number of training speakers.
3.2 Loss function
While the conventional softmax loss works reasonably well for training speaker embeddings, it is specifically designed for classification, not verification tasks. Speaker recognition systems trained with softmax loss typically use PLDA in the backend to improve separation between speakers. The triplet loss function, which is designed to reduce intra-speaker and increase inter-speaker distance, has shown to be more effective for speaker recognition . Likewise, the end-to-end loss 
has better performance than softmax. The downside to these kinds of losses is that the training infrastructure is significantly more complicated than one used for supervised learning with softmax. In a prior study
, we explored the use of several recently proposed loss functions that were first introduced in face recognition research. These loss functions are drop-in replacements for softmax, thus modification to training code is simple with little overhead in training speed. We found Additive Margin Softmax (AM-softmax) to perform best in the far-field test set, and incorporating PLDA did not improve performance against the simpler cosine distance. The elimination of the PLDA in the inference pipeline makes the entire model easy to deploy to target hardware, with the help of tools such as the Intel® Distribution of OpenVINOTM toolkit .
3.3 Datasets and augmentation
We use VoxCeleb 1 and 2   to train the system. These datasets have 7323 identities combined. We perform 9x data augmentation plus original clean speech to produce 12.7 million training utterances. For each data augmentation, we randomly choose from 2000 room impulse responses generated from Pyroomacoustics , and add randomly selected background noise from MUSAN  and AudioSet . For the test set, we used the VOiCES far-field dataset , which we believe captures the essence of challenging channel conditions. For all speech utterances, we use 40-dimension log-mel filterbanks, with 3-second sliding window mean subtraction.
3.4 Training details
We describe our training pipeline as a three step process:
Baseline model training:
We find that we get significantly better results when we start the sparsification process with a well-trained dense model. We train the model with AM-softmax loss, SGD optimizer learning rate decaying from 0.01 to 0.0001 in 30 epochs with cosine annealing. The weight decay and batch size are set to 1e-6 and 256, respectively. For each batch, we select random segments of training utterances between 2.5 to 3.0 seconds. These settings, except for the number of epochs, are used in subsequent steps. The output of this step is the best dense model we can produce, and it also serves as a baseline to measure EER against.
Learning sparse structure: We use the model from step 1 to initialize the dense model, and trained 20 epochs with the group Lasso regularization together with the AM-softmax loss:
where the first term is the original AM-softmax loss function, and the second term is the contribution from the group Lasso loss function. The group Lasso loss is essentially the summation of (the total number of groups) L2 norm of group weights in predefined groups (e.g. chunks of 8 or 16, or entire convolution filter). It rewards to total loss function for forcing low values to group weights. The coefficient controls the balance between AM-softmax loss and group Lasso loss. This step produces sparse structures by training on the new loss function . Groups with L2 values below a threshold are set to 0, and discarded in the learning process for the next step.
Fine-tuning: Lastly, we fine-tune the training for 20 epochs on the sparse model produced by step 2 using only AM-softmax loss.
More detail on step 2 and step 3 can be found in .
3.5 Hardware implementation
This work is targeting TDNN inference on the Intel® Gaussian & Neural Accelerator (GNA) . Intel® GNA is designed for continuous inference with neural networks on edge devices with high performance and very low power consumption. Since Intel® GNA fetches weight matrices in 16-byte chunks of int8 or int16 weights, we investigated structural sparsity on chunks of 8 int16 elements or 16 int8 elements. Inference measurements were made on an Intel® Celeron® Processor J4005 with Intel® GNA inside.
In the experiments, the sparsity of filters is defined as the number of zero filters over all filters, while the sparsity of chunks is defined as the the number of zero chunks over all chunks. We applied the sparsity learning only to layers 1-4. Our experiments showed that Layers 5 and above were reluctant to achieve sparsity. We suspect that this is because near the output of the network, the hidden representations contain high density of information for speaker recognition. This seems to happen at the input of the stats pooling layer.
4.1 Result analysis
The experimental results are shown in Table 2. We applied the structural sparsity on filters and chunks. Filter sparsity can be deployed on all hardware without any special modification. While applying sparsity on chunk-8 and chunk-16 are targeted at Intel® GNA. Also, we run experiments on dense models to compare the performance of sparse models and dense models.
Figure 1 is the visualization of the relationship between the coefficient and sparsity in each layer. Y-axis denotes the overall percentage of sparsity in four layers. It is shown clearly when increases, the sparsity increases. However, the sparsity growth in each layer is different. Filter sparsity shows a different growth trend from chunk sparsity. In Figure 1(a), sparse filters in the first layer (blue bar) account for much of the overall sparsity. However, in Figure 1(b) and (c), the first layer is not very sparse while layers 2 and 3 have a majority of chunks learned to be zero. We suspect that the low sparsity in layer 1 is due to the denser spectral input dimension compared to other layers; and that in layer 4 the output representation is becoming more relevant for the speaker recognition task, thus having making the network sparse here would result in higher penalty on the AM-softmax loss.
Figure 2 is the visualization of the relationship of non-zero parameters and EER or min detection cost function (minDCF). The X-axis represents the number of non-zero parameters and Y-axis is the EER and minDCF. We compared the filter sparsity, chunk-8 sparsity, and chunk-16 sparsity with dense models of different sizes. It is shown in Figure 2(a) that when the number of parameters is large, sparse models achieve lower EER than dense models of the same size. However when the number of non-zero parameters is small, dense models have better performance. In our experimental setting, the turning point is around 0.7 million parameters. For example, with EER around 2.0%, it is clear that models with smaller granularity have lower size. Chunk-8 models can reach 1.99% EER with 0.99 million parameters and chunk-16 has 1.96% EER under 1.07 million parameters. Comparing with baseline, smaller dense models reach 2.03% EER with 1.73 million parameters, chunk-8 and chunk-16 both reach lower EER with less than 60% of the parameters. Also, when non-zero parameter count is larger than 1.5 million, there is a tendency that chunk-8 has the best performance while filter sparsity has higher EER under the same non-zero parameter count. As for the relationship of minDCF, as is shown in Figure 2(b), we observe similar patterns as seen in EER.
A somewhat surprising finding in these results is that, filter_1, chunk8_1, and chunk16_1 with less parameters have slightly lower EER, 1.76%, 1.61%, and 1.68%, respectively, compared to the baseline of 1.81%. We believe this is because the group Lasso loss is an effective regularizer, and when used in a small dose, helps produce more generalized models.
4.2 Measurements on Intel® GNA
We also measured the actual inference time of some models to find out how much speedup sparse models could achieve on Intel® GNA. We selected 4 different models with similar EER but with different sparsity granularity. Specifically, they are dense_1, filter_3, chunk16_4, chunk8_3. We measure the actual inference time on Intel® GNA and use Dense_1 as baseline.
As is shown in Figure 3, although the EERs only range within 10%, sparse models are much faster than dense models. Among them, chunk8_3 has 1.99% EER, lower than 2.03% for Dense_1, while it is more than faster than dense_1 model. The result for chunk16_4 is similar. Filter sparsity has worse performance than chunk sparsity because filter_3 has higher EER but smaller speedup. Overall, Figure 3 shows the benefit of using sparse models compared to dense models.
In this paper, we applied structural sparsification for speaker recognition models. By using pretrained models and group Lasso regularization, we kept the good performance of the original model while reducing the number of parameters and accelerating the actual execution. For structural sparse models that are only slight smaller than the full size dense model, we achieved better performance on both EER and minDCF metrics.
Acknowledgements. This work is supported by the National Science Foundation CCF-1910299.
-  (2018) Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622. Cited by: §3.3.
-  (2017) Implementation of efficient, low power deep neural networks on next-generation intel client platforms. http://sigport.org/1777. Cited by: §3.5.
-  (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §3.3.
-  (2015) Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §2.
-  (2016) End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. Cited by: §1, §3.2.
-  (2019) Intel Far-Field Speaker Recognition System for VOiCES Challenge 2019. In Proc. Interspeech 2019, pp. 2473–2477. External Links: Cited by: §3.2.
Probabilistic linear discriminant analysis.
European Conference on Computer Vision, pp. 531–542. Cited by: §1.
-  (2017) Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304. Cited by: §1, §3.2.
The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
-  (2017) Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612. Cited by: §3.3.
-  (2018) Robust speaker recognition from distant speech under real reverberant environments using speaker embeddings.. In Interspeech, pp. 1106–1110. Cited by: §1.
-  (2017) Exploring sparsity in recurrent neural networks. arXiv:1704.05119. Cited by: §2.
-  OpenVINO toolkit. Note: https://docs.openvinotoolkit.org/Accessed: 2019-10-14 Cited by: §3.2.
-  (2018) Voices obscured in complex environmental settings (voices) corpus. arXiv preprint arXiv:1804.05053. Cited by: §1, §3.3.
-  (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 351–355. Cited by: §3.3.
-  (2015) MUSAN: A Music, Speech, and Noise Corpus. Note: arXiv:1510.08484v1 External Links: Cited by: §3.3.
-  (2018) X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. Cited by: §1, §3.1.
-  (2017) Speech recognition and understanding on hardware-accelerated DSP. In Proc. Interspeech, pp. 2036–2037. Cited by: §1.
-  (2018) Additive margin softmax for face verification. IEEE Signal Processing Letters 25 (7), pp. 926–930. Cited by: §3.2.
-  (2017) Learning intrinsic sparse structures within long short-term memory. arXiv:1709.05027. Cited by: §2.
-  (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1), pp. 49–67. Cited by: §1.
-  (2019) Learning efficient sparse structures in speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2717–2721. Cited by: §1, §2, §3.4.