Sparse Architectures for Text-Independent Speaker Verification Using Deep Neural Networks

05/19/2018 ∙ by Sara Sedighi, et al. ∙ 0

Network pruning is of great importance due to the elimination of the unimportant weights or features activated due to the network over-parametrization. Advantages of sparsity enforcement include preventing the overfitting and speedup. Considering a large number of parameters in deep architectures, network compression becomes of critical importance due to the required huge amount of computational power. In this work, we impose structured sparsity for speaker verification which is the validation of the query speaker compared to the speaker gallery. We will show that the mere sparsity enforcement can improve the verification results due to the possible initial overfitting in the network.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advancements in deep learning suggested new approaches to train deep networks led to almost human performance level in the image and object recognition, speech recognition 

[1, 2, 3, 4] and data mining [5, 6]. Approaches based on Information Theory have also been proposed to provide a framework for interpret the deep architecture in a better sense [7]. Some of these new approaches such as dropout [8] will handle overfitting issue [9]. For training deep neural networks, network over-parametrizing makes the architecture unnecessarily complicated. Huge computational power is also required for training and model evaluation [10].

Up to now, different approaches have been proposed for compressing models. Model compression [11, 12], pruning [13, 14], and [15] have been proposed so far for this aim. In some previous works such as [16]

, it’s been declared that training a few portions of the weights is enough by kernel-based estimators. A large amount of the previously performed methods are based on multiple steps of tuning which makes the model hardly scalable. One issue is the model complexity and computational burdon which is related ot the large number of network parameters. Feature selection is one of the approaches for reducing the number of unimportant neurons. Selecting the important features by emoving unimportant elements may impose the weight pruning. A large number of feature methods in this field such as PCA and AEss have been proposed.

For effective network compression, different methods such as utilizing the group lasso [17], structure scale constraining [18], and Structured Sparsity Learning (SSL) [19] have been proposed. For most of the research works, there is no evidence of addressing how the accuracy is related to the compression. In this work, we propose the use of sparsity for imposing structured sparsity for speaker verification. We will show that the simple sparsifying the network, can improve the results for speaker verification.

Ii Imposing sparsity

Ii-a Group sparse regularization

We focus on enforcing group sparsity to prune convolutional and fully-connected layers. Group lasso has widely been used for feature selection by enforcing the sparsity on the weights group [17, 20]. The objective of the group sparsity is to select the effective channels or neurons in case we have a convolutional layer or fully-connected layer, respectively.

Assume a convolutional layer is demonstrated by and parameter indicates the input channels, shows the kernel dimension, and

is the number of output channels. The objective loss function is as follows:


in which, indicates the group sparsity loss. The is the number of channels for the layer and is the hyper-parameter for the loss. Assume having L different groups of weights, the group sparsity is defined as follows:


in which demonstrates the group of weights and shows the number of weights in the group.

Fig. 1: Group sparsity on the fully-connected layers.

The main objective of the group sparse regularization is the removal of redundant features which are activated regarding the network over-parametrizing. For the fully-connected layers, the group is also all the weights connected to the neuron and is shown in Fig. 1.

Iii Experiments

The speaker verification is comparing the query speaker to the gallery of speaker models and validate the speaker identity. The speaker verification is mainly divided into text-independent and text-dependent types. In the text-dependent setting, the available spoken utterances are the same. In text-independent setting, however, no assumption is considered for the utterances. The challenge for the latter scenario comes from the fact that the features must distinguishable for both speaker and speech information.

Input: For each sample sound file, a window of 25ms, with 15ms overlapping is used and the result will be a spectrogram of size

for a 1-second duration of the audio sound. For the third dimension, first and second order derivative features are appended together. SpeechPy library has been used for speech feature extraction 


Dataset: We used the VoxCeleb dataset for our experiments [22]. There are 1211 available speakers, 40 speakers are chosen for test and the rest are used for training as suggested in [22]. The raw audios are extracted from regular Youtube videos which include a variety of internal differences such as background noise and different recording qualities which make the dataset very challenging. For our experiments, we choose very short 1-second utterances with Voice Activity Detection (VAD) for removing the silence parts. Choosing short utterances have forensics applications and makes our experiments challenging. It is more realistic to consider short utterance because in real-world applications, for most of the times only short utterances of different subjects are available.

For speaker verification architecture, we used convolutional neural networks due to their superiority in applications such as action recognition 

[23], object recognition [3],speaker verification and audio-visual matching.

Training and optimization objective: The architecture is shown in Fig. 2 which has two deep neural networks with weight sharing. This architecture is a Siamese neural network [24] and has been utilized for different applications [25, 26, 27]. The main objective of a Siamese network is the creation of a joint output feature space to distinguish between genuine and impostor pairs. The idea is that if two elements of an input pair are coming from the same subject, the output distance should be close by a simple distance metric and should be far away if they have different identities. For this goal, the training loss function should consider both aforementioned conditions. Contrastive loss is used for that aim and is defined as follows:


where N is the number training samples, and will be defined as follows:


in that is the regularization parameter. and are the associated costs for the genuine and impostor pairs respectively and will be defined as functions of :


for which considered to be a predefined margin and is the Euclidean distance between associated output features for the pairs.

Architecture: For the architectures, VGG-Net has been chosen as an effective model for the image classifcation [28]. The architecture is modified specified to the input features. The output dimensionality of the last layer is set to 64. The average pooling has also been employed for spatial dimension matching [29].

Fig. 2: The general employed architecture for speaker verification.

Results and comparison: The proposed approach will be compared with some other base-line methods as the GMM-UBM method [30]

which has been selected with 39 MFCCs coefficients including first and second order derivatives. Universal Background Model (UBM) with 512 mixture components has been employed. The I-Vector system

[31], is used as one of the methods for the comparison. The results are depicted in Table. I. As can be observed, imposing the proposed approach for sparsity outperforms the other methods.

Model EER
GMM-UBM [30] 28.22
I-vectors [31] 24.91
CNN [baseline] 24.68
CNN [SSL] 24.11

TABLE I: Comparing the proposed approach with the other methods.

An important factor caused by enforcing sparsity is the speedup. The results are shown in Fig. 3.

Fig. 3: The speedup for separate layers.

Iv Conclusion

In this work, we proposed the application of sparsity imposition for speaker verification. Experimental results demonstrated the effectiveness of enforcing sparsity due to its potential power for preventing overfitting. This was the direct outcome of removing unimportant elements of the network such as neurons in fully connected layers and output filters in the convolutional layer.


  • [1] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
  • [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
  • [3] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pp. 922–928, IEEE, 2015.
  • [4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 4052–4056, IEEE, 2014.
  • [5] M. Piergallini, R. Shirvani, G. S. Gautam, and M. Chouikha, “Word-level language identification and predicting codeswitching points in swahili-english language data,” in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 21–29, 2016.
  • [6] R. Shirvani, M. Piergallini, G. S. Gautam, and M. Chouikha, “The howard university system submission for the shared task in language identification in spanish-english codeswitching,” in Proceedings of The Second Workshop on Computational Approaches to Code Switching, pp. 116–120, 2016.
  • [7] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, 2001.
  • [8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”

    The Journal of Machine Learning Research

    , vol. 15, no. 1, pp. 1929–1958, 2014.
  • [9] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [10] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, pp. 3123–3131, 2015.
  • [11] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in Advances in neural information processing systems, pp. 2654–2662, 2014.
  • [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [13] R. Reed, “Pruning algorithms-a survey,” IEEE transactions on Neural Networks, vol. 4, no. 5, pp. 740–747, 1993.
  • [14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, pp. 1135–1143, 2015.
  • [15] M. D. Collins and P. Kohli, “Memory bounded deep convolutional networks,” arXiv preprint arXiv:1412.1442, 2014.
  • [16] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al., “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, pp. 2148–2156, 2013.
  • [17] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.
  • [18] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural networks,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 806–814, 2015.
  • [19] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
  • [20]

    L. Meier, S. Van De Geer, and P. Bühlmann, “The group lasso for logistic regression,”

    Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 1, pp. 53–71, 2008.
  • [21] A. Torfi, “Speechpy-a library for speech processing and recognition,” arXiv preprint arXiv:1803.01094, 2018.
  • [22] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
  • [23] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
  • [24] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 539–546, IEEE, 2005.
  • [25] X. Sun, A. Torfi, and N. Nasrabadi, “Deep siamese convolutional neural networks for identical twins and look-alike identification,” Deep Learning in Biometrics, p. 65, 2018.
  • [26] R. R. Varior, M. Haloi, and G. Wang, “Gated siamese convolutional neural network architecture for human re-identification,” in European Conference on Computer Vision, pp. 791–808, Springer, 2016.
  • [27] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML Deep Learning Workshop, vol. 2, 2015.
  • [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [29] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
  • [30]

    D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,”

    Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000.
  • [31] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.