I Introduction
Recent advancements in deep learning suggested new approaches to train deep networks led to almost human performance level in the image and object recognition, speech recognition
[1, 2, 3, 4] and data mining [5, 6]. Approaches based on Information Theory have also been proposed to provide a framework for interpret the deep architecture in a better sense [7]. Some of these new approaches such as dropout [8] will handle overfitting issue [9]. For training deep neural networks, network overparametrizing makes the architecture unnecessarily complicated. Huge computational power is also required for training and model evaluation [10].Up to now, different approaches have been proposed for compressing models. Model compression [11, 12], pruning [13, 14], and [15] have been proposed so far for this aim. In some previous works such as [16]
, it’s been declared that training a few portions of the weights is enough by kernelbased estimators. A large amount of the previously performed methods are based on multiple steps of tuning which makes the model hardly scalable. One issue is the model complexity and computational burdon which is related ot the large number of network parameters. Feature selection is one of the approaches for reducing the number of unimportant neurons. Selecting the important features by emoving unimportant elements may impose the weight pruning. A large number of feature methods in this field such as PCA and AEss have been proposed.
For effective network compression, different methods such as utilizing the group lasso [17], structure scale constraining [18], and Structured Sparsity Learning (SSL) [19] have been proposed. For most of the research works, there is no evidence of addressing how the accuracy is related to the compression. In this work, we propose the use of sparsity for imposing structured sparsity for speaker verification. We will show that the simple sparsifying the network, can improve the results for speaker verification.
Ii Imposing sparsity
Iia Group sparse regularization
We focus on enforcing group sparsity to prune convolutional and fullyconnected layers. Group lasso has widely been used for feature selection by enforcing the sparsity on the weights group [17, 20]. The objective of the group sparsity is to select the effective channels or neurons in case we have a convolutional layer or fullyconnected layer, respectively.
Assume a convolutional layer is demonstrated by and parameter indicates the input channels, shows the kernel dimension, and
is the number of output channels. The objective loss function is as follows:
(1) 
in which, indicates the group sparsity loss. The is the number of channels for the layer and is the hyperparameter for the loss. Assume having L different groups of weights, the group sparsity is defined as follows:
(2) 
in which demonstrates the group of weights and shows the number of weights in the group.
The main objective of the group sparse regularization is the removal of redundant features which are activated regarding the network overparametrizing. For the fullyconnected layers, the group is also all the weights connected to the neuron and is shown in Fig. 1.
Iii Experiments
The speaker verification is comparing the query speaker to the gallery of speaker models and validate the speaker identity. The speaker verification is mainly divided into textindependent and textdependent types. In the textdependent setting, the available spoken utterances are the same. In textindependent setting, however, no assumption is considered for the utterances. The challenge for the latter scenario comes from the fact that the features must distinguishable for both speaker and speech information.
Input: For each sample sound file, a window of 25ms, with 15ms overlapping is used and the result will be a spectrogram of size
for a 1second duration of the audio sound. For the third dimension, first and second order derivative features are appended together. SpeechPy library has been used for speech feature extraction
[21].Dataset: We used the VoxCeleb dataset for our experiments [22]. There are 1211 available speakers, 40 speakers are chosen for test and the rest are used for training as suggested in [22]. The raw audios are extracted from regular Youtube videos which include a variety of internal differences such as background noise and different recording qualities which make the dataset very challenging. For our experiments, we choose very short 1second utterances with Voice Activity Detection (VAD) for removing the silence parts. Choosing short utterances have forensics applications and makes our experiments challenging. It is more realistic to consider short utterance because in realworld applications, for most of the times only short utterances of different subjects are available.
For speaker verification architecture, we used convolutional neural networks due to their superiority in applications such as action recognition
[23], object recognition [3],speaker verification and audiovisual matching.Training and optimization objective: The architecture is shown in Fig. 2 which has two deep neural networks with weight sharing. This architecture is a Siamese neural network [24] and has been utilized for different applications [25, 26, 27]. The main objective of a Siamese network is the creation of a joint output feature space to distinguish between genuine and impostor pairs. The idea is that if two elements of an input pair are coming from the same subject, the output distance should be close by a simple distance metric and should be far away if they have different identities. For this goal, the training loss function should consider both aforementioned conditions. Contrastive loss is used for that aim and is defined as follows:
(3) 
where N is the number training samples, and will be defined as follows:
(4) 
in that is the regularization parameter. and are the associated costs for the genuine and impostor pairs respectively and will be defined as functions of :
(5) 
for which considered to be a predefined margin and is the Euclidean distance between associated output features for the pairs.
Architecture: For the architectures, VGGNet has been chosen as an effective model for the image classifcation [28]. The architecture is modified specified to the input features. The output dimensionality of the last layer is set to 64. The average pooling has also been employed for spatial dimension matching [29].
Results and comparison: The proposed approach will be compared with some other baseline methods as the GMMUBM method [30]
which has been selected with 39 MFCCs coefficients including first and second order derivatives. Universal Background Model (UBM) with 512 mixture components has been employed. The IVector system
[31], is used as one of the methods for the comparison. The results are depicted in Table. I. As can be observed, imposing the proposed approach for sparsity outperforms the other methods.Model  EER 

GMMUBM [30]  28.22 
Ivectors [31]  24.91 
CNN [baseline]  24.68 
CNN [SSL]  24.11 

An important factor caused by enforcing sparsity is the speedup. The results are shown in Fig. 3.
Iv Conclusion
In this work, we proposed the application of sparsity imposition for speaker verification. Experimental results demonstrated the effectiveness of enforcing sparsity due to its potential power for preventing overfitting. This was the direct outcome of removing unimportant elements of the network such as neurons in fully connected layers and output filters in the convolutional layer.
References
 [1] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
 [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
 [3] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for realtime object recognition,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pp. 922–928, IEEE, 2015.
 [4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. GonzalezDominguez, “Deep neural networks for small footprint textdependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 4052–4056, IEEE, 2014.
 [5] M. Piergallini, R. Shirvani, G. S. Gautam, and M. Chouikha, “Wordlevel language identification and predicting codeswitching points in swahilienglish language data,” in Proceedings of the Second Workshop on Computational Approaches to Code Switching, pp. 21–29, 2016.
 [6] R. Shirvani, M. Piergallini, G. S. Gautam, and M. Chouikha, “The howard university system submission for the shared task in language identification in spanishenglish codeswitching,” in Proceedings of The Second Workshop on Computational Approaches to Code Switching, pp. 116–120, 2016.
 [7] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, 2001.

[8]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,”
The Journal of Machine Learning Research
, vol. 15, no. 1, pp. 1929–1958, 2014.  [9] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [10] M. Courbariaux, Y. Bengio, and J.P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, pp. 3123–3131, 2015.
 [11] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in Advances in neural information processing systems, pp. 2654–2662, 2014.
 [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
 [13] R. Reed, “Pruning algorithmsa survey,” IEEE transactions on Neural Networks, vol. 4, no. 5, pp. 740–747, 1993.
 [14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, pp. 1135–1143, 2015.
 [15] M. D. Collins and P. Kohli, “Memory bounded deep convolutional networks,” arXiv preprint arXiv:1412.1442, 2014.
 [16] M. Denil, B. Shakibi, L. Dinh, N. de Freitas, et al., “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, pp. 2148–2156, 2013.
 [17] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006.

[18]
B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional
neural networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 806–814, 2015.  [19] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.

[20]
L. Meier, S. Van De Geer, and P. Bühlmann, “The group lasso for logistic regression,”
Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 70, no. 1, pp. 53–71, 2008.  [21] A. Torfi, “Speechpya library for speech processing and recognition,” arXiv preprint arXiv:1803.01094, 2018.
 [22] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a largescale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
 [23] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
 [24] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 539–546, IEEE, 2005.
 [25] X. Sun, A. Torfi, and N. Nasrabadi, “Deep siamese convolutional neural networks for identical twins and lookalike identification,” Deep Learning in Biometrics, p. 65, 2018.
 [26] R. R. Varior, M. Haloi, and G. Wang, “Gated siamese convolutional neural network architecture for human reidentification,” in European Conference on Computer Vision, pp. 791–808, Springer, 2016.
 [27] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for oneshot image recognition,” in ICML Deep Learning Workshop, vol. 2, 2015.
 [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [29] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.

[30]
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,”
Digital signal processing, vol. 10, no. 13, pp. 19–41, 2000.  [31] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
Comments
There are no comments yet.