1 Introduction
Language identification (LID) is a kind of utterancelevel paralinguistic speech attribute recognition task with variablelength sequences as inputs. For the input utterances, the duration might range from a few seconds to several minutes. Besides, there are no constraints on the lexical words thus the training utterances and test segments may have phonetic mismatch issue [1]. Therefore, our purpose is to find an effective and robust method to retrieve the utterancelevel information and encode them into fixed dimensional vector representations.
To address the variablelength inputs issue for acoustic feature based LID, many methods have been proposed in the last two decades. The deterministic Vector Quantization (VQ) model is used for LID in [2, 3]
. VQ assigns the framelevel acoustic features to the nearest cluster in codebook and calculates the VQ distortions. Every language is characterized by an occupancy probability histogram. Compared to VQ, the Gaussian Mixture Model (GMM) is capable to model the complex distribution of the acoustic features
[4]and generates soft posterior probabilities to assign those framelevel features to Gaussian components. Once the GMM is trained, the zeroorder and firstorder BaumWelch statistics can be accumulated to construct a high dimensional GMM Supervector
[1, 5, 6], which is considered as an utterancelevel representation. Furthermore, the GMM Supervector can be projected to a low rank subspace using the factor analysis technique, which results in the ivector [7, 8]. Recently, the posterior statistics on each decision tree senones generated by the speech recognition acoustic model are adopted to construct the phonetic aware DNN ivector
[9], which outperforms the GMM ivector in LID tasks [10, 11, 12] due to the discriminative frame level alignments.More recently, some endtoend learning approaches are proposed for the LID task and achieved superior performances [13, 14]
. However, these methods may lack the flexibility in dealing with durationvariant utterances. This is mainly because the deep learning module (fullyconnected layer) or the development platform usually requires fixedlength inputs. Each utterance with different number of frames has to be zero padded or truncated into fixedsize vectors. This is not a desired way to recognize spoken languages, speaker identities or other paralinguistic attributes on speech utterances with various durations. To address this problem, recurrent neural networks (RNN), e.g. Long ShortTerm Memory (LSTM)
[15], is introduced to LID and the last timestep output of the RNN layer is used as the utterancelevel representation [16]. Stacked longterm time delay neural networks (TDNN) is adopted to span a wider temporal context on the inputs and a hierarchical structure is built to predict the likelihood over different languages [17]. Alternatively, modules like temporal average pooling (TAP), are proposed to perform statistic measures on the lengthvariant feature sequences in order to generate fixed dimensional representations for LID [18, 19]. These endtoend systems perform well, however, the simple statistic measures are performed globally on all the frame level features (e.g. average pooling) which may smooth out the information on each clusters. In our early works, we imitate the GMM Supervector encoding procedure and introduce a learnable dictionary encoding (LDE) layer for the endtoend LID system [20, 21]. The success of LDE layer in the endtoend LID framework inspires us to explore different encoding methods which may be feasible for the LID task.In this paper, we adopt NetFV [22] and NetVLAD [23] in our endtoend LID task and explore the feasibility of this two methods. NetFV is the “soft assignment” version of standard Fisher Vector (FV) [24, 25, 26] and is differentiable which could be easily integrated to an endtoend trainable system. Meanwhile, Vector of Locally Aggregated Descriptors (VLAD) proposed in [27] is a simplified nonprobabilistic version of the standard FV and similarly, VLAD is further enhanced as a trainable layer named NetVLAD in [23]
. Standard FV and VLAD have been widely employed in computer vision tasks such as image retrieval, place recognition and video classification
[26, 27, 28]. Moreover, both NetFV and NetVLAD are considered as more powerful pooling techniques to aggregate variablelength inputs into a fixedlength representation. This two encoding layers have been widely used and perform well in vision tasks [22, 23, 29]. As for the LID task, we employ a residual networks (ResNet) [30] as the frontend feature extractor, and use NetFV or NetVLAD to encode the variablesize CNNs feature maps into fixedsize utterancelevel representations. Experimental results on NIST LRE 07 show that the proposed method outperforms the GMM ivector baseline as well as the TAP layer based endtoend systems. Moreover, the proposed endtoend system is flexible, effective and robust in both training and test phases.The following of this paper is organized as follows: Section 2 explains the LID methods based on the GMM ivector, NetFV and NetVLAD as well as the overall endtoend framework. Experimental results and discussions are presented in Section 3 while conclusions and future works are provided in Section 4.
2 Methods
In this section, we elaborate the mechanisms of the GMM Supervector, NetFV and NetVLAD. Besides, we explain the relevances and differences of these three encoding schemes. Furthermore, we describe our flexible endtoend framework in details.
2.1 GMM Supervector
Given a component Gaussian Mixture ModelUniversal Background Model (GMMUBM) with parameters set , where , and are the mixture weight, mean vector and covariance matrix of the Gaussian component, respectively, and a frame speech utterance with dimensional features , the normalized component’s Supervector is defined as
(1) 
where is the occupancy probability for on the component of the GMM, the numerator and the denominator are referred as the zeroorder and firstorder centered BaumWelch statistics. By concatenating all of the together, we derive the high dimensional Supervector of the corresponding utterance is
(2) 
Then, the supervector can be projected on a low rank subspace using the factor analysis technique to generate the ivector.
2.2 NetFV Layer
2.2.1 Fisher Vector
Let , , denote the sequence of input features with frames, the generation process of the data
is assumed to be modeled by a probability density function
with its parameters . As argued in [24, 31], the gradient of the loglikelihood describes the contribution of the parameters to the data generation process and can be used as discriminative representation. The gradient vector of w.r.t. the parameters can be defined as(3) 
A Fisher Kernel is introduced to measure the similarity between two data samples [24] and is defined as
(4) 
where is a sequence of features like , and is the Fisher information matrix [24] of the probability density function :
(5) 
Since the is a symmetric and positive semidefinite matrix, we can derive the Cholesky decomposition of the form , where is a lower triangular matrix. In this way, the Fisher kernel is a dotproduct between the normalized vectors , where is referred as the Fisher Vector of . We choose a component GMM to model the complex distribution of data, then , where is the parameters set of the GMM. The gradient vector is rewritten as
(6) 
From the approximation theory of the Fisher information matrix [24, 25, 28], the covariance matrix can be restricted to a diagonal matrix, that is , . Moreover, the normalization of the gradient by is simply a whitening of the dimensions [31]. Let be the posterior probability of on the GMM component,
(7) 
The gradient w.r.t. the weight parameters brings little additional information thus it can be omitted [31]. The remain gradients of
w.r.t. the mean and standard deviation parameters are derived
[31] as:(8) 
(9) 
By concatenating the gradients in Eq. 89, we get the gradient vector with dimension . With the component GMM, the Fisher Vector of is in the form of
(10) 
which is a high dimensional vector in . Finally, for the utterancelevel representation, the Fisher Vector of the whole sequence is approximated by a mean pooling on all ,
(11) 
2.2.2 NetFV
Once the FV codebook is trained, the parameters of the traditional FV are fixed and can’t be jointly learnt with other modules in the endtoend system. To address this issue, as proposed in [22], two simplifications are made to original FV: 1) Assume all GMM components have equal weights. 2) Simplify the Gaussian density to
(12) 
Let and , and with the assumption of , the gradients in Eq. 8, Eq. 9 and the posterior probability in Eq. 7 are respectively rewritten as the final form of NetFV [22]:
(13) 
(14) 
(15) 
The three modified equations above are differentiable so the parameters set, i.e., , can be learnt via the backpropagation algorithm.
2.3 NetVLAD Layer
VLAD is another strategy used to aggregate a set of feature descriptors into a fixedsize representation [27]. With the same inputs as FV and clusters assumed in VLAD, i.e., , the conventional VLAD aligns each to a cluster . The VLAD fixedsize representation is defined as
(16) 
where indicates if is the closest cluster to and otherwise. This discontinuity prevents it to be differentiable in the endtoend learning pipeline. To make the VLAD differentiable, The authors in [23] proposed the soft assignment to function , that is
(17) 
where and is a scale. By integrating the soft alignment into Eq. 16, the final form of differentiable VLAD method is derived as
(18) 
which is socalled NetVLAD [23] with the parameters set of . The fixedsize matrix in Eq. 18 is normalized to generate the final utterancelevel representations.
2.4 Insights into NetFV and NetVLAD
Focusing on the GMM supervector (Eq. 1), the gradient components w.r.t. mean in Fisher Vector (Eq. 8) and the VLAD expression (Eq. 16), we can found that these three methods calculate the zeroorder and firstorder statistics to construct fixed dimensional representations in a similar way. The residual vector measures the differences between the input feature and its corresponding component in GMM or cluster in codebooks. And all the three aforementioned methods store the weighted sum of residuals. However, they might have different formulas to compute the zeroorder statistics. As for Fisher Vector, it captures the additional gradient components w.r.t. covariance which can be considered as the secondorder statistics. Above all, these three encoding methods have theoretical explanations from different perspectives but result in some similar mathematics formulas. Motivated by the great success of GMM Supervector, the NetFV and NetVLAD layers theoretically might have good potential in paralinguistic speech attribute recognition tasks.
Compared to the temporal average pooling (TAP) layer, NetFV and NetVLAD layers are capable to heuristically learn more discriminative feature representations in an endtoend manner while TAP layer may deemphasize some important information by simple average pooling. If the number of clusters
in NetFV or NetVLAD layer is 1, and its mean is zero, the encoding layer is just simplified to TAP layer.Our endtoend framework is illustrated in Fig. 1. It comprises a CNNs architecture with output channels, an encoding layer with cluster size and a fully connected layer. Taken the variablelength features as input, the CNNs structure spatially produces a variablesize feature maps in , where is dependent on the input length . The encoding layer then aggregates the feature maps into a fixedsize representation in
. And the fully connected layer acts as a backend classifier. All the parameters in this framework are learnt via the backpropagation algorithm.
3 Experiments
3.1 2007 NIST LRE Closetset Task
The 2007 NIST Language Recognition Evaluation(LRE) is a closedset language detection task. The training set consists the datasets of Callfriend, LRE 03, LRE 05, SRE 08 and the development part of LRE 07. We split the utterances in training set into segments with duration in 3 to 120 seconds. This yields about 39000 utterances. In the test set, there are 14 target languages with 7530 utterances in total. The nominal durations of the testing data are 3s, 10s and 30s.
3.2 Experimental Setup
Raw audio is converted to 7137 based 56 dimensional shifted delta coefficients (SDC) feature, and a framelevel energybased voice activity detection (VAD) selects features corresponding to speech frames. We train a 2048component GMMUBM with full covariance and extract 600dimensional ivectors followed by the whitening and length normalization. Finally, we adopt multiclass logistic regression to predict the language labels.
For the endtoend LID systems, the 64dimension melfilterbank coefficients feature is extracted along with sliding mean normalization over a window of 300 frames. Afterwards, the acoustic features are fed to a random initialized ResNet34 networks with 128 output channels to produce the utterancedependent feature maps. The temporal average pooling (TAP) is adopted as the encoding layer to build the baseline endtoend system. Meanwhile, the cluster size ranges from 16 to 128 by step power of
to find out the best parameter setup in NetFV and NetVLAD layers, respectively. The softmax and cross entropy loss are integrated behind the fully connected layer. Finally, a stochastic gradient descent (SGD) optimizer with the momentum 0.9, the weight decay
and the initial learning rate 0.1 is used for the backpropagation. The learning rate is divided by 10 at the and epoch.To efficiently train the system, we set the minibatch size of 128 in data parallelism over 4 GPUs. For each minibatch data, a truncatedlength is randomly sampled from , and is used to truncate a segment of continuous frame features from a frame utterance. The beginning index of truncation randomly lies in the interval of . Consequently, a minibatch data with unifiedlength samples in is loaded and may change for different minibatches. In the test phase, all the speeches in 3, 10 and 30 seconds durations are tested onebyone on the same trained model. No truncation is used for the arbitraryduration utterances.
3.3 Evaluation
The training losses of the endtoend systems are sliding smoothed with window size of 400, and illustrated in Fig. 2. The NetFV and NetVLAD based systems converge faster and reach lower losses than that of the temporal average pooling (TAP) layer. With a closer look, NetFV is slightly better than NetVLAD but both of them are competitive in the training phase. The language identification results are presented in Table 1. The performance is measured in the metrics of the average detection cost and equal error rate (ERR). From the Table 1, the endtoend systems including TAP, LDE, NetFV and NetVLAD layers significantly outperform the conventional GMM ivector baseline. It shows that the proposed endtoend framework for the LID task is feasible and effective. The results of the systems based on the TAP and the remarkable LDE layers are provided in our early works [20].
Moreover, we step further to compare the performances of the TAP, NetFV and NetVLAD encoding layers. Both NetFV and NetVLAD based systems achieve much lower and EER than the ones with TAP layer. Especially on the long utterances (30s), the best and EER of NetVLAD could be relatively reduced by and respectively w.r.t. the results of TAP. If we concentrate on the NetFV and NetFV only, the performances are generally getting better while the cluster size ranges 16 to 64, and start to degrade when the cluster size reaches 128. Therefore, larger cluster size in the encoding layer may enhance the capacity of networks, however, more data and training epochs may be required as well. In addition, the TAP based system shows the accuracy rates of 75.49%, 89.71% and 93.56% on the 3s, 10s and 30s test set respectively while the NetVLAD based system improves the accuracies to 76.14%, 91.43% and 96.85%. Overall, NetVLAD is slightly superior to NetFV in the test phase and achieves the best performance when the cluster size is 64.
Despite the best result corresponding to the NetVLAD is slightly inferior to that of the LDE layer, the performances of the NetFV, NetVLAD and LDE layers are comparable. What’s more, these three powerful encoding methods are complementary. With the cluster size is 64, we fuse the three systems based on the NetFV, NetVLAD and LDE respectively at the score level. And as shown in the Table 1, the score level fusion system further reduces the and EER significantly.
System description  

3s  10s  30s  
GMM ivector  20.46/17.71  8.29/7.00  3.02/2.27 
ResNet34 TAP  9.24/10.91  3.39/5.58  1.83/3.64 
ResNet34 LDE 64  8.25/7.75  2.61/2.31  1.13/0.96 
ResNet34 NetFV 16  9.47/9.04  2.96/2.59  1.31/1.08 
ResNet34 NetFV 32  8.95/8.37  2.88/2.49  1.35/1.31 
ResNet34 NetFV 64  8.91/8.26  2.88/2.74  1.19/1.15 
ResNet34 NetFV 128  9.05/8.64  2.91/2.72  1.27/1.34 
ResNet34 NetVLAD 16  8.23/8.06  2.90/2.62  1.36/1.17 
ResNet34 NetVLAD 32  8.87/8.58  3.10/2.50  1.46/1.15 
ResNet34 NetVLAD 64  8.59/8.08  2.80/2.50  1.32/1.02 
ResNet34 NetVLAD 128  8.72/8.44  3.15/2.76  1.53/1.14 
Fusion system  6.14/6.86  1.81/2.00  0.89/0.92 
4 Conclusions
In this paper, we apply these two encoding methods in our endtoend LID framework to investigate the feasibility and performance. The NetFV and NetVLAD layers are more powerful encoding techniques with learnable parameters and are able to encoding the variablelength sequence of features into a fixedsize representation. We integrate them to a flexible endtoend framework for the LID task and conduct experiments on the NIST LRE07 task to evaluate the methods. Promising experimental results show effectiveness and great potential of NetFV and NetVLAD in the LID task. This endtoend framework might also work for other paralinguistic speech attribute recognition tasks, which will be our further works.
References
 [1] T. Kinnunen and H. Li, “An overview of textindependent speaker recognition: From features to supervectors,” Speech Communication, vol. 52, no. 1, pp. 12–40, 2010.
 [2] D. Cimarusti and R. Ives, “Development of an automatic identification system of spoken languages: Phase i,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 7. IEEE, 1982, pp. 1661–1663.
 [3] M. Sugiyama, “Automatic language recognition using acoustic features,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 813–816, 1991.
 [4] M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1, p. 31, 1996.

[5]
W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vector machines using gmm supervectors for speaker verification,”
IEEE Signal Processing Letters, vol. 13, no. 5, pp. 308–311, 2006.  [6] C. H. You, H. Li, and K. A. Lee, “A gmmsupervector approach to language recognition with adaptive relevance factor,” in Signal Processing Conference, 2010 18th European. IEEE, 2010, pp. 1993–1997.
 [7] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
 [8] N. Dehak, P. A. TorresCarrasquillo, D. Reynolds, and R. Dehak, “Language recognition via ivectors and dimensionality reduction,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
 [9] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phoneticallyaware deep neural network,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1695–1699.
 [10] M. Li and W. Liu, “Speaker verification and spoken language identification using a generalized ivector framework with phonetic tokenizations and tandem features,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [11] D. Snyder, D. GarciaRomero, and D. Povey, “Time delay deep neural networkbased universal background models for speaker recognition,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop. IEEE, 2015, pp. 92–97.
 [12] F. Richardson, D. Reynolds, and N. Dehak, “Deep neural network approaches to speaker and language recognition,” IEEE Signal Processing Letters, vol. 22, no. 10, pp. 1671–1675, 2015.
 [13] I. LopezMoreno, J. GonzalezDominguez, O. Plchot, D. Martinez, J. GonzalezRodriguez, and P. Moreno, “Automatic language identification using deep neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 5337–5341.
 [14] J. GonzalezDominguez, I. LopezMoreno, H. Sak, J. GonzalezRodriguez, and P. J. Moreno, “Automatic language identification using long shortterm memory recurrent neural networks,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [15] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [16] G. Gelly, J.L. Gauvain, V. B. Le, and A. Messaoudi, “A divideandconquer approach for language identification based on recurrent neural networks.” in Proc. Interspeech 2016, 2016, pp. 3231–3235.
 [17] D. GarciaRomero and A. McCree, “Stacked longterm tdnn for spoken language recognition.” in Proc. Interspeech 2016, 2016, pp. 3226–3230.
 [18] D. Snyder, P. Ghahremani, D. Povey, D. GarciaRomero, Y. Carmiel, and S. Khudanpur, “Deep neural networkbased speaker embeddings for endtoend speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 165–170.
 [19] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kannan, and Z. Zhu, “Deep speaker: an endtoend neural speaker embedding system,” arXiv preprint arXiv:1705.02304, 2017.
 [20] W. Cai, , X. Zhang, X. Wang, and M. Li, “A novel learnable dictionary encoding layer for endtoend language identification,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2018.
 [21] W. Cai, Z. Cai, W. Liu, X. Wang, and M. Li, “Insights into endtoend learning scheme for language identification,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2018.
 [22] P. Tang, X. Wang, B. Shi, X. Bai, W. Liu, and Z. Tu, “Deep fishernet for object classification,” arXiv preprint arXiv:1608.00182, 2016.

[23]
R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn
architecture for weakly supervised place recognition,” in
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016, pp. 5297–5307.  [24] T. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” in Advances in neural information processing systems, 1999, pp. 487–493.
 [25] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies for image categorization,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2007, pp. 1–8.
 [26] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Largescale image retrieval with compressed fisher vectors,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 3384–3391.
 [27] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 3304–3311.
 [28] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International Journal of Computer Vision, vol. 105, no. 3, pp. 222–245, 2013.
 [29] A. Miech, I. Laptev, and J. Sivic, “Learnable pooling with context gating for video classification,” arXiv preprint arXiv:1706.06905, 2017.
 [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
 [31] F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for largescale image classification,” in European Conference on Computer Vision. Springer, 2010, pp. 143–156.
Comments
There are no comments yet.