Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise in an audio stream. This is a fundamental task which finds a wide range of applications in voice technology: speech coding , automatic speech recognition (ASR, ), audio surveillance and monitoring, speech enhancement, or speaker and language identification . In the workflow of these applications, VAD is generally involved as the very first block. As a consequence, the main characteristics expected from a VAD algorithms are generally a high efficiency and robustness to noise, as well as a low computational latency.
Numerous studies have addressed the problem of VAD in the literature. Generally speaking, a VAD method consists of two successive steps: feature extraction and a discrimination model. Early works focused on energy-based features, possibly combined with the zero-crossing rate (ZCR)[4, 5]. These features are however highly affected in the presence of additive noise. Therefore, various other features have been proposed: autocorrelation-based features [6, 7, 8], Mel-Frequency Cepstral Coefficients (MFCCs) , line spectral frequencies , a cepstral distance 11] or periodicity-based features [12, 13, 8]
. Some other methods are based on a statistical model of the Discrete Fourier Transform (DFT) coefficients[14, 15]. Other approaches exploit the fact that the speech and noise signals should have different variability properties [16, 17]. Finally, some studies have addressed the use of a combination of multiple features. These works differ by the way the features are combined: using a linear combination where the weights are trained via a minimum classification error in , a linear or a kernel discriminant analysis in 
or a principal component analysis in.
The resulting acoustic information is then generally the input of a statistical model whose goal is to draw a decision about the presence or not of speech. Proposed approaches differ in whether they use a supervised framework or not. In the former case, several models have been used: Gaussian Mixture Model (GMM,
), Hidden Markov Model (HMM,
) or Multi-Layer Perceptron (MLP,). Some other works state that the drawbacks of a supervised method are that large amounts of labeled training data are required and that they are sensitive to a mismatch between training and testing conditions [8, 22]. As a consequence, unsupervised approaches have been recently proposed in [23, 8, 22].
With respect to the state of the art, the main contributions of this paper are the following: i) to propose the use of robust source-related features for VAD purpose, ii) to assess the relative performance of source and filter-based features, iii) to investigate the best strategies to merge information from various feature sets, iv) to compare the proposed VAD system with existing algorithms on real data, v) to examine the generalization capabilities of a supervised approach when trained on a multi-condition dataset. Note that the two last points must be moderated as recent studies conducted VAD experiments on real-life videos [24, 25], possibly with a multi-condition training approach .
Ii Proposed Technique
According to the mechanism of voice production, speech is considered as the result of a glottal flow (also called source or excitation signal) filtered by the vocal tract cavities . This physiological process motivates the goal of this paper as we believe it to be essential that a VAD exploits information from both these two complementary components of speech. The proposed VAD approach will be shown in Section IV to carry out a significant improvement over the best state-out-the-art approach. It is worth emphasizing on that this would not be possible without the combined effect of 4 main factors: the joint use of filter and source-related information, the design of robust source features, an efficient strategy of information fusion and a multi-condition training.
Ii-a Filter-based Features
According to the source-filter model of speech, the spectral envelope, defined as a smooth function passing through the prominent peaks of the spectrum , is the transfer function of the filter. Various ways to parameterize the spectral envelope have been proposed in the literature. In this work, the following representations are considered: the Mel Frequency Cepstral Coefficients (MFCCs, ), the Perceptual Linear Prediction coefficients (PLP, ) and the Chirp Group Delay (CGD) of the zero-phase signal which is a robust high-resolved representation of the filter resonances 
. A vector of 13 coefficients is used for each feature type. The advantage to use such parameters is that they have been already shown to be efficient for ASR or speaker recognition purpose[30, 31], and can generally be of high interest in any speech technology application following the VAD. Most of the time, their computation is therefore already required and their integration in a VAD system can therefore be achieved at a very low computational cost.
Ii-B Source-related Features
that the weakest point of current glottal source processing algorithms is clearly related to their lack of robustness. It is therefore a challenging and still open problem to design source-related features for applications in adverse environments. One issue is the strong degradation of glottal flow estimation techniques when the speech signal gets noisier. When working in adverse conditions, it is consequently preferable to use indirect measurements derived either from the speech signal or from the LP residue. In this work, we aim at using robust source-related features which are compatible with the noisy environments targeted by our VAD.
Various existing studies have already used excitation information for VAD. The periodicity of the speech signal has been exploited in [6, 7, 12, 13, 8]. Furthermore, features extracted from the LP residual have been used in . In this work, we consider some of these features already proposed for VAD purpose, as well as some new other source-related measurements. Two popular features used from the early attempts are the log-energy of the speech signal  and the zero-crossing rate (ZCR, ). In , Nemer et al. proposed the use of high-order statistics of the LP residual. As suggested in that study, we included the skewness and kurtosis of the LP residual which are known to respectively characterize the polarity of the speech signal  and the sparsity of the excitation  at the glottal closure instants. Sadjadi recently proposed in  a VAD system using 4 voicing measures: the so-called harmonicity and clarity features derived from the average magnitude difference function (AMDF), the normalized LP error  and the Harmonic Product Spectrum (HPS, ).
In addition to the aforementioned features, we include three other source-related measurements. These latter features were proposed in previous studies and were here selected for their robustness properties. The first is the Cepstral Peak Prominence (CPP) which was originally proposed in  for the prediction of breathiness ratings. CPP is a measure of the amplitude of the cepstral peak at the hypothesized fundamental period. The two other features are extracted from the Summation of the Residual Harmonics (SRH) algorithm , a robust pitch tracker. The SRH criterion quantifies the level of voicing by taking into account the harmonics and inter-harmonics of the residual spectrum. The two features used in this work, referred to as and differ by the energy normalization or not of the residual spectrum. Note that the implementations of CPP and SRH are available from the COVAREP project .
Ii-C ANN-based Classification and Information Fusion
For our classification experiments, we opted for an ANN for its discriminant properties, its ability to model non-linear relations and for the convenience of the posterior probabilities it generates. Each ANN is made of a single hidden layer consisting of neurons whose activation function is an hyperbolic tangent sigmoid transfer function. As any parameter used by the proposed technique, the number of neurons was set on the development data. Performance was very similar using between 32 and 128 neurons, and we fixed this parameter to 32 in the remainder of this paper. The output layer is a simple neuron with a sigmoid function suited for a binary VAD decision. Note that we also tried to make use of recurrent neural networks. This however did not lead to a particular gain in performance while it increased the computational load.
Before being fed to the ANN, the feature vector at time
goes through two processing steps. First, the feature trajectories are smoothed using a median filter with a width of 11 frames (5 on each side). Working with a frame shift of 10 ms, this roughly corresponds to the phone scale. This operation allows to remove possible spurious values. Secondly, contextual information is added by including the first and second derivatives, computed using the following finite difference equation:. To keep working at the phone level, the number of contextual frames is set to 10. When in test, the ANN outputs the posterior of speech activity. As a last post-process, the posterior trajectories are smoothed out by a median filter whose width is again set to 11 frames so as to remove possible erroneous isolated decisions.
Our goal being to combine various sets of features, we consider two strategies to merge their information: feature fusion and decision fusion (also called early and late fusion). In the feature fusion case, synchronous feature vectors are simply concatenated and a single ANN is trained. In the decision fusion case, one specific ANN is trained for each feature set. Each ANN outputs a trajectory of posteriors, and the trajectories from the various ANNs are further merged to derive one final posterior value. Several strategies to combine the posteriors have been proposed in 
. In this work, we have tried the arithmetic and the geometrical mean (corresponding to the sum and product rule in). The differences in performance that we noticed were however negligible, and the geometrical mean is used throughout the rest of this paper.
Iii Experimental Protocol
Iii-a Speech databases
For the training of the proposed technique, our goal was to use a corpus containing a large diversity of speakers and noisy conditions. We chose a subset of 1500 files from the TIMIT database  from 300 speakers. As the original utterances were recorded in clean studio conditions, the advantage of this approach is that the labels can be easily obtained by using a simple energy threshold to extract the speech endpoints. For each file, noise was then artificially added at two SNR levels: 0 and 10 dB, leading to a total of 3000 files. For each file, the noise was randomly selected among 4 types from the Noisex-92 database: babble, car, factory and jet noises. Note that we added 2 seconds of noise before and after each utterance so that the database is roughly balanced between speech activity and background noise. We expect that this multi-condition training set is sufficiently diversified for the classifier to be effective in various (possibly unseen) environments and with new speakers. The development set consists of a 5% held-out portion of the training set.
The testing corpus is a manually annotated proprietary database containing real data recorded in 5 places: mall, kitchen, street, station and living room. Various sources of noise are therefore covered and encompass TV in the background, people talking nearby, cooking, cars passing by, etc. The data consists of Japanese read speech from 5 speakers using either a tablet or a smartphone. The main characteristics of the testing database are summarized in Table I. Note that the averaged SNR only reflects one aspect of the noise, and that other characteristics such as its dynamics and its spectral shape might be a preponderant source of performance degradation.
|% of speech||12.3||20.2||20.5||22.6||18.9||18.9|
Iii-B Assessment Metrics
As a first metric to quantify the discrimination power of each feature individually, we use the normalized mutual information , defined as the mutual information (MI) of the feature with the class labels divided by the class entropy. The normalization ensures an intuitive interpretation with values ranging between 0 and 1. This measure has also the advantage to be independent from the subsequent classifier. The computation of mutual information is here carried out via a histogram approach . The number of bins is set to 50 for each feature dimension, which results in a trade-off between an adequately high number for an accurate estimation, while keeping sufficient samples per bin. Class labels correspond to the presence (or not) of speech.
To assess the perfomance after classification, two metrics are used. These two measures respectively characterize the frame and the utterance levels. By varying a decision threshold , a Receiving Operating Characteristics (ROC) curve can be obtained. The first metric is the so-called equal error rate (EER), which corresponds to the location on a ROC curve where the false accept rate and false reject rate are equal. The second metric quantifies the ability to detect the endpoints of speech utterances. For this purpose, we use the F1 score (maximized over
in the dev set) as a single measure combining both precision and recall. The F1 score ranges from 0 to 1, where 1 implies a perfect classification. The correctness of a speech segment with regard to a reference is conform to the CENSREC-1-C criteria defined in. Note that before being assessed at the utterance level, the vector of binary decisions goes through an hangover scheme  consisting of a morphological closing (i.e. a dilatation followed by an erosion) with a time constant of 600 ms and a length extension of 200 ms on each side. Note that the same hangover scheme was applied to all techniques for the computation of the utterance-level results.
Iii-C Comparison with state-of-the-art techniques
Four state-of-the-art VAD systems are used for comparison purpose: the G.729B algorithm , Shon’s statistical model-based VAD , Ying’s unsupervised technique based on sequential Gaussian mixture models  (whose code was kindly shared by Dongwen Ying), and Ghosh’s VAD using long-term signal variability . As for the proposed technique, each of these methods makes use of a decision parameter which was tuned to optimize the EER and F1 scores, as discussed in Section III-B.
Iv-a Mutual information-based assessment
The results of the MI-based assessment are presented in Table II. For the filter-based features (MFCC, CGD and PLP), MI values have been averaged across the 13 coefficients. Note also that, for each feature, these results are averaged across static, first and second derivatives values. It can be seen that CGD gives the best results among the spectral envelope representations. Among the source-related features, the three proposed features interestingly provide the best results. They are followed by 3 features used in : HPS, harmonicity and clarity. This latter feature achieves a MI value comparable to that of the LP kurtosis and of the log-energy. As mentioned in Section II-B, designing robust source-related features is a challenging problem. The fact that the 3 proposed features yield better performance can be explained as follows: i) time-domain features are expected to be more sensitive to noise and working either in the spectral or cepstral domain turns out to be more appropriate, ii) SRH features outperform HPS because they exploit interharmonics as well as the LP residue which allows to minimize the effects of both the vocal tract resonances and of the noise .
Iv-B Classification results
For these experiments, we consider various sets of features: 13 MFCCs, 13 CGDs, 13 PLPs, the 4 voicing features (Harmonicity, Clarity, LP error and HPS) used in Sadjadi’s paper , and the 3 new source-based features (CPP, SRH and SRH*) which have not been used for VAD purpose yet. The two last sets of features will be referred to as Sadjadi and New in the following. The performance of these 5 feature sets is shown in Table III. Two main conclusions, which corroborate our observations from Section IV-A, can be drawn from these results: i) for VAD purpose, source-related features are more relevant than those characterizing the filter. Among them, the Sadjadi and New feature sets achieve similar performance; ii) across the filter representations, the CGD features, whose robustness was already highlighted in  for ASR purpose, turn out to be the most efficient. Nonetheless, since MFCCs are widely used in various speech technology applications and that their extraction is likely to be required anyways, we chose to use them as filter-based features in the rest of this paper.
|1-EER (in %)||87.9||90.2||87.6||93.7||94.0|
|F1 score (in %)||77.1||79.2||75.5||86.8||86.7|
In the second part of our experiments, we investigated the combination of different feature sets either at the feature or the decision level (see Section II-C). The results are displayed in Table IV, where N and S respectively stand for the New and Sadjadi feature sets. Note that Table IV
only shows the EER-based results; similar conclusions could be however drawn from the F1 scores. Interestingly, it can be observed that in all cases the decision fusion scheme outperforms feature fusion, by 3% in absolute on average. Feature fusion even led to a degradation in 3 out of the 4 cases. This is important because feature concatenation is conventionally used in most existing approaches. One possible reason to explain this is the curse of dimensionality: as the dimensionality of the feature vector increases, it becomes more and more difficult to accurately model the data, as an ever increasing number of samples is required. Although the association of the two excitation-based feature sets (S+N) yields already a high performance, the best results are obtained when they are combined with MFCCs. This is however only true when using the decision fusion. In the rest of our experiments, the system based on these 3 feature sets and using decision fusion will be referred to as the proposed VAD system.
The comparative evaluation with state-of-the-art techniques is summarized in Table V for the 5 different environments and using the F1 score. Note that all the observations that will be made hereafter were also corroborated using the EER metric. Three main conclusions can be drawn from Table V. First, it can be noticed that across all conditions the proposed system clearly outperforms existing methods, sometimes by a large increase of the F1 score. This is especially true in the kitchen, living room and mall environments, where existing algorithms tend to fail dramatically. This is mostly due to the fact that the corresponding recordings contain sporadic impulsive noises such as cough, laughter or cooking, whose dynamics can sometimes be similar to that of speech. These environments are therefore much more challenging than the street and station conditions which are rather stationary. Secondly, it is worth reminding that the four state-of-the-art techniques used in this comparison are based on the power spectral density, and therefore discard any source-related information. This further supports our results from Tables II and III that excitation-based features are necessary in an efficient VAD system. Finally, despite the mismatch between training and testing data, the proposed algorithm works well in all environments. This makes us think that the generalization capabilities of the proposed system are high, and that it can potentially adapt to any new environment, speaker, language or sensor. This is likely due to the robustness of the source-related features as well as the ability of the ANN to capture the speech patterns through the multi-condition training.
|G.729B ||Sohn ||Ying ||Ghosh ||Prop.|
The goal of this paper was to investigate the joint use of source and filter-based features for VAD purpose. The main conclusions of this study are the following: i) source-related features, and especially the 3 proposed features, have a better discrimination power and their use in an efficient VAD system is necessary, ii) as a strategy to merge different sources of information, decision fusion outperforms feature fusion, iii) the resulting proposed system, combining source and filter-based information, gives a significantly better performance compared to state-of-the-art methods, iv) the robustness of source-related features combined with the generalization capabilities of neural networks makes the proposed approach perform very well in unseen conditions. Features used in this paper can be extracted with the following toolkit: tcts.fpms.ac.be/drugman/files/VAD.zip.
-  A. Benyassine, E. Shlomot, H. Su, D. Massaloux, C. Lamblin, J. Petit: ITU-T recommendations G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications, IEEE Commun. Mag., vol. 35, pp. 64-73, 1997.
-  D. Valj, B. Kotnik, B. Horvat, Z. Kacic: A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems, Eurasip J. Appl. Signal Processing, no. 4, pp. 487-497, 2005.
-  I. McCowan, D. Dean, M. McLaren, R. Vogt, S. Sridharan: The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition, IEEE. trans. Audio Speech Lang. Proc., vol. 19, pp. 2026-2038, 2011.
-  F. Lamel, R. Rabiner, E. Rosenberg, G. Wilpon: An improved endpoint detector for isolated word recognition, IEEE Trans. Acoust. Speech Signal Process., vol. 29, pp. 777-785, 1981.
-  B. Kotnik, Z. Kacic, B. Horvat: A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm, in Proc. 7th Europseech, pp. 197-200, 2001.
-  B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan, R. Sarikaya: Robust speech recognition in noisy environments: the 2001 IBM SPINE evaluation system, Proc. ICASSP, pp. 53-56, 2002.
-  T. Kristjansson, S. Deligne, P. Olsen: Voicing features for robust speech detection, Proc. Interspeech, pp. 369-372, 2005.
-  S.O. Sadjadi, J. Hansen: Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux, IEEE Sig. Pro. Letters, vol. 20, pp. 197-200, 2013.
-  M. Marzinzik, B. Kollmeier: Speech pause detection for noise spectrum estimation by tracking power envelope dynamics, IEEE Trans. Speech Audio Process., vol. 10, pp. 109-118, 2002.
-  J. Haigh, J. Mason: A voice activity detector based on cepstral analysis, Proc. Eurospeech, pp. 1103-1106, 2003.
-  E. Nemer, R. Goubran, S. Mahmoud: Robust voice activity detection using higher-order statistics in the LPC residual domain, IEEE Trans. Speech Audio Process., vol. 9, pp. 217-231, 2001.
-  R. Tucker: Voice activity detection using a periodicity measure, Proc. Inst. Elect. Eng., vol. 139, pp. 377-380, 1992.
-  K. Ishizuka, T. Nakatani: Study of Noise Robust Voice Activity Detection Based on Periodic Component to Aperiodic Component Ratio, in Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition, pp. 65–70, 2006.
-  J. Sohn, N. Kim, W. Sung: A statistical model-based voice activity detection, IEEE Sig. Pro. Letters, vol. 6, pp. 1-3, 1999.
-  J. Ramirez, J. Segura, M. Benitez, L. Garcia, A. Rubio: Statistical voice activity detection using a multiple observation likelihood ratio test, IEEE Signal Proc. Letters, vol. 12, pp. 689-692, 2005.
-  P. Ghosh, A. Tsiartas, S. Narayanan: Robust voice activity detection using long-term signal variability, IEEE Trans. Audio Speech Lang. Process., vol 19, pp. 600-613, 2011.
-  J. Ramirez, J. Segura, M. Benitez, A. de la Torre, A. Rubio: Efficient voice activity detection algorithms using long-term speech information, Speech Comm., vol. 42, pp. 271-287, 2004.
-  Y. Kida, T. Kawahara: Voice Activity Detection based on Optimally Weighted Combination of Multiple Features, in Proc. Interspeech, pp. 2621-2624, 2005.
-  S. Soleimani, S. Ahadi: Voice Activity Detection based on Combination of Multiple Features using Linear/Kernel Discriminant Analyses, in Proc. Information and Communication Technologies: From Theory to Applications, pp. 1-5, 2008.
-  T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Vesely, P. Matejka: Developing a speech activity detection system for the DARPA RATS program, in Proc. Interspeech, 2012.
-  R. Sarikaya, J. Hansen: Robust detection of speech activity in the presence of noise, in Proc. ICSLP, pp. 1455-1458, 1998.
-  F. Germain, D. Sun, G. Mysore: Speaker and Noise Independent Voice Activity Detection, in Proc. Interspeech, 2013.
D. Ying, Y. Yan, J. Dang, F. Soong:
Voice Activity Detection Based on an Unsupervised Learning Framework, IEEE Trans. Audio Speech and Lang. Process., vol. 19, pp. 2624-2633, 2011.
-  A. Misra: Speech/nonspeech segmentation in web videos, in Proc. Interspeech, 2012.
-  F. Eyben, F. Weninger, S. Squartini, B. Schuller: Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies, in Proc. ICASSP, pp.483-487, 2013.
-  T. Drugman, P. Alku, B. Yegnanarayana, A. Alwan: Glottal Source Processing: from Analysis to Applications, Computer Speech and Language, vol. 28, issue 5, pp. 1117-1138, 2014.
-  T. Drugman, Y. Stylianou: Fast Inter-Harmonic Reconstruction for Spectral Envelope Estimation in High-Pitched Voices, IEEE Signal Processing Letters, 2014.
-  F. Zheng, G. Zhang, Z. Song: Comparison of Different Implementations of MFCC, J. Computer Science & Technology, vol. 16(6), pp. 582-589, 2001.
-  H. Hermansky: Perceptual linear predictive (PLP) analysis of speech, J. Acoust. Soc. Am., vol. 87, pp. 1738-1752, 1990.
-  B. Bozkurt, L. Couvreur, T. Dutoit: Chirp group delay analysis of speech signals, Speech Comm., vol. 49, pp. 159-176, 2007.
-  T. Kinnunen, H. Li: An overview of text-independent speaker recognition: From features to supervectors, Speech Comm., vol. 52, pp. 12-40, 2010.
-  T. Drugman: Advances in Glottal Analysis and its Applications, PhD thesis, University of Mons, 2011.
-  T. Drugman, B. Bozkurt, T. Dutoit: A Comparative Study of Glottal Source Estimation Techniques, Computer Speech and Language, vol. 26, issue 1, pp. 20-34, 2012.
-  T. Drugman: Residual Excitation Skewness for Automatic Speech Polarity Detection, IEEE Signal Processing Letters, vol. 20, issue 4, pp. 387-390, 2013.
-  T. Drugman: Maximum Phase Modeling for Sparse Linear Prediction of Speech, IEEE Signal Processing Letters, vol. 21, issue 2, pp. 185-189, 2014.
-  J. Hillenbrand, R. Houde: Acoustic Correlates of Breathy Vocal Quality: Dysphonic Voices and Continuous Speech, Journal of Speech and Hearing Research, vol. 39, pp. 311-321, 1996.
-  T. Drugman, A. Alwan: Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics, in Proc. Interspeech, pp. 1973-1976, 2011.
-  G. Degottex, J. Kane, T. Drugman, T. Raitio, S. Scherer: COVAREP - A collaborative voice analysis repository for speech technologies, in Proc. ICASSP, pp. 960-964, 2014.
-  J. Kittler, M. Hatef, R. Duin, J. Matas: On combining classifiers, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, pp. 226-239, 1998.
-  DARPA-TIMIT, Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc 1-1.1, 1990.
T. Drugman, M. Gurban, J. Thiran:
Relevant Feature Selection for Audio-Visual Speech Recognition, IEEE Multimedia Signal Processing, pp. 179-182, 2007.
-  N. Kitaoka, K. Yamamoto, T. Kusamizu, S. Nakagawa et al.: Development of VAD evaluation framework CENSREC-1-C and investigation of relationship between VAD and speech recognition performance, IEEE Automatic Speech Recognition & Understanding, pp. 607-612, 2007.
-  D. Vlaj, M. Kos, M. Grasic, Z. Kacic: Influence of Hangover and Hangbefore Criteria on Automatic Speech Recognition, in Proc. Int. Conf. on Systems, Signals and Image Processing (IWSSIP), pp. 1-4, 2009.
-  R. Bellman: Adaptive control processes: a guided tour, Princeton University Press, 1961.