Nowadays, the development of speaker verification systems has become very popular in real applications and it will continue growing over the next years. Many researchers are focused on the normal speech situation, an area where these systems begin to work very well. Nevertheless, very few studies address situations where there is an extreme mismatch in terms of vocal effort. Vocal effort has been studied in [1, 2], showing differences in five different modes: whisper, soft, normal, loud and shout. Each mode alters speech production in a way that introduces noticeable differences that affect the performance of the speaker verification systems. For this reason, we consider that it is necessary to study each mode in order to be able to automatically detect it, and perform an appropriate pre-processing to compensate the changes in the features and the negative impact on performance of non-neutral modes.
This work is intended to study the effect of shouted speech in speaker verification systems that are trained with normal speech, detecting and compensating this mismatch in order to improve the speaker verification performance. State-of-the-art speaker verification systems obtain very good performance in the normal speech scenario (e.g., ), when utterances are in the same collaborative conditions (neutral, normal, calm environment, etc.). Some works [2, 4] study the effect of vocal effort conditions on the performance of speaker verification systems and demonstrate that the accuracy is negatively affected in the presence of vocal effort mismatch.
High vocal effort mode is used by speakers to produce a louder acoustic signal in order to increase the signal-to-noise ratio usually in noisy environments, distant communication or in emergency situations. There are some works that address the issue of how to detect high vocal effort[4, 5, 6]. In 
, spectral characteristics of shouted speech are studied in order to demonstrate that many acoustic properties of the voice change in high vocal effort situations. In this paper, we focus on detecting shouted speech as a previous step to alleviate the performance degradation of speaker verification systems in those situations. Our detection method is based on a logistic regression model trained on embeddings directly obtained from shouted and normal utterances by using a time-delay neural network (TDNN) as described in.
, a feature compensation technique for speech recognition in noisy domains is presented. Mel-frequency cepstral coefficients (MFCCs) are used to train different Gaussian mixture models (GMMs) associated to different noisy conditions and, then, bias compensation terms are estimated depending on the noisy environment. Finally, acoustic features are compensated with these terms to improve speech recognition performance in noisy conditions. In this work, we adapt this technique to train GMMs using embeddings extracted by employing a TDNN-based model instead of MFCCs and compensating shouted embeddings in order to be able to mitigate the negative effect of shouted speech on the performance of speaker verification systems. We demonstrate that applying a linear compensation approach like this in the presence of vocal effort mismatch yields a relative improvement of up to 13.8% in terms of equal error rate (EER) in comparison with a system that applies neither shouted speech detection nor compensation.
The remainder of this paper is organized as follows: a brief comparison between shouted and normal speech is given in Section 2. Shouted speech detection is described in Section 3. Section 4 deals with shouted speech compensation. The experimental setup and results are presented in Sections 5 and 6, respectively. Finally, Section 7 concludes this work.
2 Shouted vs. Normal Speech
Many works have analyzed the acoustic differences between shouted and normal speech [1, 2, 9]. For instance, the authors of  demonstrated that the increase of vocal effort changes many acoustical properties of speech. In the spectral domain, it makes the fundamental frequency and the first formant to increase, as well as flattening of the spectral tilt. Hence, short-term spectral features such as MFCCs are thus directly affected by the increased vocal effort, which in turn affects the speaker recognition performance. To mitigate this effect, a spectral matching between shouted and normal speech on a perceptual scale was proposed in .
In this work, we want to study how these differences between shouted and normal speech can affect the speaker verification performance. State-of-the-art speaker verification is based on a TDNN trained with MFCCs to obtain speaker embeddings . Differences between shouted and normal speech also affect both the intra- and inter-speaker variability at the TDNN output, which can be visualized in the embedding domain. For that, i.e., to transform the embeddings extracted from the TDNN and see how shouted and normal speech conditions are represented in a two-dimensional space, we use t-SNE .
In Figure 1, a two-dimensional speaker embedding representation from 11 males and 11 females is shown. Each speaker is characterized by 24 shouted and 24 normal points in the two-dimensional space. We can observe four clusters that represent different embedding characteristics. On the one hand, there are gender representation clusters for embeddings, and, on the other hand, there are shouted and normal speech differences in the embedding characteristics that are visualized in distinct clusters. This affects the speaker verification task due to the intra-speaker variability introduced by the two vocal effort domains, one of shouted utterances and the other one of normal speech. If two utterances from the same speaker but with different conditions are compared, the system will not be able to verify them correctly as same speaker because the embeddings are affected by vocal effort mismatch and will be rejected.
3 Shouted Speech Detection
To avoid introducing unnecessary distortion to normal speech embeddings, it is crucial to develop an accurate shouted speech detector before applying the proposed compensation techniques. To this end, we assume this task as a two-class classification problem by training a logistic regression model with embeddings obtained from shouted and normal speech utterances. Logistic regression is chosen due to both low complexity and very good performance. Let be a -dimensional embedding, and indicate the hypotheses that
is a shouted and a normal speech embedding, respectively. Thus, the probability thatcomes from shouted speech, , is estimated in this work as
where, as aforementioned, the parameters of the model, , are calculated from a set of training embeddings obtained from shouted and normal speech (see Subsection 5.2). At test time, an embedding
is classified as coming from a shouted speech utterance if.
The usefulness of this rather simple, yet effective method is shown in the result section.
4 Shouted Speech Compensation
In this section, we describe the technique used to compensate the shouted speech embeddings. This technique is simple and has only a few parameters to better fit the data scarcity. We propose here the use of Multi-Environment Model-based LInear Normalization (MEMLIN) , a method borrowed from robust speech recognition. Given the normal speech embedding and the shouted one , a normal speech embedding estimate, , can be obtained by minimum mean square error estimation as
where is the expectation operator and
is the conditional probability density function ofgiven . In order to evaluate the expression in Eq. (2) for MEMLIN, the following assumptions are made.
First, normal speech embeddings are modelled by using a GMM:
where denotes each Gaussian of the normal speech model, and , and are the mean, covariance matrix (which is diagonal in this work as we assume statistical independence among embedding components) and weight associated to Gaussian . In addition, is the likelihood of the normal speech embedding given the Gaussian .
Secondly, shouted speech embeddings are similarly modelled as
Finally, the third assumption is considering that the normal embedding, , can be obtained from the shouted embedding, , by making use of the above models:
With all of these assumptions, Eq. (2) can be expressed by using the Bayes’ rule and the proposed models for both domains as
where is a bias term (see below). Given the shouted speech embedding , to obtain an estimate of the normal speech embedding it is necessary to compute the probability of the shouted speech Gaussian given , , and the probability of the normal speech Gaussian given the shouted embedding and the shouted speech Gaussian , .
The bias terms, , are obtained in a training stage using a set of paired embeddings (see Subsection 5.2) from both domains, , following
In order to compare the performance of the proposed method with some other well-known compensation techniques for robustness in automatic speech recognition, we also implemented two techniques such as Multivariate Gaussian-based Cepstral Normalization (RATZ)  and Stereo-based Piecewise LInear Compensation for Environments (SPLICE) .
where is a bias term that only depends on the normal speech Gaussian and is obtained in a previous training phase from the set of paired embeddings according to
5 Experimental Setup
5.1 Speaker Verification System
The speaker verification system is implemented according to the x-vector-based Kaldi  recipe using augmented versions of the VoxCeleb1  and VoxCeleb2  corpora111https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb. The models generated from this recipe are freely available on the Internet222https://kaldi-asr.org/models/m7
. The EER (which is the primary evaluation metric in this paper) obtained using this baseline system for VoxCeleb is 3.1%.
This speaker verification system consists of a TDNN-based front-end for 512-dimensional speaker embedding (x-vector) computation (i.e.,
) plus a probabilistic linear discriminant analysis (PLDA) back-end for verification. The TDNN is fed with 30-dimensional MFCC features extracted from speech signals that are framed using a 25 ms analysis window with a 10 ms shift. Voice activity detection is employed to discard non-speech frames. Then, prior PLDA scoring, x-vectors are centered, reduced in terms of dimensionality by means of linear discriminant analysis and length-normalized.
5.2 Test Database
The speech corpus used to perform the experiments is the one presented in . It consists of 11 male and 11 female speakers. Each of them recorded 24 sentences speaking normally and the same 24 sentences shouting. The sentences were recorded in an anechoic chamber using a high-quality microphone. Channel effects and environment variations were completely excluded. The sentences were spoken in Finnish, half in imperative and half in indicative mode. The average duration of each utterance is 3 seconds.
Due to the scarcity of shouted speech, both shouted speech detection and compensation experiments are carried out using leave-one-speaker-out cross-validation to maximize the number of trials. All the utterances in the corpus are processed to extract x-vectors according to the process outlined in Subsection 5.1 and further detailed in . Four different conditions are considered for experimental evaluation:
All vs. All (A-A): All the shouted and normal speech utterances are compared each other, which yields 557,040 verification trials.
Normal vs. Normal (N-N): Normal speech utterances are compared each other, which yields 139,128 verification trials.
Shouted vs. Shouted (S-S): Shouted speech utterances are compared each other, which yields 139,128 verification trials.
Normal vs. Shouted (N-S): Normal speech utterances are compared against shouted speech utterances, which yields 278,784 verification trials.
In this section, the use of MEMLIN-, RATZ- and SPLICE-based shouted speech compensation, also considering the proposed shouted speech detection, is compared with a baseline speaker verification system that applies neither shouted speech detection nor shouted speech compensation. Notice that the shouted speech compensation techniques employ 8-component GMMs.
First, Table 1 shows speaker verification results in terms of EER, in percentages, when oracle shouted speech detection is used. As we can see, for Baseline there is a relative worsening of around 114% between Normal vs. Normal and Normal vs. Shouted conditions that justifies the need for vocal effort mismatch compensation. To a greater or lesser extent, such a mismatch is reduced by MEMLIN, RATZ and SPLICE. More in particular, the best results are those obtained by MEMLIN and SPLICE in contrast to RATZ, which suggests that modelling the shouted embedding space is important to achieve better compensation performance. Furthermore, the utility of MEMLIN for shouted embedding compensation can be visually inspected in Figure 2.
Similarly to Table 1, Table 2 shows speaker verification results when using the shouted speech detection proposed in Section 3. These results are supported by the detection error trade-off curves of Figure 3. Considering, indeed, leave-one-speaker-out cross-validation, our shouted speech detector obtains 98.11% accuracy, where only 1.17% and 2.65% of shouted and normal utterances, respectively, are misclassified. In these circumstances, it is not surprising the high similarity between the results reported in Tables 1 and 2. It is important to remark that, in the more interesting from a practical perspective All vs. All scenario, SPLICE achieves a 13.8% relative improvement with respect to Baseline (in accordance with Table 2).
From Figure 1, one may think that applying gender-dependent shouted embedding compensation can bring about an improvement with respect to employing a gender-independent approach. For this reason, we evaluated gender-dependent versions of MEMLIN-, RATZ- and SPLICE-based shouted speech compensation, the results (averaged across genders) of which are shown in Table 3. As can be seen, the equivalent gender-independent shouted embedding compensation of Table 1 is superior to the gender-dependent approach.
In this work, we have shown the need for vocal effort mismatch compensation in the context of speaker verification. Moreover, we have also shown the potential of several linear compensation techniques intended to mitigate the mismatch between speaker embeddings extracted from shouted and normal speech utterances. These techniques have worked on top of a very effective shouted speech detector based on logistic regression.
As there is certainly room for improvement, future work will be concerned with studying other mismatch compensation approaches possibly involving unsupervised learning or transfer learning. Towards this goal, we will require the acquisition of larger corpora comprising high vocal effort speech data.
Authors would like to thank Dr. Tomi Kinnunen for providing the database we have performed this study with. This work has been partially supported by the Spanish Ministry of Economy and Competitiveness and the European Social Fund through the project TIN2017-85854-C4-1-R, Government of Aragón (Reference Group T36_17R) and co-financed with Feder 2014-2020 “Building Europe from Aragón”.
-  C. Zhang and J. Hansen, “Analysis and classification of speech mode: Whispered through shouted,” in Proc. of 8th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2007, pp. 2289–2292.
-  E. Shriberg, M. Graciarena, H. Bratt, A. Kathol, S. Kajarekar, H. Jameel, C. Richey, and F. Goodman, “Effects of vocal effort and speaking style on text-independent speaker verification,” in Proc. of 9th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2008, pp. 609–612.
-  J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yu, “A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result,” in Proc. of 43rd International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5349–5353.
-  J. Pohjalainen, T. Raitio, H. Pulakka, and P. Alku, “Automatic detection of high vocal effort in telephone speech,” in Proc. of 13th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2012.
-  J. Pohjalainen, T. Raitio, S. Yrttiaho, and P. Alku, “Detection of shouted speech in noise: Human and machine,” The Journal of the Acoustical Society of America, vol. 133, pp. 2377–2389, 2013.
-  H. Chao, L. Dong, and Y. Liu, “Two-stage vocal effort detection based on spectral information entropy for robust speech recognition,” Journal of Information Hiding and Multimedia Signal Processing, vol. 9, pp. 1496–1505, 2018.
-  D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. of 43rd International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
-  L. Buera, E. Lleida, A. Miguel, A. Ortega, and O. Saz, “Cepstral vector normalization based on stereo data for robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1098–1113, 2007.
-  A. Lopez, R. Saeidi, L. Juvela, and P. Alku, “Normal-to-shouted speech spectral mapping for speaker recognition under vocal effort mismatch,” in Proc. of 42nd International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 4940–4944.
L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”
Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. [Online]. Available: http://www.jmlr.org/papers/v9/vandermaaten08a.html
-  P. Moreno, “Speech recognition in noisy environments,” Ph.D. dissertation, ECE Department, Carnegie-Mellon University, 1996.
-  J. Droppo, L. Deng, and A. Acero, “Evaluation of the SPLICE algorithm on the Aurora2 database,” in Proc. of 7th European Conference on Speech Communication and Technology (EUROSPEECH), 2001, pp. 217–220.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, and K. Vesel, “The Kaldi speech recognition toolkit,” 2011.
-  A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in Proc. of 18th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2017, pp. 2616–2620.
-  J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. of 19th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2018, pp. 1086–1090.
-  C. Hanilçi, T. Kinnunen, R. Saeidi, J. Pohjalainen, P. Alku, and F. Ertaş, “Speaker identification from shouted speech: Analysis and compensation,” in 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 8027–8031.