Humans adapt their speech to the physical environment. To avoid degradation of speech intelligibility in a noisy environment, speakers usually adjust the way they talk for effective communication. This involuntary adaptation is commonly referred to as the Lombard effect . Lombard speech, which refers to speech produced in the presence of noise, is known to be more intelligent than ‘normal’ speech when presented in equivalent amounts of noise [2, 3, 4, 5, 6]. This is the natural result of the feedback system between vocal production and auditory perception that enables correction of speech performance .
The regular changes between normal and Lombard speech include not only loudness but also other acoustic features, such as prolonging the duration of their speech , shifting the timing of vocalizations, increasing the pitch of vocalisations [9, 10], shifting in formant center frequencies for F1 (mainly) and F2 , and shifting in energy from low-frequency bands to middle or high bands . Is has also been demonstrated that spectral tilt decreases, implying an increase in high-frequency components under the Lombard effect . Also, Lombard changes are greater in adults than in children, and in spontaneous speech than in reading tasks .
In addition to its effect on psychophysiology, hearing tests, and studies on audio-vocal integration , the Lombard effect has significant impacts in applications of speech-related technology, such as speaker recognition , noisy speech recognition [15, 16], Lombard speech synthesis [17, 18], and so on. It is also applicable to the study of vocal disorders and speech production, and has even been used as a therapeutic tool to improve speech intelligibility in Parkinson’s disease patients . In architectural acoustics and design, studies on the Lombard effect have been utilized to reduce unwanted noise and improve intelligibility of speech indoors . The Lombard effect is relevant to the study of phonetics and linguistics , too.
Although the Lombard effect has been widely studied, we still know far too little about the relationship between its mechanisms and human behavior, particularly when it comes to emotional speech. Emotions color the language and act as a necessary ingredient for natural human communication and interaction. With the development of artificial intelligence, machines are expected to not only understand human speech but also to be capable of capturing human emotions and generating emotional synthetic speech just like a real human being. There is a wide range of studies on emotional speech in areas such as speech recognition and speech synthesis. However, to the best of our knowledge, there’s no published study to quantify any influence between the Lombard effect and emotion expression in human speech.
In this paper, we investigate how the Lombard effect affects emotional speech from both speaker and listener perspectives on the basis of confusion matrix and acoustic analysis. For this purpose, we recorded a Japanese emotional database that contains four emotions (happy, sad, angry, neutral) acted under both quiet and noisy conditions111Speech samples are available at:
https://nii-yamagishilab.github.io/EmotionaLombardSpeech. The context of this database is balanced in terms of both emotions and environment. While recording emotional speech uttered by speakers, several listeners were asked to identify the emotion contained in the actor’s speech at once, and their feedback was given to the actor to improve the actor’s expressions.
After recording the speech database, we first analyzed the accuracy of the actor’s performance and listeners’ judgment on the basis of confusion matrices  and Frobenius distance [21, 22]. Moreover, we also statistically analyzed the acoustic changes of speech produced in quiet and noisy conditions. The experimental results reveal many interesting phenomena.
In Section 2 of this paper, we briefly introduce the recording procedure of emotional speech acted in quiet/noisy environments. In Section 3, we try to answer the first two questions mentioned in the abstract from analysis results. We analyze the confusion matrix of acted/perceived emotions, and calculate Frobenius distance according to confusion matrices from various perspectives. In Section 4, we show the distributions of acoustic features for different emotions in either quiet or noisy environments using statistical methods. In Section 5, we measure the speech intelligibility of recorded emotional speech by using a standard objective measure. We conclude in Section 6 with a brief summary and mention of future work.
2 Recording Procedure of Emotional Speech Acted in Quiet/Noisy Environment
Twelve trained Japanese native speakers (six males and six females) aged 20–40 years took part in the recording. None of them had any basic knowledge about the Lombard effect. None reported any speech or hearing difficulties. These speakers are all professionally educated voice actors, six of them certified as ’high-level’ and the other six as ’low-level’. Three different listeners were assigned to identify the perceived emotion of each utterance for each speaker during the recordings. In total, 36 listeners (18 males and 18 females) aged 20–60 years old (ten in their 20s, ten in their 30s, eight in their 40s, eight in their 50s) participated in the judgments.
All recordings in both quiet and noisy environment conditions use the same set of ten parallel sentences with different contexts, and each sentence is asked to be uttered in four different emotions: happy, sad, angry, and neutral. The order of sentences and emotions are randomized during the recording. For each speaker, we have at least 80 successfully performed and correctly pronounced utterances (40 in the quiet environment and 40 in the noisy one). The speech material comprises at least 960 sentences (4 emotions 12 actors 10 sentences 2 conditions).
The recording procedure is shown in Fig. 1. To account for the influence of emotional speech with the Lombard effect, speakers and listeners were recorded while playing a collaborative game in pairs. During recording, the speaker was first asked to read one utterance with the assumed emotion, the pronunciation of the recorded speech was then quickly verified, and the correctly pronounced speech was sent to the listeners’ headsets. The listeners were asked to write down the perceived emotion and they could select ‘unknown’ if they were indecisive. Their judgments were sent back to the speaker to help him/her adjust the performance. Speakers and listeners were arranged in different rooms to make sure they could not hear each other. Also, listeners could not see each other during the experiment.
During the recording of emotional speech in a noisy environment, the speakers were asked to listen to noise played thorough their headphones while they read out the sentences. Listeners were also asked to listen to the recorded speech under the same condition when they judged the emotion. Since the noise was played through the headphones, we could record emotional speech acted in noise separately from the noise signals. Speech and noise were recorded onto two separate time-aligned tracks so that we can compute signal-to-noise ratio (SNR). The average SNR calculated from the separated tracks is around -8.7dB.
2.3 Experimental conditions
The recording took place in a professional audio studio. Audio equipment included an AKG C314 condenser microphone and SHURE SE215 sound-isolating earphones. Three CueBoxes were used for audio monitoring by the listeners. Recordings were taken with a sampling frequency of 48 kHz and were 24-bit, and mono channel. Noise used for the recording and listener judgment was a mixture of speech-shaped noise called ICRA noise (6-person babble)  and in-car noise. The long-term averaged spectrum of the noise is shown in Figure 2.
3 Analysis on Acted and Perceived Emotions in Noise
Since the database was recorded in different emotions and environmental conditions, it is important for us to understand how the emotions and environmental conditions affect each other. First, a confusion matrix is employed to visualize the performance of acted and perceived emotions. Second, the Frobenius distance between the confusion and identity matrices is calculated for further analysis. Through utilizing the above measures, we aim to answer the following questions222In addition, we carried out additional analysis to see whether the speakers made some changes or listeners learned some useful cues during the above recording procedure. Please see the appendix.: a) Can speakers express their emotion correctly under both quiet and noisy conditions? b) Can listeners recognize the emotion contained in speech signals even in noise?
3.1 Analysis based on confusion matrix
Figure 3 shows the normalized confusion matrix in quiet and noisy environments, and Fig. 4 reveals the differences between high-level and low-level speakers under each environmental condition. In the confusion matrix, each row represents the acted emotions while each column stands for perceived emotions. From Fig. 3, we find that the ratios of the correctly perceived emotional voices under the noise condition are obviously lower than those in the quiet condition. In the quiet environment, neutral tended to be the most confusing emotion, while in the noisy environment, the most confusing emotion was happy. Among the four emotions, anger was the least likely to be confusing in both quiet and noisy conditions and was least affected by environmental conditions. In contrast, happy voices were affected most by the environment. Figure 4 shows that high-level speakers could express the four types of emotional speech better than low-level speakers in both environments. We can see that the emotional speech uttered by high-level speakers resulted in much less confusion than that of low-level speakers, especially in the noisy environment.
These results demonstrate that speakers can express their emotions correctly in both quiet and noisy environments. If speakers are better trained, they can produce more appropriate emotions robust to noisy conditions.
3.2 Analysis based on Frobenius distance
Frobenius distance is defined as the sum of the element-wise square differences of the abstraction of two matrices . In order to compare the average accuracy of all emotions in different conditions, we calculate the Frobenius distance using
where is the Frobenius distance, denotes the confusion matrix (except the ‘unknown‘ column), and is a identity matrix. The smaller the confusion of emotional speech, the smaller the distance.
Table 1 shows the Frobenius distances computed separately per gender of speakers. Specifically, we computed a confusion matrix for each gender and computed its Frobenius distance to the identity matrix. From Table 1, we can clearly see that gender is an important factor—emotional speech spoken by female speakers has a smaller Frobenius distance. In other words, it is less confusing than that uttered by male speakers.
However, this is not the case from the listener’s perspective. According to Table 2, the average distance computed separately per listener’s gender is almost the same. On the other hand, it turns out that the listeners’ age has a significant effect on the accuracy of their judgment, as we can see from Table 3. Listeners in their 20s and 30s were less confused than listeners in their 40s or 50s.
Also, we found that the performance is jointly influenced by the gender of speakers and listeners, as shown in Table 4. Interestingly, the combination of male speakers and male listeners resulted in the worst performance of any combination.
As indicated by the above discussion, listeners were able to recognize the emotion contained in speech signals even under the noisy condition. However, the recognition accuracy largely depended on the listeners’ age, and also had a relationship to the collocation of speakers’ and listeners’ gender.
4 Statistical Analysis of Acoustic Features
Next, we analyzed the acoustic features of recorded emotional speech in quiet and noisy conditions to clarify which acoustic features are affected by noise. We used Parselmouth  to extract various acoustic features from each utterance, including F0, sound intensity , harmonics-to-noise ratio (HNR) , the first and second formants (F1 and F2), and spectral tilt . The distribution of each feature for each emotional category in both the quiet and noisy conditions is plotted in Fig. 5.
From these figures, we can see that our emotional speech data has typical patterns of the Lombard effect, such as higher F0 values, larger intensity, and flat spectral tilt. It is clear that even negative emotions (sad and angry) have higher pitch, interestingly. We can also see that since the Lombard effect increases F0 and intensity and makes spectral tilt flat in noise, relative differences of mean F0, intensity, and spectral tilt among emotional categories become smaller conversely. This is likely one reason why emotional speech in noise is more confusable.
We conducted one-way ANOVA testing to see whether there are any statistically significant differences between the means of each feature of each emotion in quiet and noisy environments. The results suggest that if we set the significance level to 0.05, all features above are significantly different.
We also conducted two-way ANOVA testing to see if there are further interactions between emotional categories and environmental conditions. We confirmed that emotional and environmental conditions are not independent factors and there are interactions regarding all the above features except F1 and F2 at the significance level of 0.05. This indicates that although the emotional speech clearly shows the typical pattern of the Lombard effect, the actual acoustic differences between emotional speech in quiet and noisy conditions are not constant and depend on the emotion category.
In addition to the mean value of each feature, we computed the maximum and minimum values as well as the variance of each feature for each utterance. We then put all these values into one array, and use t-SNE algorithm to visualize them into 2D-dimensional space. The results of a high-level speaker are shown in Fig. 6. By comparing the the quiet and noisy conditions shown in the figure, we can see that there are two separated clusters in the quiet condition. Happy and neutral speech samples are distributed in almost completely separate places. In the quiet condition, most points of sad speech are overlapped with many points of neutral speech. In contrast, in the noisy condition, many sad points are significantly overlapped with those of happy speech.
5 Speech Intelligibility
Finally, in order to determine the influence of Lombard effect on speech intelligibility, we computed the short-time objective intelligibility (STOI) 
scores, which is a standard objective intelligibility measure and its value ranges between 0 and 1. Since it clearly shows the tendency of the Lombard effect (as described earlier), we can hypothesize that emotional speech produced in noise would be more intelligible than that in clean. To verify the hypothesis, we first estimated the intelligibility of the recorded emotional speech without the Lombard effect. To calculate the STOI score, we added the same simulated noise discussed in Section 2 to emotional speech recorded in the quiet condition, and used their corresponding clean speech as the reference voice. We computed the STOI score of emotional speech acted in noise in the same way. As expected, the emotional speech produced in noise has a significantly higher STOI score (0.61) than the emotional speech produced in the quiet environment (0.44), which indicates that the emotional speech produced in noise is more intelligible.
To investigate the influence of the Lombard effect on emotional speech, we recorded parallel emotional utterances by 12 speakers under both quiet and noisy conditions in a professional recording studio. From the initial analysis results, we derive the following conclusions. 1) Speakers can express their emotion correctly even under adverse conditions. If speakers are better trained, they can produce more appropriate emotions robust to noisy conditions. Emotional speech uttered by female speakers is more accurately recognized than that of male speakers in general. 2) Younger listeners are able to recognize the emotion contained in speech signals under noise better than older listeners. 3) Emotional speech in noise shows the typical characteristics of the Lombard effect. However, the changes are complex: the acoustic differences between emotional speech in quiet and noisy environments depend on the emotion category. Finally, because of interactions with the Lombard effect, relative differences of important acoustic cues such as mean F0, intensity, and spectral tilt among emotion categories become smaller conversely. This is one reason why emotional speech in noise is more easily confused.
As our next step, we will use the above findings on the joint effect of Lombard and emotions for speech emotion recognition and emotional speech synthesis.
-  H. Lane and B. Tranel, “The Lombard sign and the role of hearing in speech,” Journal of speech and hearing research, vol. 14, no. 4, pp. 677–709, 1971.
-  J. J. Dreher and J. O’Neill, “Effects of ambient noise on speaker intelligibility for words and phrases,” The Journal of the Acoustical Society of America, vol. 29, no. 12, pp. 1320–1323, 1957.
-  W. V. Summers, D. B. Pisoni, R. H. Bernacki, R. I. Pedlow, and M. A. Stokes, “Effects of noise on speech production: Acoustic and perceptual analyses,” The Journal of the Acoustical Society of America, vol. 84, no. 3, pp. 917–928, 1988.
-  Y. Lu and M. Cooke, “Speech production modifications produced by competing talkers, babble, and stationary noise,” The Journal of the Acoustical Society of America, vol. 124, no. 5, pp. 3261–3275, 2008.
-  A. L. Pittman and T. L. Wiley, “Recognition of speech produced in noise,” Journal of Speech, Language, and Hearing Research, vol. 44, no. 3, pp. 487–496, 2001.
-  M. Garnier and N. Henrich, “Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?” Computer Speech & Language, vol. 28, no. 2, pp. 580–597, 2014.
-  S. A. Zollinger and H. Brumm, “The Lombard effect,” Current Biology, vol. 21, no. 16, pp. R614–R615, 2011.
-  J. Villegas, M. Cooke, and C. Mayo, “The role of durational changes in the Lombard speech advantage,” in The Listening Talker, 2012.
-  R. Patel and K. W. Schell, “The influence of linguistic content on the Lombard effect,” Journal of Speech, Language, and Hearing Research, vol. 51, no. 1, pp. 209–220, 2008.
-  Y. Lu and M. Cooke, “The contribution of changes in f0 and spectral tilt to increased intelligibility of speech produced in noise,” Speech Communication, vol. 51, no. 12, pp. 1253–1262, 2009.
-  C. Kirchhuebel, “The effects of Lombard speech on vowel formant measurements,” São Paulo School of Advanced Studies in Speech Dynamics SPSASSD 2010 Accepted Papers, p. 38, 2010.
-  M. Garnier, L. Bailly, M. Dohen, P. Welby, and H. Lœvenbruck, “An acoustic and articulatory study of Lombard speech: Global effects on the utterance,” in Ninth International Conference on Spoken Language Processing, 2006.
-  D. K. Amazi and S. R. Garber, “The Lombard sign as a function of age and task,” Journal of Speech, Language, and Hearing Research, vol. 25, no. 4, pp. 581–585, 1982.
-  J. H. Hansen and V. Varadarajan, “Analysis and compensation of Lombard speech across noise type and levels with application to in-set/out-of-set speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 2, pp. 366–378, 2009.
-  J. H. Hansen and O. N. Bria, “Lombard effect compensation for robust automatic speech recognition in noise,” in First International Conference on Spoken Language Processing, 1990.
-  S.-M. Chi and Y.-H. Oh, “Lombard effect compensation and noise suppression for noisy Lombard speech recognition,” in Spoken Language, ICSLP 96. Proceedings., Fourth International Conference on, vol. 4. IEEE, 1996, pp. 2013–2016.
-  T. Raitio, A. Suni, M. Vainio, and P. Alku, “Analysis of HMM-based Lombard speech synthesis,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
-  ——, “Synthesis and perception of breathy, normal, and Lombard speech in the presence of noise,” Computer Speech & Language, vol. 28, no. 2, pp. 648–664, 2014.
-  W. B. Kleijn, J. B. Crespo, R. C. Hendriks, P. Petkov, B. Sauert, and P. Vary, “Optimizing speech intelligibility in a noisy environment: A unified view,” IEEE Signal Processing Magazine, vol. 32, no. 2, pp. 43–54, 2015.
K. M. Ting, “Confusion matrix,” in
Encyclopedia of Machine Learning and Data Mining. Springer, 2017, pp. 260–260.
-  A. Amendola and G. Storti, “Model uncertainty and forecast combination in high-dimensional multivariate volatility prediction,” Journal of Forecasting, vol. 34, no. 2, pp. 83–91, 2015.
-  S. Laurent, J. V. Rombouts, and F. Violante, “On the forecasting accuracy of multivariate GARCH models,” Journal of Applied Econometrics, vol. 27, no. 6, pp. 934–955, 2012.
-  W. A. Dreschler, H. Verschuure, C. Ludvigsen, and S. Westermann, “ICRA noises: artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment: Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos,” Audiology, vol. 40, no. 3, pp. 148–157, 2001.
-  Y. Jadoul, B. Thompson, and B. De Boer, “Introducing Parselmouth: A Python interface to Praat,” Journal of Phonetics, vol. 71, pp. 1–15, 2018.
-  F. J. Fahy and V. Salmon, “Sound intensity,” 1990.
-  P. Boersma, “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound,” in Proceedings of the institute of phonetic sciences, vol. 17, no. 1193. Amsterdam, 1993, pp. 97–110.
-  L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
-  C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2010, pp. 4214–4217.
Appendix A Analysis of learning effect
To see whether the speakers made some changes or listeners learned some useful cues during the above recording procedure, we plotted Fig. 6
, in which the horizontal axis is the index of recorded utterances in chronological order, and the vertical axis is the number of listeners who correctly identified the acted emotion, divided by the total number of listeners. We fit their relationship with linear regression and calculated the correlation coefficient. In the quiet environment, the correlation coefficient of the angry speech was as high as 0.84, the sad speech was moderately correlated (0.55), and the neutral speech was weakly correlated (0.32). Meanwhile, in the noisy environment, the correlation coefficients of the four emotions were all different: 0.76 for neutral, 0.47 for happy, 0.45 for angry, and 0.15 for sad.
From this experiment, while we cannot conclude what was learned by the speakers and listeners, or why, we can at least tell that the identification ratios become obviously better through the listeners’ feedback when it comes to angry and sad in quiet and neutral in noise. Interestingly, the identification ratios of happy speech in the quiet condition and sad speech in noise remained less correlated. The reasons for these phenomena require further investigation.