MP3 Compression To Diminish Adversarial Noise in End-to-End Speech Recognition

07/25/2020 ∙ by Iustina Andronic, et al. ∙ Technische Universität München 0

Audio Adversarial Examples (AAE) represent specially created inputs meant to trick Automatic Speech Recognition (ASR) systems into misclassification. The present work proposes MP3 compression as a means to decrease the impact of Adversarial Noise (AN) in audio samples transcribed by ASR systems. To this end, we generated AAEs with the Fast Gradient Sign Method for an end-to-end, hybrid CTC-attention ASR system. Our method is then validated by two objective indicators: (1) Character Error Rates (CER) that measure the speech decoding performance of four ASR models trained on uncompressed, as well as MP3-compressed data sets and (2) Signal-to-Noise Ratio (SNR) estimated for both uncompressed and MP3-compressed AAEs that are reconstructed in the time domain by feature inversion. We found that MP3 compression applied to AAEs indeed reduces the CER when compared to uncompressed AAEs. Moreover, feature-inverted (reconstructed) AAEs had significantly higher SNRs after MP3 compression, indicating that AN was reduced. In contrast to AN, MP3 compression applied to utterances augmented with regular noise resulted in more transcription errors, giving further evidence that MP3 encoding is effective in diminishing only AN.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In our increasingly digitized world, Automatic Speech Recognition (ASR) has become a natural and convenient means of communication with many daily-use gadgets. Recent advances in ASR push toward end-to-end systems that directly infer text from audio features, with sentence transcription being done in a single stage, without intermediate representations  [25, 12]

. Those systems do not require handcrafted linguistic information, but learn to extract this information by themselves, as they are trained in an end-to-end manner. At present, there is a trend towards Deeper Neural Networks (DNN), however these ASR models are also more complex and prone to security threats: Audio Adversarial Examples (AAEs) are audio inputs which carry along a hidden message induced by adding Adversarial Noise (AN) to a regular speech input. AN is optimized so that it will mislead the ASR system into misclassification (i.e., recognizing the hidden message), while the AN itself is supposed to remain inconspicuous to humans 

[5, 14, 1, 20]

. Yet from the perspective of ASR research, the system should classify the sentence as close as it is understood by humans. For this reason, we proceed to investigate MP3 compression as a means to diminish the detrimental effects of AN to ASR performance.

Our contributions are three-fold:

  • We create AAEs in the audio domain via a feature inversion procedure;

  • We evaluate MP3’s effectiveness to diminish AN with four end-to-end ASR models trained on different levels of MP3 compression that decode AAEs in uncompressed and MP3 formats derived from the VoxForge corpus;

  • Conversely, we assess the effects of MP3 compression when applied to inputs augmented with regular, non-adversarial noise.

2 Related Work

Adversarial Examples (AEs) were first created for image classification tasks [10]

. Regular gradient-based training of DNNs calculates the gradient of a chosen loss function w.r.t. the networks’ parameters, aiming for their step-wise gradual improvement. By contrast, the Fast Gradient Sign Method (FGSM)

[10] creates AN based on the gradient w.r.t. the input data in order to optimize towards misclassification. FGSM was already applied in the context of end-to-end ASR to DeepSpeech [12, 6], a CTC-based speech recognition system, as well as for the attention-based system called Listen-Attend-Spell (LAS) [7, 22].

For generating AAEs, most of the previous works have set about to pursue either one of the following goals, sometimes succeeding in both: (1) that their AAEs work in a physical environment and (2) that they are less perceptible to humans. Carlini et al. [5] were the first to introduce the so-called hidden voice commands

, demonstrating that targeted attacks operating over-the-air against archetypal ASR systems (i.e., solely based on Hidden Markov Models - HMM), are feasible. In contrast to previous works that targeted short adversarial phrases,

[6] constructed adversarial perturbations for designated longer sentences. Their novel attack was achieved with a gradient-descent minimization based on the CTC loss function, which is optimized for time sequences. Moreover, [20] were the first to develop imperceptible AAEs for a conventional hybrid DNN-HMM ASR system by leveraging human psychoacoustics, i.e., manipulating the acoustic signal below the thresholds of human perception. In a follow-up publication [21], their psychoacoustic hiding method was enhanced to produce generic AAEs that remained robust in simulated over-the-air attacks.

Protecting ASR systems against AN that is embedded in AEs has also been primarily investigated in the image domain, which in turn inspired research in the audio domain. Two major defense strategies are considered from a security-related perspective [13], namely proactive and reactive approaches. The former aims to enhance the robustness during the training procedure of the ASR models themselves, e.g., by adversarial training [22] or network distillation [18]. Reactive approaches instead aim to detect if an input is adversarial after the DNNs are trained, by means of e.g., input transformations such as compression, cropping, resizing (for images), meant to at least partially discard the AN and thus recover the genuine transcription. Regarding AAEs, primitive signal processing operations such as local smoothing, down-sampling and quantization were applied to input audio so as to disrupt the adversarial perturbations  [26]. The effectiveness of that pre-processing defense was demonstrated especially for shorter-length AAEs. Rajaratnam et al. [19] likewise explored audio pre-processing methods such as band-pass filtering and compression (MP3 and AAC), while also venturing to more complex speech coding algorithms (Speex, Opus) so as to mitigate the AAE attacks. Their experiments, which however targeted a much simpler keyword-spotting system with a limited dictionary, indicated that an ensemble strategy (made of both speech coding and other form of pre-processing, e.g., compression) is more effective against AAEs.

3 Experimental Set-up

MP3 is an audio compression algorithm that employs a lossy, perceptual audio coding scheme based on a psychoacoustic model in order to discard audio information below the hearing threshold and thus diminish the file size [4]. Schönherr et al. [20] recently hypothesized that MP3 can make for a robust countermeasure to AAE attacks, as it might remove exactly those inaudible ranges in the audio where the AN lies. However, to date there is no published experimental work to prove the effectiveness of MP3 compression in mitigating AN targeted at a hybrid, end-to-end ASR system. Consequently, given an audio utterance that should transcribe to the original, non-adversarial phrase, we formulate our research question as follows: To what extent can MP3 aid in removing the AN and thus recover the benign character of the input? Hence, we aim to analyze the AN reduction with two objective indicators: Character Error Rates (CER) and Signal-to-Noise Ratio.

Pipeline from original audio data to MP3-compressed AAEs.

A four-stage pipline that transforms the original test data to MP3 compressed AAEs was implemented; transformation includes the FGSM method and feature inversion, as depicted in Fig. 1. To also consider the effects of MP3 compression on the ASR performance, i.e., whether the neural network adapts to MP3 compression, experiments were validated on four ASR models trained on data with four different levels of MP3 compression (uncompressed, 128 kbps, 64 kbps and 24 kbps MP3). Format-matched AAEs were then decoded by each of the four models.

Figure 1: General experimental workflow

All experiments are based on the English share of the open-source VoxForge speech corpus. It consists of

130.1 hours of utterances in various English accents that were recorded and released in uncompressed format (.wav), allowing for further compression and thus, for the exploration of our research question. The Lame MP3 encoder111 was used as command line tool for batch MP3 compression. Log-mel filterbank (fbank) features were extracted from every utterance of the speech corpus and then fed to the input of the ASR models in both training and testing stages. The speech recognition experiments are performed with the hybrid CTC-attention ASR system called ESPnet [25, 24], which combines the two main techniques for end-to-end speech recognition. First, Connectionist Temporal Classification (CTC [11]

) carries the concept of hidden Markov states over to end-to-end DNNs as training loss for sequence-classification networks. DNNs trained with CTC loss classify token probabilities for each time frame. Second, attention-based encoder-decoder architectures 


are trained as auto-regressive, sequence-generative models that directly generate the sentence transcription from the entire input utterance. The multi-objective learning framework unifies the Attention-loss and CTC-loss with a linear interpolation weight (the parameter

from Eq. 1), which is set to 0.3 in our experiments, allowing attention to dominate over CTC. Different from the original LAS ASR system [7], we use a location-aware attention mechanism that additionally uses attention weights from the previous sequence step; a more detailed description is to be found in [16].


Full descriptions of the system’s architecture can be consulted in [25, 24, 16]. We use the default ESPnet configuration222ESPnet commit 81a383a9. The full parameters’ configuration is also listed in Table 4.2 of the Master’s Thesis underlying this publication [2]. for the VoxForge dataset [23].

Adversarial Audio Generation.

The four trained ASR networks are subsequently integrated with an algorithm called Fast Gradient Sign Method (FGSM  [10]) adapted to the hybrid CTC-attention ASR system, with a focus on attention-decoded sequences [8]. It generates an adversarial instance for each utterance of the four non-adversarial test sets with different degrees of MP3-compression. This method is similar to the sequence-to-sequence FGSM, as proposed in [22]. A previously decoded label sequence is used as reference to avoid label leaking [15]. Using back propagation, cross-entropy loss is obtained from a whitebox model (whose parameters are fully known). Gradients from a sliding window with a fixed length

and stride

are accumulated to (Eq. 2). Gradient normalization [9] is applied for accumulation of gradient directions. The intensity of the AN (denoted by ) is determined by the factor in Eq. 3, which we set to 0.3. is then added to the original feature-domain input (Eq. 4), in order to trigger the network to output a wrong transcription, different from the ground truth (Eq. 5).


Notably, we did not aim for psychoacoustically-optimized AN, i.e., adversarial noise that would be totally inaudible, because the focus was not on developing powerful adversarial attacks, but rather on exploring ways to improve ASR models’ robustness to more simplistic AN. Because the networks take acoustic feature vectors as input, FGSM originally creates AEs in the feature domain. Yet, in order to evaluate our research hypothesis, we needed AEs in the audio domain, that is, AAEs. Hence, we proceeded to invert the adversarial features and thus obtain synthetically


adversarial audio (rAAEs). The exact steps for both forward feature extraction, as well as feature inversion are illustrated in Fig. 

2 and were implemented with functions from Librosa toolbox [17].

Figure 2: AAEs generation via feature inversion applied to the adversarial features

Additionally, mind that log-mel fbank features are a lossy representation of the original audio input, since they lack the phase information of the spectrum. This in turn hinders a highly accurate audio reconstruction. In fact, mere listening revealed that it was relatively easy to distinguish the reconstruction artefacts when comparing pairs of original (non-adversarial) audio samples with their reconstructed versions333Reconstructed audio samples (both non-adversarial and adversarial) can be retrieved from Yet to see whether the reconstruction method impaired in any way the ASR performance, we performed the following sanity check: the ASR model trained on uncompressed .wav files was used to decode both the original test set and its reconstructed counterpart (obtained by feature extraction directly followed by inversion, i.e., without applying FGSM). We observed just a mild 1.3% absolute increase in the Character Error Rate (CER) for the reconstructed set, which implies that the relevant acoustic features are still accurately preserved following audio reconstruction with our feature inversion method.

4 Results and Discussion

ASR results for non-adversarial vs. adversarial input.

Table 1 conveys the ASR results of decoding various test sets that originate from the same audio format as the training data of each ASR model, hence in a train-test matched setting. Adversarial inputs in the feature domain (column [b]) render far more transcription errors (by an absolute mean of over all models) compared to the baseline, non-adversarial input listed in column [a]. This validates the FGSM method as effective in creating adversarial features from input data of any audio format (uncompressed, as well as MP3-compressed). The error scores for reconstructed AAEs (rAAEs in column [c]) created from the adversarial features are also higher than the baseline, but, interestingly, lower than the CER scores for the adversarial features themselves. This suggests that our reconstruction method makes the AAEs less powerful in misleading the model, which on the other hand is beneficial for the system’s robustness to AN. When we further compress the rAAEs with MP3 at the bitrate of kbps (column [d]), we observe an additional decline in the CER, thus indicating that MP3 compression is favourable in reducing the attack effectiveness of the adversarial input. However, these CER values are still much higher than the baseline by a mean absolute difference of across all ASR models, suggesting that the original transcription could not be fully recovered. The strongest reduction effect () between MP3 rAAEs and the “original” adversarial features can be observed in the case of adversarial compressed input (24 kbps) originating from 64 kbps compressed data. Overall, the mere numbers show that MP3 compression manages to partially reduce the error rates to adversarial inputs.

ASR model
(source format of
train & test inputs in
[a], [b], [c], [d])
Input test data & corresp. CER scores
Relative CER
difference (%)
between [b] and [d]
uncompressed (#1)
128 kbps-MP3 (#2)
64 kbps-MP3 (#3)
24 kbps-MP3 (#4)
Table 1: CER results for decoding adversarial input (marked as [b], [c] and [d]). The last column indicates a relative CER score difference between adversarial features [b] and reconstructed, MP3-compressed AAEs [d], calculated as . The inputs from column [d] were compressed at 24 kbps.

MP3 effects on Signal-to-Noise Ratio (SNR).

SNR is a measure that quantifies the amount of noise in an audio sample. As such, it comes natural to assess how MP3 encoding impacts the AN in terms of SNR. Since it was difficult to compute the SNR in the original, adversarial features’ domain, we estimated the SNR of the reconstructed AAEs before and after MP3 compression as follows:


For both the uncompressed and MP3 rAAEs, the SNR was calculated with reference to the same signal in the numerator: the reconstructed version of the original speech utterance (no FGSM), so as to introduce similar artifacts as when reconstructing the adversarial audio, and thus have a more accurate SNR estimation. As expected, we obtained different SNRs for each adversarial audio, because AN varies with the speech input. Moreover, the SNR values differed before and after MP3 compression for each adversarial sample. As the normalized histograms in Fig. 3 show, most of the SNR values for the uncompressed rAAEs are negative (left plot), suggesting that the added AN has high levels and is therefore audible. However, after MP3 compression (right plot), most adversarial samples acquire positive SNRs, implying that AN was diminished due to MP3 encoding.

Figure 3: Normalized histograms following SNR estimation of reconstructed adversarial inputs (rAAEs): uncompressed (left) and MP3 at 24 kbps (right). The number of histogram bins was set to 50 for both plots.

To validate this, we applied the non-parametric, two-sided Kolmogorov-Smirnov (KS) statistical test [3] to evaluate whether the bin counts from the normalized SNR histograms of rAAEs before and after MP3 compression originate from the same underlying distribution. We obtained a p-value of 0.039 (), confirming the observed difference between the underlying distributions of the two histograms as statistically significant. Thereby, MP3’s incremental effect on the SNR values of adversarial samples was validated, which essentially means that MP3 reduces the AN.

ASR results for input augmented with regular noise.

To have a reference for the behaviour of the ASR systems when exposed to common types of noise, we assessed the effects of MP3 compression on audio inputs corrupted by regular, non-adversarial noise as well. For this complementary analysis, we augmented the original test sets with four noise types, namely white, pink, brown and babble noise (overlapping voices), each boasting distinct spectral characteristics. The noise-augmented samples and their MP3-compressed versions were then fed as input to the decoding stage of the corresponding ASR system, i.e., the one that was trained on audio data of the same format as the test data. Table 2 lists the CER results of this decoding experiment performed by ASR model #1 (trained on uncompressed data).

Test set augmented with: SNR [dB]
30 10 5 0 -5 -10

[A] white noise

19.1 32.7 41.9 53.7 66.2 78.2
MP3 compressed (24kbps) 29.1 51.2 61.7 71.2 78.7 86
[B] pink noise 18.5 29.1 38.1 51.7 67.4 82.1
MP3 compr. (24kbps) 26.9 42.5 53 66.4 79.8 89.9
[C] brown noise 17.9 19.7 21.9 26.1 34.1 47.8
MP3 compr. (24kbps) 25.3 29 32.5 38 47.3 60.6
[D] babble noise 18.3 35.8 53.4 77.4 89 93.6
MP3 compr. (24kbps) 25.8 48.2 66 83.6 93.1 95.4
Table 2: CERs for test sets augmented with regular noise (in uncompressed and MP3 formats) of different SNR values, decoded by ASR model #1

Mind that all ASR systems were trained on the original, noise-free data; therefore, decoding noisy inputs was expected to cause more transcription errors than the original, clean data. Based on CER, one can observe that the lower the SNR (or the higher level of noise added to the input), the more error-prone the ASR system is, irrespective of the noise type. The most adverse effect seems to be in the case of white and babble noise, rows [A] and [D], which also happen to have the richest spectral content, imminently interfering with the original speech.

Yet of utmost interest is what happens when the same treatment used for adversarial inputs, i.e., MP3 compression, is applied to the novel speech inputs enhanced with regular noise. These results are illustrated every second row in Table 2. Error rates turn out always higher for MP3-compressed inputs than for the uncompressed ones, regardless of noise type or SNR. Consequently, MP3 compression has the inverse effect compared to what was observed for adversarial noise: while it triggered a reduction in the amount of errors returned by the ASR systems to adversarial input, it failed to do so for non-adversarial noise. On the contrary, MP3 compression increased the amount of errors for inputs augmented with non-adversarial noise, especially at high and mid-range SNRs. This further validates that MP3 compression has the desired effect of partially reducing only adversarial perturbations, whereas deteriorating the non-adversarial, regular noise.

5 Conclusion

In this work, we explored the potential of MP3 compression as a countermeasure against Adversarial Noise compromising speech inputs. To this end, we constructed adversarial feature samples with the FGSM gradient-based method. The adversarial features were then mapped back into the audio domain by inverse feature extraction operations. The resulting adversarial audio (denoted as reconstructed) was thereafter MP3-compressed and presented to the input of four ASR models featuring a hybrid CTC-attention architecture, having been previously trained on four types of audio data. Our three key findings are:

  1. In comparison to adversarial features, reconstructed AAEs, as well as MP3 compressed AAEs had lower error rates in their transcriptions. The error reduction did not achieve the performance for non-adversarial input, which implies that correct transcriptions were not completely recovered.

  2. MP3 compression on the estimated SNR values of rAAEs yielded a statistically significant effect, supporting the observation that MP3 increased the SNR values of adversarial samples, which translates to AN reduction.

  3. Our experiments with non-adversarial noise suggest that MP3 compression is beneficial only in mitigating Adversarial Noise, while it deteriorates the speech recognition performance to non-adversarial noise.


  • [1] M. Alzantot, B. Balaji, and M. Srivastava (2018) Did you hear that? Adversarial Examples Against Automatic Speech Recognition. (Nips). External Links: Link Cited by: §1.
  • [2] I. Andronic (2020) MP3 Compression as a Means to Improve Robustness against Adversarial Noise Targeting Attention-based End-to-End Speech Recognition. Master’s Thesis, Technical University of Munich, Germany. Cited by: footnote 2.
  • [3] V. W. Berger and Y. Zhou (2014) Kolmogorov-smirnov test: overview. In Wiley StatsRef: Statistics Reference Online, pp. . External Links: ISBN , Document Cited by: §4.
  • [4] K. Brandenburg (1999) MP3 and AAC Explained. Audio Engineering Society, 17th International Conference 2004 October 26–29, pp. 99–110. Cited by: §3.
  • [5] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou (2016-08) Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), pp. 513–530. Cited by: §1, §2.
  • [6] N. Carlini and D. Wagner (2018) Audio adversarial examples: Targeted attacks on speech-to-text. Proceedings - 2018 IEEE Symposium on Security and Privacy Workshops, SPW 2018, pp. 1–7. Cited by: §2, §2.
  • [7] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §2, §3.
  • [8] E. R. Chavez Rosas (2020) Improving Robustness of Sequence-to-sequence Automatic Speech Recognition by Means of Adversarial Training. Master’s Thesis, Technical University of Munich, Germany. Cited by: §3.
  • [9] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 9185–9193. Cited by: §3.
  • [10] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–11. Cited by: §2, §3.
  • [11] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks


    Proceedings of the 23rd International Conference on Machine Learning

    pp. 369–376. External Links: ISBN 1-59593-383-2 Cited by: §3.
  • [12] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Cited by: §1, §2.
  • [13] S. Hu, X. Shang, Z. Qin, M. Li, Q. Wang, and C. Wang (2019) Adversarial Examples for Automatic Speech Recognition: Attacks and Countermeasures. IEEE Communications Magazine 57 (10), pp. 120–126. Cited by: §2.
  • [14] D. Iter, J. Huang, and M. Jermann (2017) Generating adversarial examples for speech recognition. Stanford Technical Report. Cited by: §1.
  • [15] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. CoRR abs/1611.01236. External Links: Link, 1611.01236 Cited by: §3.
  • [16] L. Kürzinger, T. Watzel, L. Li, R. Baumgartner, and G. Rigoll (2019) Exploring hybrid ctc/attention end-to-end speech recognition with gaussian processes. In International Conference on Speech and Computer, pp. 258–269. Cited by: §3.
  • [17] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Cited by: §3.
  • [18] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §2.
  • [19] K. Rajaratnam, B. Alshemali, and J. Kalita (2018-07) Speech coding and audio preprocessing for mitigating and detecting audio adversarial examples on automatic speech recognition. pp. . Cited by: §2.
  • [20] L. Schönherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa (2018) Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. CoRR abs/1808.05665 (February). External Links: Link Cited by: §1, §2, §3.
  • [21] L. Schönherr, S. Zeiler, T. Holz, and D. Kolossa (2019) Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems. External Links: Link Cited by: §2.
  • [22] S. Sun, P. Guo, L. Xie, and M. Y. Hwang (2019) Adversarial regularization for attention based end-to-end robust speech recognition. IEEE/ACM Transactions on Audio Speech and Language Processing 27 (11), pp. 1826–1838. Cited by: §2, §2, §3.
  • [23] VoxForge speech corpus. External Links: Link Cited by: §3.
  • [24] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai (2018) ESPNet: End-to-end speech processing toolkit. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2018-Sep, pp. 2207–2211. Cited by: §3.
  • [25] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi (2017) Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. IEEE Journal on Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §1, §3.
  • [26] Z. Yang, B. Li, P. Chen, and D. Song (2018) Towards mitigating audio adversarial perturbations. External Links: Link Cited by: §2.