Detecting Adversarial Attacks On Audio-Visual Speech Recognition

12/18/2019 ∙ by Pingchuan Ma, et al. ∙ Imperial College London 22

Adversarial attacks pose a threat to deep learning models. However, research on adversarial detection methods, especially in the multi-modal domain, is very limited. In this work, we propose an efficient and straightforward detection method based on the temporal correlation between audio and video streams. The main idea is that the correlation between audio and video in adversarial examples will be lower than benign examples due to added adversarial noise. We use the synchronisation confidence score as a proxy for audio-visual correlation and based on it we can detect adversarial attacks. To the best of our knowledge, this is the first work on detection of adversarial attacks on audio-visual speech recognition models. We apply recent adversarial attacks on two audio-visual speech recognition models trained on the GRID and LRW datasets. The experimental results demonstrated that the proposed approach is an effective way for detecting such attacks.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep networks achieve state-of-the-art performance on several tasks such as image classification, image segmentation and face recognition. However, recent studies

[14, 9] show that such networks are susceptible to adversarial attacks. Given any input

and a classifier

, an adversary tries to carefully construct a sample that is similar to but . The adversarial examples are indistinguishable from the original ones but can easily degrade the performance of deep classifiers.

Existing studies on adversarial attacks have mainly focused in the image domain [9, 13, 2, 3]. Recently, adversarial attacks in the audio domain have also been presented [1, 4]. One of the most prominent studies is the iterative optimisation-based attack [4], which directly operates on an audio clip and enables it to be transcribed to any phrase when a perturbation is added. Works on defense approaches against adversarial attacks can be divided into three categories: adversarial training [9], gradient masking [12] and input transformation [16]. The first one adds adversarial examples in the training set whereas the second one builds a model which does not have useful gradients. Both of them require the model to be retrained, which can be computationally expensive. In contrast, the latter one attempts to defend adversarial attacks by transforming the input.

On the other hand, work on how to detect adversarial attacks is very limited. To the best of our knowledge, the only work in the audio domain was proposed by Yang et al. [17] and exploits the inherent temporal dependency in audio samples to detect adversarial examples. The main idea is that the transcribed results from an audio sequence and segments extracted from it are consistent in benign examples but not in adversarial ones. In other words, the temporal dependency is not preserved in adversarial sequences.

Inspired by the idea of using temporal dependency to detect audio adversarial examples, we propose a simple and efficient detection method against audio-visual adversarial attacks. To the best of our knowledge, this is the first work which presents a detection method of adversarial attacks on audio-visual speech recognition. The key idea is that the audio stream is highly correlated with the video of the face (and especially the mouth region). In case of an adversarial example, the added noise on the audio and video streams is expected to weaken the audio-visual correlation. Hence, we propose the use of audio-visual synchronisation as a proxy to correlation. In other words, we expect higher synchronisation scores for benign examples and lower scores for adversarial examples.

The proposed detection method is tested on speech recognition attacks on models trained on the Lip Reading in the Wild (LRW) [5] and GRID datasets [8]. Our results show that we can detect audio-visual adversarial attacks with high accuracy.

2 Databases

Figure 1:

An overview of our proposed detection method. (a) A video and an audio clip are fed to the end-to-end audio-visual speech recognition model. They are also fed to the synchronisation network (b) which estimates a synchronisation confidence score which is used for determining if the audio-visual model has been attacked or not (c). The confidence distribution of 300 adversarial and benign examples from the GRID dataset is shown in (d).

For the purposes of this study, we use two audiovisual datasets, the LRW [5] and GRID [8] datasets. The LRW dataset is a large-scale audio-visual dataset consisting of clips from BBC programs. The dataset has 500 isolated words from more than 1000 speakers and contains 488766, 25000, and 25000 examples in the training, validation and test sets, respectively. Each utterance is a short segment with a length of 29 frames (1.16 seconds), where target words are centred in the segment of utterances.

The GRID dataset consists of 33 speakers and 33000 utterances (1000 per speaker). Each utterance is composed of six words taken from the combination of the following components: <command: 4><colour: 4><preposition: 4><letter: 25><digit: 10><adverb: 4>, where the number of choices for each component is indicated in the angle brackets. In this work, we follow the evaluation protocol from [15] where 16, 7 and 10 subjects are used for training, validation and testing, respectively.

3 Background

3.1 Attacks

In this study, we consider two attack methods, Fast Gradient Sign Method (FGSM) [9] and the iterative optimisation-based attack [4]. FGSM, which is suitable for attacks on classification models, computes the gradient with respect to the benign input and each pixel can be updated to maximise the loss. Basic Iterative Method (BIM) [10]

is an extended version of FGSM by applying it multiple times with a small step size. Specifically, given a loss function

for training the classification model , the adversarial noise is generated as follows:


where is the step size, is the adversarial example after -steps of the iterative attack and is the true label. After each step, pixel values in the adversarial images are clamped to the range , where is the maximum change in each pixel value. This method was proposed for adversarial attacks on images but can also be applied to audio clips by crafting perturbation to the audio input.

The second type of attack [4] has been recently proposed and is suitable for attacks on continuous speech recognition models. Audio adversarial examples can be generated, which can be transcribed to any phrase but sound similar to the benign one. Specifically, the goal of this targeted attack is to seek an adversary input , which is very close to the benign input , but the model decodes it to the target phrase . The objective of the attack is the following:

such that (2)

where is introduced to limit the maximum change for each audio sample or pixel and is the amount of adversarial noise.

3.2 Audio-visual Speech Recognition Threat Model

The architecture is shown in Fig. 1a. We use the end-to-end audiovisual model that was proposed in [11]

. The video stream consists of spatiotemporal convolution, a ResNet18 network and a 2-layer BGRU network whereas the audio stream consists of a 5-layer CNN and a 2-layer BGRU network. These two streams are used for feature extraction from raw modalities. The top two-layer BGRU network further models the temporal dynamics of the concatenated feature.

According to the problem type, two different loss functions are applied for training. The multi-class cross entropy loss, where each input sequence is assigned a single class, is suitable for word-level speech recognition. The CTC loss is used for sentence-level classification. This loss transcribes directly from sequence to sequence when the alignment between inputs and target outputs is unknown. Given an input sequence

, CTC sums over the probability of all possible alignments to obtain the posterior of the target sequence.

4 Synchronisation-based Detection Method

Chung et al. [6, 7] introduced the SyncNet model, which is able to predict the synchronisation error when raw audio and video streams are given. This error is quantified by the synchronisation offset and confidence score. A sliding window approach is used to determine the audio-visual offset. For each 5-frame video window, the offset is found when the distance between the visual features and all audio features in a 1 second range is minimised. The confidence score for a particular offset is defined as the difference between the the minimum and the median of the Euclidean distances (computed over all windows). Audio and video are considered perfectly matched if the offset approaches to zero with a high level of confidence score.

In this work, we aim to explore if such synchronisation is affected by adversarial noise. The detection method is shown in Fig. 1b and 1c. In the detection model, we measure the temporal consistency between the audio and video streams via a model trained for audio-visual synchronisation. For benign audio and video streams, the confidence score should be relatively high since audio and video are aligned and therefore highly synchronised. However, for adversarial audio and video examples, the confidence score is expected to be lower. The added perturbation, which aims to alter the model toward the target transcription, reduces the correlation between the two streams, hence they are less synchronous. Fig. 1d. shows the confidence distribution of 300 benign and adversarial examples from the GRID dataset.

Figure 2: One example using iterative optimisation-based attack on the GRID dataset. (a): benign example; (b): adversarial noise; (c): adversarial example; Raw audio waveforms, audio log-spectrum and raw images are presented from top to bottom.

5 Experimental Setup

5.1 Attacks

We evaluate our proposed method using two adversarial attacks on both modalities. We assume a white-box scenario, where the parameters of models are known to the attacker.

Attacks against Word-level Classification: Attacks such as FGSM and BIM are suitable for word recognition models trained on the LRW dataset. For FGSM, for the audio stream and

for the video stream, were chosen heuristically. In our case, we set

to 1024 and to 16111Pixel values are in the range of [0, 255]. Audio samples are in the range of [-32768, 32767].. For BIM, the step size was set to 1 in the image domain, which means the value of each pixel is changed by 1 at each iteration. The step size in the audio domain is set to 64. We follow the number of iterations setting suggested by [10], which is selected to be .

Attacks against Continuous Speech Recognition: For attacking a speech recognition model trained on GRID we use a recently proposed targeted attack [4]. The maximum change allowed as defined by (see Eq. 2) is initialised to 2048 and 32 for audio and video, respectively, and is reduced during iterative optimisation. We implement the attack with 1000 iterations. In our studies, 10 random utterances are selected as target utterances. 300 adversarial examples are randomly selected for each target utterance.

5.2 Evaluation Metrics

We use the Euclidean distance () for measuring the similarity between two images. We also use the norm to measure the maximum change per pixel. For audio samples we follow [4] and convert the norm to the scale of Decibels (dB): , where is an arbitrary audio sample point from the audio clip . The audio distortion is specified as the relative loudness to the benign audio, which can be defined as .

The Area Under the Curve (AUC) score is used for evaluating the detection approach. We compute the synchronisation confidence score in benign and adversarial examples and by varying the threshold we compute the Receiver Operating Characteristic (ROC) curve.

Finally, in order to compare how this approach would work in a real scenario, we select the threshold (from Fig. 1c) which maximises the average score of adversarial and benign classes on the validation set. Then we use this threshold to compute the average score on the test set.

6 Results

Attacks CR(%) AUC
FGSM (=1024, =16) 13.67 3.46 0.99 0.94
FGSM (=1024, =32) 12.40 6.89 0.99 0.96
BIM (=1024, =16) 6.67 1.22 0.89 0.82
BIM (=1024, =32) 3.27 1.67 0.93 0.85
Table 1: Results for the proposed adversarial attack detection approach on word recognition models trained on the LRW dataset. is 4 and 8 pixels when is 16 and 32, respectively. is -19dB when is 1024.

6.1 Word-level Speech Recognition

Detection results for attacks on word-level speech recognition are shown in Table 1. In the presence of adversarial noise, the classification rate drops from 97.20% 222This is the performance of the model trained on the LRW dataset when benign examples are fed to it. to 12.40% using FGSM whereas it drops to 3.27% after running 36 iterations using BIM. On the other hand, this detection method achieves an AUC score as high as 0.99 using FGSM and 0.93 using BIM. The corresponding scores are 0.96 and 0.85, respectively.

We also notice that when the attack is stronger, e.g., BIM is used instead of FSGM, the classification rate goes down, i.e., the attack is more successful, and at the same time the distortion () becomes smaller. Consequently, detection becomes more difficult and this is reflected to the lower AUC and scores.

We also investigate the detection performance when the decreases to 16 from 32, i.e., the pixel values change less. It is clear from Table 1 that for both types of attacks the distortion is smaller and as a consequence detection becomes harder, both AUC and scores go down. However, such attacks are less successful since the classification rate goes up.

6.2 Sentence-level Speech Recognition

Detection results for fully targeted attacks on sentence-level speech recognition, i.e., the goal of the attack is that the transcribed result is the same as the desired target phrase, are shown in Table 2. It is clear that the attack is almost always successful no matter what the target sentence is, more than 92% of the examples are transcribed as the target phrase. At the same time the detection rates are quite high with an AUC between 0.93 and 0.97 and an score between 0.81 and 0.86. We also observe that the maximum distortions applied to the audio and video signals are similar in most cases.

We also consider another scenario where the WER between the transcribed results and target phrases is up to 50%. Results are shown in Table 3. In this case the attack is always successful. In addition the generated audio and video adversarial examples are less distorted than the ones generated by the fully targeted attacks. In turn, this leads to smaller AUC scores, between 0.88 and 0.92, and scores, between 0.76 and 0.80.

Target Phrases Success Rate (dB) AUC
bin blue at a zero please 0.99 5.37 7.63 -41.41 0.95 0.83
bin white by o nine now 0.99 5.27 7.48 -40.18 0.94 0.83
lay green with y seven again 0.99 5.67 8.06 -40.47 0.95 0.83
lay red at c eight soon 0.99 5.61 7.96 -40.64 0.95 0.81
place blue at p one again 0.98 5.39 7.65 -42.37 0.93 0.81
place red by a one soon 0.99 5.25 7.45 -42.00 0.93 0.81
place red by z two soon 0.98 5.42 7.69 -40.70 0.95 0.84
set green in f one again 0.99 5.53 7.87 -40.09 0.96 0.83
set red in x four now 0.92 5.90 8.38 -37.99 0.97 0.86
set white in p five now 0.97 5.66 8.04 -39.99 0.95 0.82

Table 2: Results of the proposed audio-visual synchronisation detection on fully targeted adversarial attacks, i.e., the goal of the attack is to make the WER between transcribed and target phrases 0, on continuous speech recognition models trained on GRID. The success rate is the proportion of adversarial examples with WER equal to 0. (, )
Target Phrases Success Rate (dB) AUC
bin blue at a zero please 1.00 5.10 7.27 -51.39 0.91 0.79
bin white by o nine now 1.00 5.25 7.47 -48.86 0.91 0.79
lay green with y seven again 1.00 5.24 7.45 -49.28 0.91 0.79
lay red at c eight soon 1.00 5.01 7.13 -49.54 0.91 0.79
place blue at p one again 1.00 4.83 6.90 -51.71 0.88 0.76
place red by a one soon 1.00 5.02 7.14 -49.93 0.89 0.79
place red by z two soon 1.00 5.14 7.31 -48.33 0.91 0.80
set green in f one again 1.00 5.14 7.32 -47.43 0.92 0.80
set red in x four now 1.00 5.15 7.33 -46.19 0.92 0.80
set white in p five now 1.00 5.19 7.40 -46.39 0.91 0.78

Table 3: Results of the proposed audio-visual synchronisation detection on targeted adversarial attacks on continuous speech recognition models trained on GRID. The WER between transcribed and target phrases is up to 50%. The success rate is the proportion of adversarial examples with WER less than 50%. (, )

7 Conclusion

In this work, we have investigated the use of audio-visual synchronisation as a detection method of adversarial attacks. We hypothesised that the synchronisation confidence score will be lower in adversarial than benign examples and demonstrated that this can be used for detecting adversarial attacks. In future work, we would like to investigate more sophisticated approaches for measuring the correlation between audio and visual streams.


  • [1] M. Alzantot, B. Balaji, and M. Srivastava (2017)

    Did you hear that? adversarial examples against automatic speech recognition

    In NIPS Machine Deception workshop, Cited by: §1.
  • [2] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2018) Synthesizing robust adversarial examples. ICML. Cited by: §1.
  • [3] W. Brendel, J. Rauber, and M. Bethge (2018)

    Decision-based adversarial attacks: reliable attacks against black-box machine learning models

    In ICLR, Cited by: §1.
  • [4] N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. In IEEE Security and Privacy Workshops, Cited by: §1, §3.1, §3.1, §5.1, §5.2.
  • [5] J. S. Chung and A. Zisserman (2016) Lip reading in the wild. In ACCV, Cited by: §1, §2.
  • [6] J. S. Chung and A. Zisserman (2016) Out of time: automated lip sync in the wild. In ACCV, Cited by: §4.
  • [7] S. Chung, J. S. Chung, and H. Kang (2019) Perfect match: improved cross-modal embeddings for audio-visual synchronisation. In IEEE ICASSP, Cited by: §4.
  • [8] M. Cooke, J. Barker, S. Cunningham, and X. Shao (2006) An audio-visual corpus for speech perception and automatic speech recognition. In The Journal of the Acoustical Society of America, Cited by: §1, §2.
  • [9] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §1, §1, §3.1.
  • [10] A. Kurakin, I. Goodfellow, and S. Bengio (2017) Adversarial examples in the physical world. In ICLR workshop, Cited by: §3.1, §5.1.
  • [11] P. Ma, S. Petridis, and M. Pantic (2019) Investigating the lombard effect influence on end-to-end audio-visual speech recognition. In INTERSPEECH, Cited by: §3.2.
  • [12] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016)

    Distillation as a defense to adversarial perturbations against deep neural networks

    In IEEE Symposium on Security and Privacy, Cited by: §1.
  • [13] J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    Cited by: §1.
  • [14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR, Cited by: §1.
  • [15] K. Vougioukas, S. Petridis, and M. Pantic (2018) End-to-end speech-driven facial animation with temporal gans. BMVC. Cited by: §2.
  • [16] W. Xu, D. Evans, and Y. Qi (2018) Feature squeezing: detecting adversarial examples in deep neural networks. In NDSS, Cited by: §1.
  • [17] Z. Yang, B. Li, P. Chen, and D. Song (2019) Characterizing audio adversarial examples using temporal dependency. In ICLR, Cited by: §1.