Deep networks achieve state-of-the-art performance on several tasks such as image classification, image segmentation and face recognition. However, recent studies[14, 9] show that such networks are susceptible to adversarial attacks. Given any input
and a classifier, an adversary tries to carefully construct a sample that is similar to but . The adversarial examples are indistinguishable from the original ones but can easily degrade the performance of deep classifiers.
Existing studies on adversarial attacks have mainly focused in the image domain [9, 13, 2, 3]. Recently, adversarial attacks in the audio domain have also been presented [1, 4]. One of the most prominent studies is the iterative optimisation-based attack , which directly operates on an audio clip and enables it to be transcribed to any phrase when a perturbation is added. Works on defense approaches against adversarial attacks can be divided into three categories: adversarial training , gradient masking  and input transformation . The first one adds adversarial examples in the training set whereas the second one builds a model which does not have useful gradients. Both of them require the model to be retrained, which can be computationally expensive. In contrast, the latter one attempts to defend adversarial attacks by transforming the input.
On the other hand, work on how to detect adversarial attacks is very limited. To the best of our knowledge, the only work in the audio domain was proposed by Yang et al.  and exploits the inherent temporal dependency in audio samples to detect adversarial examples. The main idea is that the transcribed results from an audio sequence and segments extracted from it are consistent in benign examples but not in adversarial ones. In other words, the temporal dependency is not preserved in adversarial sequences.
Inspired by the idea of using temporal dependency to detect audio adversarial examples, we propose a simple and efficient detection method against audio-visual adversarial attacks. To the best of our knowledge, this is the first work which presents a detection method of adversarial attacks on audio-visual speech recognition. The key idea is that the audio stream is highly correlated with the video of the face (and especially the mouth region). In case of an adversarial example, the added noise on the audio and video streams is expected to weaken the audio-visual correlation. Hence, we propose the use of audio-visual synchronisation as a proxy to correlation. In other words, we expect higher synchronisation scores for benign examples and lower scores for adversarial examples.
For the purposes of this study, we use two audiovisual datasets, the LRW  and GRID  datasets. The LRW dataset is a large-scale audio-visual dataset consisting of clips from BBC programs. The dataset has 500 isolated words from more than 1000 speakers and contains 488766, 25000, and 25000 examples in the training, validation and test sets, respectively. Each utterance is a short segment with a length of 29 frames (1.16 seconds), where target words are centred in the segment of utterances.
The GRID dataset consists of 33 speakers and 33000 utterances (1000 per speaker). Each utterance is composed of six words taken from the combination of the following components: <command: 4><colour: 4><preposition: 4><letter: 25><digit: 10><adverb: 4>, where the number of choices for each component is indicated in the angle brackets. In this work, we follow the evaluation protocol from  where 16, 7 and 10 subjects are used for training, validation and testing, respectively.
In this study, we consider two attack methods, Fast Gradient Sign Method (FGSM)  and the iterative optimisation-based attack . FGSM, which is suitable for attacks on classification models, computes the gradient with respect to the benign input and each pixel can be updated to maximise the loss. Basic Iterative Method (BIM) 
is an extended version of FGSM by applying it multiple times with a small step size. Specifically, given a loss functionfor training the classification model , the adversarial noise is generated as follows:
where is the step size, is the adversarial example after -steps of the iterative attack and is the true label. After each step, pixel values in the adversarial images are clamped to the range , where is the maximum change in each pixel value. This method was proposed for adversarial attacks on images but can also be applied to audio clips by crafting perturbation to the audio input.
The second type of attack  has been recently proposed and is suitable for attacks on continuous speech recognition models. Audio adversarial examples can be generated, which can be transcribed to any phrase but sound similar to the benign one. Specifically, the goal of this targeted attack is to seek an adversary input , which is very close to the benign input , but the model decodes it to the target phrase . The objective of the attack is the following:
where is introduced to limit the maximum change for each audio sample or pixel and is the amount of adversarial noise.
3.2 Audio-visual Speech Recognition Threat Model
The architecture is shown in Fig. 1a. We use the end-to-end audiovisual model that was proposed in 
. The video stream consists of spatiotemporal convolution, a ResNet18 network and a 2-layer BGRU network whereas the audio stream consists of a 5-layer CNN and a 2-layer BGRU network. These two streams are used for feature extraction from raw modalities. The top two-layer BGRU network further models the temporal dynamics of the concatenated feature.
According to the problem type, two different loss functions are applied for training. The multi-class cross entropy loss, where each input sequence is assigned a single class, is suitable for word-level speech recognition. The CTC loss is used for sentence-level classification. This loss transcribes directly from sequence to sequence when the alignment between inputs and target outputs is unknown. Given an input sequence
, CTC sums over the probability of all possible alignments to obtain the posterior of the target sequence.
4 Synchronisation-based Detection Method
Chung et al. [6, 7] introduced the SyncNet model, which is able to predict the synchronisation error when raw audio and video streams are given. This error is quantified by the synchronisation offset and confidence score. A sliding window approach is used to determine the audio-visual offset. For each 5-frame video window, the offset is found when the distance between the visual features and all audio features in a 1 second range is minimised. The confidence score for a particular offset is defined as the difference between the the minimum and the median of the Euclidean distances (computed over all windows). Audio and video are considered perfectly matched if the offset approaches to zero with a high level of confidence score.
In this work, we aim to explore if such synchronisation is affected by adversarial noise. The detection method is shown in Fig. 1b and 1c. In the detection model, we measure the temporal consistency between the audio and video streams via a model trained for audio-visual synchronisation. For benign audio and video streams, the confidence score should be relatively high since audio and video are aligned and therefore highly synchronised. However, for adversarial audio and video examples, the confidence score is expected to be lower. The added perturbation, which aims to alter the model toward the target transcription, reduces the correlation between the two streams, hence they are less synchronous. Fig. 1d. shows the confidence distribution of 300 benign and adversarial examples from the GRID dataset.
5 Experimental Setup
We evaluate our proposed method using two adversarial attacks on both modalities. We assume a white-box scenario, where the parameters of models are known to the attacker.
Attacks against Word-level Classification: Attacks such as FGSM and BIM are suitable for word recognition models trained on the LRW dataset. For FGSM, for the audio stream and
for the video stream, were chosen heuristically. In our case, we setto 1024 and to 16111Pixel values are in the range of [0, 255]. Audio samples are in the range of [-32768, 32767].. For BIM, the step size was set to 1 in the image domain, which means the value of each pixel is changed by 1 at each iteration. The step size in the audio domain is set to 64. We follow the number of iterations setting suggested by , which is selected to be .
Attacks against Continuous Speech Recognition: For attacking a speech recognition model trained on GRID we use a recently proposed targeted attack . The maximum change allowed as defined by (see Eq. 2) is initialised to 2048 and 32 for audio and video, respectively, and is reduced during iterative optimisation. We implement the attack with 1000 iterations. In our studies, 10 random utterances are selected as target utterances. 300 adversarial examples are randomly selected for each target utterance.
5.2 Evaluation Metrics
We use the Euclidean distance () for measuring the similarity between two images. We also use the norm to measure the maximum change per pixel. For audio samples we follow  and convert the norm to the scale of Decibels (dB): , where is an arbitrary audio sample point from the audio clip . The audio distortion is specified as the relative loudness to the benign audio, which can be defined as .
The Area Under the Curve (AUC) score is used for evaluating the detection approach. We compute the synchronisation confidence score in benign and adversarial examples and by varying the threshold we compute the Receiver Operating Characteristic (ROC) curve.
Finally, in order to compare how this approach would work in a real scenario, we select the threshold (from Fig. 1c) which maximises the average score of adversarial and benign classes on the validation set. Then we use this threshold to compute the average score on the test set.
|FGSM (=1024, =16)||13.67||3.46||0.99||0.94|
|FGSM (=1024, =32)||12.40||6.89||0.99||0.96|
|BIM (=1024, =16)||6.67||1.22||0.89||0.82|
|BIM (=1024, =32)||3.27||1.67||0.93||0.85|
6.1 Word-level Speech Recognition
Detection results for attacks on word-level speech recognition are shown in Table 1. In the presence of adversarial noise, the classification rate drops from 97.20% 222This is the performance of the model trained on the LRW dataset when benign examples are fed to it. to 12.40% using FGSM whereas it drops to 3.27% after running 36 iterations using BIM. On the other hand, this detection method achieves an AUC score as high as 0.99 using FGSM and 0.93 using BIM. The corresponding scores are 0.96 and 0.85, respectively.
We also notice that when the attack is stronger, e.g., BIM is used instead of FSGM, the classification rate goes down, i.e., the attack is more successful, and at the same time the distortion () becomes smaller. Consequently, detection becomes more difficult and this is reflected to the lower AUC and scores.
We also investigate the detection performance when the decreases to 16 from 32, i.e., the pixel values change less. It is clear from Table 1 that for both types of attacks the distortion is smaller and as a consequence detection becomes harder, both AUC and scores go down. However, such attacks are less successful since the classification rate goes up.
6.2 Sentence-level Speech Recognition
Detection results for fully targeted attacks on sentence-level speech recognition, i.e., the goal of the attack is that the transcribed result is the same as the desired target phrase, are shown in Table 2. It is clear that the attack is almost always successful no matter what the target sentence is, more than 92% of the examples are transcribed as the target phrase. At the same time the detection rates are quite high with an AUC between 0.93 and 0.97 and an score between 0.81 and 0.86. We also observe that the maximum distortions applied to the audio and video signals are similar in most cases.
We also consider another scenario where the WER between the transcribed results and target phrases is up to 50%. Results are shown in Table 3. In this case the attack is always successful. In addition the generated audio and video adversarial examples are less distorted than the ones generated by the fully targeted attacks. In turn, this leads to smaller AUC scores, between 0.88 and 0.92, and scores, between 0.76 and 0.80.
|Target Phrases||Success Rate||(dB)||AUC|
|bin blue at a zero please||0.99||5.37||7.63||-41.41||0.95||0.83|
|bin white by o nine now||0.99||5.27||7.48||-40.18||0.94||0.83|
|lay green with y seven again||0.99||5.67||8.06||-40.47||0.95||0.83|
|lay red at c eight soon||0.99||5.61||7.96||-40.64||0.95||0.81|
|place blue at p one again||0.98||5.39||7.65||-42.37||0.93||0.81|
|place red by a one soon||0.99||5.25||7.45||-42.00||0.93||0.81|
|place red by z two soon||0.98||5.42||7.69||-40.70||0.95||0.84|
|set green in f one again||0.99||5.53||7.87||-40.09||0.96||0.83|
|set red in x four now||0.92||5.90||8.38||-37.99||0.97||0.86|
|set white in p five now||0.97||5.66||8.04||-39.99||0.95||0.82|
|Target Phrases||Success Rate||(dB)||AUC|
|bin blue at a zero please||1.00||5.10||7.27||-51.39||0.91||0.79|
|bin white by o nine now||1.00||5.25||7.47||-48.86||0.91||0.79|
|lay green with y seven again||1.00||5.24||7.45||-49.28||0.91||0.79|
|lay red at c eight soon||1.00||5.01||7.13||-49.54||0.91||0.79|
|place blue at p one again||1.00||4.83||6.90||-51.71||0.88||0.76|
|place red by a one soon||1.00||5.02||7.14||-49.93||0.89||0.79|
|place red by z two soon||1.00||5.14||7.31||-48.33||0.91||0.80|
|set green in f one again||1.00||5.14||7.32||-47.43||0.92||0.80|
|set red in x four now||1.00||5.15||7.33||-46.19||0.92||0.80|
|set white in p five now||1.00||5.19||7.40||-46.39||0.91||0.78|
In this work, we have investigated the use of audio-visual synchronisation as a detection method of adversarial attacks. We hypothesised that the synchronisation confidence score will be lower in adversarial than benign examples and demonstrated that this can be used for detecting adversarial attacks. In future work, we would like to investigate more sophisticated approaches for measuring the correlation between audio and visual streams.
Did you hear that? adversarial examples against automatic speech recognition. In NIPS Machine Deception workshop, Cited by: §1.
-  (2018) Synthesizing robust adversarial examples. ICML. Cited by: §1.
Decision-based adversarial attacks: reliable attacks against black-box machine learning models. In ICLR, Cited by: §1.
-  (2018) Audio adversarial examples: targeted attacks on speech-to-text. In IEEE Security and Privacy Workshops, Cited by: §1, §3.1, §3.1, §5.1, §5.2.
-  (2016) Lip reading in the wild. In ACCV, Cited by: §1, §2.
-  (2016) Out of time: automated lip sync in the wild. In ACCV, Cited by: §4.
-  (2019) Perfect match: improved cross-modal embeddings for audio-visual synchronisation. In IEEE ICASSP, Cited by: §4.
-  (2006) An audio-visual corpus for speech perception and automatic speech recognition. In The Journal of the Acoustical Society of America, Cited by: §1, §2.
-  (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §1, §1, §3.1.
-  (2017) Adversarial examples in the physical world. In ICLR workshop, Cited by: §3.1, §5.1.
-  (2019) Investigating the lombard effect influence on end-to-end audio-visual speech recognition. In INTERSPEECH, Cited by: §3.2.
Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, Cited by: §1.
One pixel attack for fooling deep neural networks.
IEEE Transactions on Evolutionary Computation. Cited by: §1.
-  (2014) Intriguing properties of neural networks. In ICLR, Cited by: §1.
-  (2018) End-to-end speech-driven facial animation with temporal gans. BMVC. Cited by: §2.
-  (2018) Feature squeezing: detecting adversarial examples in deep neural networks. In NDSS, Cited by: §1.
-  (2019) Characterizing audio adversarial examples using temporal dependency. In ICLR, Cited by: §1.