Intelligent voice control (IVC) has been widely used in human-computer interaction, such as Amazon Alexa [alexa], Google Assistant [googleassistant], Apple Siri [apple], Microsoft Cortana [microsoft] and iFLYTEK [iFLYTEK]
. Running the state-of-the-art ASR techniques, these systems can effectively interpret natural voice commands and execute the corresponding operations such as unlocking the doors of home or cars, making online purchase, sending messages, and etc. This has been made possible by recent progress in machine learning, deep learning[hinton2012deep] in particular, which vastly improves the accuracy of speech recognition. In the meantime, these deep learning techniques are known to be vulnerable to adversarial perturbations [kurakin2016adversarial, brown2017adversarial, evtimov2017robust, dalvi2004adversarial, biggio2013evasion, szegedy2013intriguing, goodfellow2014explaining, papernot2016transferability]. Hence, it becomes imperative to understand the security implications of the ASR systems in the presence of such attacks.
Threats to ASR
Prior research shows that carefully-crafted perturbations, even a small amount, could cause a machine learning classifier to misbehave in an unexpected way. Although such adversarial learning has been extensively studied in image recognition, little has been done in speech recognition, potentially due to the new challenge in this domain: unlike adversarial images, which include the perturbations of less noticeable background pixels, changes to voice commands often introduce noise that a modern ASR system is designed to filter out and therefore cannot be easily misled.
Indeed, a recent attack on ASR utilizes noise-like hidden voice command [carlini2016hidden]
, but the white box attack is based on a traditional speech recognition system that uses a Gaussian Mixture Model (GMM), not the DNN behind today’s ASR systems. Another attack transmits inaudible commands through ultrasonic sound[zhang2017dolphinattack], but it exploits microphone hardware vulnerabilities instead of the weaknesses of the DNN. Moreover, an attack device, e.g., an ultrasonic transducer or speaker, needs to be placed close to the target ASR system. So far little success has been reported in generating “adversarial sound” that practically fools deep learning technique but remains inconspicuous to human ears, and meanwhile allows it to be played from the remote (e.g., through YouTube) to attack a large number of ASR systems.
To find practical adversarial sound, a few technical challenges need to be addressed: (C1) the adversarial audio sample is expected to be effective in a complicated, real-world audible environment, in the presence of electronic noise from speaker and other noises; (C2) it should be stealthy, unnoticeable to ordinary users; (C3) impactful adversarial sound should be remotely deliverable and can be played by popular devices from online sources, which can affect a large number of IVC devices. All these challenges have been found in our research to be completely addressable, indicating that the threat of audio adversarial learning is indeed realistic.
CommanderSong. More specifically, in this paper, we report a practical and systematic adversarial attack on real world speech recognition systems. Our attack can automatically embed a set of commands into a (randomly selected) song, to spread to a large amount of audience (addressing C3). This revised song, which we call CommanderSong, can sound completely normal to ordinary users, but will be interpreted as commands by ASR, leading to the attacks on real-world IVC devices. To build such an attack, we leverage an open source ASR system Kaldi [kaldi], which includes acoustic model and language model. By carefully synthesizing the outputs of the acoustic model from both the song and the given voice command, we are able to generate the adversarial audio with minimum perturbations through gradient descent, so that the CommanderSong can be less noticeable to human users (addressing C2, named WTA attack). To make such adversarial samples practical, our approach has been designed to capture the electronic noise produced by different speakers, and integrate a generic noise model into the algorithm for seeking adversarial samples (addressing C1, called WAA attack).
In our experiment, we generated over 200 CommanderSongs that contain different commands, and attacked Kaldi with an 100% success rate in a WTA attack and a 96% success rate in a WAA attack. Our evaluation further demonstrates that such a CommanderSong can be used to perform a black box attack on a mainstream ASR system iFLYTEK222We have reported this to iFLYTEK, and are waiting for their responses. [iFLYTEK] (neither source code nor model is available). iFLYTEK has been used as the voice input method by many popular commercial apps, including WeChat (a social app with 963 million users), Sina Weibo (another social app with 530 million users), JD (an online shopping app with 270 million users), etc. To demonstrate the impact of our attack, we show that CommanderSong can be spread through YouTube, which might impact millions of users. To understand the human perception of the attack, we conducted a user study333The study is approved by the IRB. on Amazon Mechanical Turk [AmazonTurk]. Among over 200 participants, none of them identified the commands inside our CommanderSongs. We further developed the defense solutions against this attack and demonstrated their effectiveness.
Contributions. The contributions of this paper are summarized as follows:
Practical adversarial attack against ASR systems. We designed and implemented the first practical adversarial attacks against ASR systems. Our attack is demonstrated to be robust, working across air in the presence of environmental interferences, transferable, effective on a black box commercial ASR system (i.e., iFLYTEK) and remotely deliverable, potentially impacting millions of users.
Defense against CommanderSong. We design two approaches (audio turbulence and audio squeezing) to defend against the attack, which proves to be effective by our preliminary experiments.
Roadmap. The rest of the paper is organized as follows: Section 2 gives the background information of our study. Section 3 provides motivation and overviews our approach. In Section 4, we elaborate the design and implementation of CommanderSong. In Section 5, we present the experimental results, with emphasis on the difference between machine and human comprehension. Section LABEL:sec:understanding investigates deeper understanding on CommanderSongs. Section LABEL:sec:Defense shows the defense of the CommanderSong attack. Section LABEL:sec:relate compares our work with prior studies and Section LABEL:sec:Conclusion concludes the paper.
In this section, we overview existing speech recognition system, and discuss the recent advance on the attacks against both image and speech recognition systems.
2.1 Speech Recognition
Automatic speech recognition is a technique that allows machines to recognize/understand the semantics of human voice. Besides the commercial products like Amazon Alexa, Google Assistant, Apple Siri, iFLYTEK, etc., there are also open-source platforms such as Kaldi toolkit [kaldi], Carnegie Mellon University’s Sphinx toolkit [cmu], HTK toolkit [htk], etc. Figure 1
presents an overview of a typical speech recognition system, with two major components: feature extraction and decoding based on pre-trained models (e.g., acoustic models and language models).
After the raw audio is amplified and filtered, acoustic features need to be extracted from the preprocessed audio signal. The features contained in the signal change significantly over time, so short-time analysis is used to evaluate them periodically. Common acoustic feature extraction algorithms include Mel-Frequency Cepstral Coefficients (MFCC) [muda2010voice], Linear Predictive Coefficient (LPC) [itakura1975line], Perceptual Linear Predictive (PLP) [hermansky1990perceptual], etc. Among them, MFCC is the most frequently used one in both open source toolkit and commercial products [o2008automatic]
. GMM can be used to analyze the property of the acoustic features. The extracted acoustic features are matched against pre-trained acoustic models to obtain the likelihood probability of phonemes. Hidden Markov Models (HMM) are commonly used for statistical speech recognition. As GMM is limited to describe a non-linear manifold of the data, Deep Neural Network-Hidden Markov Model (DNN-HMM) has been widely used for speech recognition in academic and industry community since 2012[DNN-HMM].
Recently, end-to-end deep learning becomes used in speech recognition systems. It applies a large scale dataset and uses CTC (Connectionist Temporal Classification) loss function to directly obtain the characters rather than phoneme sequence. CTC locates the alignment of text transcripts with input speech using an all-neural, sequence-to-sequence neural network. Traditional speech recognition systems involve many engineered processing stages, while CTC can supersede these processing stages via deep learning[DeepSpeech]. The architecture of end-to-end ASR systems always includes an encoder network corresponding to the acoustic model and a decoder network corresponding to the language model [End-to-end]. DeepSpeech [DeepSpeech] and Wav2Letter [Wav2Letter] are popular open source end-to-end speech recognition systems.
2.2 Existing Attacks against Image and Speech Recognition Systems
Nowadays people are enjoying the convenience of integrating image and speech as new input methods into mobile devices. Hence, the accuracy and dependability of image and speech recognition pose critical impact on the security of such devices. Intuitively, the adversaries can compromise the integrity of the training data if they have either physical or remote access to it. By either revising existing data or inserting extra data in the training dataset, the adversaries can certainly tamper the dependability of the trained models [BEBP].
When adversaries do not have access to the training data, attacks are still possible. Recent research has been done to deceive image recognition systems into making wrong decision by slightly revising the input data. The fundamental idea is to revise an image slightly to make it “look” different from the views of human being and machines. Depending on whether the adversary knows the algorithms and parameters used in the recognition systems, there exist white box and black box attacks. Note that the adversary always needs to be able to interact with the target system to observe corresponding output for any input, in both white and black box attacks. Early researches [tronci2011fusion, schuckers2002spoofing, bao2009liveness] focus on the revision and generation of the digital image file, which is directly fed into the image recognition systems. The state-of-the-art researches [kurakin2016adversarial, brown2017adversarial, evtimov2017robust] advance in terms of practicality by printing the adversarial image and presenting it to a device with image recognition functionality.
However, the success of the attack against image recognition systems has not been ported to the speech recognition systems until very recently, due to the complexity of the latter. The speech, a time-domain continuous signal, contains much more features compared to the static images. Hidden voice command [carlini2016hidden] launched both black box (i.e., inverse MFCC) and white box (i.e., gradient decent) attacks against speech recognition systems, and generated obfuscated commands to ASR systems. Though seminal in attacking speech recognition systems, it is also limited to make practical attacks. For instance, a large amount of human effort is involved as feedback for the black box approach, and the white box approach is based on GMM-based acoustic models, which have been replaced by DNN-based ones in most modern speech recognition systems. The recent work DolphinAttack [zhang2017dolphinattack] proposed a completely inaudible voice attack by modulating commands on ultrasound carriers and leveraging microphone vulnerabilities (i.e., the nonlinearity of the microphones). As noted by the authors, such attack can be eliminated by an enhanced microphone that can suppress acoustic signals on ultrasound carrier, like iPhone 6 Plus.
In this section, we present the motivation of our work, and overview the proposed approach to generate the practical adversarial attack.
Recently, adversarial attacks on image classification have been extensively studied [brown2017adversarial, evtimov2017robust]. Results show that even the state-of-the-art DNN-based classifier can be fooled by small perturbations added to the original image [kurakin2016adversarial], producing erroneous classification results. However, the impact of adversarial attacks on the most advanced speech recognition systems, such as those integrating DNN models, has never been systematically studied. Hence, in this paper, we investigated DNN-based speech recognition systems, and explored adversarial attacks against them. Researches show that commands can be transmitted to IVC devices through inaudible ultrasonic sound [zhang2017dolphinattack] and noises [carlini2016hidden]. Even though the existing works against ASR systems are seminal, they are limited in some aspects. Specifically, ultrasonic sound can be defeated by using a low-pass filter (LPF) or analyzing the signal frequency range, and noises are easy to be noticed by users.
Therefore, the research in this paper is motivated by the following questions: (Q1) Is it possible to build the practical adversarial attack against ASR systems, given the facts that the most ASR systems are becoming more intelligent (e.g., by integrating DNN models) and that the generated adversarial samples should work in the very complicated physical environment, e.g., electronic noise from speaker, background noise, etc.? (Q2) Is it feasible to generate the adversarial samples (including the target commands) that are difficult, or even impossible, to be noticed by ordinary users, so the control over the ASR systems can happen in a “hidden” fashion? (Q3) If such adversarial audio samples can be produced, is it possible to impact a large amount of victims in an automated way, rather than solely relying on attackers to play the adversarial audio and affecting victims nearby? Below, we will detail how our attack is designed to address the above questions.
3.2 The Philosophy of Designing Our Attack
To address Q3, our idea is to choose songs as the “carrier” of the voice commands recognizable by ASR systems. The reason of choosing such “carrier” is at least two-fold. On one hand, enjoying songs is always a preferred way for people to relax, e.g., listening to the music station, streaming music from online libraries, or just browsing YouTube for favorite programs. Moreover, such entertainment is not restricted by using radio, CD player, or desktop computer any more. A mobile device, e.g., Android phone or Apple iPhone, allows people to enjoy songs everywhere. Hence, choosing the song as the “carrier” of the voice command automatically helps impact millions of people. On the other hand, “hiding” the desired command in the song also makes the command much more difficult to be noticed by victims, as long as Q2 can be reasonably addressed. Note that we do not rely on the lyrics in the song to help integrate the desired command. Instead, we intend to avoid the songs with the lyrics similar to our desired command. For instance, if the desired command is “open the door”, choosing a song with the lyrics of “open the door” will easily catch the victims’ attention. Hence, we decide to use random songs as the “carrier” regardless of the desired commands.
Actually choosing the songs as the “carrier” of desired commands makes Q2 even more challenging. Our basic idea is when generating the adversarial samples, we revise the original song leveraging the pure voice audio of the desired command as a reference. In particular, we find the revision of the original song to generate the adversarial samples is always a trade off between preserving the fidelity of the original song and recognizing the desired commands from the generated sample by ASR systems. To better obfuscate the desired commands in the song, in this paper we emphasize the former than the latter. In other words, we designed our revision algorithm to maximally preserve the fidelity of the original song, at the expense of losing a bit success rate of recognition of the desired commands. However, such expense can be compensated by integrating the same desired command multiple times into one song (the command of “open the door” may only last for 2 seconds.), and the successful recognition of one suffices to impact the victims.
Technically, in order to address Q2, we need to investigate the details of an ASR system. As shown in Figure 1, an ASR system is usually composed of two pre-trained models: an acoustic model describing the relationship between audio signals and phonetic units, and a language model representing statistical distributions over sequences of words. In particular, given a piece of pure voice audio of the desired command and a “carrier” song, we can feed them into an ASR system separately, and intercept the intermediate results. By investigating the output from the acoustic model when processing the audio of the desired command, and the details of the language model, we can conclude the “information” in the output that is necessary for the language model to produce the correct text of the desired command. When we design our approach, we want to ensure such “information” is only a small subset (hopefully the minimum subset) of the output from the acoustic model. Then, we carefully craft the output from the acoustic model when processing the original song, to make it “include” such “information” as well. Finally, we inverse the acoustic model and the feature extraction together, to directly produce the adversarial sample based on the crafted output (with the “information” necessary for the language model to produce the correct text of the desired command).
Theoretically, the adversarial samples generated above can be recognized by the ASR systems as the desired command if directly fed as input to such systems. Since such input usually is in the form of a wave file (in “WAV” format) and the ASR systems need to expose APIs to accept the input, we define such attack as the WAV-To-API (WTA) attack. However, to implement a practical attack as in Q1, the adversarial sample should be played by a speaker to interact with IVC devices over the air. In this paper, we define such practical attack as WAV-Air-API (WAA) attack. The challenge of the WAA attack is when playing the adversarial samples by a speaker, the electronic noise produced by the loudspeakers and the background noise in the open air have significant impact on the recognition of the desired commands from the adversarial samples. To address this challenge, we improve our approach by integrating a generic noise model to the above algorithm with the details in Section 4.3.
4 Attack Approach
We implement our attack by addressing two technical challenges: (1) minimizing the perturbations to the song, so the distortion between the original song and the generated adversarial sample can be as unnoticeable as possible, and (2) making the attack practical, which means CommanderSong should be played over the air to compromise IVC devices. To address the first challenge, we proposed pdf-id sequence matching to incur minimum revision at the output of the acoustic model, and use gradient descent to generate the corresponding adversarial samples as in Section 4.2. The second challenge is addressed by introducing a generic noise model to simulate both the electronic noise and background noise as in Section 4.3. Below we elaborate the details.
4.1 Kaldi Platform
We choose the open source speech recognition toolkit Kaldi [kaldi], due to its popularity in research community. Its source code on github obtains 3,748 stars and 1,822 forks [Aspiremodel]. Furthermore, the corpus trained by Kaldi on “Fisher” is also used by IBM [Fisher_IBM] and Microsoft [Fisher_Microsoft].
In order to use Kaldi to decode audio, we need a trained model to begin with. There are some models on Kaldi website that can be used for research. We took advantage of the “ASpIRE Chain Model” (referred as “ASpIRE model” in short), which was one of the latest released decoding models when we began our study444There are three decoding models on Kaldi platform currently. ASpIRE Chain Model we used in this paper was released on October 15th, 2016, while SRE16 Xvector Model was released on October 4th, 2017, which was not available when we began our study. The CVTE Mandarin Model, released on June 21st 2017 was trained in Chinese [kaldi].
. After manually analyzing the source code of Kaldi (about 301,636 lines of shell scripts and 238,107 C++ SLOC), we completely explored how Kaldi processes audio and decodes it to texts. Firstly, it extracts acoustic features like MFCC or PLP from the raw audio. Then based on the trained probability density function (p.d.f.) of the acoustic model, those features are taken as input to DNN to compute the posterior probability matrix. The p.d.f. is indexed by the pdf identifier (pdf-id), which exactly indicates the column of the output matrix of DNN.
Phoneme is the smallest unit composing a word. There are three states (each is denoted as an HMM state) of sound production for each phoneme, and a series of transitions among those states can identify a phoneme. A transition identifier (transition-id) is used to uniquely identify the HMM state transition. Therefore, a sequence of transition-ids can identify a phoneme, so we name such a sequence as phoneme identifier in this paper. Note that the transition-id is also mapped to pdf-id. Actually, during the procedure of Kaldi decoding, the phoneme identifiers can be obtained. By referring to the pre-obtained mapping between transition-id and pdf-id, any phoneme identifier can also be expressed as a specific sequence of pdf-ids. Such a specific sequence of pdf-ids actually is a segment from the posterior probability matrix computed from DNN. This implies that to make Kaldi decode any specific phoneme, we need to have DNN compute a posterior probability matrix containing the corresponding sequence of pdf-ids.
To illustrate the above findings, we use Kaldi to process a piece of audio with several known words, and obtain the intermediate results, including the posterior probability matrix computed by DNN, the transition-ids sequence, the phonemes, and the decoded words. Figure 2 demonstrates the decoded result of Echo, which contains three phonemes. The red boxes highlight the id representing the corresponding phoneme, and each phoneme is identified by a sequence of transition-ids, or the phoneme identifiers. Table 1 is a segment from the the relationship among the phoneme, pdf-id, transition-id, etc. By referring to Table 1, we can obtain the pdf-ids sequence corresponding to the decoded transition-ids sequence555For instance, the pdf-ids sequence for should be 6383, 5760, 5760, 5760, 5760, 5760, 5760, 5760, 5760, 5760.. Hence, for any posterior probability matrix demonstrating such a pdf-ids sequence should be decoded by Kaldi as .
4.2 Gradient Descent to Craft Audio
Figure 3 demonstrates the details of our attack approach. Given the original song and the pure voice audio of the desired command , we use Kaldi to decode them separately. By analyzing the decoding procedures, we can get the output of DNN matrix A of the original song (Step 1⃝ in Figure 3) and the phoneme identifiers of the desired command audio (Step 4⃝ in Figure 3).
The DNN’s output is a matrix containing the probability of each pdf-id at each frame. Suppose there are frames and pdf-ids, let () be the element at the th row and th column in A. Then represents the probability of the pdf-id at frame . For each frame, we calculate the most likely pdf-id as the one with the highest probability in that frame. That is,
Let . represents a sequence of most likely pdf-ids of the original song audio . For simplification, we use to represent the function that takes the original audio as input and outputs a sequence of most likely pdf-ids based on DNN’s predictions. That is,
As shown in Step 5⃝ in Figure 3, we can extract a sequence of pdf-id of the command , where () represents the highest probability pdf-id of the command at frame . To have the original song decoded as the desired command, we need to identify the minimum modification on so that is same or close to . Specifically, we minimize the distance between and . As and are related with the pdf-id sequence, we define this method as pdf-id sequence matching algorithm.
Based on these observations we construct the following objective function:
To ensure that the modified audio does not deviate too much from the original one, we optimize the objective function Eq (1) under the constraint of .
Finally, we use gradient descent [papernot2016cleverhans], an iterative optimization algorithm to find the local minimum of a function, to solve the objective function. Given an initial point, gradient descent follows the direction which reduces the value of the function most quickly. By repeating this process until the value starts to remain stable, the algorithm is able to find a local minimum value. In particular, based on our objective function, we revise the song into with the aim of making most likely pdf-ids equal or close to . Therefore, the crafted audio can be decoded as the desired command.
To further preserve the fidelity of the original song, one method is to minimize the time duration of the revision. Typically, once the pure command voice audio is generated by a text-to-speech engine, all the phonemes are determined, so as to the phoneme identifiers and . However, the speed of the speech also determines the number of frames and the number of transition-ids in a phoneme identifier. Intuitively, slow speech always produces repeated frames or transition-ids in a phoneme. Typically people need six or more frames to realize a phoneme, but most speech recognition systems only need three to four frames to interpret a phoneme. Hence, to introduce the minimal revision to the original song, we can analyze , reduce the number of repeated frames in each phoneme, and obtain a shorter , where .
4.3 Practical Attack over the Air
By feeding the generated adversarial sample directly into Kaldi, the desired command can be decoded correctly. However, playing the sample through a speaker to physically attack an IVC device typically cannot work. This is mainly due to the noises introduced by the speaker and environment, as well as the distortion caused by the receiver of the IVC device. In this paper, we do not consider the invariance of background noise in different environments, e.g., grocery, restaurant, office, etc., due to the following reasons: (1) In a quite noisy environment like restaurant or grocery, even the original voice command may not be correctly recognized by IVC devices; (2) Modeling any slightly variant background noise itself is still an open research problem; (3) Based on our observation, in a normal environment like home, office, lobby, the major impacts on the physical attack are the electronic noise from the speaker and the distortion from the receiver of the IVC devices, rather than the background noise.
Hence, our idea is to build a noise model, considering the speaker noise, the receiver distortion, as well as the generic background noise, and integrate it in the approach in Section 4.2. Specifically, we carefully picked up several songs and played them through our speaker in a very quiet room. By comparing the recorded audio (captured by our receiver) with the original one, we can capture the noises. Note that playing “silent” audio does not work since the electronic noise from speakers may depend on the sound at different frequencies. Therefore, we intend to choose the songs that cover more frequencies. Regarding the comparison between two pieces of audio, we have to first manually align them and then compute the difference. We redesign the objective function as shown in Eq (2).
where is the perturbation that we add to the original song, and is the noise samples that we captured. In this way, we can get the adversarial audio that can be used to launch the practical attack over the air.
Such noise model above is quite device-dependent. Since different speakers and receivers may introduce different noises/distortion when playing or receiving specific audio, may only work with the devices that we use to capture the noise. To enhance the robustness of , we introduce random noise, which is shown in Eq (3). Here, the function rand()
returns an vector of random numbers in the interval (-N,N), which is saved as a “WAV” format file to represent. Our evaluation results show that this approach can make the adversarial audio robust enough for different speakers and receivers.
In this section, we present the experimental results of CommanderSong. We evaluated both the WTA and WAA attacks against machine recognition. To evaluate the human comprehension, we conducted a survey examining the effects of “hiding” the desired command in the song. Then, we tested the transferability of the adversarial sample on other ASR platforms, and checked whether CommanderSong can spread through Internet and radio. Finally, we measured the efficiency in terms of the time to generate the CommanderSong. Demos of attacks are uploaded on the website (https://sites.google.com/view/commandersong/).