1 Introduction
Over the last decade, machine learning has made great advances and has been widely adopted for many diverse applications, including securitysensitive applications such as identity verification and fraud detection. However, recent research has also shown that many machine learning algorithms (specifically deep neural networks) are vulnerable to
adversarial attacks, where small, but carefully designed, perturbations are added to original samples, leading the target model to make wrong predictions [Szegedy et al.2013]. Such adversarial attack algorithms have been proposed for a variety of tasks, such as image recognition, speech processing, text classification, and malware detection, where they have also been shown to be highly effective [MoosaviDezfooli et al.2016, Cisse et al.2017, Gong and Poellabauer2017, Alzantot et al.2018, Carlini and Wagner2018, Schönherr et al.2018, Ebrahimi et al.2017, Grosse et al.2017].Most existing adversarial example generation algorithms require that the entire original data sample that is fed into the target model is observed and that any part of the sample can then be modified. For example, speech adversarial attack algorithms typically design a perturbation for a given speech sample, add the perturbation to the original sample, and then feed the resulting sample into the target speech recognition system. However, this approach is not always feasible, particularly when the target system requires streaming input, where the input is continuously processed as it arrives. In this realtime processing scenario, an attacker can only observe past parts of the data sample and can only add perturbations to future parts of the data sample, while the decision of the target model will be based on the entire data sample. A few concrete scenarios that operate this way are as follows:
Financial Trading Systems. Financial institutions make trading decisions using automatic machine learning algorithms based on a sequence of observations of some market conditions (e.g., variations in the stock index). An attacker may influence the trading model’s outcomes by carefully perturbing the corresponding market conditions. However, while the target trading model usually makes decisions based on a long sequence of observations, the attacker cannot change any historical data. Instead, the attack can only add perturbations to future (yet to be observed) market conditions, e.g., using market manipulations.
Realtime Speech Processing Systems. Machine learning based realtime speech processing systems (e.g., speech recognition and automatic translation systems) have been adopted widely, including many securitysensitive applications. An attacker may want to change the output of such systems by playing a carefully designed noise that is unnoticeable by the human ear, but will be superimposed on the speech generated by a human speaker through the air. The attacker can only design such noise signals based on past speech signals and superimpose the noise only on future speech signals, while the speech processing system will perform its task using the entire speech segment (e.g., a word or a sentence).
When attacking a realtime system, the attacker faces a tradeoff between observation and action space. That is, assume that the target system takes a sequential input , the attacker could choose to design adversarial perturbations at the beginning. However, in this case, the attacker does not have any observation of , but perturbations can be added to any time point of , i.e., the attacker has minimum observation and maximum action space. In contrast, if the attacker chooses to add adversarial perturbations at the end, the attacker has a full observation of , but cannot add perturbations to the data (i.e., the attacker has maximum observation, but minimum action space). In the first case, it is hard to find an optimal perturbation for
without having any observations, while in the second case, the attack cannot be implemented at all. To address this dilemma, we propose a new attack scheme that continuously uses observed data to approximate an optimal adversarial perturbation for future time points using a deep reinforcement learning architecture (illustrated in Figure
1). In this paper, we refer to such attacks as realtime adversarial attacks. To the best of our knowledge, this is the first study of dynamic realtime adversarial attacks, which have not yet received the attention they deserve. The closest related concept is universal adversarial perturbation, presented in [MoosaviDezfooli et al.2017, Li et al.2018, Neekhara et al.2019], where the authors design a fixed adversarial perturbation that is effective for different samples. The main difference to our work is that the universal adversarial perturbation is built offline and does not take advantage of observations in realtime to further improve the perturbation for a specific target input.2 Realtime Adversarial Attacks
2.1 Problem Formalization
Let denote an point timeseries data sample, where each point ;
is a classifier mapping the timeseries sample
to a discrete label set. The goal of the attacker is to design a realtime adversarial perturbation generator that continuously uses observed data to approximate an optimal adversarial perturbation for a future time point , where is the delay caused by processing the data or emitting the adversarial perturbation. That is,(1) 
We define a metric to measure the perceptibility of the adversarial perturbation; a common choice for is the induced metric of norm. We then aim to solve the following optimization problem for nontargeted adversarial attacks:
minimize  (2)  
Equation 1 implies the constraint that adversarial perturbation is crafted only based on the observed part of the data sample and can only be applied to the unobserved part of the data sample. Equation 2 implies that the attacker wants to make the perturbation as imperceptible as possible on the premise that the attack succeeds. Even without the constraint of Equation 1, directly solving Equation 2 is usually intractable when is a deep neural network due to its nonconvexity. Nevertheless, previous efforts have found effective approximation methods such as the fast gradient sign method (FGSM) [Goodfellow et al.2014], DeepFool [MoosaviDezfooli et al.2016], and the algorithm proposed in [Carlini and Wagner2018]. However, all these methods require full observation and the freedom of changing any point of the original data sample, and therefore these methods are not compatible with the constraint imposed by Equation 1.
Alternatively, a more natural way of describing this problem is to view the adversarial perturbation generator as an agent and model the problem as a partially observable decision process problem, i.e., the generator continuously observes the streaming data and makes a sequence of decisions of how to make the perturbation. This formalism is equivalent to Equations 1 and 2, but allows us to use the many tools available for reinforcement learning (RL) [Sutton et al.1998] to solve the problem. Then, the problem can be described using a tuple , where:

Observation : .

State : unobservable hidden state.

Action : , i.e., adding the perturbation to the original sample at time .

Transition : unknown.

Reward : .
This means that the attacker performs an action to emit the perturbation valued at based on the observation
, which will change the internal hidden state according to an unknown transition rule (e.g., the state can be the attack success probability, and an action could make it increase or decrease). The adversarial generator will only get the reward at the end. The goal of RL is to learn an optimal policy
that maximizes the expectation of the reward. In this problem, the environment is the target model , and the input data distribution .2.2 Adversarial Attacks Using Reinforcement Learning
As discussed in the previous section, realtime adversarial attacks can be described as reinforcement learning problems, which are usually solved by using deep neural networks (DNNs). RLDNN based adversarial attacks and conventional optimization based adversarial attacks (e.g., FGSM and DeepFool) differ in that the former treats the original example and the corresponding adversarial perturbation as the input and output of an unknown nonlinear mapping and then use a DNN to approximate it, i.e., use learning to substitute optimization. In geometric terms, the attack model is trying to predict the direction that pushes the original example out of the correct decision region using the shortest distance.
A challenge for the attack model is to “forecast” future perturbations on yet unobserved data. However, this is feasible since, given a specific machine learning task, the input sample, although yet unobserved, will obey some fixed distribution (e.g., distribution of natural speech), and there usually exist dependencies among the data points of the data sample, which can be used to forecast some characteristics of future data points based on already observed data points. We expect that such characteristics contain information that can be used to estimate an optimal perturbation for future points, which is illustrated in Figure
2.Further, another challenge of using RL to implement realtime adversarial attacks is the sparse rewards problem, i.e., the agent only receives the reward at the end and it is difficult to obtain an estimation of the reward at each time point based on the observed data and past actions. For example, estimating the expected reward at a time point simply based on feeding the observed (partial) input at that time, superimposed with the corresponding perturbation, into the target model (if accessible) and using the classification confidence to calculate the reward will not yield reliable results, because the model’s prediction is not reliable when only partial input is given. In fact, although there have been many efforts to solve the sparse reward problem, many tasks still suffer from high computational overhead and training instability. However, for the adversarial example crafting problem, we could generate many trajectories of observationaction pairs using stateoftheart nonrealtime adversarial generation algorithms. This naturally leads us to use an imitation learning and behavior cloning [Atkeson and Schaal1997] strategy to overcome the sparse reward problem. We discuss it in the following section.
2.3 Imitation Learning Strategy
Imitation learning is an RL technique that learns an optimal policy by imitating the behavior of an expert. Specifically, imitation learning requires a set of decision trajectories generated by an expert, where each decision trajectory consists of a sequence of “observationaction” pairs, i.e., . Such trajectories serve as demonstrations to teach the agent how to behave given an observation. We can extract all expert observationaction pairs from the trajectories and form a new dataset . By treating as the input feature and as the output label, we could learn
in a supervised learning manner using traditional algorithms.
Specifically for the adversarial example crafting problem, we can use stateoftheart nonrealtime attack models to generate “sampleperturbation” pairs as decision trajectories by feeding different original samples and collecting the corresponding output perturbations . Here, both and consist of a sequence of and , using the definition of observation and action in Section 2.1. We can convert each and to and , and then build a training set and use supervised learning to learn .
2.3.1 Choice of Expert
We use a stateoftheart nonrealtime adversarial example crafting technique as the expert. Over the last few years, many new attack techniques have been developed and shown to be effective. These techniques can be roughly classified into two categories. The first category includes gradientbased methods such as FGSM, DeepFool, and the method presented in [Carlini and Wagner2018]; these are typically based on deterministic optimization algorithms. The second category consists of gradientfree methods such as the methods presented in [Alzantot et al.2018, Su et al.2019]; these are typically based on stochastic optimization algorithms. Which method works better as an expert depends not only on the attack success rate; other important criteria include:
1. Flexibility of adding additional constraints. There are two reasons why we prefer an expert that provides some flexibility of adding additional constraints besides making the perturbation imperceptible. First, we ultimately need to learn from the trajectories generated by the expert using some supervised learning method, which inevitably will contain some error. We can add some regularization on the trajectories (e.g., perturb only after a specific time point) to simplify the supervised learning task, which requires additional constraints on the expert. Second, in realistic attack scenarios, the attacker usually faces additional constraints, e.g., when an attacker attempts to fool a speech recognition system by playing the perturbation over the air using a speaker, the frequency range of the perturbation is subject to the characteristics of the speaker. In general, stochastic optimization algorithms are more flexible than deterministic optimization algorithms for adding additional complex constraints.
2. Attacker’s knowledge. The attacker’s knowledge required for the proposed realtime adversarial attack follows exactly the chosen expert policy. Hence, the attacker should choose the expert policy according to the attack scenario.
3. Determinism of the expert. While stateoftheart adversarial example crafting approaches are highly effective in terms of success rate, there is no guarantee that the generated perturbation is globally optimal. Specifically, perturbations generated for the same input sample using a stochastic optimizing algorithm can vary with the random seed since the optimization solutions might stop at different suboptimal points, which will make the mapping illdefined and increase the difficulty of training . Therefore, a deterministic expert is preferred.
2.3.2 Computational Overhead and Speed
Existing adversarial example crafting techniques can be computationally expensive due to the complexity of optimization, e.g., the method in [Carlini and Wagner2018] requires about one hour to craft a single speech adversarial example. Stochastic optimization algorithms typically need to call the target model (or the substitute model) hundreds or thousands of times to find the solution. However, since we use a deep neural network to substitute optimization, no matter which expert we choose to imitate, the computational overhead for generating an adversarial perturbation for one time point is fixed to be the inference time of (denoted by , which is the computational delay). In the realtime scenario, if the input sample frequency is higher than , then the generator is not fast enough to catch up with the streaming input. The attacker then needs to lower the update frequency by modifying to do batch processing, i.e., generate a batch of actions for future points in one inference, which could lower the delay requirement by times.
2.4 Implementation
Once we form the dataset consisting of observationaction pairs from the expert’s decision trajectory, we form the realtime adversarial generator as a deep neural network and learn from the dataset. Note that each input
is a sequence of variable length; so it is natural to use a recurrent neural network as part of the network. Specifically, the neural network can be divided into two parts: the encoder and the decoder. The encoder is a recurrent neural network that maps a variable length input into a fixed dimensional encoding. We expect that the learned encoding contains useful features from
; the decoder then makes the decision of the action, e.g., in the example in Figure 2, we expect that the encoding expresses which cluster the data sample belongs to, and the decoder can find the optimal perturbation based on this information. We can then calculate the error between the predicted action and the ground truth action and use standard backpropagation to update .Assume that we have trajectories and each trajectory consists of observationaction pairs. The dataset has samples, which can be very large and will make the training slow. In fact, observations from the same trajectory are highly dependent, i.e., the only difference between and is that has one more observed point ; therefore there will be a lot of repetitive computation of the recurrent neural network (i.e., the encoder). In order to expedite the training, we should train observationaction pairs from the same trajectory in a batch, i.e., after obtaining from feeding input into , we do not feed a new input into . Instead, we feed into and obtain the output of as . Figure 3 illustrates this training process. Specifically, this approach avoids any repetitive encoder computation and can be viewed as a sequence to sequence training. Note that the predicted actions are only dependent on the current observation (i.e., they are not based on any future observations), which is different from standard sequence to sequence training used in other applications such as machine translation where the intermediate encoding contains information of the entire input sample. The pseudocode of the proposed algorithm is shown in Algorithm 1.
It is worth mentioning that although in this paper, we focus on using the basic behavior cloning algorithm for simplicity, there are many more advanced algorithms (e.g., Dataset Aggregation [Ross et al.2011]) in imitation learning and reinforcement learning that can further improve the attack performance, e.g., it is possible to design a remedy mechanism for the realtime adversarial perturbation generator that allows it to adjust its future strategy if it realizes it has previously made a wrong decision. Hence, formalizing the realtime attack into a reinforcement learning problem is not only natural, but also allows us to apply existing tools and algorithms.
3 Case Study: Attacking a Voice Command Recognition System
In the previous section, we introduced the general realtime adversarial attack framework in a relatively abstract way; in this section, we further show how to adopt the framework in a realistic task: the audio adversarial attack^{1}^{1}1Code and demos are available at https://github.com/YuanGongND/realtimeadversarialattack.
3.1 Target Model and Attack Scenario
The goal is to attack a voice command recognition system based on a convolutional neural network
[Sainath and Parada2015]. This model is used as an official example for Tensorflow
^{2}^{2}2www.tensorflow.org/tutorials/sequences/audio_recognition, it is easy to reproduce, and has also been used as the target model for attacks in [Alzantot et al.2018]. We train the voice command recognition model exactly as in the implementation of the Tensorflow example using the voice command dataset [Warden2018], except that we only use 80% of the data for training, allowing us to use the other 20% for testing. Most audio samples are of exact 1second length with a sampling rate of 16 kHz; all other samples are padded to be also of 1 second for consistency. The model can classify ten keywords: “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, and “go”. The trained model achieves 88.7% accuracy on the validation set.
The proposed realtime scheme can greatly increase the realworld threat of the audio adversarial attack. As illustrated in Figure 4, compared to previous nonrealtime audio adversarial attack technologies presented in [Carlini et al.2016, Yakura and Sakuma2018, Gong and Poellabauer2018, Qin et al.2019], the key advantage of the realtime audio adversarial attack scheme is that only by using this scheme the attacker is able to conduct attacks to an ongoing session, i.e., an ongoing humancomputer interaction, and interfere with the voice command currently being spoken by a human speaker. This is because previous nonrealtime adversarial attack approaches needed a “preparation stage”, where the attacker obtains a complete original speech sample, designs specific adversarial perturbations for this sample, and adds a perturbation to the original sample to build a malicious adversarial example. Then, in the “attack phase”, the attacker needs to initialize a new session with the target system and then replay the prepared malicious adversarial example. The application of such an attack is relatively limited, because during the attack phase, if the user is near the target system, then no matter how close the malicious sample sounds to a benign sample, it will be suspicious to the user; if the user is not near the target system, then it is not necessary to make the malicious sample imperceptible to humans. Further, it is not always easy or even possible to initiate a new session in securitysensitive systems. In contrast, the realtime adversarial attack scheme does not need a preparation phase; instead, it continuously processes the speech spoken by the user and emits the adversarial perturbation, which is superimposed with the original signal over the air in a realtime manner. In practice, the attack can be implemented by placing a device (e.g., a smartphone) equipped with a microphone and a speaker and installed with the realtime attack software near the target device.
3.2 Adversarial Attack Settings
We perform the nontargeted attack in a semiblack box setting, i.e., we assume that the attacker can call the target model an unlimited number of times and get the corresponding predictions and confidence score, but has no knowledge about the model details (architectures, algorithm, and parameters). It is a realistic setting for speech recognition system attacks, because the loss function of many speech recognition models cannot be differentiable with respect to the input, and most stateoftheart systems are cloudbased, which makes it difficult to obtain full knowledge of the model and perform a whitebox attack. For example, the front end of our target model is not a neural network, but a set of filter banks extracting Melfrequency cepstrum features, so it is hard to calculate the gradient of the loss function with respect to the input waveform, even when we have a copy of the model
[Alzantot et al.2018]; Google Speech is a commercial cloudbased model which is hard for the attacker to obtain full knowledge about its design. However, it allows users to upload speech samples and freely obtain predictions and confidences scores, which provides opportunities for semiblack box attacks.In order to emulate a realistic situation, in this example, we apply the following constraints to the adversarial perturbation. First, we constrain the norm of the adversarial perturbation, i.e., we limit the number of nonzero points of the perturbation. This is because limiting the or norm will make the amplitude of the noise small and does not pose an overtheair threat; so it is more reasonable to generate short, but relatively loud perturbations. Second, we require that the nonzero points of the perturbations must form clusters as consecutive noise segments. This is because it is impossible for an electronic speaker to generate a signal of a few nonconsecutive nonzero points due to the limitation of its dynamic characteristics. In this sample, we perturb five 0.01second segments and, for simplicity, the scales of the points in one segment are fixed and identical (i.e., the noise frequency is an integral multiple of the sampling frequency), but each noise segment can be any physically realizable signal the attack desires. Data points that have amplitudes over 1 are clipped to 1. These two constraints also greatly lower the computational complexity for the realtime adversarial perturbation generator, which now only needs to decide the timing of emitting each of the five noise segments. In this sample, we focus on the decisionmaking process, so we do not consider the signal attenuation and distortion during transmission through the air. An illustration of the proposed adversarial perturbation is shown in the upper part of Figure 5.
3.3 The Expert
Since we are performing a semiblack box attack, and to ensure realism, we add nonstandard constraints to the optimization problem. Following the discussion in Section 2.3.1, we choose a stochastic optimization based adversarial example crafting technique as the expert. Specifically, we take the differential evolution optimization [Storn and Price1997], which was previously used for the “onepixel” attack [Su et al.2019] on image recognition systems with similar constraints to our proposed attack. We extend it for use as audio attacks, and then use it as the expert. In our case, the candidate solution of the optimization is a 5tuple consisting of the starting points of each noise segment (sorted). The optimization objective is to minimize the confidence score of the original label. At each iteration, the fitness of each candidate solution is calculated and new candidate solutions are produced using the standard differential evolution formula.
The differential evolution algorithm has two main parameters: the population size and the number of iterations. On one hand, we want the optimization result to be optimal and deterministic (i.e., the result is invariant to random seeds), which requires large parameters. On the other hand, the computational overhead is linearly proportional to the population and the iteration number, and evaluating the fitness of each candidate solution requires calling the DNN based target model once. Therefore, in order to generate the dataset consisting of over 20,000 trajectories for imitation learning over a reasonable time, we have to limit the population and the iteration number. As shown in Figure 6, we test the performance and the standard derivation of the optimization result with different random seeds. We find that population size = 10 and iteration number = 75 provide a good balance between performance and computational overheads and use these values in our experiments. For each audio in the training set, we use the expert to generate a perturbation in the form of a 5tuple. Note that each audio consists of 16,000 observations, and thus forms 16,000 observationtuple pairs (a decision trajectory), where the tuple is identical for all observations since the optimal perturbation does not change with the observation.
3.4 Training the Realtime Adversarial Perturbation Generator
3.4.1 Input and Output of the Network
The realtime adversarial perturbation generator is implemented using a deep neural network; the input of the network is simply an observation (of variable length), the output of the network is a 5tuple of the same definition as the solution of the differential evolution optimization algorithm, i.e., 5 time points to emit noise segments. The tuple can be easily converted to action using the following rule: if the current estimated best emission timing is equal to or earlier than the current time point, then immediately emit the noise.
3.4.2 Batch Processing
The frequency of the speech signal (i.e., 16 kHz) is much higher than the possible update speed of the realtime adversarial perturbation generator. Therefore, we apply batch processing as mentioned in Section 2.3.2; specifically, the adversarial generator updates every 0.01 second and each update makes a decision on the actions for 0.01 seconds, so the delay is also 0.01 seconds. Note that while the update period and noise segment length are identical, they are not related.
3.4.3 The Network Architecture
As shown in Table 1, we use an endtoend neural network. Since the input is an observation of variable length , as a standard signal processing technique, we cut it into
frames, where 160 is the frame length. We then use a series of convolution and pooling layers to extract the features. The features of each frame are then sequentially fed into the long shortterm memory (LSTM)
[Hochreiter and Schmidhuber1997] layers to obtain the encoding, and two dense layers decode the encoding as the output. This basically follows the architecture shown in Figure 3: the layers before the LSTM layers are the encoder, and those after the LSTM layers are the decoder. We use 1e3 as the learning rate, mean square loss, and ADAM optimizer [Kingma and Ba2014] for training. We train data samples in the same trajectory in a batch to expedite the computations as discussed in Section 2.4.Layer Name  Output Dimension 

Input  (t, 1) 
Framing  (, 160) 
Conv1 / Pooling  (, 80, 16) 
Conv2 / Pooling  (, 40, 32) 
Conv3 / Pooling  (, 20, 48) 
Conv4 / Pooling  (, 10, 64) 
Flatten  (, 640) 
LSTM * 3  (256) 
Dense 1  (256) 
Dense 2  (128) 
Output  (5) 
3.5 Experiments
In our experiments, we test the dataset and target model mentioned in Section 3.1. The data is split as follows: we first hold out 20% of the data as the test set (test set 2) for evaluating the attack performance; so it is not seen by the target model and the attack model. We use the other 80% of the data to train the target voice recognition model; this same set is then reused to develop the attack model. Specifically, we use 75% of this set to train the attack model (attack training set), 6.25% for validation, and 18.75% for testing (test set 1). Therefore, test set 1 is seen by the target model, but not seen by the attack model. We then generate the expert demonstration of optimal emission timing using the method mentioned in Section 3.3 for each speech sample in the attack train set. Since in our setting, the amplitude of each noise segment is a given fixed value, it is expected (and is proven by our experiments later) that the expert demonstration of the optimal emission timing varies with the given amplitude value because the emission strategy may be different for different noise amplitude. In this experiment, we generate two versions of expert demonstrations using noise amplitude of 0.1 and 0.5, respectively. Note that although the expert demonstration of optimal emission time points are optimized based on a given noise amplitude, the attacker can emit noise of any amplitude as desired at these time points in the test phase, which might lead to a suboptimal attack performance. We discuss it in detail in the next section.
We then train the realtime adversarial perturbation generator to learn from the expert demonstrations using the approach described in Section 3.4. We use two metrics to evaluate the attack performance: 1) attack success rate (in the nontargeted attack setting, success means that the prediction of the perturbed sample is different from that of the original sample) and 2) confidence score drop of the original class led by the attack (which measures the confidence of the attack).
3.5.1 Overall Result
We show the attack performance on test set 1 of two nonrealtime experts (optimized for perturbation amplitude of 0.1 and 0.5, respectively) and corresponding learned realtime adversarial perturbation generators in Figure 7. We observe that:
First, the attack success rate of the adversarial perturbation generator is up to 43.5% (when perturbation amplitude is 1), which is about half of the best nonrealtime expert (90.5%) and clearly outperforms the random noise. For most perturbation amplitudes, the attack success rate of the realtime attack is 30%50% of that of the expert.
Second, the attack performance of the expert varies with the perturbation amplitude it is optimized for. It is not surprising that the expert optimized for small noise amplitude of 0.1 performs better when the actual emission amplitude is small ( 0.23) while the expert optimized for large amplitude of 0.5 performs better when the actual emission amplitude is large ( 0.23). This difference also shows in the corresponding realtime adversarial perturbation generators, but the impact is much smaller, which gives the attacker a nice property that the attack performance does not drop much when the actual and expected noise amplitude are different (e.g., for audio adversarial attacks, it is hard for the attacker to know the actual amplitude of the noise signal received by the target system due to signal attenuation, but it does not matter).
Third, we further conduct the same test on the test set 2 (attack success rate up to 42.2%) and have not found a substantial difference between the result of test set 1 and 2, indicating the attack model can be generalized to data samples that have not been seen by the target model.
3.5.2 Realtime Dynamics
We next discuss how the proposed adversarial perturbation generator works in a realtime manner; towards this end, we plot the dynamics of 64 attack trials for 64 different input samples in the left part of Figure 8. Each row represents one attack trial, which shows the adversarial perturbation generator’s estimate on the optimal emission timing of the first noise segment at each time point. We place the ground truth on the right for reference. At the beginning of each attack, when no data is observed yet, the adversarial perturbation generator outputs a prior guess which has similar values for different samples, but with more data observed, the estimate gradually improves and finally approaches the ground truth. We can also observe that the amount of observations needed for correct estimates differs among the trials. This is because the voice command samples have different lengths of silence periods at the beginning, which does not contain information helpful to predict an adversarial perturbation. This can be further verified by the detailed dynamics of a real sample shown in Figure 5, where we can find that the estimation of the adversarial perturbation generator changes dramatically when the speech signal is observed, but barely changes when the silent period is observed, indicating the generator makes decisions mainly based on the informative part of the signal, and is able to correct them given more observation. In this sample, the estimation does not become stable until half of the speech signal is observed, but three noise segments are already emitted by this time point, showing the tradeoff between the observation and action space, i.e., the attacker needs to emit the adversarial noise immediately when the current bestguess timing with partial observation arrives, otherwise the timing will pass and the emission cannot be implemented. We also show the mean absolute prediction error over time in the right part of Figure 8, which demonstrates that the adversarial generator indeed improves with more observations.
3.5.3 Error Analysis
Finally, we analyze the error of the realtime adversarial generator. There are two main types of errors causing the performance gap between the expert and the realtime adversarial generator: prediction error and realtime decision error. The proposed realtime generator essentially tries to build the mapping between the (partial) input and output of the differential evolution optimization, while this substantially speeds up the computing, it is challenging to learn such a mapping. Specifically, in our setting, the output of the stochastic optimization algorithm adopted by the expert is not deterministic (shown in the left part of Figure 6), which makes learning such a mapping even harder. As shown in the right part of Figure 8 and, even after the realtime generator observes the full data, its prediction still has a certain amount of prediction errors. Further, as discussed in the previous section, the realtime adversarial generator may emit noise segments when it does not have a reliable estimation due to the observationaction space tradeoff. We show that the distribution of the actual timing error in Figure 9, which obeys a zerocentered bellshaped distribution, and the errors of most trials are small. Statistically, the mean actual timing error (i.e., the difference between the actual emission time point and the expert’s demonstration) is 0.1135 seconds, which is slightly larger than the prediction error (i.e., the difference between the predicted emission time point with full observation and the expert’s demonstration) of 0.1091 seconds. This indicates that the main error of our attack model is the prediction error, which can be improved by further reducing the instability of the expert and optimizing the deep neural network architecture. The proposed adversarial perturbation is audible even when the amplitude is small. It sounds similar to “usual” noise experienced by electronic speakers (e.g., buzzing, interference, etc.), which makes the perturbation appear not suspicious.
4 Conclusions and Future Work
In this work, we propose the concept of realtime adversarial attacks and show how to attack a streamingbased machine learning model by designing a realtime perturbation generator that continuously uses observed data to design optimal perturbations for unobserved data. We use imitation learning and behavioral cloning algorithm to train the realtime adversarial perturbation generator through the demonstrations of a stateoftheart nonrealtime adversarial perturbation generator. The case study (voice command recognition) and results demonstrate the effectiveness of the proposed approach. Nevertheless, we observe a certain performance gap between the realtime and the nonrealtime adversarial attack when the basic behavior cloning algorithm is used. In our future research, we plan to study how to adopt more advanced reinforcement learning tools to improve the performance of decision making process, e.g., when the realtime adversarial perturbation generator realizes it has previously made a wrong decision, could it adjust its future strategy to make it up? On the other hand, we plan to study the defense strategy to protect realtime systems against such realtime adversarial attack.
References
 [Alzantot et al.2018] Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554, 2018.
 [Atkeson and Schaal1997] Christopher G Atkeson and Stefan Schaal. Robot learning from demonstration. In ICML, volume 97, pages 12–20. Citeseer, 1997.
 [Carlini and Wagner2018] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speechtotext. In 2018 IEEE Security and Privacy Workshops (SPW), pages 1–7. IEEE, 2018.
 [Carlini et al.2016] Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), pages 513–530, 2016.
 [Cisse et al.2017] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. Houdini: Fooling deep structured prediction models. arXiv preprint arXiv:1707.05373, 2017.
 [Ebrahimi et al.2017] Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: Whitebox adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
 [Gong and Poellabauer2017] Yuan Gong and Christian Poellabauer. Crafting adversarial examples for speech paralinguistics applications. arXiv preprint arXiv:1711.03280, 2017.
 [Gong and Poellabauer2018] Yuan Gong and Christian Poellabauer. An overview of vulnerabilities of voice controlled systems. arXiv preprint arXiv:1803.09156, 2018.
 [Goodfellow et al.2014] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 [Grosse et al.2017] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial examples for malware detection. In European Symposium on Research in Computer Security, pages 62–79. Springer, 2017.
 [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Li et al.2018] Shasha Li, Ajaya Neupane, Sujoy Paul, Chengyu Song, Srikanth V Krishnamurthy, Amit K Roy Chowdhury, and Ananthram Swami. Adversarial perturbations against realtime video classification systems. arXiv preprint arXiv:1807.00458, 2018.

[MoosaviDezfooli et al.2016]
SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, and Pascal Frossard.
Deepfool: a simple and accurate method to fool deep neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2574–2582, 2016.  [MoosaviDezfooli et al.2017] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1765–1773, 2017.
 [Neekhara et al.2019] Paarth Neekhara, Shehzeen Hussain, Prakhar Pandey, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. Universal adversarial perturbations for speech recognition systems. arXiv preprint arXiv:1905.03828, 2019.
 [Qin et al.2019] Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison Cottrell, and Colin Raffel. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. arXiv preprint arXiv:1903.10346, 2019.

[Ross et al.2011]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to
noregret online learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 627–635, 2011.  [Sainath and Parada2015] Tara Sainath and Carolina Parada. Convolutional neural networks for smallfootprint keyword spotting. 2015.
 [Schönherr et al.2018] Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665, 2018.

[Storn and Price1997]
Rainer Storn and Kenneth Price.
Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces.
Journal of global optimization, 11(4):341–359, 1997. 
[Su et al.2019]
Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai.
One pixel attack for fooling deep neural networks.
IEEE Transactions on Evolutionary Computation
, 2019.  [Sutton et al.1998] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
 [Szegedy et al.2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [Warden2018] Pete Warden. Speech commands: A dataset for limitedvocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
 [Yakura and Sakuma2018] Hiromu Yakura and Jun Sakuma. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793, 2018.