Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

03/17/2020 ∙ by Hu Zhang, et al. ∙ Amazon University of Technology Sydney 0

Deep neural networks are known to be susceptible to adversarial noise, which are tiny and imperceptible perturbations. Most of previous work on adversarial attack mainly focus on image models, while the vulnerability of video models is less explored. In this paper, we aim to attack video models by utilizing intrinsic movement pattern and regional relative motion among video frames. We propose an effective motion-excited sampler to obtain motion-aware noise prior, which we term as sparked prior. Our sparked prior underlines frame correlations and utilizes video dynamics via relative motion. By using the sparked prior in gradient estimation, we can successfully attack a variety of video classification models with fewer number of queries. Extensive experimental results on four benchmark datasets validate the efficacy of our proposed method.



There are no comments yet.


page 9

page 19

page 20

page 21

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the superior performance achieved in a variety of computer vision tasks, i.e., image classification

[He2016resnet], object detection [renNIPS15fasterrcnn], segmentation [he2017maskrcnn, chen2018deeplabv3plus]

, Deep Neural Networks (DNNs) are shown to be susceptible to adversarial attacks that a well-trained DNN classifier may make severe mistakes due to a single invisible perturbation on a benign input and suffers dramatic performance degradation. To investigate the vulnerability and robustness of DNNs, many effective attack methods have been proposed on image models. They either consider a white-box attack setting where the adversary can always get the full access to the model including exact gradients of given input, or a black-box one, in which the structure and parameters of the model are blocked that the attacker can only access the (

) pair through queries.

Figure 1: (a) A pipeline of generating adversarial examples to attack a video model. (b) Loss curve comparison: i) Multi-noise: sample noise prior individually for each frame; ii) One-noise: sample one noise prior for all frames; iii) Sparked prior (ours): sample one noise prior for all frames and sparked by motion information. Loss is computed in attacking an I3D model on Kinetics-400 dataset. The lower loss indicates the better attacking performance. Our proposed sparked prior clearly outperforms (i) and (ii) in terms of attacking video models.

DNNs have also been widely applied in video tasks, such as video action recognition [Wang2016TSN, kay2017kinetics], video object detection [wu2019selsa], video segmentation [wang2019fast], video inpainting [kim2019deep] etc. However, limited work have been done on attacking video models. A standard pipeline of video adversarial attack is shown in Fig. 1(a). Specially designed perturbations are estimated from prior, which normally is random noise, and imposed on the clean video frames to generate the adversarial examples. The goal is using the adversarial examples to trick the model into giving a wrong prediction. Most literature [wei2018sparse, li2018adversarial] focus on the white-box attack setting and simply transfer the methods used in image domain to video domain. Recently [jiang2019black] proposes a black-box attack method, which simply decodes a video into frames and transfer gradients from an image pretrained model for each frame. All the aforementioned methods ignore the intrinsic difference between images and videos, the extra temporal dimension. This naturally leads to a question: should we use motion information in video adversarial attack?

In this work, we propose to use motion information for black-box attack on video models. In each optimization step, instead of directly using random noise as prior, we first generate a motion map (i.e., motion vector or optical flow) between frames and construct a motion-excited sampler. The random noise will then be selected by the sampler to obtain motion-aware noise for gradient estimation, which we term as sparked prior. In the end, we feed the original video frames and the sparked prior to gradient estimator and use the estimated gradients to iteratively update the video. To show the effectiveness of using motion information, we perform a proof-of-concept comparison to two baselines. One is initializing noises separately for each frame extending 

[ilyas2018prior] (multi-noise), the other is initializing one noise and replicating it across all video frames (one-noise). We use the training loss curve to reflect the effectiveness of video attack methods, in which the loss measures the distance to a fake label. As we can see in Fig. 1(b), the loss of our proposed method drops significantly faster (orange curve) than one-noise and multi-noise method, which indicates that we need fewer queries to successfully attack a video model. This answers our previous question that we should use motion information in video adversarial attack. Our main contributions can be summarized as follows:

  • We find that simply transferring attack methods on images models to video models is less effective. Motion plays a key role in video attacking.

  • We propose a motion-excited sampler to obtain sparked prior, which leads to more effective gradient estimation for faster adversarial optimization.

  • We perform thorough experiments on four video action recognition datasets against two models and show the efficacy of our proposed algorithm.

2 Related Work

Adversarial Attack. Adversarial examples have been well studied on image models. [szegedy2013intriguing] first shows that an adversarial sample, computed by imposing small noise on the original image, could lead to wrong prediction. By defining a number of new losses, [carlini2017towards] demonstrates that previous defense methods do not significantly increase the robustness of neural networks. [papernot2017practical] first studies the black-box attack in image model by leveraging the transferability of adversarial examples, however, their success rate is rather limited. [ilyas2018black] extends Natural Evolutionay Strategies (NES) to do gradient estimation and [ilyas2018prior] recently proposes to use time and data dependent priors to reduce queries in black-box image attack.

However, limited work have been done on attacking video models. In terms of white-box attack, [wei2018sparse] proposes to investigate the sparsity of adversarial perturbations and their propagation across video frames. [li2018adversarial] leverages a Generative Adversarial Network (GAN) to account for temporal correlations and generate adversarial samples for real-time video classification system. [inkawhich2018adversarial] focuses on attacking the motion stream in a two-stream video classifier by extending [goodfellow2014explaining]. [chen2019appending] proposes to append a few dummy frames to attack different networks by optimizing specially designed loss. The first black-box video attack method is proposed in [jiang2019black]

, where they utilize the ImageNet pretrained models to generate a tentative gradient for each video frame and use NES to rectify it. More recently,

[wei2019heuristic] and [yan2020sparse] focus on sparse perturbations only on the selected frames and regions, instead of the whole video.

Our work is different from [jiang2019black] because we leverage the motion information directly in generating adversarial videos. We do not utilize the ImageNet pretrained models to generate gradient for each video frame. Our work is also different from [wei2019heuristic, yan2020sparse]

in terms of both problem setting and evaluation metric. We follow the setting of

[jiang2019black] to treat the entire video as an integrity, instead of attacking the video model from the perspective of frame difference in a sparse attack setting. We use attack success rate and consumed queries to evaluate our method instead of mean absolute perturbation (MAP). By using our proposed motion-aware sparked prior, we can successfully attack a number of video classification models using much fewer queries.

Video Action Recognition.

Recent methods for video action recognition can be categorized into two families depending on how they reason motion information, i.e., 3D Convolutional Neural Networks (CNNs) 

[ji20123d, tran2015learning, carreira2017quo, wang2018nonlocal, Feichtenhofer2019slowfast] and two-stream networks [simonyan2014two, Feichtenhofer2016twofusion, Wang2016TSN, lin2019tsm]. 3D CNNs simply extend 2D filters to 3D filters in order to learn spatio-temporal representations directly from videos. Since early 3D models [ji20123d, tran2015learning] are hard to train, many follow-up work have been proposed [carreira2017quo, qiu2017p3d, tran2018r21d, Feichtenhofer2019slowfast]. Two-stream methods [simonyan2014two] train two separate networks, a spatial stream given input of RGB images and a temporal stream given input of stacked optical flow images. An early [Feichtenhofer2016twofusion] or late [Wang2016TSN] fusion is then performed to combine the results and make a final prediction. Although 3D CNNs and two-stream networks are two different family of methods, they are not mutually exclusive and can be combined together. All aforementioned methods indicate the importance of motion information in video understanding.

Based on this observation, we believe motion information should benefit video adversarial attack. We thus propose the motion-excited sampler to generate a better prior for gradient estimation in a black-box attack setting. By incorporating motion information, our method shows superior attack performance on four benchmark datasets against two widely adopted video classification models.

3 Method

3.1 Problem Formulation

We consider a standard video action recognition task for attacking. Suppose the given videos have an underlying distribution denoted as , sample and its corresponding label are a pair in , where denote the number of video frames, height, width and channels of each frame respectively. represents the number of categories. We denote DNN model as function , where represents the model’s parameters. The goal of a black-box adversarial attack is to find an adversarial example with imperceivable difference from to fail the target model through querying the target model multiple times. It can be mathematically formulated as:


Here is the video frame index, starting from to . denotes the norm that measures how much perturbation is imposed, and indicates the maximum perturbations allowed.

is the returned logits or probability by the target model

when given an input video

. The loss function

measures the degree of certainty for the input maintaining true class . For simplicity, we shorten the loss function as in the rest of the paper since model parameters remain unchanged. The goal is to minimize the certainty and successfully fool the classification model. The first constraint enforces high similarity between clean video and its adversarial version and the second imposes a fixed budget for the number of queries used in the optimization. Hence, the fewer queries required for adversarial video and the higher overall success rate within perturbation, the better the attack method. The overview of our method is shown in Fig. 2.

Figure 2: (a): Overview of our framework for black-box video attack. i) Compute motion maps from given video frames; ii) Generate sparked prior from random noise by the proposed motion-excited sampler; iii) Estimate gradients by querying the black-box video model; iv) Use the estimated gradient to perform iterative projected gradient descent (PGD) optimization on the video. (b): Illustration of Motion-Excited Sampler. To determine pixel value located in of , we first get motion value from motion map at its location. We then select pixel value located in of and put its value into location in .

3.2 Motion Map Generation

In order to incorporate motion information for video adversarial attack, we need to find an appropriate motion representation. There are two widely adopted motion representations in video analysis domain [simonyan2014two, Wang2016TSN], motion vector and optical flow. Both of them can reflect the pixel intensity changes between two adjacent video frames. In this work, we adopt the accumulated motion vector [wu2018compressed] and the TVL1 flow [chambolle2004algorithm] as the motion representation. We first describe accumulated motion vector as below.

Most modern codecs in video compression divide video into several intervals and split frames into -frames (intracoded frames) and -frames (predictive frames). Here, we denote the number of intervals as and the length of an interval as . In each interval, the first frame is -frame and the rest are -frames. The accumulated motion vector is formulated as the motion vector of each -frame, that can trace back to the initial -frame instead of depending on previous -frames.

Suppose the accumulated motion vector in frame of interval is denoted as and each pixel value at location in can be computed as:


where represents the location traced back to initial -frame from -th frame in interval . We refer the readers to [wu2018compressed] for more details on accumulated motion vectors.

For an interval with frames, we can obtain accumulated motion vectors. We only choose the last one since it traces all the way back to the -frame in this interval and are most motion-informative. We abbreviate as and thus, we have a set of accumulated motion vectors for the whole video, denoted as .

Optical flow is a motion representation that similar to motion vector with the same dimension . It also contains spatial details but is not directly available in compressed videos. Here we use TVL1 algorithm [chambolle2004algorithm] to compute the flow given its wide adoption in video-related applications [Wang2016TSN, carreira2017quo, Feichtenhofer2019slowfast]. We apply the same strategy as motion vectors to select optical flow and will also obtain flow vectors. The set of flow vectors is also denoted as for simplicity.

3.3 Motion-Excited Sampler

In a black-box setting, random noise is normally employed in generating adversarial examples. However, as stated before in Fig. 1(b), direct usage of random noise is not promising in video attack. To tackle this problem, we propose to involve motion information in the process of generating adversarial examples, and thus propose motion-excited sampler. In this section, we describe the details of the designed sampler.

First, we define the operation of motion-excited sampler (ME-Sampler) as


where denotes the initial random noise, motion maps are selected without replacement from set introduced in Section 3.2. will be the transformed motion-aware noise, which we term as sparked prior afterwards.

To be specific, we use the motion-excited sampler to “warp” the random noise by motion. It is not just rearranging the pixels in the random noise, but constructing a completely new prior given the motion information. For simplicity, we only consider the operation for one frame here and it is straightforward to extend to the case of multiple frames. Without abuse of notation, we still use for clarification in this section.

At the -th location of the motion map , we denote the motion vector value as and its coordinate is denoted (, ), i.e., . Here, has two values, which indicate the horizontal and vertical relative movements, respectively. When computing the value of position in sparked prior , () will serve as the new coordinates for searching in original random noise. The corresponding noise value of will be assigned as the value in . Thus, we have:

We give a simplified example in Fig. 2 to show how our motion-excited sampler works. To determine pixel value located in of , we first get motion value from motion map at its location. We then select pixel value located in of and put its value into location of .

The insight behind our motion-excited sampler is that when we do sampling and re-combination of selected pixels from random noise, we are incorporating the key information from motion into the generated sparked prior. Since motion information has been experimentally proved to be critical in distinguishing different actions, using sparked prior is likely to better identify the weakness of a video model and largely improve the final attack performance.

0:  video , its label , number of frame of video is . initialized , interval t for sampling new motion map, for loss variation and for approximation.
1:  if loop % t = 0 then
2:     motion map are chosen from without replacement and are concatenated to be ;
3:  end if
4:  ;
5:  ;
6:  ;
7:  ;
8:  ;
9:  ;
10:  ;
10:  .
Algorithm 1 : Estimate .

3.4 Gradient Estimation and Optimization

Once we have the sparked prior, we incorporate it with the input video and feed them to the black-box model to estimate the gradients. We consider noise in our paper following [jiang2019black], but our framework also applies to other norms.

Similar to [ilyas2018prior], rather than directly estimating the gradient for generating adversarial video, we perform iterative updating to search. The new loss function designed for such optimization is,


is the groundtruth gradient we desire and is the gradient to be estimated. An intuitive observation from this loss function is that iterative minimization of Eq. 4 will drive our estimated gradient closer to the true gradient.

We denote to be the gradient of loss . We perform a two-query estimation to the expectation and apply the authentic sampling to get


where is a small number adjusting the magnitude of loss variation. By substituting Eq. (4) to Eq. (5), we have


In the context of finite difference method, we notice, given the function at a point in the direction of vector , the directional derivative can be transferred as:


is a small constant for approximation. By combining Eq. (6)-(7), we have,


with and . The resulting algorithm for generating gradient for is shown in Algorithm 1.

Once we have Algorithm 1, we can use it to update estimated gradient and optimize the adversarial video. To be specific, in iteration , is returned by Algorithm 1. We update by simply applying one-step gradient descent: ,

is a hyperparameter to update

. The updated is the gradient we want to use for generating adversarial videos. Finally, we combine our estimated with projection gradient descent (PGD) to translate our gradient estimation algorithm into an efficient video adversarial attack method. The detailed procedure is shown in Algorithm 2, in which returns top predicted class label, constrain the updated video close to the original video , where is the lower bound and the upper bound. is the noise constraint in Eq. (1).

0:  original video , its label , learning rate for updating adversarial video.
1:  , initially estimated , initial loop ;
2:  while  do
3:     ;
4:     ;
5:     ;
6:     ;
7:     ;
8:  end while
8:  .
Algorithm 2 Adversarial Example Optimization for norm perturbations.

3.5 Loss Function

Different from applying cross-entropy loss directly, we adopt the idea in [carlini2017towards] and design a logits-based loss. Here, the logits returned from the black-box model is denoted as , where is the number of classes. We denote the class of largest value in logits as , the largest logits value is . The final loss can be obtained as . Minimizing is expected to confuse the model with the second most confident class prediction so that our adversarial attack could succeed.

4 Experiments

In this section, we provide a comprehensive evaluation of our proposed method for video attacks against two widely adopted video classification models on four benchmark video datasets.

Figure 3: Examples of motion vectors in generating adversarial samples. In (a)-(d), the first row is the original video frame, the second row is the motion vector and the third row is generated adversarial video frame. a) Kinetics-400 on I3D: AbselingRock climbing; b) UCF-101 on I3D: BikingWalking with dog; c) Kinetics-400 on TSN: Playing bagpipesPlaying accordion; d) UCF-101 on TSN: PunchingLunges.

4.1 Experimental Setting

Datasets. We perform video attack on four video action recognition datasets: UCF-101 [soomro2012ucf101], HMDB-51 [kuehne2011hmdb], Kinetics-400 [kay2017kinetics] and Something-Something V2 [goyal2017something]. UCF-101 consists of 13,200 videos spanned over 101 action classes. HMDB-51 includes 6,766 videos in 51 classes. Kinetics-400 is a large-scale dataset which has around 300K videos in 400 classes. Something-Something V2 is a recent crowd-sourced video dataset on human-object interaction which needs more temporal reasoning. It contains 220,847 video clips in 174 classes. For notation simplicity, we use SthSth-V2 to represent Something-Something V2.

Video Models. We choose two video action recognition models, I3D [kay2017kinetics] and TSN2D [Wang2016TSN], as our black-box models. For I3D training on Kinetics-400 and SthSth-V2, we train it from ImageNet initialized weights. For I3D training on UCF-101 and HMDB-51, we train it with Kinetics-400 pretrained parameters as initialization. For TSN2D training, we use ImageNet initialized weights on all four datasets. On SthSth-V2, we use 8 segments following [lin2019tsm], and for the other three datasets, we use 3 segments following [Wang2016TSN]. At test time, we divide each video to 10 segments for I3D and 25 segments for TSN2D on UCF-101, HMDB-51 and Kinetics-400. For SthSth-V2, we use 2 segments for both I3D and TSN. The test accuracy of two models on the four datasets can be found in Table 1.

Model Kinetics-400 UCF-101 HMDB-51 SthSth-V2
I3D 70.11 93.55 68.30 50.25
TSN2D 68.87 86.04 54.83 35.11
Table 1: Test accuracy (%) of the video models.

Attack Setting. In this work, we perform both untargeted attack and targeted attack. Untargeted attack requires the given video to be misclassified to any wrong label and targeted attack requires classifying it to a specific label. For each dataset, we fix the random seed and randomly select one video from each category following the setting in [jiang2019black]. All selected videos should be correctly classified by the black-box model. For all videos, we impose noise on video frames whose pixels are normalized to 0-1. We constrain the maximum perturbation , query budget Q = 60,000 for untargeted attack and , Q = 200,000 for targeted attack. If one video is failed to attack within these constraints, we will record its consumed queries as the upper bound Q.

Evaluation Metric. We use average number of queries (ANQ) required in generating effective adversarial examples and the attack success rate (SR) to measure the attack performance. A smaller ANQ and higher SR is preferred.

4.2 Comparison to State-of-the-Art

We report the effectiveness of our proposed method across a variety of different models and datasets in Table 2. We present the results of leveraging two kinds of motion representations: Motion Vector (MV) and Optical Flow (OF) in our proposed method. In comparison, we first show the attacking performance of V-BAD [jiang2019black] under our video models since V-BAD is the only directly comparable method. We also extend two image attack methods [ilyas2018black, ilyas2018prior] as strong baselines to video to demonstrate the advantage of using motion information. They are denoted as E-NES and E-Bandits respectively and their attacking results are shown in Table 2.

Overall, our attack method using motion information achieves promising success rates on different datasets and models. On SthSth-V2 and HMDB-51, we even achieve 100% SR. On Kinetics-400 and UCF-101, we also get over 97% SR. The number of queries used in attacking these models is also encouraging. One observation worth noticing is that it only takes hundreds of queries to completely break the models on SthSth-V2. On HMDB-51 against TSN2D, we need no more than 900 queries as well. For the rest of models, we just consume slightly more queries. To analyze this, we observe that models consuming slightly more queries often have higher recognition accuracy from Table 1. From this observation, we conclude that a model is likely to be more robust if its performance on the clean video is better.

Dataset / Model Method I3D TSN2D
SthSth-V2 E-NES [ilyas2018black] 11,552 86.96 1,698 99.41
E-Bandits [ilyas2018prior] 968 100.0 435 99.41
V-BAD [jiang2019black] 7,239 97.70 495 100.0
ME-Sampler (OF) 735 98.90 315 100.0
ME-Sampler (MV) 592 100.0 244 100.0
HMDB-51 E-NES [ilyas2018black] 13,237 84.31 19,407 76.47
E-Bandits [ilyas2018prior] 4,549 99.80 4,261 100.0
V-BAD [jiang2019black] 5,064 100.0 2,405 100.0
ME-Sampler (OF) 3,306 100.0 842 100.0
ME-Sampler (MV) 3,915 100.0 831 100.0
Kinetics-400 E-NES [ilyas2018black] 11,423 89.30 20,698 71.93
E-Bandits [ilyas2018prior] 3,697 99.00 6,149 97.50
V-BAD [jiang2019black] 4,047 99.75 2,623 99.75
ME-Sampler (OF) 3,415 99.30 2,631 98.80
ME-Sampler (MV) 2,717 99.00 1,715 99.75
UCF-101 E-NES [ilyas2018black] 23,531 69.23 41,328 34.65
E-Bandits [ilyas2018prior] 10,590 89.10 24,890 66.33
V-BAD [jiang2019black] 8,819 97.03 17,638 91.09
ME-Sampler (OF) 6,101 96.00 6,598 97.00
ME-Sampler (MV) 4,748 98.02 5,353 99.00
Table 2: Untargeted attacks on SthSth-V2, HMDB-51, Kinetics-400, UCF-101. The attacked models are I3D and TSN2D. “ME-Sampler” denotes the results of our method. “OF” denotes Optical Flow. “MV” denotes Motion Vector.

In terms of using motion vector and optical flow, we find that motion vector outperforms optical flow in most cases, e.g., the number of queries used 5,353 (MV) vs 6,598 (OF) on UCF-101 against TSN2D. The reason is that motion vector always has clearer motion region since it is computed by a small block of size . However, optical flow is always pixel-wisely calculated. It is not difficult to imagine that when the region used for describing is relatively larger, it is easier and more accurate to portray the overall motion. When only considering each pixel, the movement is likely to be lost in tracking and make some mistakes.

Compared to E-NES and E-Bandits, we achieves much better results, either on consumed queries or success rate, i.e., when attacking a TSN2D model on UCF-101, our success rate is 99.00%, which is much higher than 34.65% for E-NES and 66.33% for E-Bandits. The query 5,353 is also much smaller than 41,328 and 24,890. When compared to V-BAD, our method also requires much fewer queries. For example, we save at least 1,758 queries on HMDB-51, 1,330 on Kinetics-400 and 4,071 on UCF-101 against I3D models. Meanwhile, we achieve slightly better success rate 98.02% vs 97.03% on UCF-101, 100.0% vs 97.70% on SthSth-V2.

Finally, we show the visualizations of adversarial frames on Kinetics-400 and UCF-101 in Fig. 3. We note that the generated video has little difference from the original one but can lead to a failed prediction. More visualization can be found in the supplementary materials.

(a) SthSth-V2
(b) HMDB-51
Figure 4: (a): Comparisons of targeted attack on SthSth-V2 with V-BAD: i) Average queries consumed by I3D and TSN2D; ii) Success rate achieved by I3D and TSN2D. (b): Comparisons of targeted attack on HMDB-51 with V-BAD: i) Average queries consumed by I3D and TSN2D; ii) Success rate achieved by I3D and TSN2D.

4.3 Targeted Attack

In this section, we report the results of targeted attack on dataset HMDB-51 and SthSth-V2 in Fig. 4. For dataset SthSth-V2 from Fig. 4(a), our method consumes less than 25,000 queries using either motion vector or optical flow. However, it costs V-BAD 71,791 against I3D model and 52,182 against TSN2D model. The success rate is about 6% higher in TSN2D model but with much fewer queries. For dataset HMDB-51 against I3D from Fig. 4(b), we also outperform V-BAD by saving more than 10,000 queries and achieve comparable success rate. For TSN2D, we only require half of the queries as V-BAD consumes but achieves a much higher success rate meanwhile, i.e., 92.16% vs 64.86%.

Combining with the untargeted results, we conclude that our method is more effective in generating adversarial videos than the comparing baselines.

4.4 Ablation Study

In this section, we first show the necessity of motion information and then demonstrate that it is the movement pattern in the motion map that contributes to the attacking. We also study the effect of different losses. Experiments in this section are conducted on a subset of 30 randomly selected categories from UCF-101 and 80 from Kinetics-400 by following the setting in [jiang2019black].

Dataset / Model Method I3D TSN2D
Kinetics-400 Multi-noise 11,416 95.00 15,966 89.87
One-noise 8,258 96.25 8,392 96.25
Ours 3,089 100.0 2,494 100.0
UCF-101 Multi-noise 15,798 90.00 30,337 70.00
One-noise 22,908 93.33 16,620 90.00
Ours 6,876 100.0 8,399 100.0
Table 3: Compare to cases without introducing motion information.
Dataset / Model Method I3D TSN2D
Kinetics-400 U-Sample 10,250 96.25 9,166 96.20
S-Vaule 8,610 98.75 8,429 96.20
Our 3,089 100.0 2,494 100.0
UCF-101 U-Sample 13,773 93.33 17,718 83.33
S-Vaule 11,471 96.67 17,116 86.67
Our 6,876 100.0 8,399 100.0
Table 4: Comparison of motion map with two handcrafted maps.

The necessity of motion information. As mentioned in Section 1, we evaluate two cases without using motion information: 1) Multi-noise: Directly introducing random noise for each frame independently; 2) One-noise: Introducing only one random noise and replicated to all frames. We record their performance and compare to ours in Table 3. The results show that methods without using motion are likely to spend more queries and suffer lower success rate. On UCF-101 against I3D model, ’Multi-noise’ consumes queries more than twice as our result and the success rate is 10% lower. On UCF-101 against TSN2D, it has 30% drop on success rate for ‘Multi-noise’ and 10% drop for ‘One-noise’. On Kinetics-400, the decrease of success rate is slightly alleviated for the two methods but still, when against I3D model, ‘One-noise’ incurs nearly 3 times more queries and ‘Multi-noise’ consumes nearly 4 times more queries. The query consumed on TSN2D model is even worse that ‘Multi-noise’ has 6 times more queries and ‘One-noise’ has almost 4 times queries. Such big gaps between methods without motion information and ours indicate that our designed mechanism to utilize motion information plays an important role in directing effective gradient generation for improved search of adversarial videos.

Why motion information helps? In order to answer this question, we change the pixel values in original motion map and replace with uniformly sampled values and sequenced values.

We first define a region map , which is the same size of original motion map. Its pixel values are 1 when the corresponding pixel in original motion map are nonzero, the rest pixel values in it are set to be 0. We describe the two settings here. “Uniformly Sample” (U-Sample): A new map are created whose pixel values are uniformly sampled from [0, 1] and scaled to [-50,50]. Region map and are multiplied together to replace original motion map. “Sequenced Value” (S-Value): A map whose pixel valued are ranged in order starting from 0,1,2 from left to right and from top to bottom. Region map and are multiplied together to serve as an replacement of original motion vector.

The results can be seen in Table 4. We first notice that ‘S-Value’ slightly outperforms ‘U-Sample’. On Kinetics-400 against I3D, ‘S-Value’ saves more than 1,000 queries but gets 2% higher success rate. On UCF-101 against I3D, ‘S-Value’ saves more than 2,000 queries but also gets 3% higher success rate. We analyze the reason to be the gradual change of pixel values in ‘S-Value’, rather than irregular change. As for our method, we can save nearly 3 times queries compared to ‘S-Value’ and ‘U-Sample’, but achieves 3.8% higher success rate. On UCF-101, at least 4,541 queries are not necessary and 6.77% on success rate can be further improved with I3D. 9,319 queries are saved and 16.77% higher success rate is obtained when compared to ‘U-sample’ against TSN2D. Through such comparison, we can conclude that the movement pattern in motion map is the key factor to improve the attacking performance.

Dataset / Model Method I3D TSN2D
Kinetics-400 Cross-Entropy 3,452 98.75 2,248 100.0
Probability 3,089 100.0 2,494 100.0
Logits 2,423 100.0 1,780 100.0
UCF-101 Cross-Entropy 17,362 80.00 17,992 73.26
Probability 13,217 90.00 14,842 81.19
Logits 6,876 100.0 6,182 100.0
Table 5: Comparison of losses based on Cross-Entropy, Probability, Logits.

Comparison of different losses. We study the effect of three different losses for optimization here: cross-entropy loss, logits-based in Section 3.5. We further transfer logits to probability by Softmax and construct probability-based loss: , is the class with the largest probability value. From the results in Table 5, we conclude that logits-based loss always performs better while the other two are less effective. We also observe that Kinectics-400 is less restrictive to the selection of optimization loss, compared to UCF-101.

5 Conclusion

In this paper, we consider the black-box adversarial attack on video models. We find that direct transfer of attack methods from image to video is less effective and hypothesize motion information between video frames plays a big role in misleading video action recognition models. We thus propose a motion-excited sampler to generate sparked prior and obtain significantly better attack performance. We perform extensive ablation studies to investigate why motion works in video attacking and demonstrate our superior performance over previous methods. We hope that our work will give a new direction to study adversarial attack on video models and some insight to the difference of videos and images.


Appendix 0.A Implementation details

In terms of accumulated motion vector, we follow the setting in [wu2018compressed] and set each interval within a video to 12 frames. There is no overlap between two adjacent intervals. For each interval, we generate one accumulated motion map. In adversarial attacking, we impose noises on original video frames which are normalized to [0,1]. The modified videos are processed via standard ‘mean’ and ‘std’ normalization before inputting to the black-box model for gradient estimation. Each iteration consumes three queries that two queries for in Algorithm 1 and one to determine whether the updated video is successful in attacking. In the untargeted attack setting, we set the query limit to 60,000 and the maximal iteration for updating adversarial video is 20,000. In the targeted attack setting, we set the query limit to 200,000 and the maximal iteration is 66,667 (200,000/3).

In Algorithm 1, the interval t for sampling new motion maps is 10, for adjusting the magnitude of loss variation is 0.1, for approximation is 0.1. In Algorithm 2, learning rate for updating estimated gradient is 0.1 and learning rate for updating adversarial video is 0.025.

Appendix 0.B More visualization

0.b.0.1 Demo video

We first put together a video to showcase our generated adversarial samples. The video can be found in MESampler-demo-video.mp4.

0.b.0.2 Video and motion map visualization

We also show more results of adversarial video frames and the adopted motion vector on four datasets. Visualizations on SthSth-V2 [goyal2017something] and HMDB-51 [kuehne2011hmdb] are shown in Fig. 5 and the results of Kinetics-400 [kay2017kinetics] and UCF-101 [soomro2012ucf101] are in Fig. 6.

From the demo video and the frame visualizations on four datasets, we can see that our generated adversarial samples can successfully fail the video classification model. Even though the generated video samples look the same as the original videos and the true labels can still be recognized by human without any difficulties.

0.b.0.3 Failure cases

To have a better understanding of our model, we show several failure cases of our method. Examples on Kinetics-400 and UCF-101 are shown in Fig. 7 and Fig. 8, respectively.

There are two potential reasons behind failures in adversarial videos. The first one is about the confidence of attacked video model. On certain videos, the model is confident about its prediction. Take the first video in Fig. 7 as an example, the black-box model outputs its true label ‘golf driving’ with confidence 0.9999. This confidence score is so high that the perturbation posed on the video is likely to have little consequences on the final results. Secondly, it is about motion quality in videos. We notice that for videos that we fail to attack, their motion map is rather obscure and unrecognizable. Under such circumstance, the advantage of motion information can not be fully utilized.

Appendix 0.C Better motion information

Here, we justify another assumption that clearer and more completed motion maps will lead to better attack performance. Rather than fixing the starting point and length of each interval for generating motion map, the starting point and the length of interval for generating motion maps is modified according to the trajectory of given video to get clearer and more complete description of movement. Such new map is termed as ‘improved motion map’. We show two samples on Kinetics-400 in Fig. 9, that are from class ‘zumba’ and ‘vault’. The left column are the video frames, ‘improved motion map’ and original motion map are in the middle and right respectively. Clearly, improved motion map in the middle is more consistent and clearer. For sample (a), it saves more than 10,000 queries by applying ‘improved motion map’ instead of original one. For sample (b), 30,000 queries are also saved by using the ‘improved motion map’.

However, it is still difficult to automatically determine the starting point and the interval length to generate much clearer maps. We will leave the study as the future work.

Figure 5: Examples of motion vectors used in attacking and the generated adversarial samples. In (a)-(d), the first row is the original video frame, the second row is the motion vector and the third row is generated adversarial video frame. a) SthSth-V2 on I3D: throwing a leaf in the air and letting it fall throwing tooth paste; b) HMDB-51 on I3D: throw fencing; c) SthSth-V2 on TSN2D: pretending or trying and failing to twist remote-control  pretending to open something without actually opening it; d) HMDB-51 on TSN2D: swing-baseball throw.
Figure 6: Examples of motion vectors used in attacking and generated adversarial samples. In (a)-(d), the first row is the original video frame, the second row is the motion vector and the third row is generated adversarial video frame. a) Kinetics-400 on I3D: tango dancing  salsa dancing; b) UCF-101 on I3D: StillRings PoleVault; c) Kinetics-400 on TSN2D: tossing coin scissors paper; d) UCF-101 on TSN2D: Swing TrampolineJumping.
Figure 7: Failed samples from Kinetics-400 against I3D and TSN2D. a) Sample from class ‘golf driving’ against I3D; b) Sample from class ‘presenting weather forecast’ against TSN2D. The first row are the frames of original video and the second row are the motion vectors generated between frames. The movements between video frames seem to change little and the generated motion vectors are very obscure.
Figure 8: Failed samples from UCF-101 against I3D and TSN2D. a) Sample from class ‘BasketballDunk’ against I3D; b) Sample from class ‘BreastStroke’ against TSN2D. The first row are the frames of original video and the second row are the motion vectors generated between frames. The movement between video frames changes little and the target object in the video is very small.
Figure 9: Rather than fixing the starting point and length of each interval for generating motion map, the start point and the length of interval for generating motion maps is modified according to the trajectory of the given video to get clearer and more complete description of movement. Two samples from Kinetics-400:  a) Sample from class ‘zumba’;  b) Sample from class ‘vault’. Left: the frames of original video; Middle: ‘improved motion map’; Right: original motion map in attacking. Clearly, ‘improved motion map’ is more complete and clearer than original motion map in the right. The attacking results are also better by using the ‘improved motion map’.