# Robustness Guarantees for Deep Neural Networks on Videos

The widespread adoption of deep learning models places demands on their robustness. In this paper, we consider the robustness of deep neural networks on videos, which comprise both the spatial features of individual frames extracted by a convolutional neural network and the temporal dynamics between adjacent frames captured by a recurrent neural network. To measure robustness, we study the maximum safe radius problem, which computes the minimum distance from the optical flow set obtained from a given input to that of an adversarial example in the norm ball. We demonstrate that, under the assumption of Lipschitz continuity, the problem can be approximated using finite optimisation via discretising the optical flow space, and the approximation has provable guarantees. We then show that the finite optimisation problem can be solved by utilising a two-player turn-based game in a cooperative setting, where the first player selects the optical flows and the second player determines the dimensions to be manipulated in the chosen flow. We employ an anytime approach to solve the game, in the sense of approximating the value of the game by monotonically improving its upper and lower bounds. We exploit a gradient-based search algorithm to compute the upper bounds, and the admissible A* algorithm to update the lower bounds. Finally, we evaluate our framework on the UCF101 video dataset.

## Authors

• 41 publications
• 41 publications
07/10/2018

### A Game-Based Approximate Verification of Deep Neural Networks with Provable Guarantees

Despite the improved accuracy of deep neural networks, the discovery of ...
04/16/2018

### Global Robustness Evaluation of Deep Neural Networks with Provable Guarantees for L0 Norm

Deployment of deep neural networks (DNNs) in safety or security-critical...
03/17/2021

### The Invertible U-Net for Optical-Flow-free Video Interframe Generation

Video frame interpolation is the task of creating an interface between t...
09/25/2016

### Deep learning based fence segmentation and removal from an image using a video sequence

Conventional approaches to image de-fencing use multiple adjacent frames...
03/30/2021

### What Causes Optical Flow Networks to be Vulnerable to Physical Adversarial Attacks

Recent work demonstrated the lack of robustness of optical flow networks...
03/21/2019

### Progressive Sparse Local Attention for Video object detection

Transferring image-based object detectors to domain of videos remains a ...
10/01/2020

### Assessing Robustness of Text Classification through Maximal Safe Radius Computation

Neural network NLP models are vulnerable to small modifications of the i...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks (DNNs) have been developed for a variety of tasks, including self-driving cars, malicious software classification, and abnormal network activity detection. While the accuracy of neural networks has significantly improved, matching human cognitive perception, they are susceptible to adversarial examples. An adversarial example

is an input which, whilst initially classified correctly, is misclassified with a slight, often imperceptible, perturbation.

Robustness of neural networks has been an active topic of investigation, and a number of approaches have been proposed. (See Related work below.) However, most existing works focus on robustness of neural networks on image classification problems, where convolutional neural networks (CNNs) are sufficient. One assumption that CNNs rely on is that inputs are independent of each other, and they are unable to accept a sequence of input data when the final output is dependent on intermediate outputs. In reality, though, tasks often contain sequential data as inputs, for instance, in machine translation sutskever2014sequence , speech/handwriting recognition graves2013speech ; graves2005framewise ; fernandez2007application , and protein homology detection hochreiter2007fast . To this end, recurrent neural networks (RNNs)

come into play. For RNNs, the connections between neurons form a directed graph along a temporal sequence, which captures temporal dynamic behaviours. Unlike CNNs, RNNs can use the internal state (memory) to process sequences of inputs.

In this work, we evaluate robustness of neural networks, including CNNs and RNNs, on videos. Video classification is challenging because it comprises both the spatial features on each individual frames, which can be extracted by CNNs, as well as the temporal dynamics between neighbouring frames, which can be captured by RNNs. Our main contributions are as follows.

• [leftmargin=2em]

• We define the maximum safe radius problem for DNNs for sequential video inputs by working directly with the optical flow sets, and, using Lipschitz continuity, discretise the optimisation problem for computing the maximal such radius into a finite optimisation that approximates it.

• We solve the finite optimisation problem via a two-player turn-based game, where selects among optical flows and determines manipulations imposed within the chosen flows, and demonstrate that the solution is ’s reward when taking the optimal strategy.

• To approximate the reward of the game, we design an anytime approach, in the sense of exploiting a gradient-based algorithm to compute the upper bounds and the admissible A* algorithm to improve the lower bounds.

• We evaluate the proposed framework on the UCF101 video dataset, and present converging upper and lower bounds of the maximum safe radius.

#### Related work

The notion of robustness for neural networks has been mainly studied in the context of image classification, but, to the best of our knowledge, there is no work addressing robustness guarantees for videos. We review only works that are most relevant to our approach. Apart from papernot2016limitations ; moosavi2017universal ; melis2017deep , Szegedy et al. szegedy2014intriguing implement a targeted search for adversarial examples for image classification via minimising the Euclidean distance between the images while keeping missclassification. A subsequent improvement, the Fast Gradient Sign Method (FGSM) goodfellow2015explaining , computes a linearised version of the cost function to obtain the gradients for manipulation directions. Carlini & Wagner carlini2017towards transform the existence of adversarial examples into an optimisation problem so that optimisation algorithms can be applied. Automated verification methods gopinath2018deepsafe ; ruan2018reachability ; ruan2018global aim to compute robustness guarantees against adversarial attacks; we mention constraint solving pulina2010abstraction , e.g., Reluplex katz2017reluplex , or exhaustive exploration of a discretised neighbourhood of a point huang2017safety . In wu2018game a game-based verification approach is proposed for computing the maximal safe radius for feed-forward networks, including CNNs; our method draws on that approach but we are able to handle video inputs.

Adversarial attacks have also been developed for recurrent neural networks on time-series inputs. For instance, Papernot et al. papernot2016crafting extend previous algorithms papernot2016limitations ; goodfellow2015explaining to craft adversarial input sequences for RNNs by using computational graph unfolding to compute the forward derivative of the recurrence cycle. Moreover, both inkawhich2018adversarial and wei2018sparse develop adversarial attacks on the UCF101 dataset; while the former utilises a two-stream classifier, the latter chooses a CNN + RNN architecture. Apart from these attack methods, more recent efforts have attempted to to verify the robustness of RNNs, though not on videos. kevorchian2018verification

define a series of RNN abstractions in the form of feed-forward networks, prove their equivalence to the original ones, and subsequently perform reachability analysis via Linear Programming (LP) and Satisfiability Modulo Theories (SMT)

barrett2018satisfiability . Alternatively, wang2018verification extract deterministic finite automata (DFA) from certain RNNs as the oracle, and use them to evaluate adversarial accuracy.

## 2 Preliminaries

#### Deep neural networks

Let be a neural network with a set of classes . Given an input and a class , we use to denote the confidence of believing that is in class . We work with the logit value of the last layer, but the methods can be adapted to the probability value after normalisation. Thus, is the class into which classifies . Moreover, as in this work can have convolutional and recurrent layers, we let denote the convolutional part and the recurrent part. Specifically, since the inputs we consider are videos, we let the input domain be , where is the length of , i.e., the number of frames, and are the width, height, and channels of each frame, respectively.

#### Optical flow

In order to capture the dynamic characteristics of the moving objects in a video, we utilise optical flow burton1978thinking ; warren2013electronic

, which is a pattern of the apparent motion of the image objects between two consecutive frames caused by the movement of the objects or the camera. There exist methods in the computer vision community to compute optical flows, for instance, the Lucas-Kanade method

lucas1981iterative and the Gunnar Farnebäck algorithm farneback2003two .

###### Definition 1 (Optical Flow Equation).

Consider a pixel in a frame, where denote the horizontal and vertical positions respectively, and denotes the time dimension. If after time, the pixel moves by distance in the next frame, then holds. After taking Taylor series approximation, removing common terms, and dividing by , the Optical Flow Equation is where are the image gradients, is the gradient along time, and the motion is unknown.

#### Distance metrics and Lipschitz continuity

In robustness evaluation, distance metrics are typically used to measure the discrepancy between inputs, denoted as , where indicates Manhattan (), Euclidean (), and Chebyshev () distances. Since our inputs are videos, i.e., sequences of frames, we will need a suitable metric. In this paper, we will work directly with distance metrics on optical flows, as described in the next section. Moreover, we consider neural networks that satisfy Lipschitz continuity

, and note that all networks with bounded inputs are Lipschitz continuous, such as the common fully-connected, convolutional, ReLU, and softmax layers. We denote by

the Lipschitz constant for class .

## 3 Robustness: formulation and approximation

#### Robustness and maximum safe radius

In this work, we focus on pointwise robustness, which is defined as the invariance of a network’s classification over a small neighbourhood of a given input. Following this, the robustness of a classification decision for a specific input can be understood as the non-existence of adversarial examples in the neighbourhood of the input. Here, we work with the norm ball as a neighbourhood of an input, that is, given an input , a distance metric , and a distance , is the set of inputs whose distance to is no greater than based on the -norm. Intuitively, the norm ball with centre and radius limits perturbations to at most w.r.t. . Then (pointwise) robustness is defined as follows.

###### Definition 2 (Robustness).

Given a network , an input , a distance metric , and a distance , an adversarial example is such that and . Define the robustness of by . If this holds, we say is safe with respect to within based on the -norm.

While the above definition returns only True or False, we take a step further to quantify the measurement of robustness. That is, we compute the distance to the original input in the sense that, if exceeding the distance, there definitely exists an adversarial example, whereas, within the distance, all the points are safe. We formally define this distance as the maximum safe radius as follows.

###### Definition 3 (Maximum Safe Radius).

Given a network , an input , a distance metric , and a distance , the maximum safe radius problem is to compute the minimum distance from the original input to an adversarial example , i.e.,

 MSR(N,v,Lk,d)=minv′∈D{\normv−v′Lk∣v′∈B(v,Lk,d) s.t. N(v′)≠N(v)}. (1)

If does not exist in , we let .

#### Maximum safe radius with respect to optical flow

In existing works that evaluate a network’s robustness over images, it is common to manipulate each image at pixel- or channel-level, and then compute the distance between the perturbed and original inputs. However, as we deal with time-series inputs, i.e., videos, instead of manipulating directly on each individual frame, we impose perturbation on each optical flow that is extracted from every pair of adjacent frames, so that both spatial features on frames and temporal dynamics between frames can be captured. We define optical flow as follows.

###### Definition 4 (Optical Flow).

Given an input with length of frames , the optical flow extraction function maps an input to a set of optical flows , where for each optical flow we have such that , .

Then, to study the crafting of adversarial examples, we construct manipulations on the optical flow to obtain perturbed inputs. Note that if the input values are bounded, e.g., or , then the perturbed inputs need to be restricted to be within the bounds.

###### Definition 5 ((Atomic) Optical Flow Manipulation).

Given an input with a set of optical flow , an instruction function , and a manipulation magnitude , we define the input manipulation operations

 MΘ,τ(pt[i])={pt[i]+Θ(pt[i])⋅τ,if i∈[1,w×h],i∈N+pt[i],otherwise (2)

where denote the width and height of . Specifically, when , we say the manipulation is atomic, denoted as .

Moreover, after remapping the manipulated flow back to the original frame, we obtain a perturbed new frame, i.e., , and the manipulated flow set, , maps to a new video with the perturbation. To this end, we compute the distance from to instead of that from to because the former reflects both spatial and temporal manipulations simultaneously. That is, we compute the maximum safe radius with respect to optical flow such that .

#### Approximation based on Lipschitz continuity

Here, we utilise the fact that the networks studied in this work are Lipschitz continuous to discretise the neighbourhood space of an optical flow set, i.e., transform the infinite number of points in the norm ball into a finite number on the grid. First, based on the definitions of optical flow and input manipulation, we transform the problem into the following finite maximum safe radius problem.

###### Definition 6 (Finite Maximum Safe Radius).

Given an input , and a manipulation function , let denote the perturbed input, then the finite maximum safe radius with respect to optical flow is

 minpt∈P(v)minθ∈Θ{\normP(v)−MΘ,τ(P(v))Lk∣MΘ,τ(P(v))∈B(P(v),Lk,d) s.t. N(v′)≠N(v)}. (3)

If does not exist in , we let .

Intuitively, we aim to find a set of manipulations to impose on a set of optical flows , such that the distance between the flow sets is minimal, and after the remapping procedure the perturbed input is an adversarial example. Considering that, within a norm ball , the set of manipulations is finite for a fixed magnitude , the problem only needs to explore a finite number of the ‘grid’ points. To achieve this, we let be a -grid point such that , and be the set of -grid points whose corresponding optical flow sets are in . Note that all the -grid points are reachable from each other via manipulation. By selecting a proper , we ensure that the optical flow space can be covered by small sub-spaces. That is, , where the grid width is for , for , and for . Now, we can use

to estimate

within the error bounds, as in Figure 1.

###### Theorem 1 (Error Bounds).

Given a manipulation magnitude , the optical flow space can be discretised into a set of -grid points, and can be approximated as

 FMSR(N,P(v),Lk,d,τ)−12~d(Lk,τ)≤MSR(N,P(v),Lk,d)≤FMSR(N,P(v),Lk,d,τ). (4)

Then, the problem is to determine . Note that, in order to make sure each -grid point covers all the possible manipulation points in its neighbourhood, we compute the largest . We now show that can be obtained via Lipschitz continuity. For a network which is Lipschitz continuous at input , given Lipschitz constant , for each class, we have

 ~d′(Lk,τ)≤minc∈C,c≠N(v){N(v,N(v))−N(v,c)}maxc∈C,c≠N(v)(LipN(v)+Lipc). (5)

The detailed proof is attached in Appendix A.1. Here we remark that, while is with respect to input and is with respect to the flow set , the relation between them, and similarly that between and , is dependent on the optical flow extraction method used. As this is not the main focus of this work, we do not expand on this topic.

## 4 A game-based robustness verification approach

In this section, we demonstrate that the finite optimisation problem of Definition 6 can be reduced to the computation of a player’s reward when taking an optimal strategy in a game-based setting. To this end, we adapt the game-based approach proposed in wu2018game for robustness evaluation of CNNs on images.

#### Problem solving as a two-player turn-based game

We define a two-player turn-based game, in which chooses which optical flow to perturb, and then imposes atomic manipulations of the dimensions within the selected flow.

###### Definition 7 (Game).

Given an input and its optical flow set , we let be a game model, where

• [leftmargin=2em]

• denotes the set of game states, in which is the set of ’s states whereas is the set of ’s states. Each corresponds to an optical flow set in the norm ball .

• is the initial state such that corresponds to the original optical flow set .

• is ’s transition relation defined as , and is ’s transition relation defined as , where is the atomic manipulation of Definition 5. Intuitively, in a game state , selects an optical flow of and enters into a ’s state , where then chooses an atomic manipulation on .

• is the labelling function that assigns each game state’s corresponding input to a class .

To compute of Definition 6, we let the game be cooperative. When it proceeds, two players take turns - employs a strategy to select optical flow, then employs a strategy to determine atomic manipulations - thus forming a path , which is a sequence . Formally, we define the strategy of the game as follows. Let be a set of finite paths ending in ’s state, and be a set of finite paths ending in ’s state, we define a strategy profile , such that of maps a finite path to a distribution over next actions, and similarly for .

Intuitively, by imposing atomic manipulations in each round, the game searches for potential adversarial examples with increasing distance to the original optical flow. Given , let denote the input corresponding to the last state of , and denote its optical flow set, we write the termination condition , which means that the game is in a state whose corresponding input is either classified differently, or the associated optical flow set is outside the norm ball. In order to quantify the distance accumulated along a path, we define a reward function as follows. Intuitively, the reward is the distance to the original optical flow if an adversarial example is found, and otherwise it is the weighted summation of the rewards of its children on the game tree.

###### Definition 8 (Reward).

Give a strategy profile , and a finite path , we define a reward function

 R(σ,ρ)=⎧⎪ ⎪⎨⎪ ⎪⎩\normP(v′ρ)−P(v)Lk,if tc(ρ) and ρ∈PathFI∑pt∈P(v)σI(ρ)(pt)⋅R(σ,ρTI(last(ρ),pt)),if ¬tc(ρ) and ρ∈PathFI∑θ∈ΘσII(ρ)(θ)⋅R(σ,ρTII(last(ρ),θ)),if ρ∈PathFII, (6)

where is the probability of choosing optical flow along , and is the probability of choosing atomic manipulation along . Also, and are the resulting paths of applying , respectively. Essentially, it is adding to a new state after transition.

#### Robustness guarantees

We now confirm that the game can return the optical value of the reward function as the solution to the problem. Proof of the following theorem is in Appendix A.2.

###### Theorem 2 (Guarantees).

Given an input , a game model , and an optimal strategy profile , the finite maximum safe radius problem is to minimise the reward of initial state based on , i.e., .

## 5 Computation of the converging upper and lower bounds

We utilise a gradient-based search algorithm to compute an upper bound of

. Here, we utilise the spatial features extracted from individual frames.

###### Definition 9 (Spatial Features).

Given a network , let denote the convolutional part, then maps from input to its extracted spatial features , which has consistent length of and feature dimension of a frame. Then, we pass into the recurrent part and obtain the classification results, i.e., .

The objective is to manipulate optical flow as imperceptibly as possible while altering the final classification. We write the objective function as follows:

where is a constant, and is the perturbation imposed on . The key point is to minimise so that the perturbation is unnoticeable while simultaneously changing . Here, we utilise the loss of on , denoted as , to quantify the classification change. Intuitively, if increases, is more likely to change. By utilising the concept of spatial features , we rewrite as , where denotes the gradient of the network’s loss w.r.t the spatial features, denotes the gradient of the spatial features w.r.t the optical flow, and denotes element-wise/Hadamard product. We introduce the computation of the two parts below.

On one hand, essentially exhibits the relation between spatial features and optical flow. Here we reuse input manipulation (Definition 5) to compute , though instead of manipulating the flow we impose perturbation directly on the frame. Intuitively, we manipulate the pixels of each frame to see how the subtle optical flow between the original and the manipulated frames will influence the spatial features. Each time we manipulate a single pixel of a frame, we get a new frame which is slightly different. If we perform on pixel , and denote the manipulated frame as , its spatial features as , the subtle optical flow between and as , then can be computed as in Equation (8). On the other hand, shows how the spatial features will influence the classification, which can be reflected by the loss of the network. After getting from , we can obtain from . If we perform pixel manipulation on frame , and obtain a new input, denoted as , then for this frame we have the gradient in Equation (9).

(8)
(9)
###### Remark.

From the definition of spatial features, i.e., , we know that the spatial features only depend on each individual of and do not capture the temporal information between frames. That is, when remains unchanged, and have a direct relation, which indicates that the gradient of the latter can reflect that of the former. Therefore, during implementation, instead of the distance between and , we calculate that between and , i.e., .

We exploit admissible A* to compute the lower bound of ’s reward, i.e., . An A* algorithm gradually unfolds the game model into a tree, in the sense that it maintains a set of children nodes of the expanded partial tree, and computes an estimate for each node. The key point is that in each iteration it selects the node with the least estimated value to expand. The estimation comprises two components: (1) the exact reward up to the current node, and (2) the estimated reward to reach the goal node. To guarantee the lower bound, we need to make sure that the estimated reward is minimal. For this part, we let the A* algorithm be admissible, which means that, given a current node, it never overestimates the reward to the terminal goal state. For each state in the game model , we assign an estimated distance value , where is the distance from the original state to the current state based on the -norm, and

is the admissible heuristic function that estimates the distance from the current state

to the terminal state. Here, we use in Equation (4). We present the admissible A* algorithm in Algorithm 1.

## 6 Experimental results

This section presents the evaluation results of our framework to approximate the maximum safe radius w.r.t optical flow on a video dataset. We perform the experiments on a Linux server with NVIDIA GeForce GTX Titan Black GPUs, and the operating system is Ubuntu 14.04.3 LTS. The results are obtained from a VGG16 simonyan2015very + LSTM hochreiter1997long network on the UCF101 soomro2012ucf101 video dataset. Details about the dataset, the network structure, and training/testing parameters can be found in Appendix A.3.

#### Adversarial examples via manipulating optical flows

We illustrate how optical flow can capture the temporal dynamics of the moving objects in neighbouring frames. In this case, we exploit the Gunnar Farnebäck algorithm farneback2003two as it computes the optical flow for all the pixels in a frame, i.e., dense optical flow, instead of a sparse feature set. Figure 2 presents an optical flow generated from two adjacent frames of a video labelled as : (a) shows two frames sampled at and of the video; and (b) exhibits the characteristics of the flow: and . We observe that, while the indoor background essentially remains unchanged, the motion of the player together with the football is clearly captured by the flow. See more examples in Appendix A.4.

We now demonstrate how a very slight perturbation on the flow, almost imperceptible to human eyes, can lead to a misclassification of the whole video. Figure 4 exhibits that a video originally classified as with confidence is manipulated into with confidence . Two sampled frames at and are shown in the top row. If we compare the original optical flow of and (2nd row) generated from the frames with the perturbed ones (bottom row), we can hardly notice the difference (3rd row). However, the classification of the video has changed.

#### Converging upper and lower bounds

We illustrate the convergence of the bound computation for the maximum safe radius with respect to manipulations on the optical flows extracted from the consecutive frames of a video. Take a video as an example. Figure 4 exhibits five sampled frames (top row) and the optical flows extracted between them (2nd row). By utilising our framework, we present the approximation of in Figure 6, where the red line indicates the descending trend of the upper bound, whereas the blue line denotes the ascending trend of the lower bound. Intuitively, after iterations of the gradient-based algorithm, the upper bound, i.e., minimum distance to an adversarial example, is based on the distance metric. That is, any manipulation imposed on the flows exceeding this upper bound is definitely unsafe. Figure 4 (3rd row) shows some of such unsafe perturbations on each optical flow, which result in the misclassification of the video into with confidence . As for the lower bound, we observe that, after iterations of the admissible A* algorithm, the lower bound reaches . That is, manipulations within this -norm ball is absolutely safe. Some of such safe perturbations can be found in the bottom row of Figure 4. Due to space limit, we include another example in Appendix A.5.

#### Efficiency and scalability

As for the computation time, the upper bound requires the gradient of optical flow with respect to the frame, and because we extract dense optical flow, the algorithm needs to traverse each pixel of a frame to impose atomic manipulations; thus it takes around minutes to retrieve the gradient of each frame. Once the gradient of the whole video is obtained, and the framework enters into the cooperative game, i.e., the expansion of the tree, each iteration takes minutes. Meanwhile, for the lower bound, the admissible A* algorithm expands the game tree in each iteration which takes minutes, and updates the lower bound wherever applicable. Note that initially the lower bound may be updated in each iteration, but when the size of the game tree increases, it can take hours to update. Moreover, we analyse the scalability of our framework via an example of a video in Figure 6, which shows the lower bounds obtained with respect to different dimensions of the manipulated optical flows. We observe that, within the same number of iterations, decreasing input dimension leads to faster convergence.

## 7 Conclusion

In this work, we study the maximum safe radius problem of neural networks, including CNNs and RNNs, with respect to the optical flow sets extracted from sequential videos. By relying on Lipschitz continuity, we transform the problem to a finite optimisation whose approximation has provable guarantees, and subsequently reduce the finite optimisation to the solution of a two-player turn-based game. We design algorithms to compute the upper and lower bounds, and demonstrate that the bounds converge to the maximum safe radius in the experiments.

## References

• [1] Clark Barrett and Cesare Tinelli. Satisfiability modulo theories. In Handbook of Model Checking, pages 305–343. Springer, 2018.
• [2] Andrew Burton and John Radford. Thinking in perspective: critical essays in the study of thought processes, volume 646. Routledge, 1978.
• [3] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
• [4] Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis, pages 363–370. Springer, 2003.
• [5] Santiago Fernández, Alex Graves, and Jürgen Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. In International Conference on Artificial Neural Networks, pages 220–229. Springer, 2007.
• [6] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
• [7] Divya Gopinath, Guy Katz, Corina S Păsăreanu, and Clark Barrett. DeepSafe: A data-driven approach for assessing robustness of neural networks. In International Symposium on Automated Technology for Verification and Analysis, pages 3–19. Springer, 2018.
• [8] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE, 2013.
• [9] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
• [10] Sepp Hochreiter, Martin Heusel, and Klaus Obermayer. Fast model-based protein homology detection without alignment. Bioinformatics, 23(14):1728–1736, 2007.
• [11] Sepp Hochreiter and Jürgen Schmidhuber. Neural computation, 9(8):1735–1780, 1997.
• [12] Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In International Conference on Computer Aided Verification, pages 3–29. Springer, 2017.
• [13] Nathan Inkawhich, Matthew Inkawhich, Yiran Chen, and Hai Li. Adversarial attacks for optical flow-based action recognition classifiers. arXiv preprint arXiv:1811.11875, 2018.
• [14] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pages 97–117. Springer, 2017.
• [15] Andreea Kevorchian. Verification of recurrent neural networks. 2018.
• [16] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In

Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2

, IJCAI’81, pages 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc.
• [17] Marco Melis, Ambra Demontis, Battista Biggio, Gavin Brown, Giorgio Fumera, and Fabio Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. In Proceedings of the IEEE International Conference on Computer Vision, pages 751–759, 2017.
• [18] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In

Proceedings of the IEEE conference on computer vision and pattern recognition

, pages 1765–1773, 2017.
• [19] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387. IEEE, 2016.
• [20] Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. Crafting adversarial input sequences for recurrent neural networks. In MILCOM 2016-2016 IEEE Military Communications Conference, pages 49–54. IEEE, 2016.
• [21] Luca Pulina and Armando Tacchella. An abstraction-refinement approach to verification of artificial neural networks. In International Conference on Computer Aided Verification, pages 243–257. Springer, 2010.
• [22] Wenjie Ruan, Xiaowei Huang, and Marta Kwiatkowska. Reachability analysis of deep neural networks with provable guarantees. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2651–2659. AAAI Press, 2018.
• [23] Wenjie Ruan, Min Wu, Youcheng Sun, Xiaowei Huang, Daniel Kroening, and Marta Kwiatkowska. Global robustness evaluation of deep neural networks with provable guarantees for the Hamming distance. To appear in the International Joint Conference on Artificial Intelligence (IJCAI), 2019.
• [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
• [25] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01, 2012.
• [26] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
• [27] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
• [28] Qinglong Wang, Kaixuan Zhang, Xue Liu, and C Lee Giles. Verification of recurrent neural networks through rule extraction. arXiv preprint arXiv:1811.06029, 2018.
• [29] David H Warren and Edward R Strelow. Electronic spatial sensing for the blind: contributions from perception, rehabilitation, and computer vision, volume 99. Springer Science & Business Media, 2013.
• [30] Xingxing Wei, Jun Zhu, and Hang Su. Sparse adversarial perturbations for videos. arXiv preprint arXiv:1803.02536, 2018.
• [31] Min Wu, Matthew Wicker, Wenjie Ruan, Xiaowei Huang, and Marta Kwiatkowska. A game-based approximate verification of deep neural networks with provable guarantees. To appear in the Journal of Theoretical Computer Science, 2019.

## Appendix A Appendix

### a.1 Proof of the error bounds in Theorem 1

In this section, we provide the detailed proof for the error bounds in Theorem 1, in particular, the value of in Equation (5).

###### Proof.

We first define the concept of the minimum confidence margin.

###### Definition 10 (Minimum Confidence Margin).

Given a network , an input , and a class , we define the minimum confidence margin as

 Con(v,c)=minc′∈C,c′≠c{N(v,c)−N(v,c′)}. (10)

Intuitively, it is the discrepancy between the maximum confidence of being classified as and the second maximum confidence of being classified as . Then for any input whose optical flow set is in the subspace of a grid point , and the input corresponding to this optical flow set , we have

 (11)

Now, since the optical flow set of is in the subspace of , we need to ensure that no class change occurs between and . That is, , which means . Therefore, we have

 maxc∈C,c≠N(v)(LipN(v)+Lipc)⋅~d′(Lk,τ)≤Con(v,N(v)). (12)

And as is the grid point, the minimum confidence margin for its corresponding input can be computed. Finally, we replace with its definition, then we have

 ~d′(Lk,τ)≤minc∈C,c≠N(v){N(v,N(v))−N(v,c)}maxc∈C,c≠N(v)(LipN(v)+Lipc). (13)

### a.2 Proof of the guarantees in Theorem 2

In this section, we provide the proof for the robustness guarantees in Theorem 2.

###### Proof.

On one hand, we demonstrate that for any optical flow set as a -grid point, such that and its corresponding input is an adversarial example. Intuitively, it means that ’s reward from the game on the initial state is no greater than the distance to any -grid manipulated optical flow set. That is, the reward value , once computed, is a lower bound of the optimisation problem . Note that the reward value can be obtained as every -grid point can be reached by some game play, i.e., a sequence of atomic manipulations.

On the other hand, from the termination condition of the game, we observe that, for some , if holds, then there must exist some other such that . Therefore, we have that is the minimum value of among all the -grid points such that and their corresponding inputs are adversarial examples.

Finally, we notice that the minimum value of is equivalent to the optical value required by Equation (3). ∎

### a.3 Details of the video dataset and the network

As a popular benchmark for human action recognition in videos, UCF101 [25] consists of 101 annotated action classes, e.g., (human-object interaction), (body-motion only), (human-human interaction), (playing musical instruments), and (sports). It labels video clips of hours in total, and each frame has dimension .

In the experiments, we exploit a VGG16 + LSTM architecture, in the sense of utilising the VGG16 network to extract the spatial features from the UCF101 video dataset and then passing these features to a separate RNN unit LSTM. For each video, we sample a frame every and stitch them together into a sequence of frames. Specifically, we run every frame from every video through VGG16 with input size

, excluding the top classification part of the network, i.e., saving the output from the final Max-Pooling layer. Hence, for each video, we retrieve a sequence of extracted spatial features. Subsequently, we pass the features into a single LSTM layer, followed by a Dense layer with some Dropout in between. Eventually, after the final Dense layer with activation function Softmax, we get the classification outcome.

We use the loss function and the metrics for both the VGG16 and LSTM models. Whilst the former has a optimiser and directly exploits the weights, we train the latter through a optimiser and get training accuracy as well as testing accuracy. Specifically, when the loss difference cannot reflect the subtle perturbation on optical flow during the computation of upper bounds, we use the discrepancy of values instead.

### a.4 More examples of the optical flows extracted from different videos

Apart from Figure 2 in Section 6, here we include more examples of the optical flows extracted from another two videos with classifications (Figure 7) and (Figure 8).

### a.5 Another example of the converging upper and lower bounds

Apart from the example (Figures 4 and 6, Section 6), we attach another example to illustrate the convergence of the upper and lower bounds. Similarly, Figure 10 exhibits four sampled frames (top row) from a video and the optical flows extracted between them (2nd row). The descending upper bounds (red) and the ascending lower bounds (blue) to approximate the value of are presented in Figure 9. Intuitively, after iterations of the gradient-based algorithm, the upper bound, i.e., minimum distance to an adversarial example, is based on the distance metric. That is, any manipulation imposed on the flows exceeding this upper bound is definitely unsafe. Figure 10 (3rd row) shows some of such unsafe perturbations on each optical flow, which result in the misclassification of the video into with confidence . As for the lower bound, we observe that, after iterations of the admissible A* algorithm, the lower bound reaches . That is, manipulations within this -norm ball is absolutely safe. Some of such safe perturbations can be found in the bottom row of Figure 10.