Adversarially Robust Frame Sampling with Bounded Irregularities

02/04/2020 ∙ by Hanhan Li, et al. ∙ 0

In recent years, video analysis tools for automatically extracting meaningful information from videos are widely studied and deployed. Because most of them use deep neural networks which are computationally expensive, feeding only a subset of video frames into such algorithms is desired. Sampling the frames with fixed rate is always attractive for its simplicity, representativeness, and interpretability. For example, a popular cloud video API generated video and shot labels by processing only the first frame of every second in a video. However, one can easily attack such strategies by placing chosen frames at the sampled locations. In this paper, we present an elegant solution to this sampling problem that is provably robust against adversarial attacks and introduces bounded irregularities as well.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The amount of videos we produce each year is growing at an incredible pace. The videos may originate from various kinds of devices such as professional video recorders, personal cameras, surveillance cameras, smartphones, et al., or they may be created completely from software. Internet companies host an enormous amount of videos for people to watch online. Video analysis algorithms, ranging from handcrafted ones such as shot boundary detection (Boreczky & Rowe, 1996; Pal et al., 2015)

to machine learned ones such as convolutional neural networks

(Krizhevsky et al., 2012; He et al., 2016), also advanced significantly. Machine learning algorithms are especially attractive these days as they are capable of performing difficult tasks such as extracting semantic meaning from raw media. Companies are increasingly reliant on these video understanding tools to better filter, index, and rank videos for search and recommendation at scale, and these topics are widely studied in both academia and industry.

However, most of the video analysis algorithms are computationally expensive. For example, it requires Mul-Add FLOPs to apply a ResNet101 model on a single frame with a resolution (He et al., 2016). One may achieve better performance with a larger ResNet model or with a higher frame resolution, where even more FLOPs are required. Therefore, analyzing all frames can be unaffordable or cost-inefficient, and sampling a subset of frames beforehand is usually desired.

Generally, we prefer to sample as uniformly as possible for the following three reasons. First, a uniform sequence better represents the whole video. Second, it is better interpretable and explainable. Third, irregular sequences may add extra complexity to or degrade the performance of downstream algorithms. There is a well known cloud service that takes a video file from a user and returns the video labels (objects within the video), shot changes (scene changes within the video), shot labels (description of video events over time), and more. The original service processed only the first frame of every second of a video to save computing budgets. However, if images were inserted at the rate of one frame per second into a video, the API would only output video and shot labels related to the inserted images only and ignore the rest which is the vast majority. This vulnerability was discovered and demonstrated by Hosseini et al.(Hosseini et al., 2017).

One needs to introduce randomness into the sampling algorithm as a countermeasure to these image insertion attacks. However, it will necessarily compromise uniformity, and we would like to keep the disturbance as small as possible. This paper provides such a solution, named ‘jittering with reflection’, that is provably robust and has bounded irregularities. As frame timestamps of a video can be treated as either continuous or discrete, we will address both variants in this paper. To our best knowledge, there is no prior work that jointly optimizes both uniformity and adversarial robustness of frame sampling.

Please note that there is an orthogonal problem on pixel-level robustness associated with deep neural networks (Szegedy et al., 2013) that this paper does not address. The problem there is that certain imperceptible perturbation to the images can trick the network into making completely wrong predictions.

The rest of the paper is organized as follows. In Section 2, we present the theoretical formulation and solution to the continuous version of the sampling robustness problem. Its discrete counterpart is addressed in Section 3. Section 4 demonstrates an example sampling with its associated video classification performance, and Section 5 concludes this paper.

2 The continuous version

In this section, we will first define mathematically the desired uniformity properties and randomness properties as well as the rationale behind the definitions. We will then prove that these properties are sufficient to ensure security, i.e., robustness against insertion attacks. Finally, we will propose a sampling algorithm and prove that it satisfies all the desired properties. Time is treated as a continuous quantity as it is in the physical world.

2.1 Uniformity and randomness properties

For a given interval and a perturbation threshold , we would like to probabilistically sample an infinite sequence in . We require the following uniformity properties:

  • for any .

  • There exists some offset such that for any .

We also require the following randomness properties:

  • There exists some threshold such that for any .

  • The event becomes independent of all events for as . Formally, for any , there exists a such that for any , , and .


is the probability density of the event

. Intuitively, the probability of intersecting with is .

In our application, the sequence represents the desired timestamps with which we would like to take samples with a target frequency . The threshold is usually chosen to be much smaller than .

The uniformity properties ensure that the sampled frames are well-represented and interpretable for a normal video, as they are evenly spaced up to a bounded error. Property U1 is about incremental uniformity, and it ensures that the time intervals between two neighboring frames are reasonably close to . This is particularly important for algorithms that infer motion or depth from the difference between the contents of these two frames (Dosovitskiy et al., 2015; Gordon et al., 2019). Property U2 is about cumulative uniformity, and it ensures that there is no long time deviation from a fixed frame rate. This is particularly important if we need to align the sampled frames with a stream of features from another modality ( that may have a constant frequency.

The randomness properties ensure that the sampled frames are robust against adversarially positioned frames. Property R1 ensures that there is no blind spot that we never take sample from. Property R2 ensures that there is no long time correlation that could be exploited. For example, if a uniform sequence is shifted by an overall random interval, R1 could be satisfied, but R2 could not. In this case, an attacker at most needs to try a couple of times to find an offset to his/her favor and trick the sampling strategy.

The following subsection proves that any set of frames that persistently appear throughout a video will have a probability exponentially close to of getting caught in a sampled sequence with the above properties.

2.2 Security statement and proof

Theorem 1.

Let be a subset of with a measure of . Let be a sampled sequence in that satisfies the uniformity and randomness properties. Then the probability of and being disjoint goes to exponentially with the measure of as goes to .


Let be the threshold in Property R1. By Property R2, there exists an integer such that

for any , , and . It follows that


for any , , and under the same condition.

We pick an offset such that and that is an integer multiple of . We partition into the following sets, , , … , where

with the intervals defined as

Let us denote and . Then , , … is a partition of .

It’s clear that


for any and if .

Let , it’s clear that if .

Let be the measure function on . We define

It’s clear that

By Property U2, with a probability of , there is exactly one in the interval , and hence the events for all in are disjoint.

As a result, by Property R1,

In addition, by Equation 1 and Equation 2, for any ,


Therefore, the probability goes to as goes to , and the rate of convergence is exponential with . ∎

2.3 Jittering with reflection

Our proposed sampling strategy ‘Jittering with Reflection’ satisfies all desired properties. Moreover, it has an elegant behavior that is constantly , which maximally ensures robustness. The key of the algorithm is mirror reflections against the boundaries of the intervals . Figure 1 shows examples of the sampled frames.

Figure 1: This figure shows sampled frames from Algorithm 1 with 3 different seeds. We used and . Locations of the vertical lines represent times of the sampled frames.
0:  A jittering distribution on which is symmetric about . It has a piecewise smooth p.d.f.. For example .
0:  A sampled sequence .
  Sample .
  for  do
     Sample , set .
     if  then
        Set .
     else if  then
        Set .
        Set .
     end if
  end for
Algorithm 1 Jittering with reflection (continuous version)
Theorem 2.

The sampled sequence from Algorithm 1 satisfies the uniformity and randomness properties.


It is clear that uniformity properties U1 and U2 are satisfied, with offset .

Denote , then . Moreover,

is a Markov chain with transitions given by



be the probability density function of

, and be the probability density function of . We prove by induction that for any . follows the distribution by design. Suppose follows it, we break the range of into three segments.

If , only the case in Equation 3a can happen:

If , only the cases in Equation 3a and Equation 3b can happen:

We used the fact that because is symmetric about . The same result for can be proved in a similar way. It follows that , which concludes the induction.

As the sequence partitions into intervals and each covers a unique interval

with uniform distribution, Property R1 is satisfied.

To prove Property R2, we will show that the Markov chain with any initial distribution will converge to the same stationary distribution with an exponential rate. First, we define that has a period of , with

And we look at the Fourier series of with period :

and the Fourier transform of


It is easy to see that has a nice behavior under ‘jittering with reflection’. Namely, is the convolution of and , or

By induction,


we see that , with the equality holds only if . Also, by Riemann–Lebesgue lemma, we have as . Therefore, when , only the zero frequency coefficient will survive, and becomes a constant function regardless of . Also, the convergence has bounded exponential rate. This asymptotic independence implies that Property R2 is satisfied.

3 The discrete version

In this section, we are going to formulate the discrete version of this problem, which is relevant to real products. For example, MediaPipe (Lugaresi et al., 2019)

is an open sourced framework for building multi-modal (, audio) applied ML pipelines, and its timestamps are at microsecond granularity.

3.1 Uniformity and randomness properties

For a given interval and a perturbation threshold , we would like to probabilistically sample an infinite sequence in that satisfies both the uniformity properties and randomness properties as defined below.

  • for any .

  • There exists some offset such that for any .

  • There exists some threshold such that for any .

  • The event becomes independent of all events for as . Formally, for any , there exists an integer such that for any integer , , and .

3.2 Security Statement

Theorem 3.

Let be an infinite subset of . Let be a sampled sequence in that satisfies the uniformity and randomness properties. Then the probability of and being disjoint goes to exponentially with the cardinality of as goes to .

The proof will be similar to the continuous version and is omitted.

3.3 Jittering with reflection

0:  A jittering distribution on which is symmetric about . The gcd of and the indices of the nonzero entries of needs to be . For example .
0:  A sampled sequence .
  Sample .
  for  do
     Sample , set .
     if  then
        Set .
     else if  then
        Set .
        Set .
     end if
  end for
Algorithm 2 Jittering with reflection (discrete version)

The discrete ‘jittering with reflection’ sampling is stated in Algorithm 2.

Theorem 4.

The sampled sequence from Algorithm 2 satisfies the uniformity and randomness properties.

The proof will be analogous to that of the continuous version. We can show that for any . The additional condition ensures that the Markov chain will converge to for any initial distribution. The reason is that, in the discrete Fourier transform of , the coefficients of all but the constant term are less than .

4 An example: video classification after frame sampling

In this section, we demonstrate a simple example of the discrete frame sampling with interval and presents its impact on video classification.

Let , then is a Markov chain with only two possible states, and . At each step, the jittering flips the state with a probability of . The transition matrix , where represents the probability of moving from state to state , is given by

It can be derived that the correlation between and is

It shows that the correlation drops to zero at an exponential speed, which is consistent with the randomness properties. We define the correlation length as the number of steps over which the correlation drops to , i.e.,

A larger gives a shorter correlation length and therefore more robustness. The red curve in Figure 2 visualizes this relation.

Figure 2: This figure presents some numerical results of the discrete ‘jittering with reflection’ sampling with .

in the horizontal axis is a measure of the jittering magnitude, and the variance of the distance between two neighboring frames is proportional to it. The red line shows the correlation length of sampled frames decreases with

. The blue line shows that the performance (as measured in GAP) of a Youtube-8M classification model slightly decreases with .

It can also be derived that the variance of is

This means a larger indicates further departure from uniformity. Sampling frames at irregular intervals may have many undesired consequences. Nevertheless, we use the following experiment to demonstrate that its impact on video level classification tasks is small.

YouTube-8M (Abu-El-Haija et al., 2016) is a large-scale labeled video dataset that consists of features from millions of YouTube videos with high-quality machine-generated annotations. We use the 2018 version which has about 6 million videos with a diverse vocabulary of 3862 audio-visual entities. 1024-dimensional visual features and 128-dimensional audio features at 1 frame per second are extracted from bottleneck layers of pre-trained deep neural networks and are provided as input features for this dataset.

We train a deep-bag-of-frames (DBoF) model as described in Li et al.(Li et al., 2019). In the DBoF, a few layers (shared across frames) are applied to each frame, and then the frame level features are aggregated into a video feature, and finally a few additional layers are applied to obtain predictions. For evaluation, we only take frames at multiples of 5 seconds just to magnify the effect of sampling. On top of these frames, we use the discrete ‘jittering with reflection’ with (which corresponds to 10 seconds in the videos) to sample frames before applying the DBoF model. The blue curve in Figure 2 shows that the global average precision (GAP) of the DBoF model slightly degrades when increases, which is expected. The variation is small because topical annotations are insensitive to locations of the frames.

One can adjust this to achieve the desired trade-off between robustness (as measured by ) and uniformity (as measured by ).

5 Conclusions

In this paper, we formulated the uniformity and randomness properties that a general frame sampling strategy desires. We proved that if these properties are satisfied, a strategy is robust in the sense that any recurring sequence of frames has exponentially small chance of concealing itself. We designed an algorithm ‘jittering with reflection’ that satisfies all the desired properties, and the magnitude of the jittering can be tuned to achieve the desired trade-off between the amount of irregularities and the degree of robustness. We expect this algorithm to be widely useful in video analysis.


We thank Qingchun Ren for assistance in formalizing many proofs in this paper.