The amount of videos we produce each year is growing at an incredible pace. The videos may originate from various kinds of devices such as professional video recorders, personal cameras, surveillance cameras, smartphones, et al., or they may be created completely from software. Internet companies host an enormous amount of videos for people to watch online. Video analysis algorithms, ranging from handcrafted ones such as shot boundary detection (Boreczky & Rowe, 1996; Pal et al., 2015)2012; He et al., 2016), also advanced significantly. Machine learning algorithms are especially attractive these days as they are capable of performing difficult tasks such as extracting semantic meaning from raw media. Companies are increasingly reliant on these video understanding tools to better filter, index, and rank videos for search and recommendation at scale, and these topics are widely studied in both academia and industry.
However, most of the video analysis algorithms are computationally expensive. For example, it requires Mul-Add FLOPs to apply a ResNet101 model on a single frame with a resolution (He et al., 2016). One may achieve better performance with a larger ResNet model or with a higher frame resolution, where even more FLOPs are required. Therefore, analyzing all frames can be unaffordable or cost-inefficient, and sampling a subset of frames beforehand is usually desired.
Generally, we prefer to sample as uniformly as possible for the following three reasons. First, a uniform sequence better represents the whole video. Second, it is better interpretable and explainable. Third, irregular sequences may add extra complexity to or degrade the performance of downstream algorithms. There is a well known cloud service that takes a video file from a user and returns the video labels (objects within the video), shot changes (scene changes within the video), shot labels (description of video events over time), and more. The original service processed only the first frame of every second of a video to save computing budgets. However, if images were inserted at the rate of one frame per second into a video, the API would only output video and shot labels related to the inserted images only and ignore the rest which is the vast majority. This vulnerability was discovered and demonstrated by Hosseini et al.(Hosseini et al., 2017).
One needs to introduce randomness into the sampling algorithm as a countermeasure to these image insertion attacks. However, it will necessarily compromise uniformity, and we would like to keep the disturbance as small as possible. This paper provides such a solution, named ‘jittering with reflection’, that is provably robust and has bounded irregularities. As frame timestamps of a video can be treated as either continuous or discrete, we will address both variants in this paper. To our best knowledge, there is no prior work that jointly optimizes both uniformity and adversarial robustness of frame sampling.
Please note that there is an orthogonal problem on pixel-level robustness associated with deep neural networks (Szegedy et al., 2013) that this paper does not address. The problem there is that certain imperceptible perturbation to the images can trick the network into making completely wrong predictions.
The rest of the paper is organized as follows. In Section 2, we present the theoretical formulation and solution to the continuous version of the sampling robustness problem. Its discrete counterpart is addressed in Section 3. Section 4 demonstrates an example sampling with its associated video classification performance, and Section 5 concludes this paper.
2 The continuous version
In this section, we will first define mathematically the desired uniformity properties and randomness properties as well as the rationale behind the definitions. We will then prove that these properties are sufficient to ensure security, i.e., robustness against insertion attacks. Finally, we will propose a sampling algorithm and prove that it satisfies all the desired properties. Time is treated as a continuous quantity as it is in the physical world.
2.1 Uniformity and randomness properties
For a given interval and a perturbation threshold , we would like to probabilistically sample an infinite sequence in . We require the following uniformity properties:
for any .
There exists some offset such that for any .
We also require the following randomness properties:
There exists some threshold such that for any .
The event becomes independent of all events for as . Formally, for any , there exists a such that for any , , and .
is the probability density of the event. Intuitively, the probability of intersecting with is .
In our application, the sequence represents the desired timestamps with which we would like to take samples with a target frequency . The threshold is usually chosen to be much smaller than .
The uniformity properties ensure that the sampled frames are well-represented and interpretable for a normal video, as they are evenly spaced up to a bounded error. Property U1 is about incremental uniformity, and it ensures that the time intervals between two neighboring frames are reasonably close to . This is particularly important for algorithms that infer motion or depth from the difference between the contents of these two frames (Dosovitskiy et al., 2015; Gordon et al., 2019). Property U2 is about cumulative uniformity, and it ensures that there is no long time deviation from a fixed frame rate. This is particularly important if we need to align the sampled frames with a stream of features from another modality (e.g.audio) that may have a constant frequency.
The randomness properties ensure that the sampled frames are robust against adversarially positioned frames. Property R1 ensures that there is no blind spot that we never take sample from. Property R2 ensures that there is no long time correlation that could be exploited. For example, if a uniform sequence is shifted by an overall random interval, R1 could be satisfied, but R2 could not. In this case, an attacker at most needs to try a couple of times to find an offset to his/her favor and trick the sampling strategy.
The following subsection proves that any set of frames that persistently appear throughout a video will have a probability exponentially close to of getting caught in a sampled sequence with the above properties.
2.2 Security statement and proof
Let be a subset of with a measure of . Let be a sampled sequence in that satisfies the uniformity and randomness properties. Then the probability of and being disjoint goes to exponentially with the measure of as goes to .
Let be the threshold in Property R1. By Property R2, there exists an integer such that
for any , , and . It follows that
for any , , and under the same condition.
We pick an offset such that and that is an integer multiple of . We partition into the following sets, , , … , where
with the intervals defined as
Let us denote and . Then , , … is a partition of .
It’s clear that
for any and if .
Let , it’s clear that if .
Let be the measure function on . We define
It’s clear that
By Property U2, with a probability of , there is exactly one in the interval , and hence the events for all in are disjoint.
As a result, by Property R1,
Therefore, the probability goes to as goes to , and the rate of convergence is exponential with . ∎
2.3 Jittering with reflection
Our proposed sampling strategy ‘Jittering with Reflection’ satisfies all desired properties. Moreover, it has an elegant behavior that is constantly , which maximally ensures robustness. The key of the algorithm is mirror reflections against the boundaries of the intervals . Figure 1 shows examples of the sampled frames.
The sampled sequence from Algorithm 1 satisfies the uniformity and randomness properties.
It is clear that uniformity properties U1 and U2 are satisfied, with offset .
Denote , then . Moreover,
is a Markov chain with transitions given by
be the probability density function of, and be the probability density function of . We prove by induction that for any . follows the distribution by design. Suppose follows it, we break the range of into three segments.
If , only the case in Equation 3a can happen:
We used the fact that because is symmetric about . The same result for can be proved in a similar way. It follows that , which concludes the induction.
As the sequence partitions into intervals and each covers a unique interval
with uniform distribution, Property R1 is satisfied.
To prove Property R2, we will show that the Markov chain with any initial distribution will converge to the same stationary distribution with an exponential rate. First, we define that has a period of , with
And we look at the Fourier series of with period :
and the Fourier transform of:
It is easy to see that has a nice behavior under ‘jittering with reflection’. Namely, is the convolution of and , or
we see that , with the equality holds only if . Also, by Riemann–Lebesgue lemma, we have as . Therefore, when , only the zero frequency coefficient will survive, and becomes a constant function regardless of . Also, the convergence has bounded exponential rate. This asymptotic independence implies that Property R2 is satisfied.
3 The discrete version
In this section, we are going to formulate the discrete version of this problem, which is relevant to real products. For example, MediaPipe (Lugaresi et al., 2019)
is an open sourced framework for building multi-modal (e.g.video, audio) applied ML pipelines, and its timestamps are at microsecond granularity.
3.1 Uniformity and randomness properties
For a given interval and a perturbation threshold , we would like to probabilistically sample an infinite sequence in that satisfies both the uniformity properties and randomness properties as defined below.
for any .
There exists some offset such that for any .
There exists some threshold such that for any .
The event becomes independent of all events for as . Formally, for any , there exists an integer such that for any integer , , and .
3.2 Security Statement
Let be an infinite subset of . Let be a sampled sequence in that satisfies the uniformity and randomness properties. Then the probability of and being disjoint goes to exponentially with the cardinality of as goes to .
The proof will be similar to the continuous version and is omitted.
3.3 Jittering with reflection
The discrete ‘jittering with reflection’ sampling is stated in Algorithm 2.
The sampled sequence from Algorithm 2 satisfies the uniformity and randomness properties.
The proof will be analogous to that of the continuous version. We can show that for any . The additional condition ensures that the Markov chain will converge to for any initial distribution. The reason is that, in the discrete Fourier transform of , the coefficients of all but the constant term are less than .
4 An example: video classification after frame sampling
In this section, we demonstrate a simple example of the discrete frame sampling with interval and presents its impact on video classification.
Let , then is a Markov chain with only two possible states, and . At each step, the jittering flips the state with a probability of . The transition matrix , where represents the probability of moving from state to state , is given by
It can be derived that the correlation between and is
It shows that the correlation drops to zero at an exponential speed, which is consistent with the randomness properties. We define the correlation length as the number of steps over which the correlation drops to , i.e.,
A larger gives a shorter correlation length and therefore more robustness. The red curve in Figure 2 visualizes this relation.
It can also be derived that the variance of is
This means a larger indicates further departure from uniformity. Sampling frames at irregular intervals may have many undesired consequences. Nevertheless, we use the following experiment to demonstrate that its impact on video level classification tasks is small.
YouTube-8M (Abu-El-Haija et al., 2016) is a large-scale labeled video dataset that consists of features from millions of YouTube videos with high-quality machine-generated annotations. We use the 2018 version which has about 6 million videos with a diverse vocabulary of 3862 audio-visual entities. 1024-dimensional visual features and 128-dimensional audio features at 1 frame per second are extracted from bottleneck layers of pre-trained deep neural networks and are provided as input features for this dataset.
We train a deep-bag-of-frames (DBoF) model as described in Li et al.(Li et al., 2019). In the DBoF, a few layers (shared across frames) are applied to each frame, and then the frame level features are aggregated into a video feature, and finally a few additional layers are applied to obtain predictions. For evaluation, we only take frames at multiples of 5 seconds just to magnify the effect of sampling. On top of these frames, we use the discrete ‘jittering with reflection’ with (which corresponds to 10 seconds in the videos) to sample frames before applying the DBoF model. The blue curve in Figure 2 shows that the global average precision (GAP) of the DBoF model slightly degrades when increases, which is expected. The variation is small because topical annotations are insensitive to locations of the frames.
One can adjust this to achieve the desired trade-off between robustness (as measured by ) and uniformity (as measured by ).
In this paper, we formulated the uniformity and randomness properties that a general frame sampling strategy desires. We proved that if these properties are satisfied, a strategy is robust in the sense that any recurring sequence of frames has exponentially small chance of concealing itself. We designed an algorithm ‘jittering with reflection’ that satisfies all the desired properties, and the magnitude of the jittering can be tuned to achieve the desired trade-off between the amount of irregularities and the degree of robustness. We expect this algorithm to be widely useful in video analysis.
We thank Qingchun Ren for assistance in formalizing many proofs in this paper.
- Abu-El-Haija et al. (2016) Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., and Vijayanarasimhan, S. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
- Boreczky & Rowe (1996) Boreczky, J. S. and Rowe, L. A. Comparison of video shot boundary detection techniques. Journal of Electronic Imaging, 5(2):122–129, 1996.
- Dosovitskiy et al. (2015) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., and Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766, 2015.
- Gordon et al. (2019) Gordon, A., Li, H., Jonschkowski, R., and Angelova, A. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hosseini et al. (2017) Hosseini, H., Xiao, B., Clark, A., and Poovendran, R. Attacking automatic video analysis algorithms: A case study of google cloud video intelligence api. In Proceedings of the 2017 on Multimedia Privacy and Security, pp. 21–32. ACM, 2017.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
- Li et al. (2019) Li, H., Ng, J. Y.-H., and Natsev, P. Ensemblenet: End-to-end optimization of multi-headed models. arXiv preprint arXiv:1905.09979, 2019.
- Lugaresi et al. (2019) Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.-L., Yong, M. G., Lee, J., et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
- Pal et al. (2015) Pal, G., Rudrapaul, D., Acharjee, S., Ray, R., Chakraborty, S., and Dey, N. Video shot boundary detection: a review. In Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI Volume 2, pp. 119–127. Springer, 2015.
- Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.