1 Introduction
Visual repetition is ubiquitous in the world around us. It is present in activities like rowing, musicmaking and cooking. It arises in natural and urban environments: traffic patterns, blinking lights, and leaves in the wind. Rhythm and repetition are used to approximate velocity, estimate progress and to trigger attention [13]
. In computer vision, understanding repetition in video is important as it can serve action classification
[9, 17], action localization [14, 24], human motion analysis [1, 21], 3D reconstruction [3] and camera calibration [12]. Estimating repetition remains challenging. First and foremost, repetition appears in many forms due to its variety in motion pattern and motion continuity. The viewpoint is crucial for the perception of recurrence. In practice, camera motion makes repetition estimation inevitably hard.Existing work on repetition estimation in video [15, 19] reports good results under the assumption that the motion is welllocalized (static) and strongly periodic (stationary). In short, existing work focuses on video that is static in every aspect of repetition. As real life is more complex, our method relies on motion foreground segmentation to localize the salient motion and handle nonstatic video. Furthermore, we found fixedperiod Fourier analysis [7, 19, 20] to be unsuitable for repetition estimation in realworld video as nonstationarity often appears. To permit nonstationary video dynamics, we adopt the wavelet transform for decomposing video signals into a timefrequency spectrum.
We reconsider the theory of repetition [19, 8] starting from the divergence, gradient and curl operators acting on the 3D flow field. We derive three motion types and three motion continuities. What follows are fundamental cases of intrinsic periodicity in 3D. For the 2D perception of 3D intrinsic periodicity, the observer’s viewpoint can be somewhere in the continuous range between two viewpoint extremes. Ultimately, we distinguish fundamental cases for the 2D perception of 3D intrinsic periodic motion.
The contributions of our work are the following. (1) Starting from the first principles of 3D periodicity and its perception in 2D, we derive fundamentally different cases of repetitive perception. (2) To estimate repetition in video under realistic circumstances, we compute a diverse flowbased representation over the motion foreground segmentation. Our method uses wavelets to handle nonstationary motion and automatically selects the most discriminative signal based on selfestimated quality assessment. (3) Extending beyond the video dataset of [15], we propose the new QUVA Repetition dataset for repetition estimation, that is more realistic and challenging by lifting the static and stationary assumptions. (4) We evaluate on the task of repetition counting and show that our method outperforms the deep learningbased stateoftheart [15] on the new dataset.
2 Related Work
Existing approaches for repetition estimation in video typically represent video as onedimensional signals that preserve the repetitive structure of the motion. Then, frequency information is extracted by Fourier analysis [2, 7, 19, 30], peak detection [28]
or singular value decomposition
[6]. Pogalin et al. [19]estimate the frequency of motion in video by tracking an object, performing principal component analysis over the tracked regions and employing the Fourierbased periodogram. However, methods relying on Fourieranalysis for periodic motion are unable, nor intended, to handle nonstationary motion as is ubiquitous in the real world.
Briassouli & Ahuja [4]
employ timefrequency analysis using the Short Time Fourier Transform for dealing with multiple periodic motions. In
[5], the authors propose a spatiotemporal filter bank for estimating repetition in video. Their filters work online and are effective when tuned correctly. However, we question its practical use, as their experiment are limited to stationary motion and the filter bank requires manual tuning. We also use a timefrequency decomposition of signals from video, but concentrate on handling nonstationary repetition. Instead of using the ShortTime Fourier Transform, we adopt the continuous wavelet transform to achieve better resolution [23].The studies on periodic motion by [8, 19, 26] have encouraged us to reconsider visual repetition. Pogalin et al. [19] identify four visually periodic motion types (translation, rotation, deformation and intensity variation) supplemented with three cases of motion continuity (oscillating, constant and intermittent) in the 2D field of view. In this work, we argue that the 3D flow field is the right starting point to derive the foundations of repetition. From the 3D flow field and the differential operators acting on it, we derive three motion types and three motion continuities that organize into a Cartesian table. Moreover, the projection of 3D periodicity on 2D perception has to consider the viewpoint. What follows are fundamentally different cases of 2D repetitive perception from 3D periodicity.
Levy & Wolf [15]
introduce a convolutional neural network for estimating repetition by counting in live video. Their network is trained to predict the motion period on synthetic video sequences in which moving squares exhibit periodic motion of four motion types from
[19]. At test time, the method takes a stack of video frames, computes a region of interest by motion thresholding, and forwards the frame crops through the network to classify the motion period. The system is evaluated on the task of repetition counting and shows nearperfect performance on their
YTSegments dataset. The videos are a good initial set of examples but as the majority of videos have static viewpoint and exhibit stationary periodic repetitions, we propose a new dataset. Our dataset better reflects reality by including more nonstatic and nonstationary examples. Similar to Levy & Wolf, we also evaluate repetition estimation by counting.3 Theory
3.1 3D Intrinsic Periodicity
In 3D, intrinsic periodicity is defined as the reappearing of the same 3Dflow
induced by the motion of an object over time. For a moment in time
, we denote the flow by . The 3Dflow field tied to the object is periodic as expressed by , where we exclude for the moment the trivial case that the flow field is constant. The parameter is the period over time, where is the period, if any, over space.Let the flow field be given by its directional components: . From differential geometry, we have the three operators on the flow field:
(1)  
(2)  
(3) 
Where in Eq. (1) the product
defines a dyadic tensor, and indices are summed over the
terms by the Einstein convention [27]. The equations define the gradient, divergence and curl of the flow field [25]. Three basic 3Dmotion types emerge depending on the values of divergence and curl as follows:translation:  
rotation:  
expansion: 
In practice there may be a mixture types; as we are aiming to handle realistic video, we select the dominant 3Dperiodicity in the object’s motion whichever is measurable best. In the rare case of counterbalancing expansion and contraction over different axes, it can be that while being periodic.
In addition, the motion continuity in 3D can be a source of periodicity. Depending on the type of motion, the motion field needs fulfill one of the following necessary periodic conditions:
where denotes a translation as the object’s periodicity may be superposed on translation. For robustness to illumination changes, the measurement of is preferred over . From these equations three different periodic motion continuities can be distinguished: constant, intermittent and oscillating periodicity. Again, in practice the motion continuity may be a mixture between types.
3.2 2D Recurrence of 3D Intrinsic Periodicity
So far we have considered the intrinsic periodicity in 3D. We reserve the term recurrent for the 2D observation of the 3D periodicity. Recurrence in the field of view is defined by:
(4) 
where is perceived flow in 2D image coordinates , is the observed displacement, is the recurrence and denotes the observational scale (camera zoom). The underlying principle is that the same period length will be observed in both 3D and 2D for all cases of intrinsic periodicity. As we perform all measurements within one image, from here on implies where subscript is omitted for clarity.
In addition, the intrinsic periodicity in 3D does not cover all perceived recurrence in an image sequence. For the trivial cases of constant translation and constant expansion in 3D, perceived recurrence will appear when a repetitive chain of objects (conveyor belt) or a repetitive appearance (checkered balloon) on the object, as given by Equation 4, is aligned with the motion. In such cases, recurrence will also be observed in the field of view. For constant rotation, the restriction is that the appearance cannot be constant over the surface, as no motion, let alone recurrent motion would be observed. In the rotational case, any rotational symmetry in appearance will induce a higher order recurrence as a multiplication of the symmetry and the rotational speed.
For the purpose of recurrence, nine cases organize in a Cartesian table of basic motion type times motion continuity, see (a). The corresponding examples of these nine cases are given in (b). This is the list of fundamental cases, where a mixture of types is permitted. In practice, some cases are ubiquitous, while for others it is hard to find examples at all and a mixture of types is rare.
3.3 The Viewpoint
The point of view has a large influence on the perception of the flow field. There are two fundamentally different viewpoints: the frontal view and the side view:
frontal view:  
side view: 
For translation there is one main axis and two perpendicular axes, which are both identical for our purpose. There is no distinction between the two perpendicular views. Similarly, for rotation the two perpendicular cases are also indistinguishable. For expansion there are one, two or three axes of expansion, again leaving us with the frontal case and the perpendicular case as the two fundamental cases. Consequently, for all cases considered, a distinction between frontal view and side view is sufficient. As a result, the perceived recurrence is summarized between the two extreme viewpoints, which results in the Cartesian product of two times nine basic cases as summarized in Figure 3
. The two views are the end of a continuous range of viewpoints. An actual viewpoint will be somewhere in between the frontal view and the side view, most of the time. This leaves the flow field asymmetrical or skewed, either in gradient, curl or divergence. As long as the signal can be measured this will not affect the recurrent nature of the signal.
3.4 NonStatic Repetition
So far we have assumed a static camera position. In particular with recurrent motion (1) the camera may move itself because the camera is mounted on the moving object itself, or (2) the camera is following the target of interest, or (3) the camera is in motion independent of the motion of the object. For the first two cases, the camera motion reflects the periodic dynamics of the object’s motion. The flow field may be outside the object, but otherwise it displays a complementary pattern in the flow field.
Only the third case demands removal of the camera motion prior to the repetitive motion analysis. In practice, this situation occurs frequently. Therefore, particular attention needs to be paid to camera motion independent of the target’s motion. When due to the camera motion, the viewpoint changes from frontal to side view, the analysis will be inevitably hard. Figure 3 illustrates the dramatic changes in the flow field when the camera changes from one extreme viewpoint (side) to the other (frontal), or vice versa.
In addition, even when object motion and camera are both static, for none of the intrinsic motion types (translation, rotation, expansion), a point on the object will be at the same position in the camera field all the time. Under the double static condition, a point will just return to the same point on the camera field. As the intermediate points on the object or background have an arbitrary albedo and radiate an arbitrary luminance, no sinusoidal signal will result in general. This is noteworthy as all previous work [7, 16, 19] implicitly assumed such a signal by considering the Fourier transform or variants.
3.5 NonStationary Repetition
A recurrent signal is said to be stationary when the period length is constant over time. In the initial steps of periodicity analysis, it was assumed the periodic signal was nearstationary. In practice, we have observed that stationary repetitive signals are relatively rare. Decay in frequency or accelerating motion are common in realistic video. Therefore, in contrast to [7, 19] we do not assume stationarity, making the method more robust to acceleration. We will employ local wavelets in response to the anticipated signals.
4 Repetition Estimation
Our method for repetition estimation follows a threestage approach (Figure 4). First, we localize the target instance in the scene, then we represent the target by a set of timevarying signals and finally we perform timefrequency decomposition to estimate repetition and select the most discriminative signal.
Signals from Video. To deal with camera motion and to handle the wide variety in repetitions, we construct a diverse set of timevarying flowbased signals that we compute over the motion foreground segmentation. Specifically, we measure the averagepooled flow field and the differentials of the flow. We estimate by measuring and . All the differentials of the flow field are computed using Gaussian derivative filters with a large filter size to obtain a global measurement over the foreground segmentation. The final measurement is the averagepooled value over a small radius around the object’s center. The differential operators of the flow field comprise four different measurements (as the curl has only one direction perpendicular to the screen), whereas there are two zerothorder flow signals. In total these amount to six different signals.
For the cases of oscillating and intermittent motion observed from the side, will deliver the strongest repetitive signal. The flow field will convey a stronger repetitive signal for the cases of constant motion appearance. In practice, it may be hard to select the most discriminative signal, to which we return at the end of this section.
TimeFrequency Decomposition. Given a discrete signal for timesteps sampled at equally spaced intervals . Let be some admissible wavelet function, depending on the nondimensional time parameter . The continuous wavelet transform [10] is defined as the convolution of with a “daughter” wavelet generated by scaling and translating the wavelet function :
(5) 
where the asterisk represents the complex conjugate. By varying time parameter and the scale parameter , the wavelet transform can generate a timescale representation describing how the amplitude of the signal changes with time and scale. While formally a timescale representation, it can also be considered a timefrequency representation since the wavelet scale is directly related to the Fourier frequency [29]. We use the Morlet wavelet, a complex exponential carrier modulated by a Gaussian envelope:
(6) 
Since the Morlet wavelet is complex, the wavelet transform is also complex. Therefore, it is useful to define the wavelet power spectrum or scalogram as representing the timefrequency localized energy. The 2D representation can reveal the signal’s nonstationary repetitive dynamics. Once the wavelet is chosen, what remains is defining the resolution of the timefrequency spectrum by specifying scales . In practice, a logarithmic scaling is effective [29]: with . The smallest measurable scale and the number of scales determine the range of the frequency resolution.
To estimate nonstationary repetitions in a given video, we decompose the six signals into a timefrequency spectrum using the continuous wavelet transform. What follows are six 2D timefrequency representations that enable further analysis of the repetitive contents of the video.
Counting. We assume there is only one dominant repetitive motion observable in the wavelet spectrum; this is reasonable as the foreground motion segmentation encourages temporal consistency. Selecting the modulus maximum from the wavelet spectrum for every timestep gives a local frequency measurement of approximately for a Morlet wavelet. Our method integrates local frequencies over time to estimate the repetition count: . For a stationary periodic signal the modulus maximum forms a horizontal ridge through time. We emphasize the ability to count nonstationary signals using our approach since the local frequency may change over time. Therefore, our method is able to deal with accelerations or transient phenomena.
MinCost Signal Selection. The question that remains is selecting the most discriminative signal out of the six. We propose a selection mechanism that prioritizes signals with local regularity in the timefrequency space. Specifically, we adopt a mincost algorithm for finding the optimal path through the timefrequency space. We turn the wavelet power into a cost surface for optimization by simply inverting it: . Traversing over a highpower region translates to low cost. As our goal is to characterize a signal by one cost measure, we run a greedy mincost pathfinding algorithm to assess the minimum cost required to traverse the spectrum through time. Consequently, the algorithm assigns a lower cost to paths with high local regularity. This is appealing as realistic video signals can be nonstationary but locally smooth. To make a final prediction we select the signal with minimum cost and its corresponding repetition count.
5 Datasets, Evaluation and Implementation
Motivated by the observation that the YTSegments [15] dataset for visual repetition estimation is limited in terms of its complexity, we present a new dataset that is more difficult in scene complexity, repetitive appearance and cycle length variation. Our code and data will be made available^{1}^{1}1https://tomrunia.github.io/projects/repetition.
QUVA Repetition consists of videos displaying a wide variety of repetitive video dynamics, including swimming, stirring, cutting, combing and musicmaking. The untrimmed videos are collected from YouTube. We asked two human annotators to label the temporal bounds of each interval containing at least four unambiguous repetitions. We found high interagreement between the annotators and keep the intervals with the highest overlap to increase clarity. Final intervals are obtained by taking the intersection of the two temporal annotations. Next, we ask the annotators to label the repetition count and the temporal bounds of each cycle. Figure 5 shows a few video examples along with their annotation. In Table 1 we compare the characteristics of our dataset to the YTSegments [15]. Our videos have more variability in cycle length, motion appearance, camera motion and background clutter. By increasing difficulty in both scene complexity and temporal dynamics, our dataset represents a more realistic and challenging benchmark for estimating repetition in video.
YTSegments  QUVA Repetition  

Number of Videos  
Duration (s)  
Count Avg. Std.  
Count Min/Max  /  / 
Cycle Length Variation  
Camera Motion  
Superposed Translation 
Count Evaluation. Given a set of videos, we evaluate the performance between ground truth count and the count prediction for . We report the mean absolute error following prior work [15]: . We also record the offbyone accuracy (OBOA) or count within1 accuracy.
Implementation. We use the motion segmentation of Papazoglou and Ferrari [18]. To account for incorrect segmentation masks we reuse the segmentation of the previous frame if the fraction of foreground pixels is less than of the entire frame. To compute the dense flow field we rely on EpicFlow [22]. We compute the divergence and curl by firstorder Gaussian derivative filters with a filter size. We use a Morlet wavelet with logarithmic scales (, ) based on [29] in all experiments. We limit the range of corresponding to a minimum of four repetitions in the video. Before applying the wavelet transform, we mean filter and linearly detrend the input signals. The mean filter uses a window size of time steps in all experiments.
Baselines. We choose the method of Pogalin et al. [19] to represent the class of Fourierbased methods for repetition estimation. Our reimplementation uses a more recent object tracker [11] but is identical otherwise. The tracker is initialized by manually drawing a box on the first frame. Converting the frequency to a count is trivial using the video length and frame rate. Additionally, we compare with the deeplearning method of Levy & Wolf [15] using their publicly available code and pretrained model without any modifications.
6 Experiments
6.1 Fourier versus Wavelets
Setup. We first compare the Fourierbased periodogram with a waveletbased timefrequency representation for counting the number of repetitions in each signal. To assess this, we generate idealized signals by plotting sinusoidals through the individual cycle bound annotations for every video in our QUVA Repetition dataset. From the periodogram we detect the maximum peak and convert its corresponding frequency to a count using the video’s duration.
Results. From the results in Figure 6 it is clear that waveletbased counting outperforms the periodogram on idealized signals. We also add a significant amount of Gaussian noise () to the signals which has a minor negative effect on both methods (data not shown). We observe that increased cycle length variation negatively affects Fourierbased counting. This is expected as it globally measures frequency and is unable to deal with nonstationarity. As wavelets naturally handle nonstationary repetition they are less sensitive to cycle length variability.
6.2 Value of Diverse Signals
Setup. As wavelets prove to be effective for the counting task, we now assess the value of a diverse signal representation. The set of six signals that we verify comprises: . These are measured over the foreground segmentation and evaluated for individual performance. Again, we test repetition counting on our QUVA Repetition dataset. To obtain a lowerbound on the error, we select the best signal per video in an oracle fashion.
Results. The results in Table 2 reveal that for the wide variability of repetitive appearance there is no one size fits all solution. The individual signals are unable to handle all variety of repetitive appearances by themselves, but their joint diversity results in a good lowerbound. The vertical flow is best overall and selected more often than the others by the oracle. We explain this bias towards vertical flow by the observation that our dataset contains many sports videos in which the gravity is often used as opposing force. Repeating this experiment on the YTSegments dataset with oracle signal selection achieves an MAE of .
MAE  OBOA  # Selected  

Oracle Best 
6.3 Video Acceleration Sensitivity
Setup. In this experiment we examine our method’s sensitivity to acceleration by artificially speedingup videos. Starting from the YTSegments dataset, we induce significant nonstationarity by artificially accelerating the videos halfway. Specifically, we modify the videos such that after the midpoint frame, the speed is increased by dropping every second frame. What follows are videos with a acceleration starting halfway. We compare against [15] which handles nonstationarity by predicting the period of motion in slidingwindow fashion over the video. This experiment omits Fourierbased analysis, as by its nature, it will inevitably fail on this task.
Results. Figure 7 presents the MAE in both original and accelerated setting. On their own dataset, the system of Levy & Wolf [15] excels. Acceleration changes the results as our method suffers less and obtains a lower MAE on the accelerated videos. This reveals their sensitivity to acceleration, whereas our method deteriorates less.
6.4 Comparison StateoftheArt
Setup. We carry out a full count comparison with the methods of Pogalin et al. [19] and Levy & Wolf [15] on both datasets. Our method uses fixed parameters in all cases and utilizes the mincost signal selection algorithm to pick the most discriminative signal.
Results. The outcome of the final experiment is presented in Table 3. For the YTSegments dataset, the method of [15] performs best with an MAE of , where our method scores , better than the Fourierbased approach of [19]. The results change when considering the more realistic and challenging QUVA Repetition dataset. The method of [15] performs the worst, with an MAE of , which we attribute to the fact that their network only considers four motion types during training. The Fourierbased method of [19] scores an MAE of , whereas we obtain an error of . Overall our method is better able to handle the nonstatic and nonstationary video characteristics in our QUVA Repetition dataset while still performing reasonably well on the videos from YTSegments. We highlight three examples of our method in Figure 8.
7 Conclusion
We have categorized 3D intrinsic periodic motion as translation, rotation or expansion depending on the divergence and curl of the flow field. Analysis of the timevarying flow gradient distinguishes three motion continuities: constant, intermittent or oscillatory. For the 2D perception of 3D periodicity, two viewpoint extremes are considered. What follows is the categorization of fundamental cases of recurrent perception derived from the differential operators acting on the flow field. The use of the differentials extends beyond theory, as our experiments demonstrate that measuring flowbased signals over the motion foreground segmentation is effective for recurrence estimation in realistic video. We show that our method improves the stateoftheart and effectively handles complex appearances, camera motion and nonstationarity on a realistic video dataset.
References
 [1] A. B. Albu, R. Bergevin, and S. Quirion. Generic temporal segmentation of cyclic human motion. PR, 41(1):6–21, 2008.
 [2] O. Azy and N. Ahuja. Segmentation of periodically moving objects. In ICPR, 2008.
 [3] S. Belongie and J. Wills. Structure from periodic motion. In Spatial Coherence for Visual Motion Analysis, pages 16–24. Springer Berlin Heidelberg, 2006.
 [4] A. Briassouli and N. Ahuja. Extraction and analysis of multiple periodic motions in video sequences. TPAMI, 29(7):1244–1261, 2007.
 [5] G. J. Burghouts and J.M. Geusebroek. Quasiperiodic spatiotemporal filtering. TIP, 15(6):1572–1582, 2006.
 [6] D. Chetverikov and S. Fazekas. On motion periodicity of dynamic textures. In BMVC, 2006.
 [7] R. Cutler and L. S. Davis. Robust realtime periodic motion detection, analysis, and applications. TPAMI, 22(8):781–796, 2000.
 [8] J. Davis, A. Bobick, and W. Richards. Categorical representation and recognition of oscillatory motion patterns. In CVPR, 2000.
 [9] R. Goldenberg, R. Kimmel, E. Rivlin, and M. Rudzsky. Behavior classification by eigendecomposition of periodic motions. PR, 38(7):1033–1043, 2005.
 [10] A. Grossmann and J. Morlet. Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM Journal on Mathematical Analysis, 15(4):723–736, 1984.
 [11] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of trackingbydetection with kernels. In ECCV, 2012.
 [12] S. Huang, X. Ying, J. Rong, Z. Shang, and H. Zha. Camera calibration from periodic motion of a pedestrian. In CVPR, 2016.
 [13] G. Johansson. Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, pages 201–211, 1973.
 [14] I. Laptev, S. J. Belongie, P. Perez, and J. Wills. Periodic motion detection and segmentation via approximate sequence alignment. In ICCV, 2005.
 [15] O. Levy and L. Wolf. Live Repetition Counting. In CVPR, 2015.
 [16] F. Liu and R. W. Picard. Finding periodicity in space and time. In ICCV, 1998.
 [17] C. Lu and N. J. Ferrier. Repetitive motion analysis: Segmentation and event classification. TPAMI, 26(2):258–263, 2004.
 [18] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In ICCV, 2013.
 [19] E. Pogalin, A. W. M. Smeulders, and A. H. Thean. Visual quasiperiodicity. In CVPR, 2008.
 [20] R. Polana and R. C. Nelson. Detection and recognition of periodic, nonrigid motion. IJCV, 23(3):261–282, 1997.
 [21] Y. Ran, I. Weiss, Q. Zheng, and L. S. Davis. Pedestrian detection via periodic motion analysis. IJCV, 71(2):143–160, 2007.

[22]
J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid.
EpicFlow: Edgepreserving interpolation of correspondences for optical flow.
In CVPR, 2015.  [23] O. Rioul and M. Vetterli. Wavelets and signal processing. Signal Processing Magazine, 8(4):14–38, 1991.
 [24] B. Sarel and M. Irani. Separating transparent layers of repetitive dynamic behaviors. In ICCV, 2005.

[25]
H. M. Schey.
Div, grad, curl, and all that: an informal text on vector calculus
. WW Norton, 2005.  [26] S. M. Seitz and C. R. Dyer. Viewinvariant analysis of cyclic motion. IJCV, 25(3):231–251, 1997.
 [27] M. Spivak. Comprehensive Introduction to Differential Geometry. Publish or Perish, Inc., University of Tokyo Press, 1981.
 [28] A. Thangali and S. Sclaroff. Periodic motion detection and estimation via spacetime sampling. In WACV, 2005.
 [29] C. Torrence and G. P. Compo. A practical guide to wavelet analysis. Bulletin of the American Meteorological society, 79(1):61–78, 1998.
 [30] P.S. Tsai, M. Shah, K. Keiter, and T. Kasparis. Cyclic motion detection for motion based recognition. PR, 27(12):1591–1603, 1994.
Comments
There are no comments yet.