Videos of natural scenes contain vast varieties of motion patterns. We divide these motion patterns in a table based on their complexities measured by two criteria: i) sketchability (Guo et al (2007)), i.e. whether a local patch can be represented explicitly by an image primitive from a sparse coding dictionary, and ii) intrackability (or trackability) (Gong and Zhu (2012)
), which measures the uncertainty of tracking an image patch using the entropy of posterior probability on velocities. Fig.1 shows some examples of the different video patches in the four categories. Category A consists of the simplest vision phenomena, i.e. sketchable and trackable motions, such as trackable corners, lines, and feature points, whose positions and shapes can be tracked between frames. For example, patches (a), (b), (c) and (d) belong to category A. Category D is the most complex and is called textured motions or dynamic texture in the literature, such as water, fire or grass, in which the images have no distinct primitives or trackable motion, such as patches (h) and (i). The other categories are in between. Category B refers to sketchable but intrackable patches, which can be described by distinct image primitives but hardly be tracked between frames due to fast motion, for example the patches (e) and (f) at the legs of the galloping horse. Finally category C includes the trackable but non-sketchable patches, which are cluttered features or moving kernels, e.g. patch (g).
In the vision literature, as it was pointed out by (Shi and Zhu (2007)), there are two families of representations, which code images or videos by explicit and implicit functions respectively.
1, Explicit representations with generative models. (Olshausen (2003); Kim et al (2010)) learned an over-complete set of coding elements from natural video sequences using the sparse coding model (Olshausen and Field (1996)). (Elder and Zucker (1998)) and (Guo et al (2007)) represented the image/video patches by fitting functions with explicit geometric and photometric parameters. (Wang and Zhu (2004)) synthesized complex motion, such as birds, snowflakes, and waves with a large mount of particles and wave components. (Black and Fleet (2000)) represented two types of motion primitives, namely smooth motion and motion boundaries for motion segmentation. In higher level object motion tracking, people represented different tracking units depending on the underlying objects and scales, such as sparse or dense feature points tracking (Serby et al (2004),Black and Fleet (2000)), kernels tracking (Comaniciu et al (2003); Fan et al (2006)), contours tracking (Maccormick and Blake (2000)), and middle-level pairwise-components generation (Yuan et al (2010)).
2, Implicit representations with descriptive models. For textured motions or dynamic textures, people used numerous Markov models which are constrained to reproduce some statistics extracted from the input video. For example, dynamic textures (Szummer and Picard (1996); Campbell et al (2002)) were modeled by a spatio-temporal auto-regressive (STAR) model, in which the intensity of each pixel was represented by a linear summation of intensities of its spatial and temporal neighbors. (Bouthemy et al (2006)) proposed a mixed-state auto-models for motion textures by generalizing the auto-models in (Besag (1974)). (Doretto et al (2003)) derived an auto-regression moving-average model for dynamic texture. (Chan and Vasconcelos (2008)) and (Ravichandran et al (2009)) extended it to a stable linear dynamical system (LDS) model.
Recently, to represent complex motion, such as human activities, researchers have used Histogram of Oriented Gradients (HOG) (Dalal and Triggs (2005)) for appearance and Histogram of Oriented Optical-Flow (HOOF) (Dalal et al (2006); Chaudhry et al (2009)) for motion. The HOG and HOOF record the rough geometric information through the grids and pool the statistics (histograms) within the local cells. Such features are used for recognition in discriminative tasks, such as action classification, and are not suitable for video coding and reconstruction.
In the literature, these video representations are often manually selected for specific videos in different tasks. There lacks a generic representation and criterion that can automatically select the proper models for different patterns of the video. Furthermore, as it was demonstrated in (Gong and Zhu (2012)) that both sketchability and trackability change over scales, densities, and stochasticity of the dynamics, a good video representation must adapt itself continuously in a long video sequence.
1.2 Overview and contributions
Motivated by the above observations, we study a unified middle-level representation, called video primal sketch (VPS), by integrating the two families of representations. Our work is inspired by Marr’s conjecture for a generic “token” representation called primal sketch as the output of early vision (Marr (1982)), and is aimed at extending the primal sketch model proposed by (Guo et al (2007)) from images to videos. Our goal is not only to provide a parsimonious model for video compression and coding, but more importantly, to support and be compatible with high-level tasks such as motion tracking and action recognition.
Fig.2 overviews an example of the video primal sketch. Fig.2.(a) is an input video frame which is separated into sketchable and non-sketchable regions by the sketchability map in (b), and trackable primitives and intrackable regions by the trackability map in (c). The sketchable or trackable regions are explicitly represented by a sparse coding model and reconstructed in (d) with motion primitives, and each non-sketchable and intrackable region has a textured motion which is synthesized in (e) by a generalized FRAME (Zhu et al (1998)) model (implicit and descriptive). The synthesis of this frame is shown in (f) which integrates the results from (d) and (e) seamlessly.
As Table 1 shows, the explicit representations include parameters for the positions, types, motion velocities, etc of the video primitives and the implicit representations have parameters for the histograms of a set of filter responses on dynamic textures. This table shows the efficiency of the VPS model.
This paper makes the following contributions to the literature.
We present and compare two different but related models to define textured motions. The first one is a spatio-temporal FRAME (ST-FRAME) model, which is a non-parametric Markov random field and generalizes the FRAME model (Zhu et al (1998)) of texture with spatio-temporal filters. The ST-FRAME model is learned so that it has marginal probabilities that match the histograms of the responses from the spatio-temporal filters on the input video. The second one is a motion-appearance FRAME model (MA-FRAME), which not only matches the histograms of some spatio-temporal filter responses, but also matches the histograms of velocities pooled over a local region. The MA-FRAME model achieves better results in video synthesis than the ST-FRAME model, and it is, to some extent, similar to the HOOF features used in action classification (Dalal et al (2006); Chaudhry et al (2009)).
We learn a dictionary of motion primitives from input videos using a generative sparse coding model. These primitives are used to reconstruct the explicit regions and include two types: i) generic primitives for the sketchable patches, such as corners, bars etc; and ii) specific primitives for the non-sketchable but trackable patches which are usually texture patches similar to those used in kernel tracking (Comaniciu et al (2003)).
The models for implicit and explicit regions are integrated in a hybrid representation – the video primal sketch (VPS), as a generic middle-level representation of video. We will also show how VPS changes over information scales affected by distance, density and dynamics.
We show the connections between this middle-level VPS representation and features for high-level vision tasks such as action recognition.
|Video Resolution||288352 pixels|
|Explicit Region||31,644 pixels 30%|
|Primitive Width||11 pixels|
|Explicit Parameters||3,600 3.6%|
Our work is inspired by Gong’s empirical study in Gong and Zhu (2012), which revealed the statistical properties of videos over scale transitions and defined intrackability as the entropy of local velocities. When the entropy is high, the patch cannot be tracked locally and thus its motion is represented by a velocity histogram. Gong and Zhu (2012) did not give a unified model for video representation and synthesis which is the focus on the current paper.
This paper extends a previous conference paper (Han et al (2011)) in the following aspects:
We propose a new dynamic texture model, MA-FRAME, for better representing velocity information. Benefited from the new temporal feature, the VPS model can be applied to high-level action representation tasks more directly.
We do a series of perceptual experiments to verify the high quality of video synthesis from the aspect of human perception.
The remainder of this paper is organized as follows. In Section 2, we present the framework of video primal sketch. In Section 3, we explain the algorithms for explicit representation, textured motion synthesis and video synthesis, and show a series of experiments. The paper is concluded with a discussion in Section 4.
2 Video primal sketch model
In his monumental book (Marr (1982)), Marr conjectured a primal sketch as the output of early vision that transfers the continuous “analogy” signals in pixels to a discrete “token” representation. The latter should be parsimonious and sufficient to reconstruct the observed image without much perceivable distortions. A mathematical model was later studied by Guo, et al (Guo et al (2007)), which successfully modeled hundreds of images by integrating sketchable structures and non-sketchable textures. In this section, we extend it to video primal sketch as a hybrid generic video representation.
Let be a video defined on a 3D lattice . is divided disjointly into explicit and implicit regions,
Then the video is decomposed as two components
are defined by explicit functions , in which, each instance is corresponded to a different function form of and indexed by a particular value of parameter . And are defined by implicit functions , in which, extracts the statistics of filter responses from image and is a specific value of histograms.
In the following, we first present the two families of models for and respectively, and then integrate them in the VPS model.
2.1 Explicit representation by sparse coding
The explicit region of a video is decomposed into disjoint domains (usually is in the order of ),
Here defines the domain of a “brick”. A brick, denoted by , is a spatio-temporal volume like a patch in images. These bricks are divided into the three categories A, B and C as we mentioned in section 1.
The size of
influences the results of tracking and synthesis to some degree. The spatial size should depend on the scale of structures or the granularity of textures, and the temporal size should depend on the motion amplitude and frequency in time dimension, which are hard to estimate in real applications. However, a general size works well for most of cases, saypixels 3 frames for trackable bricks (sketchable or non-sketchable), or pixels1 frame for sketchable but intrackable bricks. Therefore, in all the experiments of this paper, the size of is chosen as such.
Fig.3 shows one example comparing the sketchable and trackable regions based on sketchability and trackability maps shown in Fig.2(b) and (c) respectively. It is worth noting that the two regions overlaps with only a small percentage of the regions is either sketchable or trackable.
Each brick can be represented by a primitive through an explicit function,
means the th primitive from the primitive dictionary , which fits the brick best. Here indexes the parameters such as type, position, orientation and scale of . is the corresponding coefficient. represents the residue, which is assumed to be i.i.d. Gaussian. For an trackable primitive, includes frames and thus encodes the velocity in the frames. For sketchable but intrackable primitive, has only frame.
As Fig. 4 illustrates, the dictionary is composed of two categories:
Common primitives . These are primitives shared by most videos, such as blobs, edges and ridges etc. They have explicit parameters for orientations and scales. They are mostly belong to sketchable region as shown in Fig. 3.
Special primitives . These bricks do not have common appearance and are limited to specific video frames. They are non-sketchable but trackable, and are recorded to code the specific video region. They are mostly belong to trackable region but not included in sketchable region as shown in Fig. 3.
(4) uses only one base function and thus is different from conventional linear additive model. Follows the Gaussian assumption for the residues, we have the following probabilistic model for the explicit region
where represents the selected primitive set, is the size of each primitive, is the number of selected primitives and
is estimated standard deviation of representing natural videos by.
2.2 Implicit representations by FRAME models
The implicit region of video is segmented into (usually is no more than ) disjoint homogeneous textured motion regions,
One effective approach for texture modeling is to pool the histograms for a set of filters (Gabor, DoG and DooG) on the input image (Bergen and Adelson (1991); Chubb and Landy (1991); Heeger and Bergen (1995); Zhu et al (1998); Portilla and Simoncelli (2000)
). Since Gabor filters model the response functions of the neurons in the primary visual cortex, two texture images with the same histograms of filter responses generate the same texture impression, and thus are considered perceptually equivalent (Silverman et al (1989)). The FRAME model proposed in (Zhu et al (1998)) generates the expected marginal statistics to match the observed histograms through the maximum entropy principle. As a result, any images drawn from this model will have the same filtered histograms and thus can be used for synthesis or reconstruction.
We extend this concept to video by adding temporal constraints and define each homogeneous textured motion region by an equivalence class of videos,
where is a series of 1D histograms of filtered responses that characterize the macroscopic properties of the textured motion pattern. Thus we only need to code the histograms and synthesize the textured motion region by sampling from the set . As is defined by the implicit functions, we call it an implicit representation. These regions are coded up to an equivalence class in contrast to reconstructing the pixel intensities in the explicit representation.
To capture temporal constraints, one straightforward method is to choose a set of spatio-temporal filters and calculate the histograms of the filter responses. This leads to the spatio-temporal FRAME (ST-FRAME) model which will be introduced in section 2.3. Another method is to compute the statistics of velocity. Since the motion in these regions is intrackable, at each point of the image, its velocity is ambiguous (large entropy). We pool the histograms of velocities locally in a way similar to the HOOF (Histogram of Oriented Optical-Flow)(Dalal et al (2006); Chaudhry et al (2009)) features in action classification. This leads to the motion-appearance FRAME (MA-FRAME) model which uses histograms of both appearance (static filters) and velocities. We will elaborate on this model in section 2.4.
2.3 Implicit representation by spatio-temporal FRAME
ST-FRAME is an extension of the FRAME model (Zhu et al (1998)) by adopting spatio-temporal filters.
A set of filters is selected from a filter bank . Fig.5 illustrates the three types of filters in : i) the static filters for texture appearance in a single image; ii) the motion filter with certain velocity; and iii) the flicker filter that have zero velocity but opposite signs between adjacent frames. For each filter , the spatio-temporal filter response of at is . The convolution is over spatial and temporal domain. By pooling the filter responses over all , we obtain a number of 1D histograms
where indexes the histogram bins, and if belongs to bin , and otherwise. Following the FRAME model, the statistical model of textured motion is written in the form of the following Gibbs distribution,
where are potential functions.
According to the theorem of ensemble equivalence (Wu et al (2000)
), the Gibbs distribution converges to the uniform distribution over the setin (7), when is large enough. For any fixed local brick , the distribution of follows the Markov random field model (9). The model can describe textured motion located in an irregular shape region .
The filters in are pursued one by one from the filter bank so that the information gain is maximized at each step.
and are the response histograms of before and after synthesizing by adding respectively. The larger the difference, the more important is the filter.
Following the distribution form of (9), the probabilistic model of implicit parts of is defined as
where represents the selected spatio-temporal filter set.
In the experiments described later, we demonstrate that this model can synthesize a range of dynamic textures by matching the histograms of filter responses. The synthesis is done through sampling the probability by Markov chain Monte Carlo.
2.4 Implicit representation by motion-appearance FRAME
Different from ST-FRAME, in which, temporal constraints are based on spatio-temporal filters, the MA-FRAME model uses the statistics of velocities, in addition to the statistics of filter responses for appearance.
For the appearance constraints, the filter response histograms are obtained similarly as ST-FRAME in (10)
where the filter set includes static and flicker filters in .
For the motion constraints, the velocity distribution of each local patch is estimated via the calculation of trackability (Gong and Zhu (2012)), in which, each patch is compared with its spatial neighborhood in adjacent frame and the probability of the local velocity is computed as
Here, is the standard deviation of the differences between local patches from adjacent frames based on various velocities. The statistical information of velocities for a certain area of texture is approximated by averaging the velocity distribution over region
Let collect the filter responses and velocities histograms of the video. The statistical model of textured motion can be written in the form of the following joint Gibbs distribution,
Here, is the parameter of the model.
In summary, the probabilistic model for the implicit regions of is defined as
where represents the selected filter set.
In the experiment section, we show the effectiveness of the MA-FRAME model and its advantages over the ST-FRAME model.
2.5 Hybrid model for video representation
Here, represents the boundary condition of , which belong to the reconstruction of . It leads to seamless boundaries in the synthesis.
By integrating the explicit and implicit representation, the video primal sketch has the following probability model,
where is the normalizing constant.
We denote by the representation for the video , where includes the histograms described by and ; and includes all the primitives with parameters for their indexes, position, orientation and scales etc.
gives the prior probability of video representation by. , in which, is the number of primitives. , in which, is the energy term and for instance, to penalize the number of implicit regions. Thus, the best video representation is obtained by maximizing the posterior probability,
following the video primal sketch model in (18).
Table 1 shows an example of . For a video of the size of pixels, about 30% of the pixels are represented explicitly by motion primitives. As each primitive needs 11 parameters (the side length of the patch according to the primitive learning process in section 3.2) to record the profile and 1 more to record the type, the number of total parameters for the explicit representation is 3,600. textured motion regions are represented implicitly by the histograms, which are described by , and filters respectively. As each histogram has 15 bins, the number of the parameters for the implicit representation is 420.
2.6 Sketchability and Trackability for Model Selection
The computation of the VPS involves the partition of the domain into the explicit regions and implicit regions . This is done through the sketchability and trackability maps. In this subsection, we overview the general ideas and refer to previous work on sketchability (Guo et al (2007)) and trackability (Gong and Zhu (2012)) for details.
Let’s consider one local volume of the video . In the video primal sketch model, may be modeled either by the sparse coding model in (5) or by the FRAME model in (11). The choice is determined via the competition between the two models, i.e. comparing which model gives shorter coding length (Shi and Zhu (2007)) for representation.
If is represented by the sparse coding model, the posterior probability is calculated by
where . The coding length is
Since is estimated via the given data temporarily in real application, holds by definition. As a result, the coding length is derived as,
If is described by the FRAME model, the posterior probability is calculated by
The coding length is estimated through a sequential reduction process. When , with no constraints, the FRAME model is a uniform distribution, and thus the coding length is where is the cardinality of the space of all videos in . Suppose the intensities of the video range from 0 to 255, then . By adding each constraint, the equivalence will shrink in size, and the ratio of the compression is approximately equal to the information gain in (10). Therefore we can calculate the coding length by
By comparing and , whoever has the shorter coding length will win the competition and be chosen for .
In practice, we use a faster estimation which utilizes the relationship between the coding length and the entropy of the local posterior probabilities.
Consider the entropy of ,
It measures the uncertainty of selecting a primitive in for representation. The sharper the distribution is, the lower the entropy will be, which gives smaller according to (21). Hence, reflects the magnitude of . Set an entropy threshold on , ideally, if and only if . Therefore, when , we consider is lower and is modeled by the sparse coding model, else it is modeled by the FRAME model.
It is clear that has the same form and meaning with sketchability (Guo et al (2007)) in appearance representation and trackability (Gong and Zhu (2012)) in motion representation. Therefore, sketchability and trackability can be used for model selection for each local volume. Fig.2 (b) and (c) show the sketchability and trackability maps calculated by the local enetropy of posteriors. The two maps decide the partition of the video into the explicit implicit regions. Within the explicitly regions, they also decide whether a patch is trackable (using primitives with size of pixels frames) or intrackable (using primitives with pixels frame).
3 Algorithms and experiments
3.1 Spatio-temporal filters
In the vision literature, spatio-temporal filters have been widely used for motion information extraction (Adelson and Bergen (1985)), optical flow estimation (Heeger (1987)), multi-scale representation of temporal data (Lindeberg and Fagerström (1996)), pattern categorization (Wildes and Bergen (2000)), and dynamic texture recognition (Derpanis and Wildes (2010)). In the experiments, we choose spatio-temporal filters as shown in Fig.5. It includes three types:
Static filters. Laplacian of Gaussian (LoG), Gabor, gradient, or intensity filter on a single frame. They capture statistics of spatial features.
Motion filters. Moving LoG, Gabor or intensity filters in different speeds and directions over three frames. Gabor motion filters move perpendicularly to their orientations.
Flicker filters. One static filter with opposite signs at two frames. They contrast the static filter responses between two consequent frames and detect the change of dynamics.
3.2 Learning motion primitives and reconstructing explicit regions
After computing the sketchability and trackability maps of one frame, we extract explicit regions in the video. By calculating all the coefficients of each part with motion primitives from the primitive bank, , all the are ranked from high to low. Each time, we select the primitive with the highest coefficient to represent the corresponding domain and then do local suppression to its neighborhood to avoid excessive overlapping of extracted domains. The algorithm is similar to matching pursuit (Mallat and Zhang (1993)) and the primitives are chosen one by one.
In our work, in order to alleviate computational complexity, are calculated by filter responses. The filters used here are pixels and have orientations and 8 scales. The fitted filter gives a raw sketch of the trackable patch and extracts property information, such as type and orientation, for generating the primitive. If the fitted filter is a Gabor-like filter, the primitive is calculated by averaging the intensities of the patch along the orientation of , while if the fitted filter is a LoG-like filter, is calculated by averaging the intensities circularly around its center. Then is added to the primitive set with its motion velocities calculated from the trackability map. It is also added into for the dictionary buildup. The size of each primitive is , the same as the size of the fitted filter. And the velocity are two parameters for recording motion information. In Fig.4, we show some examples of different types of primitives, such as blob, ridge and edge. Fig.6 shows some examples of reconstruction by motion primitives. In each group, the original local image, the fitted filter, the generated primitive and the motion velocity are given. In the frame, each patch is marked by a square with a short line for representing its motion information.
Through the matching pursuit process, the sketchable regions are reconstructed by a set of common primitives. Fig.7 shows an example of the sketchable region reconstruction by using a series of common primitives. By comparing the observed frame (a) and reconstructed frame (b), (c) shows the error of reconstruction. The more detailed quantitative assessment is given in section 3.7. It is evident that a rich dictionary of video primitives can lead to a satisfactory reconstruction of explicit regions of videos.
For non-sketchable but trackable regions, based on the trackability map, we get the motion trace of each local trackable patch. Because each patch cannot be represented by a shared primitive, we record the whole patch and motion information as a special primitive for video reconstruction. It is obvious that special primitives increase model complexity compared with common primitives. However, as stated in section 2.1, the percentage of special primitives for the explicit region reconstruction of one video is very small (around 2-3%), hence it will not affect the final storage space significantly.
3.3 Synthesizing textured motions by ST-FRAME
Each local volume of textured motion located at follows a Markov random field model conditioned on its local neighborhood following (9),
where Lagrange parameters are the discrete form of potential function learned from input videos by maximum likelihood,
But the closed form of is not available in general. So it can be solved iteratively by
In order to draw a typical sample frame from , we use the Gibbs sampler which simulates a Markov chain. Starting from any random image, e.g. a white noise, it converges to a stationary process with distribution . Therefore, we get the final converged results dominated by , which characterizes the observed dynamic texture.
In summary, the process of textured motion synthesis is given by the following algorithm.
Algorithm 1. Synthesis for Textured Motion by ST-FRAME
Input video .
Suppose we have , our goal is to synthesize the next frame .
Select a group of spatio-temporal filters from a filter bank .
Compute of .
Initialize as a uniform white noise image.
Calculate from .
Update and .
Sample by Gibbs sampler.
Until for .
Fig.8 shows an example of the synthesis process. (f) is one frame from textured motion of ocean. Starting from a white noise frame in (a), (b) is synthesized with only 7 static filters. It shows high smoothness in spatial domain, but lacks temporal continuity with previous frames. However, in (c) the synthesis with only 9 motion filters has similar macroscopic distribution to the observed frame, but appears quite grainy over local spatial relationship. By using both static and motion filters, the synthesis in (d) performs well on both spatial and temporal relationships. Compared with (d), the synthesis by 2 extra flicker filters in (e), shows more smoothness and more similar to the observed frame.
In Fig.9, we show four groups of textured motion (4 bits) synthesis by Algorithm 1: ocean (a), water wave (b), fire (c) and forest (d). In each group, as time passes, the synthesized frames are getting more and more different from the observed one. It is caused by the stochasticity of textured motions. Although the synthesized and observed videos are quite different on pixel level, the two sequences are perceived extremely identical by human after matching the histograms of a small set of filter responses. This conclusion can be further supported by perceptual studies in section 3.9. Fig.10 shows that as changes from white noise (Fig.8(a)) to the final synthesized result (Fig.8(e)), the histograms of filter responses become matched with the observed ones.
Table 2 shows the comparison of compression ratios between ST-FRAME and the dynamic texture model (Doretto et al (2003)). It has a significantly better compression ratio than the dynamic texture model, because the dynamic texture model has to record PCA components as large as the image size.
3.4 Computing velocity statistics
One popular method for velocity estimation is optical flow. Based on the optical flow, HOOF features extract the motion statistics by calculating the distribution of velocities in each region. Optical flow is an effective method for estimating the motions at trackable areas, but does not work for the intrackable dynamic texture areas. The three basic assumptions for optical flow equations, i.e. brightness constancy between matched pixels in consecutive frames, smoothness among adjacent pixels and slow motion, are violated in these areas due to the stochastic nature of dynamic textures. Therefore, we go for a different velocity estimation method.
Considering one pixel at in frame , we denote its neighborhood as . Comparing patch with all the patches in the previous frame within a searching radius, each patch corresponding to one velocity , we obtain a distribution
This distribution describes the probability of the origin of the patch, i.e. the location where the patch moves from. Equivalently, it reflects the average probability of the motions of the pixels in the patch. Therefore, by clustering all the pixels according to their velocity distribution, the cluster center of each cluster gives the velocity statistics of all the pixels in this cluster approximately, which reflects the motion pattern of these clustered pixels. Fig.14 and Fig.15 show some examples of velocity statistics, in which the brighter, the higher probability while the darker, the lower probability. The meanings of these two figures are explained later.
Compared to HOOF, the estimated velocity distribution is more suitable for modeling textured motion. Firstly, the velocity distribution is estimated pixel-wisely. Hence it can depict more non-smooth motions. Secondly, although it seeks to compare the intensity pattern around a point to nearby regions at a subsequent temporal instance, which seems to also take brightness constancy assumption into account, the difference here is that it calculates the probability of motions rather than the single pixel correspondence. As a result, the constraints by the assumption is weakened, and it has the ability to represent stochastic dynamics.
3.5 Synthesizing textured motions by MA-FRAME
In MA-FRAME model, similar to ST-FRAME, each local volume of textured motion follows a Markov random field model. However, the difference is that MA-FRAME extracts motion information via the distribution of velocities .
In experiments, we design an effective way for sampling from the above model. For each pixel, we build a 2D-distribution matrix, whose two dimensions are velocities and intensities respectively, to guide the sampling process. The sampling probability for every candidate (labeled by one velocity and one intensity) is obtained by integrating motion score, appearance score and multiplying smoothness weight,
The details are explained with the illustration by Fig.11 for the sampling method at one pixel. For each pixel of the current frame , we consider its every possible velocity within the range . Each velocity corresponds to a position in the previous frame . Under velocity , the perturbation range of yields the intensity candidates for which is a smaller interval than and thus saves computational complexity. In the shown example (Fig.11(a)), and the perturbed intensity range is . Therefore we have velocity candidates and intensity candidates for each velocity, hence the size of the sampling matrix is (Fig.11(b)). With the motion constraints given by matching the velocity statistics, the velocity candidates have their motion scores. With the appearance constraints given by matching the filter response histograms, intensity candidates have their appearance scores. By integrating the two sets of scores, we obtain a preliminary sampling matrix shown in Fig.11(b).
In order to guarantee the motion of each pixel is as consistent as possible with its neighborhoods to make the macroscopic motion smooth enough, we add a set of weights on the distribution matrix, in which each multiplier for candidates of one velocity is calculated by
The weights encourage the velocity candidate which is closer to the velocities of its neighbours. With the weights, the sampled velocities are prone to be regarded as “blurred” optical flow. The main difference is that it preserves the uncertainty of dynamics in a texture motion, but not definite velocities of every local pixel.
After multiplying the weights to the preliminary matrix, we get the final sampling matrix. Although the main purpose of MA-FRAME is sampling intensities of each pixel from a textured motion, the sampling for intensities is highly related to velocities, and the sampling process is actually based on the joint distribution of velocity and intensity.
In summary, textured motion synthesis by MA-FRAME is given as follows
Algorithm 2. Synthesis for Textured Motion by MA-FRAME
Input video .
Suppose we have , our goal is to synthesize the next frame .
Select a group of static and flicker filters from a filter bank , where is the number of selected filters.
Compute , of , where is the number of velocity clusters.
Initialize velocity vectoruniformly, and initialize by choosing intensities based on .
Calculate and from .
Update and .
Sample by Gibbs sampler.
Until for .
Fig.12 and Fig.13 show two examples of textured motion synthesis by MA-FRAME. Different from the synthesis results by ST-FRAME, it can deal with videos of larger size, higher intensity level (8 bits here compared to 4 bits in ST-FRAME experiments) and more frames because of its smaller sample space and higher temporal continuity. Furthermore, it generates better motion pattern representations.
Fig.14 shows the comparison of velocities statistics between the original video and the synthesized video of different textured motion clusters, the brighter, the higher motion probability while the darker, the lower probability. It is easy to tell that they are quite consistent, which means the original and synthesized videos have similar macroscopical motion properties.
We also test local motion consistency between observed and synthesized videos by comparing velocity distributions of every pair of corresponding pixels. Fig.15 shows the comparisons of ten pairs of randomly chosen pixels. Most of them match well. It demonstrated that the motion distributions of most of local patches also preserve well during the synthesis procedure.
3.6 Dealing with occlusion parts in texture synthesis
Before providing the full version of computational algorithm for VPS, we first introduce how to deal with occluded areas.
In video, dynamic background textures are often occluded by the movement of foreground objects. Synthesizing background texture by ST-FRAME uses histograms of spatio-temporal filter responses. When a textured region becomes occluded, the pattern no longer belongs to the same equivalence class. In this event, the spatio-temporal responses are not precise enough for matching the given histograms, and may cause a deviation in the synthesis results. These errors may accumulate over frames and the synthesis will ultimately degenerate completely. Synthesis by MA-FRAME has a greater problem because the intensities in the current frame are selected from small perturbations in intensities from the previous frame. If a pixel cannot be found from the neighborhood in the previous frame that belongs to the same texture class, the intensity it adopts may be incompatible with other pixels around it.
In order to solve this problem, occluded pixels are sampled separately by the original (spatial) FRAME model, which means, we have two classes of filter response histograms
Static filter response histograms . Histograms are calculated by summarizing static filter responses of all the textural pixels;
Spatio-temporal filter response histograms . Histograms are calculated by summarizing spatio-temporal filter responses of all the non-occluded textured pixels.
Therefore, in the sampling process, the occluded pixels and non-occluded pixels are treated differently. First, their statistics are constrained by different sets of filters; second, in MA-FRAME, the intensities of non-occlude pixels are sampled from the intensity perturbation of their neighborhood locations in previous frame, while the intensities of occluded pixels are sampled from the whole intensity space, say 0-255 for 8 bits grey levels.
3.7 Synthesizing videos with VPS
In summary, the full version of the computational algorithm for video synthesis of VPS is presented as follows.
Algorithm 3. Video Synthesis via Video Primal Sketch
Input a video .
Compute sketchability and trackability for separating into explicit region and implicit region .
Reconstruct by the sparse coding model with the selected primitives chosen from the dictionary to get .
For each region of homogeneous textured motion , using as boundary condition, synthesize by ST-FRAME model or MA-FRAME with the selected filter set chosen from the filter bank to get .
The synthesis of the th frame of the video is given by aligning and together seamlessly.
Output the synthesized video .
Fig.2 shows this process as we introduced in section 1. Fig.16 shows three examples of video synthesis (YCbCr color space, 8 bits for grey level) by VPS frame by frame. In every experiment, observed frames, trackability maps, and final synthesized frames are shown. In Table 3, H.264 is selected as the reference of compression ratio compared with VPS, from which we can tell VPS is competitive with state-of-art video encoder on video compression.
For assessing the quality of the synthesized results quantitatively, we adopt two criteria for different representations, rather than the traditional approach based on error-sensitivity as it has a number of limitations (Wang et al (2004)). The error for explicit representations is measured by the difference of pixel intensities
while for implicit representations, the error is given by the difference of filter response histograms,
Table 4 shows the quality assessments of the synthesis, which demonstrates good performance of VPS on synthesizing videos.
|Example||Raw (Kb)||VPS (Kb)||H.264 (Kb)|
|1||924||16.02 (1.73%)||20.8 (2.2%)|
|2||1,485||26.4 (1.78%)||24 (1.62%)|
|3||1,485||28.49 (1.92%)||18 (1.21%)|
3.8 Computational Complexity Analysis
In this subsection, we analyze the computational complexity of the algorithms studies in this paper. We discuss the complexity for four algorithms in the following. The implementation environment is the desktop computer with Intel Core i7 2.9 GHz CPU, 16GB memory and Windows 7 operating system.
1) Video modeling by VPS. Suppose one frame of a video contains pixels, of which, pixels belong to explicit regions and in implicit regions. Let the size of the filter dictionary be and the filter size be , the computational complexity for calculating filter responses is . For extracting and learning explicit bricks, the complexity is no more than . For calculating the response histograms of chosen filters within the implicit regions, the complexity is no more than if there are homogeneous textural area in the regions. To sum up, the total computation complexity for video coding is no more than . In our experiments, for coding one frame of the video with the size of , the time consumption is less than 0.5 seconds.
2) Reconstruction of explicit regions. Because the information of all the basis for explicit regions are recorded and there needs no additional computations for reconstructing, the computational complexity can be regarded as and the reconstruction costs no time in comparison to other components.
3) Synthesis of implicit regions by Gibbs sampling by ST-FRAME. For one round sampling, each of the pixels will be sampled in the range of the overall intensity levels, say . For every sampling candidate, i.e. one intensity, the score is calculated via the change of synthesized filter response histograms. To reduce the computation burden, we can simply update the change of filter responses caused by the change of the intensity on the current pixel. This operation requires times of multiplications. As a result, the computational complexity for one round sampling of one frame is . In the experiments of this paper, one frame will be sampled for about 20 rounds. Then the running time is about 2 minutes if the image is 4 bits and the size of implicit region is pixels.
4) Synthesis of implicit regions by Gibbs sampling by MA-FRAME. The computational complexity of MA-FRAME is quite similar with ST-FRAME. The biggest difference is the number of sampling candidates. As the number of velocity candidates is and the intensity perturbation range is , the computational complexity is , which is on the same level with ST-FRAME. However, in real application, because the intensities of the neighborhood of one pixel are not far away, the intensities of the candidates with different velocities is quite redundant. As a result, MA-FRAME may save a lot of time compared with ST-FRAME, especially when the intensity level is high. For one frame with 8 bits and pixels, the running time is about 4 minutes within 20 rounds sampling.
In summary, the computational complexity of video modeling / coding by VPS is small, but that of video synthesis is quite large. It is because of texture synthesis procedure. In VPS, the textures are modeled by MRF and synthesized via a Gibbs sampling process, which is well known as a computational costing method. However, the video synthesis is only one of the applications of VPS and is used for verifying the correctness of the model. As a result, it is not the very important issue we care about here.
3.9 Perceptual Study
The error assessment of VPS is consistent with human perception. To support this claim, in this subsection, we present a series of human perception experiments and explore the relationship between perception accuracy. In the experiments below, the 30 participants include graduate students and researchers from mathematics, computer science and medical science. The age range is from 22 to 39, and they all have normal or corrected-to-normal vision.
In the first experiment, we randomly crop several clips of videos with different sizes from the four synthesized textured motion examples and their corresponding original videos (as shown in the left side of Fig. 17, 18 and 19, each video is shown one frame as an example which is marked by (a), (b), (c) and (d) respectively, and they are in different sizes but shown in the same size after zooming for better shows). And then for original and synthesized examples respectively, each participant is shown 40 clips one by one (10 clips from each texture) and is required to guess which texture they come from. We show 3 representative groups of results below for demonstration, in which the sizes of cropped examples are , and respectively. Both of the confusion rates (%) of original and synthesized examples are shown in the tables on the right side in Fig. 17, 18 and 19. Each row gives the average confusion rates, which the video clip labeled by the row title is judged coming from textures labeled by the column titles. In order to test if the syntheses are perceived the same with the original videos, we compare the original and synthesis confusion tables in each group. From the results, we can tell that the confusion tables are mostly consistent. For more precise quantitative estimation, we also analyze the recognition accuracies by ANOVA in Table 5, in which, each row shows the corresponding and values for each texture in all the three groups. The results show that the recognition accuracies on original and synthesized textures do not differ significantly.
|F/p||Group 1||Group 2||Group 3|
Also, it is noted that texture (a) and (b) appear similarly while (c) and (d) tend to be confused with each other. Therefore, the confusion rates between (a) and (b), (c) and (d) are apparently larger. However, from Fig. 17 to 19, as the size of cropped videos gets larger, the confusion rate becomes lower, and actually when the size goes larger than in this experiment, the accuracies get very close to 100%. This experiment demonstrates the fact that the dynamic textures synthesized by the statistics of dynamic filters can be well discriminated by human vision, although the synthesized one and the original one are totally different on pixel level. Therefore it is evident that the approximation of filter response histograms reflects the quality of video synthesis. Furthermore, it is proved that larger area textures give much better perception effect because human can extract more macroscopic statistical information and motion-appearance characteristics, while small size local areas can only provide salient structural information which may be shared by a various of different videos.
|Video||Scale 100%||Scale 75%||Scale 50%||Scale 25%|
In the second experiment, we test if the synthesized video by VPS gives similar vision impact compared with the original video. Each time we provide the original and the synthesized videos to one participant in the same scale. The videos are played synchronously and the participants are required to point out which is the original video in 5 seconds. Each pair of videos is tested in four scales, 100%, 75%, 50% and 25%. The accuracy are shown in Table 6. From the result, when the videos are shown in larger scales, it is easier to discriminate the original and synthesized videos, because a lot of structural details can be noticed by the observers. But as the scale gets smaller, the macroscopic information gives the major impact to the vision system, therefore the original and synthesized video are perceived almost the same, so that the accuracy get lower and approach to 50%. From this experiment, it is evident that although VPS cannot give the complete reconstruction of a video on pixel level, especially for dynamic textures, but the synthesis gives human similar vision impact, which means most of the key information for perception are kept via VPS model.
3.10 VPS adapting over scales, densities and dynamics
As it was observed in (Gong and Zhu (2012)) that the optimal visual representation at a region is affected by distance, density and dynamics. In Fig.20, we show four video clips from a long video sequence. As the scale changes from high to low over time, the birds in the videos are perceived by lines of boundary, groups of kernels, dense points and dynamic textures respectively. We show the VPS of each clip and demonstrate that the proper representations are chosen by the model. Fig.21 shows the types of chosen primitives for explicit representations, in which circles represent blob-like type while short lines represent edge-like type primitives. Table 7 gives corresponding comparisons between the number of blob-like and edge-like primitives in each scale. For each scale, the comparison is within first 50, 100, 150 and 200 chosen primitives respectively. It is quite obvious that the percentage of chosen edge-like primitives in large scale frame is much higher than that in small scale. Meanwhile, in large scale frame, the blob-like primitives start to appear very late, which shows the fact that edge-like primitives are much more important in this scale for representing videos. But in small scale frame, the blob-like primitives possess a large percentage at the very beginning, and the number increase of edge-like primitives gets quicker and quicker while more and more primitives are chosen. This phenomenon demonstrates blob-like structures are much more prominent in small scale. So from this experiment, it is evident that VPS can choose proper representations automatically and furthermore, the representation patterns may reflect the scale of the videos.
|Scale||First 50||First 100||First 150||First 200|
3.11 VPS supporting action representation
VPS is also compatible with high-level action representation. By grouping meaningful explicit parts in a principled way, it represents an action template. In Fig.22, (b) is the action template given by the deformable action template model (Yao and Zhu (2009)) from the video shown in (a). The action template is essentially the sketches from the explicit regions. (c) shows an action synthesis with only filters from a matching pursuit process. While in (d), following the VPS model, the action parts and a few sketchable background are reconstructed by the explicit representation, and the large region of water is synthesized by the implicit representation; thus we get the synthesis of the whole video. Here, the explicit regions correspond to meaningful “template” parts, while the implicit regions are auxiliary background parts.
In order to show the relationship between VPS representation and effective high-level features, we take an KTH video (Schuldt et al (2004)) as an example. Fig.23 and Fig.24 show the spatial and temporal features of explicit regions respectively. In Fig.23, we compare VPS spatial descriptor with well-known HOG feature (Dalal and Triggs (2005)), which has been widely used for object representation recently. (b) is the HOG descriptor for the human in one video frame (a). (c) shows structural features extracted by VPS, where circles and short edges represent 53 local descriptors. Compared with HOG in (b), VPS makes a local decision on each area based on statistics of filter responses, therefore it provides shorter coding length than HOG. Furthermore, it gives more precise description than HOG, e.g. the head part is represented by a circle descriptor, which contains more information than pure filter response histogram like HOG. And (d) gives a synthesis with corresponding filters, which shows the human boundary precisely.
In Fig.24, we show the motion information between two continuous frames (a) and (b) extracted by MA-FRAME in VPS. (d) gives the clustered motion styles in the current video. The motion statistics of the five styles are shown in (e) respectively. It is obvious that region 1 represents the area of head, which is almost still in the waving motion, while region 5 is for two arms, which shows definite moving direction. Region 3 represents the legs, which is actually an oriented trackable area. Region 2 and 4 are relatively ambiguous in motion direction, which are basically background of textures in the video. After giving the trackability map shown in (c) based on these motion styles, the motion template pops up.
In summary, the information extracted by VPS is compatible with high-level object and motion representations. Especially, it is very close to HOG and HOOF descriptors, which are proven effective spatial and temporal features respectively. The main difference is VPS makes a local decision to give a more compact expression and be better for visualization. Therefore, VPS does not only give a middle-level representation for video, but also has strong connection with low-level vision features and high-level vision templates.
4 Discussion and Conclusion
In this paper, we present a novel video primal sketch model as a middle-level generic representation of video. It is generative and parsimonious, integrating a sparse coding model for explicitly representing sketchable and trackable regions and extending the FRAME models for implicitly representing textured motions. It is a video extension of the primal sketch model (Guo et al (2007)). It can choose appropriate models automatically for video representation.
Based on the model, we design an effective algorithms for video synthesis, in which, explicit regions are reconstructed by learned video primitives and implicit regions are synthesized through a Gibbs sampling procedure based on spatio-temporal statistics. Our experiments shows that VPS is capable for video modeling and representation, which has high compression ratio and synthesis quality. Furthermore, it learns explicit and implicit expressions for meaningful low-level vision features and is compatible with high-level structural and motion representations, therefore provides a unified video representation for all low, middle and high level vision tasks.
In ongoing work, we will strengthen our work from several aspects, especially enhance the connections with low-level and high-level vision tasks. For low-level study, we are learning a much richer dictionary of for video primitives, which is more comprehensive. For high-level application, we are applying the VPS features to object and action representation and recognition.
Acknowledgements.This work is done when Han is a visiting student at UCLA. We thank the support of an NSF grant DMS 1007889 and ONR MURI grant N00014-10-1-0933 at UCLA. The authors also thank the support by four grants in China: NSFC 61303168, 2007CB311002, NSFC 60832004, NSFC 61273020.
- Adelson and Bergen (1985) Adelson E, Bergen J (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2)
- Bergen and Adelson (1991) Bergen JR, Adelson EH (1991) Theories of visual texture perception. Spatial Vision, D Regan (Eds), CRC Press
- Besag (1974) Besag J (1974) Spatial interactions and the statistical analysis of lattice systems. J Royal Statistics Soc, Series B 36
- Black and Fleet (2000) Black MJ, Fleet DJ (2000) Probabilistic detection and tracking of motion boundaries. IJCV 38(3)
- Bouthemy et al (2006) Bouthemy P, Hardouin C, Piriou G, Yao J (2006) Mixed-state auto-models and motion texture modeling. Journal of Mathematical Imaging and Vision 25(3)
- Campbell et al (2002) Campbell NW, Dalton C, Gibson D, Thomas B (2002) Practical generation of video textures using the auto-regressive process. Proceedings of British Machine Vision Conference pp 434–443
- Chan and Vasconcelos (2008) Chan AB, Vasconcelos N (2008) Modeling, clustering, and segmenting video with mixtures of dynamic textures. PAMI 30(5)
- Chaudhry et al (2009) Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. CVPR
- Chubb and Landy (1991) Chubb C, Landy MS (1991) Orthogonal distribution analysis: A new approach to the study of texture perception. Comp Models of Visual Proc, MS Landy et al (Eds), MIT Press
- Comaniciu et al (2003) Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. PAMI 25(5)
- Dalal and Triggs (2005) Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. CVPR
- Dalal et al (2006) Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. ECCV
- Derpanis and Wildes (2010) Derpanis KG, Wildes RP (2010) Dynamic texture recognition based on distributions of spacetime oriented structure. CVPR
- Doretto et al (2003) Doretto G, Chiuso A, Wu YN, Soatto S (2003) Dynamic textures. IJCV 51(2)
- Elder and Zucker (1998) Elder J, Zucker S (1998) Local scale control for edge detection and blur estimation. PAMI 20(7)
- Fan et al (2006) Fan Z, Yang M, Wu Y, Hua G, Yu T (2006) Effient optimal kernel placement for reliable visual tracking. CVPR
- Gong and Zhu (2012) Gong HF, Zhu SC (2012) Intrackability : Characterizing video statistics and pursuing video representations. IJCV 97(33)
- Guo et al (2007) Guo C, Zhu SC, Wu YN (2007) Primal sketch: integrating texture and structure. CVIU 106(1)
- Han et al (2011) Han Z, Xu Z, Zhu SC (2011) Video primal sketch: a generic middle-level representation of video. ICCV
- Heeger (1987) Heeger D (1987) Model for the extraction of image flow. JOSA A 4(8)
- Heeger and Bergen (1995) Heeger DJ, Bergen JR (1995) Pyramid-based texture analysis/synthesis. SIGGRAPH
- Kim et al (2010) Kim T, Shakhnarovich G, Urtasun R (2010) Sparse coding for learning interpretable spatio-temporal primitives. NIPS
- Lindeberg and Fagerström (1996) Lindeberg T, Fagerström D (1996) Scale-space with casual time direction. ECCV
- Maccormick and Blake (2000) Maccormick J, Blake A (2000) A probabilistic exclusion principle for tracking multiple objects. IJCV 39(1)
- Mallat and Zhang (1993) Mallat S, Zhang Z (1993) Matching pursuits with time-frequency dictionaries. IEEE TSP 41(12)
- Marr (1982) Marr D (1982) Vision. W H Freeman and Company
- Olshausen (2003) Olshausen BA (2003) Learning sparse, overcomplete representations of time-varying natural images. ICIP
- Olshausen and Field (1996) Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381
- Portilla and Simoncelli (2000) Portilla J, Simoncelli E (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. IJCV 40(1):49–71
- Ravichandran et al (2009) Ravichandran A, Chaudhry R, Vidal R (2009) View-invariant dynamic texture recognition using a bag of dynamical systems. CVPR
- Schuldt et al (2004) Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. ICPR
- Serby et al (2004) Serby D, Koller-Meier S, Gool LV (2004) Probabilistic object tracking using multiple features. ICPR
- Shi and Zhu (2007) Shi K, Zhu SC (2007) Mapping natural image patches by explicit and implicit manifolds. CVPR
- Silverman et al (1989) Silverman MS, Grosof DH, Valois RLD, Elfar SD (1989) Spatial-frequency organization in primate striate cortex. Proc Natl Acad Sci 86
- Szummer and Picard (1996) Szummer M, Picard RW (1996) Temporal texture modeling. ICIP
- Wang and Zhu (2004) Wang YZ, Zhu SC (2004) Analysis and synthesis of textured motion: particles and waves. PAMI 26(10)
- Wang et al (2004) Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error measurement to structural similarity. IEEE TIP 13(4)
- Wildes and Bergen (2000) Wildes R, Bergen J (2000) Qualitative spatiotemporal analysis using an oriented energy representation. ECCV
- Wu et al (2000) Wu YN, Zhu SC, Liu XW (2000) Equivalence of julesz ensemble and frame models. IJCV 38(3)
- Yao and Zhu (2009) Yao B, Zhu SC (2009) Learning deformable action templates from cluttered videos. ICCV
- Yuan et al (2010) Yuan F, Prinet V, Yuan J (2010) Middle-level representation for human activities recognition: the role of spatio-temporal relationships. ECCVW
- Zhu et al (1998) Zhu SC, Wu YN, Mumford DB (1998) Filters, random field and maximum entropy (FRAME): towards a unified theory for texture modeling. IJCV 27(2)