Dynamic patterns, such as running water, traffic flow, fire and smoke and so on, are a set of commonly seen visual phenomena in natural world and have attracted many attentions in computer vision, named after dynamic textures or temporal textures [11, 46, 5, 31, 21, 45]. By the term of “dynamic textures” (DTs) in this paper, we refer to any spatial-temporal process as sequence of images that exhibit temporal regularities, i.e., certain stationarity properties in time . A substantial amount of work has been devoted to synthesizing dynamic textures from examples, i.e., example-based dynamic texture synthesis (EDTS) 
, which aims to generate new dynamic texture of desired length or size that is perceptually similar to the DT example. Recent years have witnessed significant progress in example-based dynamic texture synthesis algorithms, either in parametric models[11, 12, 46, 5, 40, 41, 43]31, 37, 21, 2, 45]. While many attempts mainly focused on synthesis approaches to generate high-quality results, how to analyse and evaluate DTs as input samples for synthesis has not been addressed yet.
1.1 Problem statement
The goal of example-based dynamic texture synthesis (EDTS) can be stated as follows: given a dynamic texture sample, synthesize a novel video sequence that looks perceptually equivalent to the input sample with similar appearance and dynamics. This task reveals that the result of generated dynamic texture is subject to both synthesis algorithm and input exemplar, where the former promises technical support and the latter provides material to be simulated. We take interest in dynamic texture synthesizability with respect to both DT samples and EDTS methods - how to analyse and evaluate DTs as input samples to aid EDTS. Our studies of learning synthesizability of dynamic texture samples are based on following observations:
No existing EDTS methods can tackle all kinds of DTs equally well. Although a growing number of approaches have emerged in EDTS to generate dynamic textures of high quality, none of the methods can synthesize all dynamic textures equally well. Each method has its own benefits and limits. For instance, statistic parametric models [11, 5, 41] are mathematically sound, but likely to suffer from degraded visual quality because the statistics are not complete enough or not well enforced. And the non-parametric models such as patch-based methods [37, 21] though efficient sometimes produce verbatim patterns. Moreover, some methods [12, 41] are good at stationary dynamic texture synthesis in joint spatial-temporal domain, while others [11, 5] do better in control synthesis along time axis.
Not all dynamic textures can be equally well reproducible in synthesis.
On the other hand, dynamic texture samples are not equally well reproducible. Except when dealing with a controlled video acquisition protocol, for instance in specific lab environments, DT samples usually contain outliers, cluttered backgrounds, non-uniform illumination, or even objects and complex scenes. If not at a fixed viewpoint, video sequences of DTs are captured unstably by cameras with panning or jittering. The appearance of DTs can vary largely, ranging from fire and smoke to crowds and traffic flow. Dynamics also exhibit different motion patterns, e.g. dominant orientation, heterogeneous orientation and stochastic dynamics. The diversities of appearance and dynamics in various DTs bring difficulties and challenges in EDTS. It would be helpful to evaluate dynamic texture samples input for EDTS according to how well its appearance and dynamics can be reproduced by only analysing the original sample.
How to suggest suitable EDTS methods for DT samples? Neither are existing synthesis methods able to tackle all kinds of dynamic textures, nor can dynamic textures be equally well reproducible. It is difficult to find a universal approach to synthesize all kinds of DTs well. Since many existing EDTS methods are available, why not choose a flexible strategy to take advantage of their strengths in synthesis. Can we suggest appropriate EDTS methods adapted to different DT samples? Rather than struggle to build a perfect EDTS method which is extremely difficult to implement, we select one case-optimal method among several existing alternatives, in order to reach success rates better than those of any individual method. It is a comprehensive way to draw on the wisdom of the masses. For instance, linear dynamic system (LDS)  is a statistical generative model for synthesizing dynamic textures, but will have a tendency toward smoothing the dynamics and degrading visual quality over time. It would be beneficial that one could resort to an alternative method if in such case when long sequences need to be synthesized. This leads to a win-win situation both to DT samples and EDTS methods: provide good DT samples for EDTS methods, and choose suitable EDTS methods for DT samples.
1.2 Motivation and objective
Although it is intuitive that some videos will be easier to synthesize than others, quantifying this intuition has not been addressed in previous studies. There are no previous investigations that try to quantify individual DTs in terms of how synthesizable they are, and few computer vision systems to predict synthesizability and suggest suitable synthesized methods for DT samples. Also, there are no databases of videos calibrated in terms of the degree of synthesizability for each dynamic texture. Induced by these issues, we are going to investigate the synthesizability of dynamic texture - how well its underlying dynamic patterns can be reproduced by only analyzing the original sample.
We characterize the synthesizability of a dynamic texture sample as the probability that existing EDTS methods will produce good synthesized results for a specific DT sample, prior to using any synthesis method to synthesize it. The seminal work to predict synthesizability of static texture is proposed by Daiet al. . Inspired by their work, we propose in this paper to predict synthesizability score of a given dynamic texture sample, and suggest which EDTS method is best suited to synthesize it. Fig. (a)a shows the synthesizability scores assigned to some dynamic texture samples by our system. Although dynamic textures are common in natural scenes, in many cases only a part of the scene forms a dynamic texture. It would be useful to tailor videos to regions with good synthesizability by discarding undesirable background. Fig. (b)b illustrates the trimming of video into its most synthesizable and rectangular homogeneous regions (the red boxes).
In order to learn the synthesizability of dynamic textures, we have collected a dataset of dynamic textures with manually annotated synthesizability scores according to the synthesis results by a set of EDTS methods. We connect dynamic texture samples with synthesizability scores via a learning scheme, where regression models are learned from the collection of annotated data. Feature representation is imperative for learning synthesizability. We designed a SCOP-DT descriptor specific for dynamic texture representation, which extended the shape-based co-occurrence patterns (SCOP)  from 2D static texture to 3D dynamic texture by incorporating temporal cues implicitly.
Our work is distinguished in following aspects:
We make it possible to estimate the synthesizability of dynamic textures by learning regression models with appropriate spatiotemporal descriptors. Meanwhile, for a given DT sample, our system can also suggest the “best” synthesis method from off-the-shelf EDTS algorithms.
To the best of our knowledge, we first investigate “dynamic textureness” property of dynamic patterns in video to automatically discern dynamic textures, and divide dynamic textures into two subcategories based on the spatial modes to suit different synthesis capacity.
We proposed a novel SCOP-DT descriptor for dynamic texture representation, which can capture geometrical aspects and temporal consistency simultaneously.
We compile a dynamic texture dataset with synthesizability annotations.
The rest of the paper is organized as follows. Section 2 briefly recalls related work. Section 3 presents the problem formulation of learning dynamic texture synthesizability. Section 4 investigates dynamic texture representation relevant to synthesizability and depicts the proposed method of learning synthesizability. Section 5 introduces our dataset collected for learning synthesizability. Section 6 demonstrates the experimental results and analysis. Finally, Section 7 draws some conclude remarks. Note that all the experimental results are available at http://captain.whu.edu.cn/project/DTsynthesizability.html.
2 Related work
This section briefly recalls some researches that are closely related to our work.
Example-based dynamic texture synthesis:
The synthesizability aims to help find good samples and suggest suitable EDTS methods. Recent years have witnessed significant progress in example-based texture synthesis algorithms, roughly divided into two main categories of approaches: parametric and non-parametric methods. Parametric methods usually define a parametric model consisting of a set of statistical measurements that cover the spatial extent and temporal domain of dynamic textures, e.g. spatiotemporal autoregressive (STAR) model , linear dynamic system (LDS) model  and its variants [46, 38, 3, 5]. While most of LDS-based methods are subject to learning the temporal statistics, Doretto et al. 
tried to jointly model the spatio-temporal statistics by dynamic multiscale autoregressive models. Xiaet al. [41, 40]
developed a compact Gaussian texton representation to synthesize stationary dynamic textures. Based on convolutional neural network, a spatial-temporal generative ConvNet (STGConvNet) has been proposed to model and synthesize dynamic textures. Parametric methods build explicit models with mathematic foundation, but the main challenge lies in designing rigid and meaningful mathematical models that are able to capture the essence of different dynamic patterns. Due to the incomplete or unenforced statistics, parametric models often failed to synthesize complex geometry patterns.
Rather than build explicit models with parameters estimated, non-parametric methods often bypass modeling spatial-temporal mathematic mechanism of dynamic texture, mainly including copy-based methods and feature-oriented synthesis. Copy-based methods produce new dynamic textures by resampling small parts from an input sample as elements, to synthesize in spatiotemporal domain [37, 21] or along time . Though efficient and visual results are strikingly good, copy-based methods likely lead to verbatim reproduction, too close to a mere copy-paste. Methods of feature-oriented synthesis match statistics of features between synthetic videos and original ones [2, 45]. This subgroup of methods do not generate verbatim repetition, but need to carefully choose or design statistics and spatiotemporal features.
Image or video quality evaluation:
Several works have investigated quantifying certain qualitative characteristics of image or video in computer vision. These include interestingness , memorability , quality  or city geo-awareness , which leverage data-driven approaches to compute high-level visual attributes likely concerned with psychological perception. The most related work to ours is proposed by Dai et al.  which predicted synthesizability of static texture. However, the question of learning the synthesizability of dynamic texture remains an open issue. We introduce synthesizability into dynamic textures and develop a series of approaches in a dynamical setting.
Dynamic texture recognition:
Our work is also related to dynamic texture recognition in the sense of representation and classification. For recognition, it includes several major procedures, like feature extraction[47, 8], metric learning [15, 29], classifier design . In our task, we also need to exploit spatiotemporal features for dynamic texture representation.
3 Dynamic Texture Synthesizability
3.1 Problem formulation
Denote as a video with channels defined in space-time domain . In particular, for grey-scale videos and for color videos. The synthesizability of a DT sample indicates the probability that EDTS methods will produce good synthesized results for before using any synthesis method to synthesize it. A DT sample with synthesizability score is denoted as .
SHDT and TDT: Dynamic textures are spatial-temporal visual patterns that exhibit temporal regularities, but do not necessarily show statistical stationarity in the spatial domain. If imposed by spatially stationary constraints as well, we call these spatially homogeneous dynamic textures (SHDTs) . To differentiate, for the DTs only showing temporal stationarity, we refer to time-stationary dynamic textures (TDTs). Synthesis of TDTs can be only conducted along time axis, whereas it is potential to synthesize SHDTs in both the spatial and temporal domains. See differences in Fig. 2 for example. Correspondingly, spatial and temporal synthesizability can be pre-computed for every SHDT sample, while TDTs merely cope with temporal synthesizability. We use a binary label to indicate the spatial mode of DT sample , i.e. ‘1’ for SHDT, and ‘0’ for TDT.
To suggest which EDTS method is best suited to synthesize , an integer label is used. We employed representative off-the-shelf EDTS methods as candidate synthesis methods, thus . Then, we denote a DT sample with synthesizability score and the two associated labels and as , where is the label space. The task of learning dynamic texture synthesizability is to build a mapping from a training data set to predict the label sets for unseen dynamic texture. The problem of learning dynamic texture synthesizability can be formulated as
where is a video feature extractor put on .
The goal of learning dynamic texture synthesizability is to build the mapping . Then given a DT sample , the task of predicting synthesizability is to estimate and assign the label sets for the sample .
3.2 Label space
The three labels for annotation are not independent, but they are associated with each other. Based on the spatial mode, dynamic textures can be divided into SHDTs and TDTs labeled as in the training data, with or without synthesis capacity in space. Accordingly, the annotation of synthesizability score for SHDT and TDT is different: SHDT with spatial and temporal synthesizability respectively, TDT only with temporal synthesizability. In addition to the synthesizability score, the optimal synthesis method of each DT video in the training set was also recorded and labeled as . Not all EDTS methods are able to synthesize both TDTs and SHDTs. Some methods can deal with the both, but the others have preference for one of them. Hence, should also be connected with .
To summarize, there are three type of labels for a DT: binary label for the spatial mode , continuous value to label synthesizability score , discrete label for method index . Obviously, this is a multi-label problem but more than that, because three labels are interrelated. Whether to compute spatial synthesizability or not is determined by label . If for TDT, we only compute temporal synthesizability score; whereas for SHDT to compute both spatial and temporal aspects.
3.3 Divide-and-conquer strategy
To clarify this problem, given the dynamic texture training set where every DT sample is associated with three labels, we want to learn a set of models to estimate the synthesizability of unknown test dynamic texture. The mapping from one video to three lables is an one-to-many mapping, which is difficult to tackle. To simplify the problem involved with three correlated labels, we break it down into three sub-tasks to solve the complex mapping : binary classification for SHDT and TDT, regression to predict synthesizability score, and an additional classier to suggest the “best” synthesis method.
3.3.1 Learning binary label
Binary classification of SHDT and TDT amounts to estimate the class posterior distribution over the binary classes , which can be given as
3.3.2 Learning synthesizability score
The synthesizability of a DT sample indicates the probability that EDTS methods will produce good synthesized results for before synthesizing it. We denote a DT sample with synthesizability score as . The problem of learning dynamic texture synthesizability score can be formulated as
where is a function to learn synthesizability, e.g. regression model, and is a video feature extractor.
3.3.3 Suggest synthesis method index
We also suggest the “best” EDTS method by an additional classifier, which can be regarded as multi-class classification to assign a method label to indicate the optimal synthesis method for DT sample. Since synthesis methods have their own abilities to deal with different spatial modes, we can introduce prior information indicated by
to constrain the possible methods for SHDTs or TDTs, which leads to a joint class posterior probability distribution:
The goal of learning dynamic texture synthesizability is to set up the function , and build the class posterior probability distribution and . Then given a test DT sample , the task of predicting synthesizability is to give the spatial mode label and estimate synthesizability score value , as well as suggest the optimal EDTS method label for the sample .
4 Method to Learn Dynamic Texture Synthesizability
In this section, we firstly investigate the dynamic texture representation related to synthesizability. Next, the method of learning dynamic texture synthesizability is depicted in detail. Finally, how to use synthesizability to detect synthesizable regions in video is introduced.
4.1 Dynamic texture representation
For dynamic texture representation relevant to synthesizability, we revisit general spatial-temporal features and design a novel SCOP-DT descriptor for dynamic texture.
4.1.1 Lbp-Top 
Local binary patterns from three orthogonal planes (LBP-TOP)  extended Local Binary Patterns (LBP)  to 3D volume for dynamic texture analysis by calculating the LBP code in three orthogonal 2D cross profiles , and planes. LBP-TOP is invariant to local contrasts of dynamic textures, but can not depict the geometric aspects.
4.1.2 C3D ConvNet
In static texture recognition, Cimpoi et al.  have successfully ported object/scene description methods e.g. convolutional neural networks (CNNs)  to texture descriptors, and significantly outperform the state-of-the-art recognition rates established by specialized texture descriptors. Inspired by their work, here we use the generic video descriptors learned by CNNs to characterize DTs. Recently, the architecture of deep 3-dimensional convolutional networks (C3D ConvNets)  has been proposed to learn spatiotemporal features for video description. C3D trained the network with 3D filter kernels in space-time volume, which can be used as a generic video feature extractor, but not specific for dynamic textures.
4.1.3 Shape-based co-occurrence patterns for dynamic textures
In static texture analysis, Shape Co-occurrence Patterns (SCOPs) [42, 23] proposed a kind of shape-based texture representation by using the co-occurrence patterns of shapes. Following the shape-based invariant texture analysis (SITA) , texture images are first represented by tree of shapes
(the topographical map), where each shape is associated with some predefined geometrical and radiometric attributes. SCOPs learned a set of co-occurrence patterns of shapes via clustering based on the hierarchical relationship of the shapes in the tree. Establishing co-occurrence patterns of shapes as codewords of dictionaries, a texture image can be encoded into a vector descriptor. SCOPs captured geometrical aspects of textures and high-order statistics between shape relationships, which demonstrated superior performance both on multiple texture image dataset and the complex scene image dataset.
The original SCOP is designed for static texture in 2D space. In this paper, we extend SCOP descriptor from image space to space-time dynamic texture by incorporating temporal cues implicitly. A natural way to extend the SCOP for DT is to treat a DT sequence as a 3D volume. This requires to extend the tree of shapes for 2D image to handle 3D space-time video, where each shape is 3D in volume rather than 2D in plane. This approach, while seemingly natural and sound, faces challenges such as dealing with varying frame rates or motion speed . In this case, to treat space and time equally in 3D cuboid domain may not be reasonable, owing to the different scale and occurrence of elements observed in space and motion. To bypass this problem, we propose an alternative method here, which implicitly captures the self-similarity behavior of DT sequence along time axis, referred to as SCOP-DT descriptor/feature in this paper.
The extraction of SCOP-DT in video is illustrated in Fig. 3. We use fast level set transformation (FLST)  to calculate Tree of Shapes (ToS) for each frame in DT video. Due to self-similarities between frame images in dynamic texture, the extracted ToS of each frame is also similar to each other. To construct dictionaries of shape co-occurrence patterns in ToSs, we randomly choose ToSs of frames from every training video to learn codewords in the dictionaries. In the stage of encoding, based on the temporal consistency in DTs, we implicitly encode temporal information to incorporate both shape and dynamic aspects for joint spatial-temporal representation.
More precisely, suppose a DT sample is a sequence of images along time axis. The Tree of Shapes (ToS) of the -th frame is calculated by fast level set transformation (FLST): . Then we randomly choose ToSs of frames from each video in the training sets to learn the dictionary , e.g. via clustering. In the coding step, we slice a video into a set of
frame clips with stride offrame interval between two consecutive clips. To include temporal dependencies in SCOP-DT, we project the ToSs of a clip with consecutive frames onto the dictionary and compute a coding vector. The coding vector of a with a set of can be encoded by:
where denote projecting the set of s onto the learned to calculate the coding vector.
We sample ToS sequences set with stride of sliding window along time axis for all clips in a video, with maximum . The encoding of can be denoted by
To extract SCOP-DT feature, the coding vectors of all clips in are averaged to form a final SCOP-DT video descriptor.
It is noticed that this approach implicitly incorporate time information by simultaneously encoding frames into one coding vector, which implies self-similarities and temporal stationarity in DTs. To further improve the stability of the descriptor, the mean of coding vectors over all clips is used. By taking the average response over time, it effectively suppresses the variations of the computed encoding vectors of all clips.
4.2 Framework of learning synthesizability
Having described dynamic texture representation relevant to synthesizability, we now present the technical framework of learning processing. Our ultimate goal is to predict synthesizability of dynamic texture sample, but we intend to generalize this problem into more unconstrained setting. We would like to first retrieve dynamic textures from general video resources (including DT and non-DT), and then pre-compute synthesizability for SHDT and TDT according to DT’s spatial mode. To this end, as illustrated in Fig. 4, we proposed a hierarchical architecture to deal with the routine from generic video sequences to dynamic textures, further distinguish DT into SHDT and TDT, and output synthesizability score at last. This leads to a 2-level partition. We retrieve DTs as positive examples to separate DTs from non-DT videos in the 1st-level. The partition for SHDTs and TDTs in the 2nd-level is performed by binary classification. The prediction of synthesizability in the final stage is conducted as a regression problem. An additional classifier is used to suggest the “best” EDTS method. We start with dynamic texture retrieval, to move on to binary classification for SHDTs and TDTs, and to prediction of synthesizability as well as suggestion on suitable EDTS methods.
The overall step-by-step procedure in Fig. 4 contains three main stages as follows:
retrieve dynamic textures to separate DTs from non-DT videos;
binary classification to distinguish DTs into SHDTs and TDTs;
predict synthesizability scores by regression and suggest suitable EDTS methods by classification.
4.2.1 Retrieve DTs from videos
In nowadays, with easy access to portable sensor devices (e.g. mobile phones), videos being shared have become ubiquitous on the Internet. The explosive growth of video data makes video processing popular and imperative in recent years. As for dynamic textures, on one hand, big data provides a great diversity of video source, on the other hand, it turns out to be a huge workload to pick out dynamic texture exemplars from video data. Polana and Nelson categorized visual motion recorded as video into three kinds : activities, motion events, and DTs. We give several examples for each category in Fig. 5. Activities and motion events are more difficult to synthesize than dynamic textures. In this paper, we focus on dynamic textures and want to retrieve DTs in a variety of videos when dynamic textures, motion events and activities all occur.
We train a classifier to distinguish dynamic textures from activities and dynamic scenes (mostly motion events). The dynamic texture dataset we collect delivered the positive samples (totally 1729 DTs). Maryland  and YUPENN  dynamic scene datasets, and the UCF101 action dataset 
as the negative ones (events/activities), consist of 13,520 videos in all. Generally speaking, DTs are only a small fraction of videos. Hence the number of negative samples (non-DTs) is almost an order of magnitude than the positive ones (DTs) in retrieval. We choose C3D descriptor for representation in the retrieval task because it is a generic video descriptor confirmed by many video recognition and classification tasks. The universality of C3D makes it equally effective to characterize DTs and non-DTs without much specialization, considering that all kinds of videos are included in retrieval. Random Forest was used as the regression model with C3D descriptor as feature. The regression score is taken as DT “textureness” to quantify the chance that a video is a dynamic texture.
4.2.2 Binary classification for SHDTs and TDTs
After picking out DTs, we should split up DTs into SHDTs and TDTs straightforward to accommodate the joint spatial-temporal synthesis or only temporal synthesis. We train a classifier to distinguish DTs into SHDTs and TDTs by using the proposed SCOP-DT feature. As SCOP-DT is able to characterize the geometrical and radiometric attributes of textures inherited from SCOP , this property can be useful to distinguish SHDTs and TDTs due to their difference mainly in structural aspects of appearance. To tackle the binary classification of SHDTs and TDTs, we use the proposed SCOP-DT as feature to train a binary-class SVM classifier. Considering that early researches in image recognition have shown that combining multiple descriptors is very useful to improve classification performance [17, 22], SCOP-DT has been further combined with C3D and LBP-TOP in ensemble SVMs  for binary classification of SHDTs and TDTs.
4.2.3 Learning the synthesizability of dynamic texture samples
Now that DTs are classified into two subsets: SHDTs and TDTs, the last step is to predict synthesizability for them. SHDT samples have spatial and temporal synthesizability, while TDTs merely cope with temporal synthesizability.
We suppose that synthesizability score is learnable and predictable, which can be formulated as a regression problem. Given the DT samples and labeled synthesizability scores in the training set, we formalise the learning problem as a regression model :
Let the DT sample be described by a set of features ,
where is feature extractor e.g. SCOP-DT, C3D and LBP-TOP.
By the feature representation of DT sample in (9), we denote the regression model as
where and are parameters of regression model corresponding to feature representation , with denoting the vector transpose, and can be linear or nonlinear mapping.
To build the regression model (10), we train the model on training set to determine parameters and by
Then for the test DT sample with unknown synthesizability score, we can use the trained regression model (10) to predict synthesizability score with feature representation . The predicted synthesizability score is a computable index to quantify how well a dynamic texture can be synthesized by only analysing the original sample, where the bigger score the better synthesizability.
In the spirit of Dai et al.’s  approach in utilization of multi-feature combination to predict static texture synthesizability, we follow their policy for dynamic texture synthesizability prediction. But unlike Dai et al.
’s method by simply concatenating multiple features for combination, we aggregate different features on decision level for more robust fusion. In general, it is beneficial to combine multiple descriptors in machine learning[17, 22]. The naive solution to combination is to concatenate different descriptors into one vector, yet faced with some deficits. On one hand, a possible problem of creating a large input vector for a machine learning classifier is that the input vector becomes of very large dimensionality, which may lead to overfitting and degenerate generalization performance . On the other hand, different features lying in different feature space with different scales are likely incompatible in a concatenation.
We combine multiple features to predict dynamic texture synthesizability by aggregating different features on decision level. We use three features SCOP-DT, C3D and LBP-TOP in combination. Recently, ensemble of classifiers have been used to combine features efficiently in dynamic texture recognition . Therefore, we resort to feature combination on decision level by ensemble schedule. There are two merits as for fusion on decision level. We don’t need to consider the normalization problem of concatenating different descriptors directly. Besides, the choice of classifiers can be more flexible for different type of descriptors, considering that every feature may be in favour of a specific kind of classifier.
We select the optimal regression model relevant to each feature, e.g. Support Vector Machine (SVM) or Random Forest (RF). Regression models on the training data with labeled synthesizability scores are trained for each feature respectively as depicted in formulation (10). The trained models are then used to predict the synthesizability of a given video. Thus for a video, synthesizability scores can be predicted with respect to features. Then scores are weighted averaged to form a final synthesizability prediction score. The weights are set manually according to the performance of each individual feature for effective combination. The final output prediction score is a weighted average of given the different features of the unknown test sample :
where are weighting factors w.r.t. features C3D, LBP-TOP and SCOP-DT.
The scheme of learning and predicting synthesizability by aggregating features is illustrated in Fig. 6. Firstly, we use regression models SVM and RF for every single feature to predict synthesizability, and compare the performance of two regression models. Secondly, we choose the optimal feature and regression model among them, and set weights manually for feature combination in formula (12). Finally, the predicted synthesizability scores of feature combination on decision level are output.
4.2.4 Suggest synthesized methods
We not only predict synthesizability score for a given DT sample, but also suggest the “best” EDTS method to synthesize it. The optimal synthesized method of each DT in the training set is also recorded as a method label along with the synthesizability score. We can recommend the “best” EDTS method by an additional classifier. We combine C3D and SCOP-DT features in Ensemble SVMs  for classification. Algorithm 1 presents the implementation pipeline of predicting synthesizability and suggesting optimal EDTS method in our approach.
4.3 Detecting synthesizable regions
In natural world, dynamic textures normally appear as visual phenomena with cluttered background in complex scene. Correspondingly, DTs captured in unconditional circumstances only occupy parts of videos with uncertain and irregular shapes, which are readily inappropriate to use the whole video as input exemplar for DT synthesis. Therefore, it would be beneficial to tailor videos into regions with good synthesizability by discarding undesirable background. To this end, we use the dynamic texture detection method  to detect the rough and irregular DT regions in video at first. Then, the detected region is trimmed into regular shape aimed at good synthesizability. 100 rectangular subregions were randomly sampled within the detected region to compute and compare their synthesizability scores. The most synthesizable subregion is then suggested as shown in Fig. 11. The subregions are spatially homogeneous dynamic textures (SHDTs) and can be synthesized in spatial extent.
5 Data Collection and Annotation
Since we want to formulate the prediction of synthesizability as a regression problem, it is necessary to compile a collection of annotated data in terms of synthesizability. There are several established benchmark datasets from dynamic texture community, which have been mainly targeted on classification issues yet. In order to fit into our problem, we compiled a dynamic texture dataset and manually annotated it with synthesizability score. Most of the dynamic texture examples came from available existing datasets e.g. UCLA , DynTex  and Spacetime Texture Dataset . To enlarge the dataset size, we selected some samples belonging to dynamic texture category from two dynamic scene datasets: Maryland “In-The-Wild”  dataset and the YUPENN Dynamic Scenes  dataset. In comparison to dynamic textures, dynamic scenes are composed of moving scene elements with certain spatial layout, where several inner-related regions of different dynamic patterns appear in complex settings, (e.g. burning fire with billowing smoke in forest, and a downtown street scene composed of the pedestrians, vehicular traffic and flashing lights). Thus, dynamic textures can be viewed as a kind of particular dynamic scenes in simpler settings, typically with the field of view restricted to a single uniform dynamic region. Therefore we also picked out some dynamic texture samples from the dynamic scene dataset. Finally, we ended up with a dataset of 1729 DT samples, among which there are 452 SHDTs and 1277 TDTs respectively.
Since none of the EDTS methods can always perform well on all kinds of dynamic textures, several representative state-of-the-art methods were tested, in order to cover different aspects of EDTS methods and provide a comparison across these synthesis results. The selected algorithms comply with the following principles: the classical LDS model , two compact representation models  SN-textons and AR-textons, copy-based method Graphcut Textures , two CNN-based methods STGConvNet  and Yang et al.’s work , the latter namely Gatys-DT here, which follows Gatys et al.’s static texture synthesis method . The six dynamic texture synthesis methods have their own preference for two types of dynamic textures SHDTs and TDTs. All six methods were used for SHDTs; whereas for TDTs, only three methods LDS model, Graphcut Textures and STGConvNet were included, because the other three methods work on synthesis in joint spatiotemporal domain, which are unable to customize temporal synthesis specifically.
We synthesize each dynamic texture example with corresponding available algorithms. Several synthesis results are compared against each other. Then we manually annotated the synthesizability score of each dynamic texture sample as the “goodness” level of the best synthesized result. The synthesizability score of a dynamic texture sample is annotated as follows: given a sample, the annotator got all the synthesis results and chose the best one to score. Following the work of , the goodness of the synthesizability was divided into 3 levels: good, acceptable, and bad, with the quantitative scores assigned as 1, 0.5 and 0 respectively. The best synthesis method of each DT example was also recorded for “good” and “acceptable” DTs; “bad” ones were assigned to “NULL”. The synthesis results can only be evaluated qualitatively by observing the perceptual quality. An ideal synthesized dynamic texture should look perceptually similar to the input example when perceived by a human observer, and should not have visible artifacts such as seams, blocks or corrupt elements, should not show discontinuity or missing frames during playback. Since the synthesized video should be as natural as possible while maintaining equivalent visual perception to the input, the verbatim copying reproduction of salient repeated parts is undesired, if not the case in the original. The final outcome of synthesizability score annotated for a DT example is the “goodness” of the synthesized result that an expert annotator considered best. See Fig. 7 for examples of such annotation. TDTs can only be synthesized along time, whereas synthesis for SHDTs is practicable both in space and time. Therefore, we annotated TDTs with temporal synthesizability score, and SHDTs are labeled with spatial and temporal synthesizability score respectively. For temporal synthesizability of 1277 TDTs, 25.69% samples were labeled bad, 24.67% acceptable and 49.65% good. For 452 SHDTs, spatial and temporal synthesizatility scores were both annotated: spatially 29.42% bad, 36.95% acceptable and 33.63% good; temporally 4.65% bad, 32.30% acceptable and 63.05% good.
6 Results and Validation
In this section, we evaluate all sub-tasks of the hierarchical procedure in Figure 4. We run through the experiments step-by-step to evaluate each task on the dataset. Both quantitative and qualitative experiments are reported. All the results are available at http://captain.whu.edu.cn/project/DTsynthesizability.html, where one can checked the videos.
6.1 Quantitative evaluation
of the dataset videos were used for training, the rest for testing. We report results over 100 random training-testing splits in all quantitative experiments.
6.1.1 Retrieve DTs from videos
We retrieve DTs in videos using Random Forest with C3D feature. There are 15,249 videos, of which 1729 videos are DTs. We evaluated the precision for different levels of recall in the retrieval task. The precision-recall curve is plotted in Figure 8. The average precision is when half for training.
6.1.2 Binary classification for SHDTs and TDTs
There are SHDTs and TDTs in our synthesizability dataset. We use SCOP-DT feature to distinguish DTs into SHDTs and TDTs. To verify the proposed SCOP-DT for DTs, we compare it to the approach in , which randomly select several frames in a DT to extract SCOPs for each selected frame separately and perform dynamic texture recognition in the late fusion ensemble architecture. We set for SCOP-DT. In accord with this, we randomly choose frames for the approach in . The comparison is shown in Table 1, where the SCOP-DT we propose for DTs outperformed the usage of SCOP in  by a large margin. The results confirmed that SCOP-DT containing implicit temporal information can benefit the description of DTs.
Considering that early researches in image recognition have shown that combining multiple descriptors is very useful to improve classification performance [17, 22], SCOP-DT was combined with C3D and LBP-TOP in ensemble SVMs  for binary classification of SHDTs and TDTs. We evaluated all 3 single features and their combination in Table 2. The accuracy of SCOP-DT is superior to both LBP-TOP and C3D. As expected, the feature combination further promotes classification rate, which gets 96.06%.
6.1.3 Prediction of synthesizability
For prediction of synthesizability, all 3 single features C3D, LBP-TOP, SCOP-DT and their combination were evaluated for SHDTs and TDTs respectively. For quantitative evaluation, we performed two-level retrieval tasks and evaluated the average precision of synthesizability prediction with individual feature and multiple features: (1) retrieve videos with “good” scores (good); (2) retrieve videos with “good” or “acceptable” scores (acceptable).
Results on SHDTs:
In prediction of spatial synthesizability for SHDTs, we also compared LBP-TOP, C3D, and SCOP-DT to the features used by Dai et al.  for static texture synthesizability. To utilize the static features in DTs, we randomly select 16 frames in each DT to extract features in  for each frame and compute the average feature over 16 frames. The retrieval experiments of SHDTs are shown in Table 3. C3D with random forest and SCOP-DT with SVM have better performance than LBP-TOP and Dai et al. . What’s more, for spatial synthesizability prediction, SCOP-DT got the average precision score with 66.63% for good, and had an advantage over others by 6% at least. Then we use the two fine features C3D and SCOP-DT for combination with the weighting factors set 0.5 for both, which can improve the average precision a little.
For SHDTs, from the precision scores (spatial synthesizability acceptable 92.65%, and good 67.43%; temporal synthesizability acceptable 98.89%, and good 88.26%), we can conclude that dynamic texture synthesizability is learnable and predictable. The table shows that spatial synthesizability is harder to predict than temporal synthesizability, because DTs usually exhibit strong self-correlation in time and make it relatively easy to synthesize new samples only along time. However, the degree of repetition and homogeneity is much lower in space than time. Also, the complex structures in 2D space make it more difficult to synthesize spatially, whereas dynamics in time is limited to 1D direction with its simplicity to some extent.
|Features||LBP-TOP||C3D||SCOP-DT||C3D + SCOP-DT||Dai et al. |
|Regression methods||RF||SVM||RF||SVM||RF||SVM||RF + SVM||RF||SVM|
Results on TDTs:
The total number of TDTs is and the retrieval experiments are shown in Table 4. For individual features, LBP-TOP with random forest got the best retrieval accuracy, and the performance of three features is more or less comparative. Thus, we combine all three features by equal weight of to predict temporal synthesizability, by which we get the highest precision scores (96.20% for acceptable, and 90.30% for good). As already pointed out, the table also suggests that temporal synthesizability is not too difficult to predict, in line with simpler synthesis in time than space. The retrieval experiments confirm that we can expect a very high precision when a fraction of well-synthesizable TDT examples need to be retrieved. It is very useful to choose synthesizable DTs from internet videos in unconstrained circumstances.
|Features||LBP-TOP||C3D||SCOP-DT||LBP-TOP + C3D + SCOP-DT||C3D + SCOP-DT|
|Regression methods||RF||SVM||RF||SVM||RF||SVM||RF + RF + SVM||RF + SVM|
6.1.4 Suggest synthesized methods
For a given DT, we use an additional classifier to suggest the “best” EDTS method to synthesize it. For spatial-temporal synthesis of SHDTs, there are 4 synthesized methods: SN-textons, AR-textons, Graphcut Textures and Gatys-DT. Three choices of synthesized methods: LDS, Graphcut Textures and STGConvNet are used for temporal synthesis of TDTs. For simplicity, we choose two effective features C3D and SCOP-DT along with their combination for classification and the classification results are shown in Table 5.
|For SHDTs (%)||70.80||73.26||75.76|
|For TDTs (%)||80.86||80.88||82.39|
6.2 Qualitative evaluation
On the spatial synthesizability:
Fig. 9 shows SHDT examples together with their predicted spatial synthesizability. The synthesizability predictor here was trained with all annotated SHDTs except for the test one itself given for prediction. As can be seen, homogeneous, repetitive SHDTs with tiny oscillating dynamics obtain higher scores. The low scores are caused by many factors, such as surface irregularity, uneven illumination, and outliers. In Fig. 9, the “best” synthesised dynamic textures by the EDTS methods are also given. If none of the EDTS methods can synthesize a test example well (“bad” ones were assigned to “NULL” method), synthesis result of a randomly chosen method is shown. The predicted synthesizability score is more or less consistent with the quality of synthesized DTs. This is crucial because it allows us to select DT regions - also as video parts - that can be synthesized well.
On the temporal synthesizability:
For temporal synthesizability, Fig. 10 shows some TDT examples together with their predicted temporal synthesizability. As shown, TDTs with repetitive and slight movement especially for turbulent dynamics of tiny structures obtain higher scores in agreement with the synthesis results. The low quality of temporal synthesis is due to dominant motion of large-structure patterns, time discontinuity and outliers. As can be seen in the 3rd example in Fig. 10, artifacts appear in the synthesis of scattered driving cars, where mosaic mismatches are noticeable especially for the large truck.
6.3 Detect synthesizable regions
As for detection of the most synthesizable regions, we predict the spatial synthesizability of the original videos and the segmented DT subregions respectively, seen in Fig. 11. The synthesis results are illustrated, which shows that synthesis is much better for the tailored parts compared to the entire videos. It is thus possible to trim unconstrained videos into synthesizable DT examples. The proposed method also works well for the trimming tasks.
This paper investigated the synthesizability of dynamic texture samples via a learning scheme. To accommodate in more general settings, we proposed a hierarchical architecture to automatically identify dynamic textures from a diverse set of unconstrained videos firstly, followed by partitions of DTs into SHDTs and TDTs according to their spatial modes, and then predict synthesizability scores of DTs to help find good DT examples for synthesis. To this end, we constructed a fairly large dynamic texture dataset and calibrated it in the light of synthesizability. We solved the learning problem by regression models with the proposed SCOP-DT descriptor and other spatiotemporal features. Experimental results show that the proposed hierarchical learning scheme is effective in dynamic texture samples selection, partition and synthesizability prediction. It is helpful to pick out dynamic textures, to find good DT examples for synthesis, to crop synthesizable dynamic texture regions in unconstrained videos, and to suggest an appropriate EDTS method for synthesis. The proposed method of learning synthesizability makes it possible to provide suitable DT examples, thus also facilitating the development of future EDTS methods. For further study, it will be interesting to investigate the relationship between dynamic texture synthesizability and other measures such as video quality or motion patterns.
We thank Dr. G. Doretto for providing the UCLA datasets. The authors thank Dr. Gang Liu for his insightful discussion and Prof. Gianfranco Doretto for providing the UCLA datasets.
-  A. Abdullah, R. C. Veltkamp, and M. A. Wiering. Spatial pyramids and two-layer stacking svm classifiers for image categorization: A comparative study. In IJCNN, pages 5–12, 2009.
-  Z. Bar-Joseph, R. El-Yaniv, D. Lischinski, and M. Werman. Texture mixing and texture movie synthesis using statistical learning. IEEE TVCG, 7(2):120–135, 2001.
-  A. B. Chan and N. Vasconcelos. Classifying video with kernel dynamic textures. In CVPR, pages 1–6, 2007.
-  M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
-  R. Costantini, L. Sbaiz, and S. Süsstrunk. Higher order svd analysis for dynamic texture synthesis. IEEE TIP, 17(1):42–52, 2008.
-  D. Dai, H. Riemenschneider, and L. Van Gool. The synthesizability of texture examples. In CVPR, pages 3027–3034, 2014.
K. G. Derpanis, M. Lecce, K. Daniilidis, and R. P. Wildes.
Dynamic scene understanding: The role of orientation features in space and time in scene classification.In CVPR, pages 1306–1313, 2012.
-  K. G. Derpanis and R. P. Wildes. Spacetime texture representation and recognition based on a spatiotemporal orientation analysis. IEEE TPAMI, 34(6):1193–1205, 2012.
-  C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. What makes paris look like paris? In SIGGRAPH, pages 101:1–101:9, 2012.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014.
-  G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto. Dynamic textures. IJCV, 51(2):91–109, 2003.
-  G. Doretto, E. Jones, and S. Soatto. Spatially homogeneous dynamic textures. In ECCV, pages 591–602. 2004.
-  S. Fazekas, T. Amiaz, D. Chetverikov, and N. Kiryati. Dynamic texture detection based on motion analysis. IJCV, 82(1):48, 2009.
-  L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, pages 262–270, 2015.
-  B. Ghanem and N. Ahuja. Maximum margin distance learning for dynamic texture recognition. In ECCV, pages 223–236. 2010.
-  B. Ghanem and N. Ahuja. Sparse coding of linear dynamical systems with an application to dynamic texture recognition. In ICPR, pages 987–990, 2010.
-  K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In ICCV, volume 2, pages 1458–1465, 2005.
-  M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. Van Gool. The interestingness of images. In ICCV, pages 1633–1640, 2013.
-  P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an image memorable? In CVPR, pages 145–152, 2011.
-  H. Ji, X. Yang, H. Ling, and Y. Xu. Wavelet domain multifractal analysis for static and dynamic texture classification. IEEE TIP, 22(1):286–299, 2013.
-  V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis using graph cuts. ACM Transactions on Graphics (ToG), 22(3):277–286, 2003.
-  S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, volume 2, pages 2169–2178, 2006.
-  G. Liu, G.-S. Xia, W. Yang, and L. Zhang. Texture analysis with shape co-occurrence patterns. In ICPR, pages 1627–1632, Aug 2014.
-  Y. Luo and X. Tang. Photo and video quality evaluation: Focusing on the subject. In ECCV, pages 386–399. 2008.
-  P. Monasse and F. Guichard. Fast computation of a contrast-invariant image representation. IEEE TIP, 9(5):860–872, 2000.
-  T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE TPAMI, 24(7):971–987, 2002.
-  R. Péteri, S. Fazekas, and M. J. Huiskes. Dyntex: A comprehensive database of dynamic textures. Pattern Recognition Letters, 31(12):1627–1632, 2010.
-  R. Polana and R. Nelson. Temporal texture and activity recognition. In Motion-based recognition, pages 87–124. Springer, 1997.
-  A. Ravichandran, R. Chaudhry, and R. Vidal. Categorizing dynamic textures using a bag of dynamical systems. IEEE TPAMI, 35(2):342–353, 2013.
-  P. Saisan, G. Doretto, Y. N. Wu, and S. Soatto. Dynamic texture recognition. In CVPR, volume 2, pages II–58, 2001.
-  A. Schödl, R. Szeliski, D. H. Salesin, and I. Essa. Video textures. In SIGGRAPH, pages 489–498, 2000.
-  N. Shroff, P. Turaga, and R. Chellappa. Moving vistas: Exploiting motion for describing scenes. In CVPR, pages 1911–1918, 2010.
-  K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012.
-  M. Szummer and R. W. Picard. Temporal texture modeling. In ICIP, volume 3, pages 823–826, 1996.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
-  L.-Y. Wei, S. Lefebvre, V. Kwatra, and G. Turk. State of the art in example-based texture synthesis. In Eurographics, pages 93–117, 2009.
-  L.-Y. Wei and M. Levoy. Fast texture synthesis using tree-structured vector quantization. In SIGGRAPH, pages 479–488, 2000.
-  F. Woolfe and A. Fitzgibbon. Shift-invariant dynamic texture recognition. In ECCV, pages 549–562, 2006.
-  G.-S. Xia, J. Delon, and Y. Gousseau. Shape-based invariant texture indexing. IJCV, 88(3):382–403, 2010.
-  G.-S. Xia, S. Ferradans, G. Peyré, and J.-F. Aujol. Compact representations of stationary dynamic textures. In ICIP, pages 2993–2996, 2012.
-  G.-S. Xia, S. Ferradans, G. Peyré, and J.-F. Aujol. Synthesizing and mixing stationary gaussian texture models. SIAM J. Imaging Sciences, 7(1):476–508, 2014.
-  G. S. Xia, G. Liu, X. Bai, and L. Zhang. Texture characterization using shape co-occurrence patterns. IEEE TIP, 26(10):5005–5018, Oct 2017.
-  J. Xie, S.-C. Zhu, and Y. N. Wu. Synthesizing dynamic patterns by spatial-temporal generative convnet. In CVPR, 2017.
-  F. Yang, G.-S. Xia, G. Liu, L. Zhang, and X. Huang. Dynamic texture recognition by aggregating spatial and temporal features via ensemble svms. Neurocomputing, 173:1310–1321, 2016.
-  F. Yang, G.-S. Xia, L. Zhang, and X. Huang. Stationary dynamic texture synthesis using convolutional neural networks. In ICSP, pages 1135–1139, 2016.
-  L. Yuan, F. Wen, C. Liu, and H.-Y. Shum. Synthesizing dynamic texture with closed-loop linear dynamic system. In ECCV, pages 603–616. 2004.
-  G. Zhao and M. Pietikainen. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE TPAMI, 29(6):915–928, 2007.