1 Introduction
Dynamic patterns are spatiotemporal processes that exhibit complex spatial and motion patterns, such as dynamic texture (e.g., falling waters, burning fires), as well as human facial expressions and movements. A fundamental challenge in understanding dynamic patterns is learning disentangled representations to separate the underlying factorial components of the observations without supervision [3, 17]. For example, given a video dataset of human facial expressions, a disentangled representation can include the face’s appearance attributes (such as color, identity, and gender), the trackable motion attributes (such as movements of eyes, lip, and noise), and the intrackable motion attributes (such as illumination change). A disentangled representation of dynamic patterns is useful in manipulable video generation and calculating video statistics. The goal of this paper is not only to provide a representational model for video generation, but more importantly, for video understanding by disentangling appearance, trackable and intrackable motions in an unsupervised manner.
Studying video complexity is key to understanding motion perception, and also useful for designing metrics to characterize the video statistics. Researchers in the field of psychophysics, e.g., [20]
, have studied the human perception of motion uncertainty, and found that human vision fails to track the objects when the number of moving objects increases or their motions are too random. In the field of computer vision,
[15] proposes the intrackability concept in the context of surveillance tracking. [7]defines intrackability quantitatively to measure the uncertainty of tracking an image patch using the entropy of posterior probability on velocities. In this paper, we are also interested in providing a new method to define and measure the intrackability of videos, by disentangling the trackable and intrackable components in the videos, in the context of the proposed model.
A widely used representational model for dynamic patterns is the state space model, where the hidden state evolves through time according to a transition model, and the state generates the image frames according to an emission model. The original dynamic texture model of [5] is such a model where the hidden state is a lowdimensional vector, and both the transition model and the emission model are linear. The model can be generalized to nonlinear versions where the nonlinear mappings in the transition and emission models can be parametrized by neural nets [30].
In terms of the underlying physical processes and the perception of the dynamic patterns, they are largely about motions, i.e., movements of pixels or constituent elements, and it is desirable to have a model that is based explicitly on the motions. In this paper, we propose such a motionbased model for dynamic patterns. Specifically, in the emission model, we let the hidden state generate the displacement field, which warps the trackable component in the previous image frame to generate the next frame while adding a simultaneously emitted residual image to account for the change that cannot be explained by the deformation. Thus, each image frame is decomposed into a trackable component that is obtained by warping the previous frame and an intrackable component in the form of the simultaneously generated residual image.
We use the maximum likelihood method to learn the model parameters. The learning algorithm iterates between (1) inferring latent noise vectors that drive the transition model, and (2) updating the parameters given the inferred latent vectors. Meanwhile we adopt a regularization term to penalize the norms of the residual images to encourage the model to explain the change of image frames by motion. Unlike existing methods on dynamic patterns, we learn our model in unsupervised setting without ground truth displacement fields or optical flows. Moreover, with the disentangled representation of a video, we can define a notion of intrackability by comparing the trackable and intrackable components of the image frames to measure video complexity.
Experiments show that our method can learn realistic dynamic pattern models, the learned motion can be transferred to testing images with unseen appearances, and intrackability can be quantitatively measured under the proposed representation.
Contribution. Our contributions are summarized below: (1) We propose a novel representational model of dynamic patterns to disentangle the appearance, trackable and intrackable motions. (2) The model can be learned in purely unsupervised setting in that the associated maximum likelihood learning algorithm can learn the model without ground truth or preinferred displacement fields or optical flows. (3) The learning algorithm does not rely on an extra assisting network as in VAEs [14, 21, 18] and GANs [8]. (4) The experiments show that appearance and motion can be well separated, and the motion can be effectively transferred to a new unseen appearance. (4) With such a representational model, a measure of intrackability can be defined to characterize the video statistics, i.e., video complexity, in the context of the model.
2 Related work
Learning generative models for dynamic textures has been extensively studied in the literature [5, 27, 28]. For instance, the original model for dynamic texture in [5]
is a vector autoregressive model coupled with framewise dimension reduction by singular value decomposition. It is linear in both the transition model and the emission model.
By generalizing the energybased generative ConvNet model in [32], [33]
develops an energybased model where the energy function is parametrized by a spatialtemporal bottomup ConvNet with multiple layers of nonlinear spatialtemporal filters that capture complex spatialtemporal patterns in dynamic textures. The model is learned from scratch by maximizing the loglikelihood of the observed data.
[9] represents dynamic textures by a topdown spatialtemporal generator model that consists of multiple layers of spatialtemporal kernels. The model is trained via alternative backpropagation algorithm. [31] proposes a cooperative learning scheme to jointly train the models in [33] and [9] simultaneously for dynamic texture synthesis. Recently, [30] proposes a dynamic generator model that consists of nonlinear transition model and nonlinear emission model. Unlike the above two models in [33] and [9], the model in [30] unfolds over time and is a causal model. Our work is based on [30] and is an extension of it. Compared to [30], our model in this paper represents dynamic patterns with an unsupervised disentanglement of appearance (pixels), trackable motion (pixel displacement), and intrackable motion (residuals). Therefore our model can animate a static image by directly applying the motion extracted from another video to the static image, even though two appearances are not the same. All models mentioned above can not handle this. Additionally, the intrackable motion provides a new perspective to define and measure the intrackability of videos, which makes our model significantly distinct from and go beyond [30].Recently, multiple video generation frameworks based on GANs [8] have been proposed. For example, VGAN [25], TGAN [22], and MoCoGAN [24].
All of the above methods need to recruit a discriminator with appropriate convolutional architecture to evaluates whether the generated videos are from the training data or the video generator. Our work is not within the domain of adversarial learning. Unlike GANbased methods, our model is learned by maximum likelihood without recruiting a discriminator network.
3 Model and learning
3.1 Motionbased generative model
Let be the observed video sequence of dynamic pattern, where is a frame at time , and is defined on the 2D rectangle lattice . The motionbased model for the dynamic patterns consists of the following components:
(1)  
(2)  
(3)  
(4)  
(5) 
where . We single out and discuss it in Equation (6) below.
Equation (1) is the transition model, where is the state vector at time ,
is a hidden Gaussian white noise vector, where
is the identity matrix.
are independent over . defines the transition from to .The state vector consists of two subvectors. One is for motion. The other is for residual. While generates the motion of the trackable part of the image frame , generates the nontrackable part of .
Specifically, in Equation (2), generates the field of pixel displacement , which consists of the displacement of pixel in the image domain . is a 2D image, because the displacement is 2D. defines the mapping from to . In Equation (4), is used to warp the trackable part of the previous image frame by a warping function
, which is given by bilinear interpolation. There is no unknown parameter in
. In Equation (3), generates the residual image . defines the mapping from to . In Equation (5), the image frame is the sum of the warped image (note that the notation is not in bold font, and it is different from the image frame , which is in bold font) and the residual image , plus a Gaussian white noise error. We assume the variance
is given. In Equation (6), the initial trackable frame is generated by an generator from an appearance hidden variablethat follows Gaussian distribution. To initialize the first frame
, we use the following method:(6) 
Please see Figure 1 for an illustration of the proposed model.
Multiple sequences. Our model can be easily generalized to handle multiple sequences. We only need to introduce a sequence specific vector , sampled from a Gaussian white noise prior distribution. For each video sequence, this vector is fixed, and it can be concatenated to the state vector in both the transition model and the emission model. We may also let , so that is concatenated to to generate , and is concatenated to to generate . This enables us to disentangle motion pattern and appearance pattern in the video sequence.
Intrackability. For the image , we define to be the trackable part because it is obtained by the movements of pixels, and we define to be the nontrackable part. The intrackability of the sequence can be defined as the ratio between the average of the norm of the nontrackable part and the norm of the image , where the average is over the time frames.
Summarized form. Let . consist of the hidden random vectors that need to be inferred from . We can also include the latent variables and into for notation simplicity. Although is generated by the state vector , are generated by . In fact, we can write , where composes , , , and over time , and denotes the observation errors.
3.2 Maximum likelihood learning algorithm
The model is a generator model with being the hidden vector. In recent literature, such a model is commonly learned by VAE [14, 21, 18] and GAN [8]. However, unlike a regular generator model, is a sequence of hidden vectors, and we need to design highly sophisticated inference network or discriminator network if we want to implement VAE or GAN, and this is not an easy task. In this paper, we choose to learn the model by maximum likelihood algorithm which is simple and efficient, without the need to recruit an extra inference or discriminator network.
Our maximum likelihood learning method is adapted from the recent work [30]. Specifically, let be the Gaussian white noise prior distribution of . Let be the conditional distribution of the video sequence given . The marginal distribution of is with the latent variable integrated out. The loglikelihood is , which is analytically intractable due to the integral over . The gradient of the loglikelihood can be computed using the following identity:
(7)  
where is the posterior distribution of the latent given the observed . The expectation with respect can be approximated by Monte Carlo sampling. The sampling of can be accomplished by the Langevin dynamics:
(8) 
where indexes the time step of the Langevin dynamics. Here we use the notation because we have used to index the time of the video sequence. . is the Gaussian white noise vector. is the step size of the Langevin dynamics. After sampling using the Langevin dynamics, we can update by stochastic gradient ascent
(9) 
where we use the sampled to approximate the expectation in (7).
The learning algorithm iterates the following two steps. (1) Inference step: Given the current , sample from according to (8). (2) Learning step: Given , update according to (9). We can use a warm start to sample in step (1), that is, when running the Langevin dynamics, we start from the current , and run a finite number of steps. Then we update in step (2) using the sampled . Such a stochastic gradient ascent algorithm has been analyzed by [34].
Since , both steps (1) and (2) are based on computing the derivatives of
where the constant term does not depend on or . The derivatives with respect to and can be computed efficiently and conveniently by backpropagation through time.
To encourage the model to explain the video sequence by the trackable motion, we add to the loglikelihood a penalty term . To encourage the smoothness of the inferred displacement field , we also add another penalty term
. We estimate
by gradient ascent on .In VAE, we need to define an inference model to approximate the posterior distribution . Due to the complex structure of the model, it is not an easy task to design an accurate inference model. While VAE maximizes a lower bound of the loglikelihood
, where the tightness of the lower bound depends on the KullbackLeibler divergence between
and , our learning algorithm seeks to maximize the loglikelihood itself.4 Experiments
Our paper studies learning to disentangle appearance, trackable motion, and intrackable motion of dynamic pattern in an unsupervised manner by proposing a motionbased dynamic generator. We conduct the following three experiments to test and understand the proposed model. As a generative model for videos, Experiment 1 investigates how good the proposed model can be learned by evaluating its data generation capacity, which is a commonly used way to check whether the learned model can capture the target data distribution. Experiment 2 investigates if the proposed model can successfully decompose the appearance and motion by a task of motion transfer. Experiment 3 studies the disentanglement of trackable and intrackable motions, and use the intrackable one to define the concept of “intrackability”, which is an application of our model.
4.1 Implementation details
Our model was implemented using Python with TensorFlow
[1]. Each prepared training video clip is of the size pixels frames. The configuration of our model architecture is presented as follows.Transition model. The transition model is a threelayer feedforward neural network that takes a 80dimensional state vector and a 100dimensional noise vector at time as input and outputs a 80dimensional vector at time , so that . This is a residual form [11] for computing given
. The output of each of the first two layers is followed by a ReLU operation. The tanh activation function is crucial to prevent
from being increasingly lager during the recurrent computation of by constraining it within the range of . The numbers of nodes in the three layers of the feedforward neural network are . Each state vector consists of two parts , where is a 50dimensional motion state vector and is a 30dimensional residual state vector.Emission model. The emission model for motion is a topdown deconvolution neural network or generator model that maps (i.e., ) to the displacement field or optical flow of size by 6 layers of deconvolutions with kernel size 4 and upsampling factor from top to bottom. The numbers of channels at different layers of the generator are
from top to bottom. Batch normalization
[12] and ReLU layers are added between deconvolution layers, and tanh activation function is used at the bottom layer to make the output signals fall within . The emission model for residuals is also a generator model that maps (i.e., ) to the residual image frame of size by a generator with the same structure as the emission model for motion, except that the output channel of the last layer is 3 rather than 2. A generator of trackable appearance that maps a 10dimensional noise vector to the first image frame follows the same structure as the residual generator.Optimization and inference. Adam [13] is used for optimization with and the learning rate is 0.001. The Langevin step size is set to be
for all latent variables, and the standard deviation of residual error
. During each learning iteration, we run steps of Langevin dynamics for inferring the latent noise vectors. Unless otherwise stated, the penalty weights for residuals and smoothness of the displacement field are set to be and , respectively.4.2 Experiment 1: Dynamic pattern synthesis
We firstly evaluate the representational power of the proposed model by applying it to dynamic pattern sythesis. A good generative model for video should be able to generate samples that are perceptually indistinguishable from the real training videos in terms of appearance and dynamics. We learn our models from a wide range of dynamic textures (e.g., flowing water, fire, etc), which are selected from DynTex++ dataset of [6] and the Internet. We learn a single model from each training example and generate multiple synthesized examples by simply drawing independent and identically distributed samples from Gaussian distribution of the latent factors. Note that our model only learns from raw video data without relying on other information, such as optical flow ground truths.
Some results of dynamic texture synthesis are displayed in Figure 2. We show the synthesis results by displaying the frames in the video sequences. For each example, the first row displays 6 frames of the observed 30frame video sequence, while the second and the third rows show the corresponding 6 frames of two synthesized 30frame video sequences that are generated by the learned model.
Human perception has been used in [4, 23, 26, 30] to evaluate the synthesis quality. We follow the same protocol in [30] to conduct a human perceptual study to get feedback from human subjects on evaluating the visual quality of the generated dynamic textures. We randomly choose 20 different human observers to participate in the perceptual test, where each participant needs to perform 36 (12 categories 3 examples per category) pairwise comparisons between a synthesized dynamic texture and its real version. For each pairwise comparison, participants are asked to select the more realistic one after observing each pair of dynamic textures for a specified observation time, which is chosen from discrete durations between 0.3 and 3.6 seconds. The varying observation time will help us to investigate how quickly the difference between dynamic textures can be identified. We specifically ask the participants to carefully check for both temporal coherence and image quality. We present all the dynamic textures to the participants in the form of video with a resolution of pixels. To obtain unbiased and reliable results, we randomize the comparisons across the left/right layout of two videos in each pair and the display order of different video pairs. We measure the realism of dynamic textures by the participant error rate in distinguishing synthesized dynamic textures from real ones. The higher the participant error rate, the more realistic the synthesized dynamic textures. The “perfectly” synthesized results would cause an error rate of 50, because random guesses are made when the participants are incapable of distinguishing the synthesized examples from the real ones.
For comparison, we use three baseline methods, such as LDS (linear dynamic system) [5], TwoStream [23], MoCoGAN [24], and dynamic generator (DG) [30]. The comparison is performed on 12 dynamic texture videos (e.g., waterfall, burning fire, waving flag, etc) that have been used in [30].
The results of this study are summarized in Figure 3, which shows perceived realism (i.e., user error) as a function of observation time across methods. Overall, the “perceived realism” decreases as observation time increases, and then stays at relatively the same level for longer observation. This means that as the observation time becomes longer, the participants feel easier to distinguish “fake” examples from real ones. The results clearly show that the dynamic textures generated by our models are more realistic than those obtained by models LDS, TwoStream, and MoCoGAN, and on par with those synthesized by DG.
To better understand the comparison results, we further analyze the performance of the baselines. We notice that the linear model (i.e., LDS) surpasses those methods using complicated deep network architecture (i.e., TwoStream and MoCoGAN). This is because one single training example is insufficient to train the MoCoGAN, which contains a large number of learning parameters, in an unstable adversarial learning scheme, while the TwoStream method, relying on pretrained discriminative networks for feature matching, is incapable of synthesizing spatially inhomogeneous dynamic textures (i.e., dynamic textures with structured background, e.g., boiling water in a static pot), which has been mentioned in [23] and observed in [30]. Our model is simple in the sense that it relies on neither auxiliary networks for variational or adversarial training nor pretrained networks for feature matching, yet powerful in terms of disentanglement of appearance (represented by pixels), trackable motion (represented by pixel movements or optical flow), and intrackable motion (represented by residuals).
4.3 Experiment 2: Unsupervised disentanglement of appearance and motion
To study the performance of the proposed model for disentanglement of appearance and motion, we perform a motion exchange experiment between two randomly selected facial expression sequences from MUG Facial Expression dataset [19] by the learned model. We first disentangle the appearance vector , optical flow as trackable motion, and residuals as intrackable motion for each of the two sequences by fitting our model on them. We then exchange their inffered motions and regenerate both sequences by repeatedly warping the appearance images that are generated by their own appearance vectors with the exchanged optical flows . Figure 4 displays the results, where (a) shows some image frames of two selected original facial expression videos respectively. One is a man with sadness facial expression, and the other is a woman with surprise facial expression; (b) visualizes the learned trackable motions (optical flows) by color images, by following [2], where each color represents a direction (Please see the appendix for the displacement field color coding map.); and (c) displays some image frames of the generated videos after motion exchange between the man and the woman.
From Figure 4, we can see that the motion latent vectors do not encode any appearance information. The color, illumination, and identity information in the generated video sequence only depend on the appearance latent vector, and are not changed after motion exchange. Figure 5 demonstrates an idea of learning from only one single video and unsupervisedly disentangling the motion and appearance of the video, and then transferring the motion to the other appearances. Figure 5 (a) displays some image frames of one observed video where a woman is performing surprise expression. We first learn a model from the observed video. Figure 5 (b) visualizes the learned motion. We then fix the inferred appearance latent vector and synthesize new surprise facial expression by randomly sampling the latent vectors of the learned model. Two new synthesized “surprise” expressions on the same women are shown in Figure 5 (c). We further study transferring the learned motion to some unseen appearances. We select two unseen faces from the testing set. We apply the learned motion (i.e., the learned warping sequence) to the first frame of each testing video, and generate new image sequences, as shown in Figure 5 (d). We can also apply the learned motion to some faces from other domains. Figure 5 (e) shows two examples of transferring the learned motion to the cartoon face images. In each example, the image frame shown in the first column is the input appearance, and the rest image frames are generated when we apply the learned warping sequence to the input appearance. We can even apply the learned human facial expression motion to nonhuman appearances, such as animal faces (see Figure 5(f)). Figure 6 shows one more example of motion transfer from another input video.
Although the appearance domain in testing is significantly different from that in training, because our trackable motion does not encode any appearance information, the motion transfer will not modify the appearance information, which corroborates the disentangling power of the proposed model. Currently, our model does not consider face geometric deformation. We assume the face data we used in this experiment are well aligned. We can easily prealign a testing face by morphing, when performing motion transfer to a nonaligned testing face, and then morph the new generated faces in each image frame back to its original shape. More rigorously, we can add one more generator that takes care of the shape geometric deformation of the appearance to deal with the alignment issue. The training of such a model will lead to an unsupervised disentanglement of appearance, geometry, and motion of video. We leave this as our future work.
Figure 7 shows another example of motion transfer from dynamic texture. Similarly, we learn our model from the waving yellow flag, which is shown in Figure 7(a), and transfer the learned motion (shown in Figure 7(b)) to some new images of flags to make them waving in Figure 7(c). We can use the learned model to generate an arbitrarily long motion sequence and transfer it to different images.
4.4 Experiment 3: Unsupervised disentanglement of trackable and intrackable motions
Intrackability (or trackability) is an important concept of motion patterns, which has been studied in [7]. It was demonstrated in [29, 7, 10] that trackability changes over scales, densities, and stochasticity of the dynamics. For example, trackability of a video of waterfall will depend on the distance between the observed target and the observer. Besides, the observer’s preference for interpreting dynamic motions via tracking appearance details is a subjective factor to affect the perceived trackability of a dynamic pattern in the visual system of the brain.
In the context of our model, we can define intractability as the ratio between the average of norm of the nontractable residual image and the average of the norm of the observed image . This ratio depends on the penalty parameter of the norm of used in the learning stage. This penalty parameter corresponds to the subjective preference mentioned above. The larger the preference is, the larger extent to which we interpret a video by trackable contents, the less the residuals, and the less intrackability score.
Our model can unsupervisedly disentangle the trackable and intrackable components of the training videos. The intrackability can be directly obtained as a result of learning the model, where we do not need the ground truth or preinferred optical flows. In addition, the intrackability is defined in terms of the coherent motion pattern learned from the whole video sequence by our model.
Figure 8 shows a curve of intrackability scores under different preference rates ( and
) for each of 10 different dynamic patterns. One typical image frame is illustrated for each of video clips that we used. The model structure and hyperparameter setting are the same as the one we used in Experiment 1. The penalty parameter for smoothness is fixed to be 0.005. The results are reasonable and consistent with our empirical observations and intuitions. For example, under the same subjective preference, a video with structured background and slow motion tends to have a lower intrackability score because one can track the elements in motion easily (e.g., a video clip exhibiting boiling water in a static pot), while a video with fast and random motion tends to have a higher intrackability score due to the loss of track of the elements in the video (e.g., a video clip exhibiting burning flame or flowing water). Moreover, we find that as the preference
increases, the intrackability of all videos decrease, because the model seeks to interpret each video using more trackable motion.Figure 9 and 10 demonstrate two examples of unsupervised disentanglement of trackable and intrackable components from an observed video under different preference rates. In each of the figures, panel (a) displays some image frames of the training video, while panels (b) and (c) show the disentanglement results under preference rates equal to 0.5 and 5, respectively. We can see that the residual part (i.e., intrackable component) decreases and the optical flows (or displacement fields) become detailed and complicated, as the preference rate increases. Our model is natural to understand the concept of intrackability of dynamic patterns.
We also conduct an ablation study to investigate the effect of the part of intrackable motion in our model, by comparing the full model with the one only taking into account the trackable motion. Table 1
reports the average training loss across 12 training videos with different training epochs. The results suggest that, with the same numbers of training epochs, the model without considering intrackable motion tends to have higher training loss, especially when the intrackability of the video is high. Thus, intrackable motion is indispensable in representing a dynamic pattern.
epoch  2000  3000  4000  5000 

full model  0.0285  0.0253  0.0235  0.0222 
trackable only  0.0487  0.0463  0.0442  0.0426 
5 Conclusion
This paper proposes a motionbased generator model for dynamic patterns. The model is capable of disentangling the image sequence into appearance, trackable and intrackable motions, by modeling them by nonlinear state space models, where the nonlinear functions in the transition model and the emission model are parametrized by neural networks.
A key feature of our model is that we can learn the model without ground truth or preinference of the movements of the pixels or the optical flows. They are automatically inferred in the learning process. We show that the learned model for the motion can be generalized to unseen images by animating them according to the learned motion pattern. We also show that in the context of the learned model, we can define the notion of intrackability of the training dynamic patterns.
Project page
The code and videos of our generated results can be found at http://www.stat.ucla.edu/~jxie/MotionBasedGenerator/MotionBasedGenerator.html
Acknowledgement
The work is supported by DARPA XAI project N660011724029; ARO project W911NF1810296; ONR MURI project N000141612007. We thank Yifei Xu for his assistance with experiments. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Appendix
Figure 11 shows the color map for the color coded displacement fields used in [16]. We visualize trackable motion (optical flow) by using the same color map in this paper.
References

[1]
(2015)
TensorFlow: largescale machine learning on heterogeneous systems
. Note: Software available from tensorflow.org External Links: Link Cited by: §4.1.  [2] (2011) A database and evaluation methodology for optical flow. International Journal of Computer Vision 92 (1), pp. 1–31. Cited by: §4.3.
 [3] (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §1.
 [4] (2017) Photographic image synthesis with cascaded refinement networks. In International Conference on Computer Vision, Vol. 1, pp. 1511–1520. Cited by: §4.2.
 [5] (2003) Dynamic textures. International Journal of Computer Vision 51 (2), pp. 91–109. Cited by: §1, §2, §4.2.
 [6] (2010) Maximum margin distance learning for dynamic texture recognition. In European Conference on Computer Vision, pp. 223–236. Cited by: §4.2.
 [7] (2012) Intrackability: characterizing video statistics and pursuing video representations. International journal of computer vision 97 (3), pp. 255–275. Cited by: §1, §4.4.
 [8] (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §2, §3.2.
 [9] (2018) Learning generator networks for dynamic patterns. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §2.
 [10] (2015) Video primal sketch: a unified middlelevel representation for video. Journal of Mathematical Imaging and Vision 53 (2), pp. 151–170. Cited by: §4.4.

[11]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §4.1.  [12] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1.
 [13] (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
 [14] (2014) Autoencoding variational bayes. In International Conference on Learning Representations, Cited by: §1, §3.2.
 [15] (2007) Dynamic feature cascade for multiple object tracking with trackability analysis. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 350–361. Cited by: §1.
 [16] (2010) Sift flow: dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 978–994. Cited by: Appendix.
 [17] (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pp. 5040–5048. Cited by: §1.
 [18] (2014) Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on Machine Learning, Cited by: §1, §3.2.
 [19] (2010) The MUG facial expression database. In Proceedings of 11th Int. Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp. 12–14. Cited by: §4.3.
 [20] (2006) Dynamics of target selection in multiple object tracking (mot). Spatial vision 19 (6), pp. 485–504. Cited by: §1.

[21]
(2014)
Stochastic backpropagation and approximate inference in deep generative models
. In Proceedings of the 31st International Conference on Machine Learning (ICML14), T. Jebara and E. P. Xing (Eds.), pp. 1278–1286. Cited by: §1, §3.2.  [22] (2017) Temporal generative adversarial nets with singular value clipping. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [23] (2018) Twostream convolutional networks for dynamic texture synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6703–6712. Cited by: §4.2, §4.2, §4.2.
 [24] (2018) MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535. Cited by: §2, §4.2.
 [25] (2016) Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pp. 613–621. Cited by: §2.
 [26] (2018) Highresolution image synthesis and semantic manipulation with conditional gans. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807. Cited by: §4.2.
 [27] (2002) A generative method for textured motion: analysis and synthesis. In European Conference on Computer Vision, pp. 583–598. Cited by: §2.
 [28] (2004) Analysis and synthesis of textured motion: particles and waves. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (10), pp. 1348–1363. Cited by: §2.
 [29] (2008) From information scaling of natural images to regimes of statistical models. Quarterly of Applied Mathematics 66, pp. 81–122. Cited by: §4.4.

[30]
(2019)
Learning dynamic generator model by alternating backpropagation through time.
In
The ThirtyThird AAAI Conference on Artificial Intelligence
, Cited by: §1, §2, §3.2, §4.2, §4.2, §4.2.  [31] (2016) Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408. Cited by: §2.
 [32] (2016) A theory of generative convnet. In ICML, Cited by: §2.
 [33] (2017) Synthesizing dynamic patterns by spatialtemporal generative convnet. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7093–7101. Cited by: §2.
 [34] (1999) On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics: An International Journal of Probability and Stochastic Processes 65 (34), pp. 177–228. Cited by: §3.2.
Comments
There are no comments yet.