1 Introduction
Dynamic or video textures are movies that are stationary both in space and time. Common examples are movies of flame patterns in a fire or waves in the ocean. There exists a long history in synthesising dynamic textures (e.g. [1, 2, 3, 4, 5, 6, 7]) and recently spatiotemporal Convolutional Neural Networks (CNNs) were proposed to generate samples of dynamic textures [8]. In this note we introduce a much simpler approach based on feature spaces of a CNN trained on object recognition [9]. We demonstrate that our model leads to comparable synthesis results without the need to train a separate network for every input texture.
2 Dynamic texture model
Our model directly extends the static CNN texture model of Gatys et al. [10]. In order to model a dynamic texture, we compute a set of spatiotemporal summary statistics from a given example movie of that texture. While the static texture model from [10] only captures spatial summary statistics of a single image, our model additionally includes temporal correlations over several video frames.
We start with a given example video texture X consisting of T frames , for . For each frame we compute the feature maps in layer of a pretrained CNN. Each column of is a vectorised feature map and thus where is the number of feature maps in layer and is the product of height and width of each feature map.
In the static texture model from [10], a texture is described by a set of Gram Matrices computed from the feature responses of the layers included in the texture model. A Gram Matrix from the feature maps in layer in response to image x is defined as .
To include temporal dependencies in our dynamic texture model we combine the feature maps of consecutive frames and compute one large Gram Matrix from them (Fig.1). We first concatenate the feature maps from the frames along the second axis: such that . Then we use this large feature matrix to compute a Gram Matrix that now also captures temporal dependencies of the order (Fig.1). Finally this Gram Matrix is averaged over all time windows . Thus our model describes a dynamic texture by the spatiotemporal summary statistics
(1) 
computed at all layers included in the model. Compared to the static texture model [10] this increases the number of parameters by a factor of .
3 Texture generation
After extracting the spatiotemporal summary statistics from an example movie they can be used to generate new samples of the video texture. To that end we sequentially generate frames that match the extracted summary statistics. Each frame is generated by a gradient based preimage search that starts from a white noise image to find an image that matches the texture statistics of the original video.
Thus, to synthesise a frame given the previous frames
we minimise the following loss function with respect to
:(2)  
(3) 
For all results presented here we included the layers ‘conv1_1’, ‘conv2_1’, ‘conv3_1’, ‘conv4_1’ and ‘conv5_1’ of the VGG19 network [9] in the texture model and weighted them equally ().
The initial 1 frames can be taken from the example movie, which allows the direct extrapolation of an existing video. Alternatively they can be generated jointly by starting with randomly initialised frames and minimising jointly with respect to .
In general this procedure can generate movies of arbitrary length because the extracted spatiotemporal summary statistics naturally do not depend on the length of the source video.
4 Experiments and Results
Here we present dynamic textures generated by our model. We used example video textures from the DynTex database [11] and from the Internet. Each frame was generated by minimising the loss function for 500 iterations of the LBFGS algorighm [12]. All source textures and generated results can be found at https://bethgelab.org/media/uploads/dynamic_textures/.
First we show the results for and random initialisation of the initial frames (Fig. 2). We extracted the texture parameters from either frames of the source movie or just from a pair of frames . Surprisingly we find that extracting the texture parameters from only two frames is often sufficient to generate diverse dynamic textures of arbitrary length (Fig. 2, bottom rows).
However, the entropy of the generated frames is clearly higher for and for some videos (example: water) greyish regions appear in the generated texture if only two original frames are used.
Next we explored the effect of increasing the size of the time window . Here we show results for and . In general we noted that for most video textures varying the size of the time window has little effect. We observed differences, however, in cases where the motion is more structured. For example, given a movie of a branch of a tree moving in the wind (Fig. 3, top row), the leaves are only moving slightly up and down for (Fig. 3, middle row), whereas for the motion extends over a larger range (Fig. 3, bottom row).
Still, even for , the generated video fails to capture the motion of the original texture. In particular, it fails to reproduce the global coherence of the motion in the source video. While in the source video, all leaves move together with the branch up or down, in the synthesised one some leaves move up while some move down at the same time. The disability to capture the global structure of the motion is even more apparent in the second example in Fig. 3 and illustrates a limitation of our model.
Finally, instead of generating a video texture from a random initialisation, we can also initialise with frames from the example movie. In that way the spatial arrangement is kept and we are predicting the next frames of the movie based on the initial motion. We use three frames of the original video were to define the texture statistics (, ) (Fig. 4). The first two frames of the new movie are taken from the example and the following frames were sequentially generated as described in section 3. In the resulting video the different elements keep moving in the same direction: The squirrel continues flying to the top left, while the plants move upwards. If an element disappears from the image, it reappears somewhere else in the image. The generated movie can be arbitrarily long. In this case we used only the initial 3 frames to generate over 600 frames of a squirrel flying through the image and did not observe a decrease in image quality.
5 Discussion
We introduced a parametric model for dynamic textures based on the feature representations of a CNN trained on object recognition [9]. In contrast to the CNNbased dynamic texture model by Xie et al. [7], our model can capture a large range of dynamic textures without the need to retrain the network for every given input texture.
Surprisingly we find that even when the temporal dependencies are extracted from as little as two adjacent frames our model still produces diverse looking dynamic textures (Fig. 2). This is also true for nontexture movies with simple motion. We see that in this case we can generate a theoretically infinite movie repeating the same motion (Fig. 4.).
However, our model fails to capture structured motion with more complex temporal dependencies (Fig. 3). Possibly spatiotemporal CNN features or the inclusion of optical flow measures [13] might help to model temporal dependencies of that kind.
In general though we find that for many dynamic textures the temporal statistics can be captured by second order dependencies between complex spatial features leading to a simple yet powerful parametric model for dynamic textures.
References
 [1] G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, “Dynamic textures,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference On, vol. 2, pp. 439–446, IEEE, 2003.
 [2] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick, “Graphcut textures: image and video synthesis using graph cuts,” in ACM Transactions on Graphics (ToG), vol. 22, pp. 277–286, ACM, 2003.
 [3] A. Schödl, R. Szeliski, D. H. Salesin, and I. Essa, “Video Textures,” in Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, (New York, NY, USA), pp. 489–498, ACM Press/AddisonWesley Publishing Co., 2000.
 [4] M. Szummer and R. W. Picard, “Temporal texture modeling,” in Image Processing, 1996. Proceedings., International Conference on, vol. 3, pp. 823–826, IEEE, 1996.
 [5] Y. Wang and S.C. Zhu, “A generative method for textured motion: Analysis and synthesis,” in European Conference on Computer Vision, pp. 583–598, Springer, 2002.

[6]
L.Y. Wei and M. Levoy, “Fast texture synthesis using treestructured vector quantization,” in
Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 479–488, ACM Press/AddisonWesley Publishing Co., 2000.  [7] J. Xie, S.C. Zhu, and Y. N. Wu, “Synthesizing Dynamic Textures and Sounds by SpatialTemporal Generative ConvNet,” arXiv preprint arXiv:1606.00972, 2016.
 [8] Z. Zhu, X. You, S. Yu, J. Zou, and H. Zhao, “Dynamic texture modeling and synthesis using multikernel Gaussian process dynamic model,” Signal Processing, vol. 124, pp. 63–71, July 2016.
 [9] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” arXiv:1409.1556 [cs], Sept. 2014. arXiv: 1409.1556.
 [10] L. A. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” Advances in Neural Information Processing Systems 28, 2015.
 [11] R. Péteri, S. Fazekas, and M. J. Huiskes, “DynTex: A comprehensive database of dynamic textures,” Pattern Recognition Letters, vol. 31, pp. 1627–1632, Sept. 2010.
 [12] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization,” ACM Transactions on Mathematical Software (TOMS), vol. 23, no. 4, pp. 550–560, 1997.
 [13] M. Ruder, A. Dosovitskiy, and T. Brox, “Artistic style transfer for videos,” in German Conference on Pattern Recognition, pp. 26–36, Springer, 2016.
Comments
There are no comments yet.