Animating Landscape: Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video Synthesis

10/16/2019 ∙ by Yuki Endo, et al. ∙ 14

Automatic generation of a high-quality video from a single image remains a challenging task despite the recent advances in deep generative models. This paper proposes a method that can create a high-resolution, long-term animation using convolutional neural networks (CNNs) from a single landscape image where we mainly focus on skies and waters. Our key observation is that the motion (e.g., moving clouds) and appearance (e.g., time-varying colors in the sky) in natural scenes have different time scales. We thus learn them separately and predict them with decoupled control while handling future uncertainty in both predictions by introducing latent codes. Unlike previous methods that infer output frames directly, our CNNs predict spatially-smooth intermediate data, i.e., for motion, flow fields for warping, and for appearance, color transfer maps, via self-supervised learning, i.e., without explicitly-provided ground truth. These intermediate data are applied not to each previous output frame, but to the input image only once for each output frame. This design is crucial to alleviate error accumulation in long-term predictions, which is the essential problem in previous recurrent approaches. The output frames can be looped like cinemagraph, and also be controlled directly by specifying latent codes or indirectly via visual annotations. We demonstrate the effectiveness of our method through comparisons with the state-of-the-arts on video prediction as well as appearance manipulation.



There are no comments yet.


page 8

page 9

page 10

page 11

page 12

page 13

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

From a scenery image, humans can imagine how the clouds move and the sky color changes as time goes by. Reproducing such transitions in scenery images is a common subject of not only artistic contents called cinemagraph [Bai et al., 2012; Liao et al., 2013; Oh et al., 2017] but also various techniques for image manipulation (e.g., scene completion [Hays and Efros, 2007], time-lapse mining [Martin-Brualla et al., 2015], attribute editing [Shih et al., 2013; Laffont et al., 2014], and sky replacement [Tsai et al., 2016]

). However, creating a natural animation from a scenery image remains a challenging task in the fields of computer graphics and computer vision.

Previous methods in this topic can be grouped into two categories. The first category is the example-based approach that can create a realistic animation by transferring exemplars, e.g., fluid motion [Okabe et al., 2009; Prashnani et al., 2017] or time-varying scene appearance [Shih et al., 2013]. This approach, however, heavily relies on reference videos that match the target scene. The other category is the learning-based approach, which is typified by the recent remarkable techniques using Deep Neural Networks (DNNs).

DNN-based techniques have achieved great success in image generation tasks, particularly thanks to Generative Adversarial Networks (GANs) [Karras et al., 2018; Wang et al., 2018] and other generative models, e.g., Variational Auto-Encoders (VAEs), which were also used to generate a video [Xiong et al., 2018; Li et al., 2018] from a single image. Unfortunately, the resolution and quality of the resulting videos are far lower than those generated in image generation tasks. One reason for the poor results is that the spatiotemporal domain of videos is too large for generative models to learn, compared to the domain of images. Another reason is the uncertainty in future frame predictions; for example, imagine clouds in the sky in a single still image. The clouds might move left, right, forward, or backward in the next frames according to the environmental factors such as wind. Due to such uncertainty, learning a unique output from a single input (i.e., one-to-one mapping) is intractable and unstable. The recent work using VAEs to handle the uncertainty is still insufficient for generating realistic and diverse results [Li et al., 2018].

In this paper, we propose a learning-based approach that can create a high-resolution video from a single outdoor image using DNNs. This is accomplished by self-supervised learning with a training dataset of time-lapse videos. Our key idea is to learn the motion (e.g., moving clouds in the sky and ripples on a lake) and the appearance (e.g., time-varying colors in daytime, sunset, and night) separately, by considering their spatiotemporal differences. For example, clouds move rapidly on the scale of seconds, whereas sky color changes slowly on the scale of tens of minutes, as shown in the riverside scene of Figure 1. Moreover, the moving clouds exhibit detailed patterns, whereas the sky color varies overall smoothly.

With this observation in mind, we learn/predict the motion and appearance separately using two types of DNN models (Figure 2

) as follows. For motion, because one-shot prediction of complicated motion is difficult, our motion predictor learns the differences between two successive frames as a backward flow field. Long-term prediction is achieved by inputting the predicted frames recurrently. Motion-added images are then generated at high resolution by reconstructing pixels from the input image after tracing back the flow fields. For appearance, our predictor learns the differences between the input frame and arbitrary frames in each training video as spatially-smooth color transfer functions. In the prediction phase, color transfer functions are predicted at sparse frames and are applied to the motion-predicted frames via temporal interpolation. We assume that the motion and color variations in landscape time-lapse videos are spatiotemporally-smooth, and enforce such regularization in our training, which works well particularly with the motions of clouds in the sky and waves on water surfaces as well as the color variations of dusk/sunset in the sky. The output animation can be looped, inspired by cinemagraph.

To combat the uncertainty of future prediction, we also extract latent codes both for motion and appearance, which depict potential future variations and enable the learning of one-to-many mappings. The user can manipulate the latent codes to control the motion and appearance smoothly in the latent space. Note that the backward flow fields, color transfer functions, and latent codes are learned in a self-supervised manner because their ground-truth data are not available in general. Unlike previous techniques using 3D convolutions [Vondrick et al., 2016; Xiong et al., 2018; Li et al., 2018] for predictions with fixed numbers of frames, our networks adopt 2D convolutional layers. This approach allows fast learning and prediction and abolishes the limit on the number of predicted frames by recurrent feeding.

Our main contributions are summarized as follows:

  • A framework for automatic synthesis of animation from a single outdoor image with fully convolutional neural network (CNN) models for motion and appearance prediction,

  • Higher-resolution and longer-term movie generation by training with only hundreds to thousands of time-lapse videos in a self-supervised manner, and

  • Decoupled control mechanism for the variations of motion and appearance that change at different time intervals based on latent codes.

We demonstrate these advantages by comparing with various methods of video prediction as well as attribute transfer. Our user study reveals that our results are subjectively evaluated as competitive or superior to those of previous methods or commercial software (see Appendix D). We also show applications for controlling the motion and appearance of output frames.

2. Related Work

Here we briefly review the related work of our technical components; optical flow prediction, color transfer, style transfer, video prediction, and so forth.

2.1. Optical Flow Prediction

Optical flow prediction from a single image has been studied with various approaches. Supervised approaches using CNNs have also been proposed [Walker et al., 2015; Gao et al., 2017]. The point is how to prepare ground-truth flow fields for supervised learning. The above-mentioned methods exploited existing techniques (e.g., FlowNet [Dosovitskiy et al., 2015], DeepFlow [Weinzaepfel et al., 2013; Revaud et al., 2016], and SpyNet [Ranjan and Black, 2017]) for generating ground-truth flow fields synthetically. However, we confirmed that previous methods relying on such synthetic data yield poor predictions for time-lapsed videos (see Section 6.2) for which no genuine ground-truth is available.

Recently, self-supervised approaches have been proposed for estimating flow fields between two input images 

[Ren et al., 2017; Wang et al., 2018]. We also adopt a self-supervised approach where a flow field is computed between two consecutive frames. The main difference against the existing approaches is that we input only a single image in the inference phase and handle the prediction uncertainty by introducing latent codes.

Figure 2. Overview of our inference pipeline. Given the input image and latent codes that control future variations, the motion predictor generates future backward flows. The flows are used to warp the input image to synthesize motion-added images, which are then converted to a cyclic motion loop. The appearance predictor generates color transfer maps, which are finally used for color transfer to obtain the output video. Input photo: Per Erik Sviland (Vvuxdqn-0vo)/

2.2. Appearance Manipulation

Color transfer [Reinhard et al., 2001]

is a fundamental technique for changing color appearance. This technique makes the overall color of a target image conform to that of a reference image while retaining the scene structure of the target, by matching the statistics (i.e., the mean and standard deviation) of the two images. The original method 

[Reinhard et al., 2001] is enhanced by respecting local color distributions using soft clustering [Tai et al., 2005] or semantic region correspondence [Wu et al., 2013]. There is also a color tone transfer method specialized in sky replacement [Tsai et al., 2016].

Style transfer can convey richer information, including textures using DNNs, than color transfer. The original work [Gatys et al., 2016]

in this literature optimizes an output image via backpropagation of the perceptual loss for retaining the source content and style loss for transferring the target style. Faster transfer is accomplished by pre-training autoencoders for specific styles 

[Johnson et al., 2016] and using whitening and coloring transforms (WCTs) for arbitrary styles [Li et al., 2017]. Semantic region correspondence can also be integrated [Luan et al., 2017]. However, the strong expression power of style transfer works negatively for our purpose; it yields unnatural results for various scenes due to overfitting (see Section 6). We instead delegate the texture transfer to our motion predictor and change the color appearance using the transfer functions that can avoid overfitting.

A recent arXiv paper [Karacan et al., 2018] presents a method to manipulate attributes of natural scenes (e.g., night, sunset, and winter) via style transfer [Luan et al., 2017]

and image synthesis using a conditional GAN. From a semantic layout of the input image and a target attribute vector, the method first synthesizes an intermediate style image, which is then used for style transfer with the input image. Animations can be generated by gradually changing the attribute vector, but enforcing temporal coherence is difficult with this two-step synthesis. In contrast, our method offers smooth appearance transitions via latent-space interpolation, as we demonstrate in Section 


2.3. Video Generation from a Still Image

An early attempt to animate a natural scene in a single image was a procedural approach called stochastic motion texture [Chuang et al., 2005]. This approach generates simple quasi-periodic motions of individual components, such as swaying trees, rippling water, and bobbing boats, with parameter tuning for each component.

Example-based approaches can reproduce realistic motion or appearance without complex parameters by directly transferring reference videos [Okabe et al., 2009, 2011, 2018; Prashnani et al., 2017; Shih et al., 2013]. However, their results become unnatural without an appropriate reference video similar to the input image. This issue can be alleviated at the cost of larger database and larger computational resource. Also, existing techniques often impose tedious manual processes for specifying, e.g., alpha mattes, flow fields and regions for fluid. Our method can generate high-resolution videos using only hundreds of megabytes of pre-trained data within a few minutes on a single GPU. Our method can run automatically yet can also be controlled using latent codes.

Example-based appearance transfer [Shih et al., 2013] can reproduce the time-varying color variations in a static image with a reference video. However, simple frame-by-frame transfer suffers from flickering artifacts for dynamic objects in the scene. Key-frame interpolation alleviates such flickering, which is not directly applicable if the outputs are videos containing dynamic objects, unlike ours. The method by Laffont et al. [2014] achieves appearance transfer using a manually-annotated database whereas our training datasets do not require manual annotations.

The past few years have witnessed the dramatic advances in learning-based approaches, particularly using DNN. For example, DNN architectures used for video prediction include not only 2DCNN [Xue et al., 2016; Mathieu et al., 2016; Lotter et al., 2017; Babaeizadeh et al., 2018; Hao et al., 2018]

but also convolutional Recurrent Neural Networks (cRNNs) 

[Ranzato et al., 2014]

, Long Short-Term Memory (LSTM)

[Srivastava et al., 2015; Zhou and Berg, 2016; Denton and Birodkar, 2017; Byeon et al., 2018], and 3DCNNs [Vondrick et al., 2016; Xiong et al., 2018; Li et al., 2018]. However, even with the state-of-the-art techniques [Xiong et al., 2018; Li et al., 2018], the frame length and resolution of generated videos are quite limited (i.e., up to 16 or 32 frames at ) due to the training complexity and architecture design. In a sharp contrast, our method can generate much higher-resolution videos with an unlimited number of frames by leveraging intermediate flow fields and color transfer functions, as we discuss in Section 6. Note that a recent work by Li et al. [2018] also predicts flow fields like our method. The key differences of our method are that i) their method requires ground-truth flow fields, whereas ours does not (i.e., learning is self-supervised); ii) their method uses 3DCNN, whereas ours uses 2DCNN, which reduces the training complexity; and iii) their method cannot provide direct control over appearance transition, whereas ours can because we employ decoupled training of motion and appearance.

3. Method Overview

Figure 2 shows the whole pipeline of our video synthesis, where our method first generates motion-added frames from the single input image, optionally makes them looped by linear blending, and then applies color transfer to each frame. As we explained in Section 1, our motion predictor infers backward flows recurrently, whereas our appearance predictor infers a color transfer function for each frame. This design is crucial for handling the well-known problem in recurrent inference where error accumulates in the cycled output frames [Shi et al., 2015]; in our motion prediction, error accumulates in the backward flows, which we assume are spatially-smooth and thus less sensitive to error. Each predicted frame is reconstructed by tracing back to the input image to avoid error accumulation in RGB values due to repetitive color sampling. In our appearance prediction, on the other hand, we avoid recurrent feeding and infer time-varying color transfer maps from the input image directly. Blur artifacts and error accumulation in output RGB values can be avoided because the per-pixel RGB value in the input image is sampled only once for each output frame in both predictions.

We handle the future uncertainty in both predictions using latent codes extracted in the training phases. By assuming that the overall motion throughout an animation sequence is similar, we control the motion in a single animation only with a single latent code. On the other hand, because our appearance predictor is trained with frame pairs between an input image and arbitrary frames in each training video, we require a latent code to control the appearance of each frame. Consequently, for appearance control of an animation sequence, we require a sequence of latent codes, which has the same length as the output frame length. The latent codes can be specified automatically or manually, from latent codes stored during training (hereafter we refer to them as a codebook).

4. Models

Hereafter, we describe our network models and distinguish the notations between motion and appearance with the superscripts and , respectively. Our motion predictor and appearance predictor are encoder-decoders with the same architecture of a fully CNN. The inputs of the predictors are i) a linearly-normalized RGB image (where and are image width and height) and ii) a latent code z to account for uncertainty of future prediction. Code controls the motion in a whole sequence, whereas controls the appearance of only a single frame. The outputs of the predictors are multi-channel intermediate maps that are then used to convert the input image I into an output RGB frame , where we use a circumflex (  ) to indicate an inferred output.

In the following subsections, for motion and appearance, we first explain the inference phase to illustrate the use cases of the predictors and then describe how to train the networks.

Figure 3. Recurrent inference using the motion predictor . A backward flow at time is predicted from an image at time and a latent code . An output frame is obtained by warping using . is then used as the next input . This procedure is repeated to obtain multiple frames. Input photo: echoesLA (zleuiAR2syI)/

4.1. Motion Predictor


Given an input image (, where indicates the time, i.e., the frame number) and a latent code , the motion predictor infers a backward flow field using for normalization. Here the pixel positions of are normalized as . The pixel value at position in the output frame is then reconstructed by sampling that in the current frame at via bilinear interpolation, where is the flow vector at . We call this reconstruction operation as warping in this paper. We recurrently use the predicted frames as the next motion predictor input (see Figure 3). However, if we warp the current frame to synthesize the next frame naïvely, the output frames will become gradually blurry, as explained in Section 3. Therefore, we instead warp flow fields , , , sequentially to accumulate flow vectors so that we can reconstruct each output frame from the input image directly.

Predicting flow fields in our self-supervised setting is challenging because it is essentially to find correspondences between two consecutive frames with large degrees of freedom, which is easily trapped into local optima yielding inconsistent flow fields. We thus restrict the range of the output flow fields both in prediction and training phases by assuming that the objects do not move significantly in a single timestep. Specifically, we divide inferred flow fields by a constant

to restrict the range of their magnitudes to . Figure 4 demonstrates the effectiveness of , with the results obtained after training only using the single image shown at the top left. Without this restriction (i.e., ), the estimated flow fields are inconsistent and the reconstructed images are corrupted. With this restriction (e.g., ), the reconstructed frames match to the ground-truth more closely, thanks to the consistent flow field estimation.

Figure 4. Motion restriction with constant . The inferred flows become inconsistent without restriction (i.e., ) but yield warped images close to ground-truth frames with restriction (i.e., ). Input photos: Melania Anghel (rM7aPu9WV2Q)/

A straightforward way for training the motion predictor is to minimize the difference between inferred and ground-truth flow fields, as done in [Walker et al., 2015; Gao et al., 2017; Li et al., 2018]. Our motion predictor, in contrast, learns future flow fields in a self-supervised manner only from time-lapse videos that have no ground-truth.

Figure 5. Training of the motion predictor . The training is done using consecutive frames and such that the loss between and the output is minimized. is obtained by warping using the backward flow , which is regularized with the loss . is obtained by encoding the previously inferred flow using the motion encoder in our self-supervised setting where ground-truth flows do not exist. Input photos: Akio Terasawa (gRnKhf9Kw1Q)/

Figure 5 outlines the training of the motion predictor. We first define a L2 loss for the network output obtained from the input image and the next frame :


where means L2 norm. Also, a weighted total variation loss is applied to the output flow field for edge-preserving smoothing:


where indicates the right and above neighbors of p, and is a constant to determine influence of this term. The output flow field is smoothed using the weighting function such that respects the color variations of the next frame . Using weights and

, our total training loss function is defined by


To handle future uncertainty and extract latent codes , we simultaneously train the motion encoder . Problems similar to this one-to-many mapping were tackled in BicycleGAN [Zhu et al., 2017], where latent codes are learned from ground-truth images. In our case, the latent codes should be learned from the flow fields , whose ground-truth are not available.

To overcome this chicken-and-egg problem, we initialize the input flow field of our motion encoder

as zero tensor in the first epoch, and gradually update it with

during the training phase. Another problem is that, because a pair of consecutive frames for training is selected randomly from each training video for each epoch (see Section 4.3), a naïve approach would initialize the input of for each pair, which yields slow convergence. We thus re-use the input of for each training video, assuming that frames throughout the video exhibit a similar motion. We refer to this re-used input as a common motion field for the training video and condition it on a single latent code . A common motion field of each training video is stored in each epoch and used in the next epoch to extract the latent code of the corresponding video. In this way, we finally store the code in a codebook for the use in the inference phase. A pseudo-code of this training procedure is shown in Appendix A.2.

4.2. Appearance Predictor


Given (equals a motion-added frame at time ), our appearance predictor infers a color transfer map (where ) for an arbitrary frame (Figure 6). Each color transfer map is controlled by the latent code at frame . The output frame is then computed by applying the map to the input image as follows:


where denotes Hadamard product and is used to restrict the pixel values of within .

In the final video generation (Section 5), we first interpolate the latent code sequence linearly, and then apply color transfer to each frame.

Figure 6. Inference using the appearance predictor . Color transfer maps at time are computed from an input image and latent code . An output frame is obtained by applying color transfer to . Multiple frames are obtained using latent code sequence , unlike the recurrent feeding used in the motion predictor . Input photo: Domenico Loia/

Figure 7 outlines the training of the appearance predictor. We first define loss functions between two frames with different appearances sampled from the training dataset. To learn style conversion for the entire image, we use a style loss between the inferred output frame and the ground-truth target frame :


where the function outputs feature maps obtained from the -th layer of the pre-trained VGG16 [Simonyan and Zisserman, 2014]. The function outputs the Gram matrix of the features maps. Inspired by the existing style transfer algorithm [Johnson et al., 2016], we use relu_2_2, relu_3_3, and relu_4_3 as the layers . Note that the style loss is insensitive to spatial color distributions due to the Gram matrix, which makes, for example, a partially red sky during sunset difficult to handle. Therefore, an additional weak constraint is imposed on the output frame to roughly conform to the spatial color distributions:


where indicates the spatial pyramid pooling function [He et al., 2015], which outputs fixed-size feature maps by dividing an image into multi-level grids. We set the pyramid height as one and divide the image into a grid, where average pooling is applied to each cell. Whereas the above losses are defined against the ground-truth target frame , a content loss is defined against the input to keep the input scene structure:


As the layer in this loss function, we use relu_1_2 only to retain high-frequency components of the input scene. Finally, the inferred color transfer map is regularized to improve the generalization ability of the model:


Note that, unlike Equation (2), these color transfer maps are smoothed such that respects the scene structure of the input image . Total loss is then given by the summation of the above losses with weights , , , and :

Figure 7. Training of the appearance predictor . The training uses each pair of a source image and a target image such that the losses and between and are minimized. is obtained via color transfer based on to . The losses and impose that the content of be preserved in and be regularized, respectively. is obtained by encoding the target image using the appearance encoder . Input photos: Anonymous (a8CTqQAxBzI)/

We also train the appearance encoder to extract latent codes simultaneously using . The input of the appearance encoder is the target frame so that the inferred output is conditioned on . After the training, a sequence of latent codes for each training video is extracted using and stored in a codebook, similarly to the motion predictor.

4.3. Implementation

The network architectures of our predictors are summarized in Appendix A.1. The motion and appearance predictors and are fully CNNs, each of which consists of three downsampling layers, five residual blocks, and three upsampling layers. The networks contain skip connections also used in U-Net [Ronneberger et al., 2015]. Our motion and appearance encoders and adopt the same network structure as that in resnet_128 [Zhu et al., 2017]

, which consists of six layers for convolution, pooling, and linear transformation.

To avoid training biases by longer video clips, we train each pair of frames sampled randomly for each video clip in each epoch. Whereas the motion predictor learns from a pair of consecutive frames, the appearance predictor learns from any pair of frames. The pseudo-codes of the training procedures are described in Appendix A.2.

The training image size was set to for the predictors and for the encoders for both motion and appearance. The number of dimensions of the latent codes z was set to 8. We used the Adam optimizer [Kingma and Ba, 2014] with a learning rate of , two coefficients of , and a batch size of 8 for backpropagation. Regarding the weights of the loss functions, we empirically chose , , , , , , and .

5. Single-image Video Generation

Now we explain how to generate a video from a single image by integrating the two predictors. Inspired by cinemagraph, the output animation can be looped as an option. Here we explain the looped version.

Algorithm 1 summarizes the procedure of our video generation. The motion prediction first generates a sequence of frames , which is then converted to a looped one . A sequence of output frames are finally generated from through the appearance prediction. Note that is used instead of if the looping process is not required. To make a motion loop from the non-periodic sequence , various methods can be used [Schödl et al., 2000; Liao et al., 2015]. Among the several methods that we tested, simple cross-fading  [Schödl et al., 2000] worked relatively well for making plausible animations without significant discontinuities. Whereas the resolutions of images , the output frames in , , and the final video

are not limited, the inputs to the predictors and encoders are resized to fixed resolutions for training. The inferred flow fields and color transfer maps are resized to the original size and then applied to the original input image. We do not magnify output frames directly to avoid blurring. To handle sampling outside of previous flow fields during the reconstruction of output frames, reflection padding is applied to the input image and previous flow fields.

Algorithm 1. Single-image Video Generation
Input: Input image I, latent codes ,
Output: Output video
//Motion prediction
 3: for each frame do
 4:     Resize(Resize
 5:     Warp(
 8: endfor
 9: GenerateLoop()
//Appearance prediction
11: InterpolateLatentCodes
12: for all in do
13:     GetNextFrameCyclically
14:     Resize
15:     ColorTransfer(Resize),
17: endfor

We can control the future variations of output frames with latent codes and , and also adjust the speeds of motion and appearance. The latent codes can be selected randomly (in this case, automatically) or manually from the codebook. We also show some applications to control latent codes indirectly in Section 6.4. The motion speed can be adjusted by simply multiplying flow fields by an arbitrary scalar value. Meanwhile, the appearance speed is determined in two ways; by adjusting the number of latent codes in a sequence obtained from the codebook, or by repeating the motion loop an integer number of times during one cycle of appearance variation. We adopt the latter for all the looped videos. The latent code sequence for appearance at key-frames are linearly interpolated to generate latent codes for the whole frames. We also interpolate the final and initial latent codes to generate a cycle.

6. Experiments

We implemented our system with PyTorch library running on a PC with NVIDIA GeForce GTX 1080 Ti GPUs. We stopped training after 5,000 epochs, and the computation time was about one week on a single GPU. Motion and appearance inferences to generate a

frame took 0.054 seconds and 0.058 seconds, respectively. Overall computation time to generate a cinemagraph of 1,010 frames was 98 seconds, which included trained model parameter loadings to a GPU of 9 seconds, motion inference of 11 seconds, motion loop generation of 6 seconds, and appearance inference of 59 seconds. The other processes consumed the remaining time.

The results in our paper are demonstrated in the supplemental video. The directions and magnitudes of optical flow vectors are visualized using the pseudo colors shown in Figure 8.

Figure 8. Qualitative comparison with the state-of-the-art video generation by Xiong et al. [2018] and Li et al. [2018]. Their and our output resolutions are and , respectively. Input photo: Per Erik Sviland (Vvuxdqn-0vo)/

6.1. Dataset Generation

For training the motion predictor and encoder, we used the time-lapse video dataset published by Xiong et al. [2018]. The dataset was divided into 1,825 video clips for training and 224 clips for testing at the resolution of . To avoid learning motions that were too subtle, we first sampled every other frame from each training video clip and then automatically omitted pairs of frames in which the average of differences of pixel values of consecutive frames was less than 0.02. The resultant video clips contain 227 frames on average.

Because the videos used for motion modeling are too short to observe appearance transitions, we collected 125 one-day video clips from YouTube and the dataset published by Shih et al. [2013] for appearance modeling. Because appearance changes more slowly than motion, we omitted more redundant frames from the dataset. Specifically, we first sampled frames about every 10 minutes in real-world time for each video clip, and then omitted consecutive frames containing smaller appearance variations. To do this, we computed the sum of the RGB differences of the average of the pixel values for the consecutive frames, and adjacent frames were automatically omitted if the corresponding sum was less than 0.3. With this sampling process, the number of frames for each training clip is reduced to 15 on average. Note that the input images shown in this paper were not included in the training data unless otherwise noted.

Figure 9. Another qualitative comparison with the state-of-the-art video generation by Xiong et al. [2018] and Li et al. [2018]. Their and our output resolutions are and , respectively. Input photo:

6.2. Comparisons with Video Prediction Models

To clarify the advantages of our method, we compared it with the state-of-the-art video prediction models for a single input image. The comparison models are 3DCNN encoder-decoders [Li et al., 2018] that predict flow fields for a fixed number frames from an input image and generate future frames based on the predicted flows. To train the comparison models, we used the same training data [Xiong et al., 2018] as ours, and ground-truth flow fields were created using SpyNet [Ranjan and Black, 2017] based on the authors’ codes [Li et al., 2018]. We used their default parameters and image size. The number of epochs was also the same as ours (5,000), and improvement was not observed with more epochs. We also compared our method with other recent GAN-based models [Xiong et al., 2018].

Figures 8 and 9 show qualitative comparisons. The right images are generated frames and flow fields using each method from the upper-left image. As shown by the insets in the second and third rows, our method generates more plausible flow fields than the previous method according to the input scene structure; for example, whereas the clouds and the water surface move differently, the lands remain static overall. In the first row, the GAN-based method severely suffers from artifacts even in low-resolution images. In the second row, the generated frames by Li et al. are unnaturally abstracted despite the model’s two-phase design that first predicts flow fields and then generates future frame pixels. Our results are clearer and higher-resolution as demonstrated in the third row. Moreover, our method can theoretically generate an unlimited number of frames. Finally, as shown in the third row, our method can generate a looped animation that also contains appearance variations, thanks to decoupled learning, whereas the comparative method cannot handle it sufficiently.

Figure 10. Quantitative comparisons with ground-truth for 224 test video clips. RMSE (left) and perceptual dissimilarity [Zhang et al., 2018] (right) are computed for each predicted frame. The solid lines and error bars denote average, minimum, and maximum values of the metrics, respectively.

In addition, we conducted quantitative evaluations using 224 test video clips with the methods by Xiong et al. [2018] and Li et al. [2018]

, regarding the accuracy compared to ground-truth successive frames. We compared differences between generated sequences and ground-truth ones frame-by-frame. As evaluation metrics, we used RMSE and perceptual dissimilarity 

[Zhang et al., 2018] based on Alex-Net [Krizhevsky et al., 2012]. Because our results depend on latent codes, we compared average, minimum, and maximum values of evaluation metrics for five latent codes sampled from the codebook. The previous method [Li et al., 2018]

based on VAE can also synthesize different future sequences, and thus we sampled different noises five times from the normally distributed latent space for generating five sequences, which are used to calculate the metrics in the same manner as ours. Figure 

10 shows the frame-by-frame RMSEs and perceptual dissimilarities for each method. The solid lines and error bars denote average, minimum, and maximum values of the metrics computed from the different future sequences. The increasing trends in both graphs imply that long-term prediction is challenging. Nevertheless, our method outperforms the state-of-the-art method in that ours can generate higher-resolution and longer sequences. In particular, our results are perceptually more similar to the ground-truth sequences, even when generated with different parameters.

6.3. Comparisons on Appearance Manipulation

Figure 11. Comparisons with previous methods for appearance manipulation. From left to right, the output image sizes are , , and , respectively. Input photos: Shih et al. [2013].

We further compared our appearance-only results with those of previous color/style transfer methods. Figure 11 shows the results of appearance transfer obtained using the source and target images (inset) in the top row. For the local color transfer [Tai et al., 2005] in the second row, the appearance variations are monotonic and inconsistent with the scene structures. Style transfer based on WCT [Li et al., 2017] in the third row can handle more diverse appearance variations but some artifacts can be observed. Although these artifacts are alleviated by solving the screened Poisson equation [Mechrez et al., 2017] as shown in the fourth row, the results are still unnatural. On the other hand, the example-based hallucination [Shih et al., 2013] (fifth row) and deep photo style transfer  [Luan et al., 2017] (sixth row) successfully transfer the target appearances. These methods, however, require a target video and an additional semantic segmentation map, respectively. Even worse, when applied to videos, frame-by-frame optimization will cause flickering artifacts, and key-frame interpolation cannot be used with dynamic objects. Our results in the bottom row are generated without any additional inputs, except for latent codes encoded from target images. Thanks to latent codes, natural and smooth interpolation is possible in the latent space, as demonstrated from the seventh to ninth rows where the appearance changes from the source to the target. Moreover, we can dispense even with target images if latent codes are specified from the codebook or are predicted from source images via LSTM prediction (see Appendix B).

Figure 12. Comparisons with the state-of-the-art attribute manipulation by Karacan et al. [2018]. The red circles indicate flicker artifacts in their results. In contrast, our method can reproduce smoothly-varying appearances. We also tried to quantitatively visualize this difference based on the sum of absolute distance and structural similarity between consecutive frames. The output resolution is . Input photo: Heretiq/
Figure 13. Comparisons with commercial appearance editing software (Photoshop Match Color). The input image is the same as that in Figure 12, and the target images are also used to extract the latent codes for our method. The fade parameter (from 0 to 100) can control the degree of color transfer.
Figure 14. Effects of latent codes for motion. The input latent codes are extracted via self-supervised learning, and sorted along the first principle axis in the codebook. We can see that the output frames are also aligned according to the input latent codes. The output resolution is . Input photo: echoesLA (zleuiAR2syI)/

To the best of our knowledge, there are no methods using generative models for appearance variation, except for the recent one for manipulating image attributes [Karacan et al., 2018]. This method can be applied to generation of videos containing appearance transitions by gradually changing attributes. Therefore, in Figure 12, we compared our appearance predictor with their method using the input image and results on their project page. In our results, we selected the latent codes that yield appearances similar to those of the compared results. As demonstrated in the first row, for each output frame, the compared method can generate semantically plausible appearances that match the image content. Their sequence, however, contains flickering artifacts due to the two-stage synthesis where temporal consistency is difficult to impose as is. In the second and third rows, we can see that our result is free from noticeable artifacts. We can generate temporally-coherent animations thanks to keyframe interpolation in the latent space, unlike the compared method. Also, we tried to quantitatively visualize such artifacts by computing the sum of absolute differences and structural similarity between the consecutive frames, as shown at the lower left. The resultant values imply that our method allows smoother transition than the compared method. In addition, we compared commercial appearance editing software (Photoshop Match Color). Figure 13 demonstrates that local appearance transitions cannot be reproduced by the software, unlike our method. The halo artifacts behind the cottage roof in Figure 13 is stronger than ours in Figure 12. Note that, in the supplemental video, the resultant animation using the commercial software was generated by repeatedly applying color transfer to intermediate frames using multiple target images. In this case, the halo artifact is reduced unexpectedly, but the global color variation becomes monotonic due to error accumulation. Our method can avoid such error accumulation thanks to the latent-space interpolation.

6.4. Controlling Future Variations

6.4.1. Effects of latent codes.

To verify that our method can handle future uncertainty and can learn meaningful latent space in an unsupervised manner, i.e., without any ground-truth labels such as wind directions or time labels (e.g., “daytime” or “night”), we investigated how latent codes affect outputs. Figure 14 shows the examples of motion. Here we sorted latent codes in the codebook according to the first principle component and applied them to the same input image. As we can see from the optical flows, our method generates similar motions from similar latent codes, while retaining a wide variety of motions. For appearance, a sequence contains time-varying latent codes, and thus we can see that similar consecutive latent codes yield smooth transition in a time series (please check our supplemental video). Figure 15 also demonstrates that diverse appearances (e.g., sunset, twilight, and night) can be reproduced from the same input image with different latent code sequences.

Figure 15. Effects of latent codes for appearance. In a frame sequence in each row, diverse appearances are synthesized from a different latent code sequence in the codebook. The output resolution is . Input photo: Jezael Melgoza/

Whereas direct use of latent codes from the codebook yields natural transition (e.g., from daytime to sunset in appearance) because they are extracted from real time-lapse videos, we also provide means to indirectly specify latent codes, namely, using arrow annotations for motion and image patches for appearance, as explained below.

Figure 16. Motion specification using arrow annotations. The output resolution is . Input photo: Pixabay/

6.4.2. Motion control using arrow annotations

We offer arrow annotations for specifying flow directions of motion, as shown in the left column in Figure 16. We represent these sparse annotations as 2D maps , where pixels corresponding to the arrows have the specified flow vectors. Given and an input image , an optimum latent code is obtained via optimization w.r.t. while fixing the network parameters of the motion predictor , as done in [Gatys et al., 2016]:


where the function gives a map containing the cosine of an angle between two flows for each pixel. The mask is used to compute error on arrows only, and is a constant margin map (having 0.5 for each pixel) that allows a certain level of difference between estimated flows and specified flows. We used the Adam optimizer [Kingma and Ba, 2014] for this optimization. Using and , we recurrently predict image sequences containing motions similar to the directions of the annotations. The middle and right columns in Figure 16 demonstrate that entire flow fields are plausibly generated using the sparse annotations. Optimization for each user edit took about seven seconds.

Figure 17. Appearance specification using image patches. The output resolution is . Input photo: Pixabay/

6.4.3. Appearance control using image patches

We offer a means to specify the appearance at specific positions and frames using image patches. Using a map containing information of placed patches, an optimum latent code is obtained, similarly to Equation (12):


Figure 17 shows the results obtained via latent space interpolation between the input image and . We can further change its appearance by placing multiple appearance patches as shown in the right image in the middle row. Cyclic animations can also be created via interpolation between the final and input images as demonstrated in the 300th to 400th frames.

6.5. Ablation Study

We conducted an ablation study to investigate the effectiveness of our loss functions. Figure 18 shows comparisons between the generated frames with and without each of the loss functions. Without the motion regularization term , the resultant flow fields in the third row are often inconsistent with the scene structure due to overfitting. For the same reason, the lack of the appearance regularization term also causes noticeable artifacts (see our supplemental video) in the generated frames in the sixth row. Moreover, without the loss function for learning spatial color distributions, the appearances vary uniformly, as shown in the fifth row, and the partially-reddish sky due to sunset is not sufficiently reproduced. In contrast, we can see that the resultant frames generated with the full losses in the second and fourth rows are more stable than the others.

Direct in Figure 18 means that the output images were inferred directly without color transfer functions. For this, we used a CNN with the same architecture as the appearance predictor except for the three-channel output. Although this CNN was trained with the full losses (the TV loss was applied to network outputs), the results are less natural than the others.

Figure 18. Ablation study of loss functions. The input image is the same as that in Figure 9. “Direct” means direct inference of output frames without color transfer. The output resolution is .

6.6. Discussion

Do latent codes need regularization?

To make search of latent codes in a latent space more stable, there is an additional training option for adopting regularizers used by Variational Auto-Encoder (VAE) [Kingma and Welling, 2013] and Wasserstein Auto-Encoder (WAE) [Tolstikhin et al., 2017]. Whereas we regard the direct use of stored latent codes as the default choice because they yield plausible results, VAE and WAE allow us to select latent codes from regularized latent space, without referring to the codebook. In Figure 19

, the predictors trained with these regularizers generated the results using latent codes sampled from a Gaussian distribution. In particular, the WAE regularizer is effective for generating more various outputs than the VAE regularizer because latent codes of different examples can stay far away from each other. In contrast, we can see that the models trained without these regularizers failed to generate plausible results from Gaussian latent codes. Although the regularization for training latent codes is not essential in our case because we use the codebook and how to select appropriate sequences of latent codes for appearance from regularized latent space is not clear without the codebook, it might be useful for future applications.

Figure 19. Comparison of with and without regularizations used in VAE and WAE. The appearance and motion (insets) in each column are inferred using the same latent code randomly sampled from a Gaussian distribution. The input image is the same as that in Figure 1.
Figure 20. Failure cases. Very long-term prediction causes distortions (red circles in the top right), and non-uniform motions are difficult to reproduce (bottom rows). The output resolutions are (top) and (bottom), respectively. Input photos: Race The World (jZOLRAIUW2s)/ and Justin Leibow/

Although our motion predictor can generate an unlimited number of frames, very long-term prediction causes unnatural distortions because predicted frames are reconstructed only from an input image. The first row in Figure 20 shows an example, where the clouds in the 500th predicted frame are unnaturally stretched and the border of the road is deformed, compared to those in the 200th frame. Also, our method erroneously generates uni-directional motion even for objects that should exhibit scattered motions such as splashes, as shown in the third and fourth rows in Figure 20. There is still room for improvement in handling specific targets; cloud motions sometimes look unnatural, and mirror images of the sky on the water surface do not move synchronously in our results. When the user controls cloud motions (Section 6.4.2), the reflected motions on water surfaces are also changed but do not necessarily move consistently with the clouds. These artifacts might be alleviated by introducing specialized loss function(s) (e.g., physically- or geometrically-inspired loss functions for clouds and mirror images) and training data for each target.

There is also a trade-off between the diversity of output videos and the generalization ability of the models. To handle more various motions and appearances, a straightforward solution is to reduce the regularization weights while restricting unnatural deformations and artifacts to a tolerable level.

We mainly focus on landscape animations, especially of skies and waters, and put other types of animations where something appears (e.g., flower florescence or building construction) outside the scope. Nonetheless, we believe that our scope covers a wide variety of landscape videos and our motion predictor can also handle other types of motions (e.g., crowd motions seen from a distance) that can be well described with flow fields.

7. Conclusion

This paper has presented a method that can create a plausible animation at high resolution from a single scenery image. We demonstrated the effectiveness of our method by qualitatively and quantitatively comparing it with not only the state-of-the-art video prediction models but also other appearance manipulation methods. To the best of our knowledge, it is unprecedented to synthesize high-resolution videos with separated control over motion and appearance variations. This was accomplished by self-supervised, decoupled learning and latent-space interpolation. Our method can generate images with higher-resolution and longer-term sequences than previous methods. This advantage comes from the indirect image synthesis using intermediate maps predicted via training with regularization, rather than directly generating output frames as done in previous methods. The output sequences can also be controlled using latent codes extracted during training, which can be specified not only directly from the codebook but also indirectly via simple annotations.

One future direction is to improve the proposed model to create higher-quality animations. For example, additional information for semantic segmentation might be helpful for improving the performance, as done in existing style transfer methods [Luan et al., 2017]. Our current method does not adopt this approach to avoid the influence of segmentation error. Occlusion information [Wang et al., 2018] could be incorporated explicitly into training of the motion predictor. We believe that our work has taken a significant step in single-image synthesis of videos and will inspire successive work for diverse animations.

The authors would like to thank the anonymous reviewers for giving insightful comments. Many thanks also go to Dr. Kyosuke Nishida for discussion and encouragement. This work was supported by JSPS KAKENHI Grant Number 17K12689.


  • M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2018) Stochastic variational video prediction. pp. . Cited by: §2.3.
  • J. Bai, A. Agarwala, M. Agrawala, and R. Ramamoorthi (2012) Selectively de-animating video. ACM Trans. Graph. 31 (4), pp. 66:1–66:10. Cited by: §1.
  • W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos (2018) ContextVP: fully context-aware video prediction. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI, pp. 781–797. Cited by: §2.3.
  • Y. Chuang, D. B. Goldman, K. C. Zheng, B. Curless, D. Salesin, and R. Szeliski (2005) Animating pictures with stochastic motion textures. ACM Trans. Graph. 24 (3), pp. 853–860. Cited by: §2.3.
  • E. L. Denton and V. Birodkar (2017) Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 4417–4426. Cited by: Appendix C, §2.3.
  • A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox (2015) FlowNet: learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2758–2766. Cited by: §2.1.
  • R. Gao, B. Xiong, and K. Grauman (2017) Im2Flow: motion hallucination from static images for action recognition. CoRR abs/1712.04109. External Links: 1712.04109 Cited by: §2.1, §4.1.
  • L. A. Gatys, A. S. Ecker, and M. Bethge (2016) Image style transfer using convolutional neural networks. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    pp. 2414–2423. Cited by: §2.2, §6.4.2.
  • Z. Hao, X. Huang, and S. J. Belongie (2018) Controllable video generation with sparse trajectories. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7854–7863. Cited by: §2.3.
  • J. Hays and A. A. Efros (2007) Scene completion using millions of photographs. ACM Trans. Graph. 26 (3), pp. 4. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37 (9), pp. 1904–1916. Cited by: §4.2.
  • J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pp. 694–711. Cited by: §2.2, §4.2.
  • L. Karacan, Z. Akata, A. Erdem, and E. Erdem (2018) Manipulating attributes of natural scenes via hallucination. CoRR abs/1808.07413. External Links: 1808.07413 Cited by: §2.2, Figure 12, §6.3.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR 2018, Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: 1412.6980 Cited by: §4.3, §6.4.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. CoRR abs/1312.6114. External Links: Link, 1312.6114 Cited by: §6.6.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114. Cited by: §6.2.
  • P. Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays (2014) Transient attributes for high-level understanding and editing of outdoor scenes. ACM Trans. Graph. 33 (4), pp. 149:1–149:11. Cited by: Figure 23, Appendix C, §1, §2.3.
  • Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2017) Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 385–395. Cited by: §2.2, §6.3.
  • Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2018) Flow-grounded spatial-temporal video prediction from still images. In European Conference on Computer Vision, Cited by: Appendix C, Appendix D, §1, §1, §2.3, §4.1, Figure 8, Figure 9, §6.2, §6.2.
  • J. Liao, M. Finch, and H. Hoppe (2015) Fast computation of seamless video loops. ACM Trans. Graph. 34 (6), pp. 197:1–197:10. Cited by: §5.
  • Z. Liao, N. Joshi, and H. Hoppe (2013) Automated video looping with progressive dynamism. ACM Trans. Graph. 32 (4), pp. 77:1–77:10. Cited by: §1.
  • W. Lotter, G. Kreiman, and D. D. Cox (2017) Deep predictive coding networks for video prediction and unsupervised learning. pp. . Cited by: §2.3.
  • F. Luan, S. Paris, E. Shechtman, and K. Bala (2017) Deep photo style transfer. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6997–7005. Cited by: §2.2, §2.2, §6.3, §7.
  • R. Martin-Brualla, D. Gallup, and S. M. Seitz (2015) Time-lapse mining from internet photos. ACM Trans. Graph. 34 (4), pp. 62:1–62:8. Cited by: §1.
  • M. Mathieu, C. Couprie, and Y. Lecun (2016) Deep multi-scale video prediction beyond mean square error. In ICLR’06, pp. . Cited by: §2.3.
  • R. Mechrez, E. Shechtman, and L. Zelnik-Manor (2017) Photorealistic style transfer with screened poisson equation. In British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4-7, 2017, Cited by: §6.3.
  • T. Oh, K. Joo, N. Joshi, B. Wang, I. S. Kweon, and S. B. Kang (2017) Personalized cinemagraphs using semantic understanding and collaborative learning. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 5170–5179. Cited by: §1.
  • M. Okabe, K. Anjyo, and R. Onai (2011) Creating fluid animation from a single image using video database. Comput. Graph. Forum 30 (7), pp. 1973–1982. Cited by: §2.3.
  • M. Okabe, K. Anjyo, T. Igarashi, and H. Seidel (2009) Animating pictures of fluid using video examples. Comput. Graph. Forum 28 (2), pp. 677–686. Cited by: §1, §2.3.
  • M. Okabe, Y. Dobashi, and K. Anjyo (2018) Animating pictures of water scenes using video retrieval. The Visual Computer 34 (3), pp. 347–358. Cited by: §2.3.
  • E. Prashnani, M. Noorkami, D. Vaquero, and P. Sen (2017) A phase-based approach for animating images using video examples. Comput. Graph. Forum 36 (6), pp. 303–311. Cited by: §1, §2.3.
  • A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2720–2729. Cited by: §2.1, §6.2.
  • M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014) Video (language) modeling: a baseline for generative models of natural videos. CoRR abs/1412.6604. External Links: 1412.6604 Cited by: §2.3.
  • E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley (2001) Color transfer between images. IEEE Computer Graphics and Applications 21 (5), pp. 34–41. Cited by: §2.2.
  • Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha (2017)

    Unsupervised deep learning for optical flow estimation


    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.

    pp. 1495–1501. Cited by: §2.1.
  • J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid (2016) DeepMatching: hierarchical deformable dense matching. International Journal of Computer Vision 120 (3), pp. 300–323. Cited by: §2.1.
  • O. Ronneberger, P.Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §4.3.
  • A. Schödl, R. Szeliski, D. Salesin, and I. A. Essa (2000) Video textures. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, New Orleans, LA, USA, July 23-28, 2000, pp. 489–498. Cited by: §5.
  • C. Schüldt, I. Laptev, and B. Caputo (2004) Recognizing human actions: A local SVM approach. In 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, August 23-26, 2004., pp. 32–36. Cited by: Figure 22, Appendix C.
  • X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional LSTM network: A machine learning approach for precipitation nowcasting

    In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 802–810. Cited by: §3.
  • Y. Shih, S. Paris, F. Durand, and W. T. Freeman (2013) Data-driven hallucination of different times of day from a single outdoor photo. ACM Trans. Graph. 32 (6), pp. 200:1–200:11. Cited by: Figure 21, §1, §1, §2.3, §2.3, Figure 11, §6.1, §6.3.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §4.2.
  • N. Srivastava, E. Mansimov, and R. Salakhutdinov (2015) Unsupervised learning of video representations using lstms. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 843–852. Cited by: §2.3.
  • Y. Tai, J. Jia, and C. Tang (2005)

    Local color transfer via probabilistic segmentation by expectation-maximization

    In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pp. 747–754. Cited by: §2.2, §6.3.
  • I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Schölkopf (2017) Wasserstein auto-encoders. CoRR abs/1711.01558. External Links: Link, 1711.01558 Cited by: §6.6.
  • Y. Tsai, X. Shen, Z. Lin, K. Sunkavalli, and M. Yang (2016) Sky is not the limit: semantic-aware sky replacement. ACM Trans. Graph. 35 (4), pp. 149:1–149:11. Cited by: §1, §2.2.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 613–621. Cited by: §1, §2.3.
  • J. Walker, A. Gupta, and M. Hebert (2015) Dense optical flow prediction from a static image. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2443–2451. Cited by: §2.1, §4.1.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu (2018) Occlusion aware unsupervised learning of optical flow. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §7.
  • P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid (2013) DeepFlow: large displacement optical flow with deep matching. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pp. 1385–1392. Cited by: §2.1.
  • F. Wu, W. Dong, Y. Kong, X. Mei, J. Paul, and X. Zhang (2013) Content-based colour transfer. Comput. Graph. Forum 32 (1), pp. 190–203. Cited by: §2.2.
  • W. Xiong, W. Luo, L. Ma, W. Liu, and J. Luo (2018) Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, Cited by: Appendix D, §1, §1, §2.3, Figure 8, Figure 9, §6.1, §6.2, §6.2.
  • T. Xue, J. Wu, K. L. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 91–99. Cited by: Appendix C, §2.3.
  • R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 10, §6.2.
  • Y. Zhou and T. L. Berg (2016) Learning temporal transformations from time-lapse videos. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pp. 262–277. Cited by: §2.3.
  • J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017)

    Toward multimodal image-to-image translation

    In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 465–476. Cited by: §4.1, §4.3.

Appendix A Implementation Details of Our DNNs

a.1. Network Architectures

Components Layers Specifications
conv1 Concat(z), Conv2D(C(128), K(5), S(2), P(2)), LeakyReLU(0.1)
Downsampling conv2 Concat(z), Conv2D(C(256), K(3), S(2), P(1)), InstanceNorm(256), LeakyReLU(0.1)
conv3 Concat(z), Conv2D(C(512), K(3), S(2), P(1)), InstanceNorm(512), LeakyReLU(0.1)
Residual res1, , res5 ResBlock2D(C(512), K(3), S(1), P(1))
upconv1 Concat(conv3), Upsample(2), Conv2D(C(256), K(3), S(1), P(1)), InstanceNorm(256), LeakyReLU(0.1)
Upsampling upconv2 Concat(conv2), Upsample(2), Conv2D(C(128), K(3), S(1), P(1)), InstanceNorm(128), LeakyReLU(0.1)
upconv3 Concat(conv1), Upsample(2), Conv2D(C(3) or C(6), K(5), S(1), P(2))
Table 1. Network architecture of the motion predictor and appearance predictor. Concat(z) means concatenation between each pixel of input feature maps and z

. For 2D convolutional layers and residual blocks, C is the number of channels, K is the kernel width and height, S is the stride, and P is the padding. Upsample(2) means magnifying input feature maps twice using nearest-neighbor interpolation.

Component Layers Specifications
conv1 Conv2D(C(64), K(4), S(2), P(1))
res1 ResBlock2D(C(128), K(3), S(1), P(1))
Encoder res2 ResBlock2D(C(192), K(3), S(1), P(1))
res3 ResBlock2D(C(256), K(3), S(1), P(1))
pooling LeakyReLU(0.2), AvgPool2D(8,8)
fc Linear(8)
Table 2. Network architecture of the motion encoder and the appearance encoder. Notations are the same as those in Table 1.

Table 1 summarizes the architecture of our motion and appearance predictors and , whereas Table 2 shows the network architecture of our motion and appearance encoders and .

a.2. Training Algorithms

The training procedures are summarized in Algorithms 1 and 2. The motion and appearance predictors and are trained using the time-lapse video datasets and . These datasets contain video clips , each of which consists of images in a time series. Note that, in the training of our motion predictor (Algorithm 1), each minibatch uses frames only from a specific set of video clips randomly selected in each epoch, and a latent code is learned and saved for each training video.

Algorithm 2. Training of Motion Predictor
Input: ,
 1: for to do
 2:     Common motion field for
 3: endfor
 4: for each epoch do
 5:     for each minibatch do
 7:        for each in do
 8:           RandomSample
 9:           ,
12:           Warp(, )
13:           + +
15:        endfor
16:        , Optimize, Optimize
17:     endfor
18: endfor
Algorithm 3. Training of Appearance Predictor
Input: ,
 1: for each epoch do
 2:     for each minibatch do
 4:        for each do
 5:           RandomSample
 6:           ,
 9:           ColorTransfer(,
10:           + + + +
11:        endfor
12:        , Optimize, Optimize
13:     endfor
14: endfor

Appendix B Latent code prediction using LSTM

To generate latent code sequences for appearance without using the codebook in the inference phase, we can use a simple LSTM neural network that predicts future latent codes recurrently. The LSTM model is trained in advance using latent code sequences in the codebook. In the inference phase, the first latent code is encoded by the appearance encoder, and successive latent codes are predicted by the LSTM model recurrently. The network architecture is shown in Table 3. Figure 21 shows the resultant video sequences with latent code sequences predicted only from input images in the left.

Figure 21. Appearance predictions only from input images. In each row, the latent code for the first frame is encoded using the appearance encoder, and successive latent codes are predicted recurrently by the LSTM model. Various appearance transitions are reproduced from different latent code sequences, each of which varies smoothly via latent-space interpolation. From top to bottom, the output resolutions are , , and . Input photos: :DC Snapshots/, Domenico Loia/, and Shih et al. [2013].

Appendix C Generalizability

Whereas most of our results contain cloud-like motion and one-day appearance transition simply because time-lapse videos in available datasets typically capture such scenes, we further investigated the generalizability of our method.

Component Layers Specifications
fc Linear(128)
LSTM model lstm LSTM(128)
fc Linear(8)
Table 3. Network architecture of the LSTM model for predicting latent codes for appearance.

Figure 22 compares gait motions generated from the KTH dataset [Schüldt et al., 2004]. Our method yields more plausible results than [Xue et al., 2016; Denton and Birodkar, 2017]. The quality of our first-half frames is comparable to that of the state-of-the-art [Li et al., 2018]. Meanwhile, our latter-half frames indicate that there is room for improvement in the prediction of which leg precedes next after the leg-crossed pose. Modeling long-term dependency to handle such a situation is left for future work.

Figure 22. Comparison of gait motions generated from the KTH dataset [Schüldt et al., 2004].

Figure 23 compares season transitions into winter, generated from the transient attribute dataset [Laffont et al., 2014]. Our transition sequences contain more wider variations in spatially-local appearance and are more faithful to the target images than those of Photoshop Match Color, even in different times of the day.

Figure 23. Comparison of season transitions into winter, generated from the transient attribute dataset [Laffont et al., 2014]. Input photo: Kevin Jarrett/

Appendix D User study

We conducted a user study for subjective validation of the plausibility of our results. We compared our method with commercial software (Plotagraph, After Effects, and Photoshop) that requires manual annotations (e.g., static and movable regions plus fine flow directions) as well as the previous methods [Xiong et al., 2018; Li et al., 2018]. We used 20 different scenes (ten for comparisons with the previous methods and the other ten for commercial software). For fair comparisons with the previous methods [Xiong et al., 2018; Li et al., 2018], we made their results looped in the same way as ours and minified our results to the same size () as their results. For comparisons with commercial software, we collected manually-created animations from Youtube and Vimeo, and generated our results from the same input images. Because the collected animations do not contain appearance transition, we created two more results containing only appearance transition using Photoshop Match Color. The evaluation criteria are i) plausibility w.r.t. motion and appearance transition for motion-added animations and ii) faithfulness against target images for appearance-only animations (i.e., comparisons with Photoshop Match Color); appearance-only results are highly plausible in any methods, and thus we omitted plausibility for them. We requested 11 subjects to score video clips on a 1-to-4 scale ranging from “implausible (or unfaithful)” to “plausible (or faithful)” after they watch each clip only once. The movie used in this user study is submitted as a supplemental material.

Figure 24.

Statistics of our user study. The graphs indicate that our method yields more plausible results than the previous methods and commercial software. The error bars represent standard errors. The results marked with * show statistically-significant differences (paired t-test).

Figure 24 summarizes the statistics of the user study. The graphs in Figure 24 (a) and (b) indicate that our method significantly outperforms the previous methods in terms of plausibility of motion and appearance. In Figure 24 (c), we can confirm that our motion scores are slightly better than those manually created using commercial software. Figure 24 (d) shows that our method can reproduce target styles more faithfully and can handle wider variations in appearance than commercial software, as demonstrated in Figures 12, 13, and 23.