Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings

01/04/2020 ∙ by Amy Zhao, et al. ∙ MIT 16

We introduce a new video synthesis task: synthesizing time lapse videos depicting how a given painting might have been created. Artists paint using unique combinations of brushes, strokes, colors, and layers. There are often many possible ways to create a given painting. Our goal is to learn to capture this rich range of possibilities. Creating distributions of long-term videos is a challenge for learning-based video synthesis methods. We present a probabilistic model that, given a single image of a completed painting, recurrently synthesizes steps of the painting process. We implement this model as a convolutional neural network, and introduce a training scheme to facilitate learning from a limited dataset of painting time lapses. We demonstrate that this model can be used to sample many time steps, enabling long-term stochastic video synthesis. We evaluate our method on digital and watercolor paintings collected from video websites, and show that human raters find our synthesized videos to be similar to time lapses produced by real artists.



There are no comments yet.


page 1

page 3

page 6

page 7

page 9

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Aspiring artists often learn their craft by following step-by-step tutorials. There are many tutorials that describe how to paint common scenes, but how does one learn to create novel pieces without such guidance? For instance, how might an artist paint a unique fantasy landscape, or mimic the striking style of Paul Cezanne? We present a new video synthesis problem: given a completed painting, can we synthesize a time lapse video depicting how an artist might have painted it?

Figure 1: We present a probabilistic model for synthesizing time lapse videos of paintings. We demonstrate our model on Still Life with a Watermelon and Pomegranates by Paul Cezanne (top), and Wheat Field with Cypresses by Vincent van Gogh (bottom).

Artistic time lapses present many challenges for video synthesis methods. There is a great deal of variation in how people create art. Suppose two artists are asked to paint the same landscape. One artist might start with the sky, while the other might start with the mountains in the distance. One might finish each object before moving onto the next, while the other might work a little at a time on each object. During the painting process, there are often few visual cues indicating where the artist will apply the next stroke. The painting process is also long, often spanning hundreds of paint strokes and dozens of minutes.

In this work, we present a solution to the painting time lapse synthesis problem. We begin by defining the problem and describing its unique challenges. We then derive a principled, learning-based model to capture a distribution of steps that a human might use to create a given painting. We introduce a training scheme that encourages the method to produce realistic changes over many time steps. We demonstrate that our model can learn to solve this task, even when trained using a small, noisy dataset of painting time lapses collected from the web. We show that human evaluators almost always prefer our method to an existing video synthesis baseline, and often find our results indistinguishable from time lapses produced by real artists.

This work presents several technical contributions:

  1. [leftmargin=15pt,itemsep=1pt]

  2. We demonstrate the use of a probabilistic model to capture stochastic decisions made by artists, thereby capturing a distribution of plausible ways to create a painting.

  3. Unlike work in future frame prediction or frame interpolation, we synthesize long-term videos spanning dozens of time steps and many real-time minutes.

  4. We demonstrate a model that successfully learns from painting time lapses “from the wild.” This data is small and noisy, having been collected from uncontrolled environments with variable lighting, spatial resolution and video capture rates.

2 Related work

To the best of our knowledge, this is the first work that models and synthesizes distributions of videos of the past, given a single final frame. The most similar work to ours is a recent method called visual deprojection [5]. Given a single input image depicting a temporal aggregation of frames, their model captures a distribution of videos that could have produced that image. We compare our method to theirs in our experiments. Here, we review additional related literature in three main areas: video prediction, interpolation, and art synthesis.

2.1 Future frame prediction

Future video frame prediction is the problem of predicting the next frame or few frames of a video, given a sequence of past frames. Early work in this area focused on predicting motion trajectories [8, 16, 34, 50, 54] or synthesizing motions in small frames [40, 41, 49]. Recent methods train convolutional neural networks on large video datasets to synthesize videos of natural scenes and human actions [35, 38, 45, 51, 52]. Zhou et al. synthesize time lapse videos, but output only a few frames depicting specific physical processes: melting, rotting, or flowers blooming [68].

Our problem differs from these works in several key ways. First, most future frame prediction methods focus on short time scales, synthesizing frames on the order of seconds into the future, and encompassing relatively small changes. In contrast, painting time lapses span minutes or even hours, and depict dramatic content changes over time. Second, most future frame predictors output a single most likely sequence, making them ill-suited for capturing a variety of very different plausible painting trajectories. One study [62]

uses a conditional variational autoencoder to model a distribution of plausible future frames of moving humans. We build upon these ideas to model paint strokes across multiple time steps. Finally, future frame prediction methods focus on natural videos, which depict of motions of people and objects

[51, 52, 62] or physical processes [68]. The input frames often contain visual cues about how the motion, action or physical process will progress, limiting the space of possibilities that must be captured. In contrast, snapshots of paintings provide few visual cues, leading to many plausible trajectories.

2.2 Frame interpolation

Our problem can be thought of as a long-term frame interpolation task between a blank canvas and a completed work of art, with many possible painting trajectories between them. In frame interpolation, the goal is to temporally interpolate between two frames in time. Classical approaches focus on natural videos, and estimate dense flow fields 

[4, 57, 64] or phase [39] to guide the interpolation process. More recent methods use convolutional neural networks to directly synthesize the interpolated frame [44], or combine flow fields with estimates of scene information [28, 43]. Most frame interpolation methods predict a single or a few intermediate frames, and are not easily extended to predicting long sequences, or predicting distributions of sequences.

2.3 Art synthesis

The graphics community has long been interested in simulating physically realistic paint strokes in digital media. Many existing methods focus on physics-based models of fluids or brush bristles [6, 7, 9, 12, 56, 61]. More recent learning-based methods leverage datasets of real paint strokes [31, 36, 67], often posing the artistic stroke synthesis problem as a texture transfer or style transfer problem [3, 37]. Several works focus on simulating watercolor-specific effects such as edge darkening [42, 55]. We focus on capturing large-scale, long-term painting processes, rather than fine-scale details of individual paint strokes.

In style transfer, images are transformed to simulate a specific style, such as a painting-like style [20, 21] or a cartoon-like style [66]. More recently, neural networks have been used for generalized artistic style transfer [18, 69]. We leverage insights from these methods to synthesize a realistic progressions of paintings.

Several recent works use reinforcement learning by first designing parameterized brush strokes, and then training an agent to apply strokes to produce a given painting

[17, 22, 26, 27, 58, 59]. Some works focus on specific artistic tasks such as hatching or other repetitive strokes [29, 60]. These approaches require careful hand-engineering, and are not ++optimized to produce varied or realistic painting progressions. In contrast, we learn a broad set of effects from real painting time lapse data.

3 Problem overview

Figure 2: Several real painting progressions of similar-looking scenes. Each artist fills in the house, sky and field in a different order.

Given a completed painting, our goal is to synthesize different ways that an artist might have created it. We work with recordings of digital and watercolor painting time lapses collected from video websites. Compared to natural videos of scenes and human actions, videos of paintings present unique challenges.

High Variability
  1. [topsep=2pt,itemsep=1pt,leftmargin=2ex]

  2. Painting trajectories: Even for the same scene, different artists will likely paint objects in different temporal orders (Figure 2).

  3. Painting rates: Artists work at different speeds, and apply paint in different amounts.

  4. Scales and shapes: Over the course of a painting, artists use strokes that vary in size and shape. Artists often use broad strokes early on, and add fine details later.

  5. Data availability: Due to the limited number of available videos in the wild, it is challenging to gather a dataset that captures the aforementioned types of variability.

Medium-specific challenges
  1. [topsep=2pt,itemsep=1pt,leftmargin=2ex]

  2. Non-paint effects: Tools that apply local blurring, smudging, or specialized paint brush shapes are common in digital art applications such as Procreate [23]. Artists can also apply global effects simulating varied lighting or tones.

  3. Erasing effects: In digital art programs, artists can erase or undo past actions, as shown in Figure 3.

  4. Physical effects in watercolor paintings: Watercolor painting videos exhibit distinctive effects resulting from the physical interaction of paint, water, and paper. These effects include specular lighting on wet paint, pigments fading as they dry, and water spreading from the point of contact with the brush (Figure 4).

In this work, we design a learning-based model to handle the challenges of high variability and painting medium-specific effects.

Figure 3: Example digital painting sequences. These sequences show a variety of ways to add paint, including fine strokes and filling (row 1), and broad strokes (row 3). We use red boxes to outline challenges, including erasing (row 2) and drastic changes in color and composition (row 3).
Figure 4: Example watercolor painting sequences. The outlined areas highlight some watercolor-specific challenges, including changes in lighting (row 1), diffusion and fading effects as paint dries (row 2), and specular effects on wet paint (row 3).

4 Method

We begin by formalizing the time lapse video synthesis problem. Given a painting , our task is to synthesize the past frames . Suppose we have a training set of real time lapse videos . We first define a principled probabilistic model, and then learn its parameters using these videos. At test time, given a completed painting, we sample from the model to create new videos that show realistic-looking painting processes.

4.1 Model

We propose a probabilistic, temporally recurrent paint strokes model. At each time instance , the model predicts a pixel-wise intensity change that should be added to the previous frame to produce the current frame; that is, . This change does not necessarily correspond to a single paint stroke – it could represent one or multiple physical or digital paint strokes, or other effects such as erasing or fading.

We model as being generated from a random latent variable , the completed piece , and the image content at the previous time step ; the likelihood is

. Using a random variable

helps to capture the stochastic nature of painting. Using both and enables the model to capture time-varying effects such as the progression of coarse to fine brush sizes, while the Markovian assumption facilitates learning from a small number of video examples.

Figure 5: We model the change at each time step as being generated from the latent variable . Circles represent random variables; the shaded circle denotes a variable that is observed at inference time. The rounded rectangle represents model parameters.
Figure 6: We implement our model using a conditional variational autoencoder framework. At training time, the network is encouraged to reconstruct the current frame , while sampling the latent from a distribution that is close to the standard normal. At test time, the auto-encoding branch is removed, and is sampled from the standard normal. We use the shorthand .

It is common to define such image likelihoods as a per-pixel normal distribution, which results in an L2 image similarity loss term in maximum likelihood formulations

[33]. In synthesis tasks, using L2 loss often produces blurry results [24]. We instead optimize both the L1 distance in pixel space and the L2 distance in a perceptual feature space. Perceptual losses are commonly used in image synthesis and style transfer tasks to produce sharper and more visually pleasing results [14, 24, 30, 44, 65]. We use the L2 distance between normalized VGG features [48] as described in [65]. We let the likelihood take the form:


where , represents a function parameterized by , is a function that extracts normalized VGG features, and are fixed noise parameters.

We assume the latent variable is generated from the multivariate standard normal distribution:


We show a diagram of this model in Figure 5.

We aim to find model parameters that best explain all videos in our dataset:


This integral is intractable, and the posterior is also intractable, preventing the use of the EM algorithm. We instead use variational inference and introduce an approximate posterior distribution [32, 62, 63]. We let this approximate distribution take the form of a multivariate normal:


where are functions parameterized by , and is diagonal.

Figure 7: In sequential CVAE training, our model is trained to reconstruct a training frame (outlined in green) while building upon its previous predictions for time steps.
Figure 8: In sequential sampling training, we use a conditional frame critic to encourage all frames sampled from our model to look realistic. The image similarity loss on the final frame encourages the model to complete the painting in time steps.

4.1.1 Neural network framework

We implement the functions , and as a convolutional encoder-decoders parameterized by and , using a conditional variational autoencoder (CVAE) framework [53, 63]. We use an architecture similar to [63], which we summarize in Figure 6. We include full details in the appendix.

4.2 Learning

We learn model parameters using short sequences from the training video dataset, which we discuss in further detail in Section 5.1. We use two stages of optimization to facilitate convergence: pairwise optimization, and sequence optimization.

4.2.1 Pairwise optimization

From Equations (3) and (4), we obtain an expression for each pair of consecutive frames (a derivation is provided in the appendix):


where denotes the Kullback-Liebler divergence. Combining Equations (1), (2), (4), and (5), we minimize:


where , and represent L1 and L2 distance respectively. We refer to the last two terms as image similarity losses.

We optimize Equation (6) on single time steps, which we obtain by sampling all pairs of consecutive frames from the dataset. We also train the model to produce the first frame from videos that begin with a blank canvas, given a white input frame , and . These starter sequences are important for teaching the model how to start a painting at inference time.

4.2.2 Sequence optimization

To synthesize an entire video, we run our model recurrently for multiple time steps, building upon its own predicted frames. It is common when making sequential predictions to observe compounding errors or artifacts over time [51]. We use a novel sequential training scheme to enforce that the outputs of the model are accurate and realistic over multiple time steps. We alternate between two sequential training modes.

  • [leftmargin=0pt, itemsep=1pt]

  • Sequential CVAE training encourages sequences of frames to be well-captured by the learned distribution, by reducing the compounding of errors. Specifically, we train the model sequentially for a few frames, predicting each intermediate frame using the output of the model at the previous time step: . We compare each predicted frame to the corresponding training frame using the image similarity losses in Eq. (6). We illustrate this strategy in Figure 7.

  • Sequential sampling training encourages random samples from our learned distribution to look like realistic partially-completed paintings. During inference (described below), we rely on sampling from the prior at each time step to synthesize new videos. A limitation of the variational strategy is the limited coverage of the latent space during training [15], sometimes leading to unrealistic predictions for . To compensate for this, we introduce supervision on such samples by amending the reconstruction term in Equation (5) using a conditional critic loss term [19]:


    where is a critic function with parameters . The critic encourages the distribution of sampled strokes  to match the distribution of training strokes . We use a critic architecture based on [10] and optimize it using WGAN-GP [19].

    In addition to the critic loss, we apply the image similarity losses discussed above after time steps, to encourage the model to eventually produce the completed painting.

4.3 Inference: video synthesis

Given a completed painting and learned model parameters , we synthesize videos by sampling from the model at each time step. Specifically, we synthesize each frame using the synthesized previous frame and a randomly sampled . We start each video using , a blank frame.

4.4 Implementation

We implement our model using Keras


and Tensorflow


. We experimentally selected the hyperparameters controlling the reconstruction loss weights to be

, using the validation set.

Figure 9: Diversity of sampled videos. We show examples of our method applied to a digital (top 3 rows) and a watercolor (bottom 3 rows) painting from the test set. Our method captures diverse and plausible painting trajectories.

(a) Similarly to the artist, our method paints in a coarse-to-fine manner. Blue arrows show where our method first applies a flat color, and then adds fine details. Red arrows indicate where the baselines add fine details even in the first time step.

(b) Our method works on similar regions to the artist, although it does not use the same color layers to achieve the completed painting. Blue arrows show where our method paints similar parts of the scene to the artist (filling in the background first, and then the house, and then adding details to the background). Red arrows indicate where the baselines do not paint according to semantic boundaries, filling in both the background and the house.
Figure 10: Videos predicted from the digital (top) and watercolor (bottom) test sets. For the stochastic methods vdp and ours, we show the nearest sample to the real video out of 2000 samples. We show additional results in the appendices.

5 Experiments

5.1 Datasets

We collected recordings of painting time lapses from YouTube and Vimeo. We selected digital and watercolor paintings (which are common painting methods on these websites), and focused on landscapes or still lifes (which are common subjects for both mediums). We downloaded each video at resolution and cropped it temporally and spatially to include only the painting process (excluding any introductions or sketching that might have preceded the painting). We split each dataset in a 70:15:15 ratio into training, validation, and held-out test video sets.111While we cannot host individual video files, we make our download scripts available at

  1. [itemsep=1pt,leftmargin=0pt]

  2. Digital paintings: We collected digital painting time lapses. The average duration is 4 minutes, with many videos having already been sped up by artists using the Procreate application [23]. We selected videos with minimal zooming and panning. We manually removed segments that contained movements such as translations, flipping and zooming. Figure 3 shows example video sequences.

  3. Watercolor paintings: We collected watercolor time lapses, with an average duration of 20 minutes. We only kept videos that contained minimal movement of the paper, and manually corrected any small translations of the painting. We show examples in Figure 4.

    A challenge with videos of physical paintings is the presence of the hand, paintbrush and shadows in many frames. We trained a simple convolutional neural network to identify and remove frames that contained these artifacts.

  4. Sequence extraction. We synthesize time lapses at a lower temporal resolution than real-time for computational feasibility. We extract training sequences from the raw videos at a period of frames (i.e., skipping

    real frames in each synthesized time step), with a maximum variance of

    frames. Allowing some variance in the sampling rate enables us to extract sequences at an approximate period of frames, which is useful for (1) improving robustness to varied painting rates, and (2) extracting sequences from watercolor painting videos where many frames containing hands or paintbrushes have been removed. We select and independently for each dataset. We avoid capturing static segments of each video (e.g., when the artist is speaking) by requiring that adjacent frames in each sequence have at least of the pixels changing by a fixed intensity threshold. We use a dynamic programming method to find all sequences that satisfy these criteria. We train on sequences of length 3 or 5 for sequential CVAE training, and length for sequential sampling training, which we determined using experiments on the validation set. For the test set, we extract a single sequence from each test video that satisfies the filtering criteria.

  5. Crop extraction. To facilitate learning from small numbers of videos, we extract multiple crops from each video. We first downsample each video spatially by a factor of , so that most patches contained visually interesting content and spatial context.

5.2 Baselines

We compare our method to the following baselines:

  • [itemsep=1pt,topsep=2pt]

  • Deterministic video synthesis (unet): In image synthesis tasks, it is common to use a simple encoder-decoder architecture with skip connections, similar to U-Net [24, 46]. We adapt this technique to synthesize the entire video at once.

  • Stochastic video synthesis (vdp): Visual deprojection synthesizes a distribution of videos from a single temporally-projected input image [5].

We design each baseline model architecture to have a comparable number of parameters to our model. Both baselines output videos of a fixed length, which we choose to be to be comparable to our choice of in Section 5.1.

5.3 Results

Comparison All paintings Watercolor paintings Digital paintings
real vdp 90% 90% 90%
real ours 55% 60% 51%
ours vdp 91% 90% 88%
Table 1: User study results. Users compared the realism of pairs of videos randomly sampled from ours, vdp, and real videos. The vast majority of participants preferred our videos over vdp videos (). Similarly, most participants chose real videos over vdp videos (). Users preferred real videos over ours (), but many participants confused our videos with the real videos, especially for digital paintings.

We conducted both quantitative and qualitative evaluations. We first present a user study quantifying human perception of the realism of our synthesized videos. Next, we qualitatively examine our synthetic videos, and discuss characteristics that contribute to their realism. Finally, we discuss quantitative metrics for comparing sets of sampled videos to real videos. We show additional results, including videos and visualizations using the tipiX tool [13] on our project page at

We experimented with training each method on digital or watercolor paintings only, as well as on the combined paintings dataset. For all methods, we found that training on the combined dataset produced the best qualitative and quantitative results (likely due to our limited dataset size), and we only present results for those models.

5.3.1 Human evaluations

We surveyed 158 people using Amazon Mechanical Turk [2]. Participants compared the realism of pairs of videos, with each pair containing videos randomly sampled from ours, vdp, or the real videos. In this study, we omit the weaker baseline unet, which performed consistently worse on all metrics (discussed below).

We first trained the participants by showing them several examples of real painting time lapses. We then showed them a pair of time lapse videos generated by different methods for the center crop of the same painting, and asked “Which video in each pair shows a more realistic painting process?” We repeated this process for 14 randomly sampled paintings from the combined test set. We include full study details in the appendix. Table 1 indicates that almost every participant thought synthetic videos produced by our model looked more realistic than those produced by vdp (). Furthermore, participants confused our synthetic videos with real videos nearly half of the time. In the next sections, we show example synthetic videos and discuss aspects that make the results of our model appear more realistic, offering an explanation for these promising user study results.

5.3.2 Qualitative results

Figure 9 shows sample sequences produced by our model, for two input paintings. Our model chooses different orderings of semantic regions from the beginning of each sequence, leading to different paths that still converge to the same completed painting.

Figure 10 shows sequences synthesized by each method. To objectively compare the stochastic methods vdp and ours, we show the most similar prediction by L1 distance to the ground truth sequence. The ground truth sequences show that artists tend to paint in a coarse-to-fine manner, using broad strokes near the start of a painting, and finer strokes near the end. As we highlight with arrows, our method captures this tendency better than baselines, having learned to focus on separate semantic regions such as mountains, cabins and trees. Our predicted trajectories are similar to the ground truth, showing that our sequential modeling approach is effective at capturing realistic temporal progressions. In contrast, the baselines tend to make blurry changes without separating the scene into components, a common result for methods that do not explicitly model stochastic processes.

We examine failure cases from our method in Figure 11, such as making many fine or disjoint changes in a single time step and creating an unrealistic effect.

(a) The proposed method does not always synthesize realistic strokes for fine details. Blue arrows highlight frames where the method makes realistic paint strokes, working in one or two semantic regions at a time. Red arrows show how our method sometimes fills in many details in the frame at once.
(b) The proposed method sometimes synthesizes changes in disjoint regions. Red arrows indicate where the method produces paint stroke samples that fill in small patches that correspond to disparate semantic regions, leaving unrealistic blank gaps throughout the frame. Like the previous example, this example also fills in much of the frame in one time step, although most of the filled areas in the second frame are coarse.
Figure 11: Failure cases. We show unrealistic effects that are sometimes synthesized by our method, for a watercolor painting (top) and a digital painting (bottom).
(a) Digital paintings test set.
(b) Watercolor paintings test set.
Figure 12: Quantitative measures

. As we draw more samples from each stochastic method (solid lines), the best video similarity to the real video improves. This indicates that some samples are close to the artist’s specific painting choices. We use L1 distance as the metric on the left (lower is better), and stroke IOU on the right (higher is better). Shaded regions show standard deviations of the stochastic methods. We highlight several insights from these plots. (1) Both our method and

vdp produce samples that are comparably similar to the real video by L1 distance (left). However, our method synthesizes strokes that are more similar in shape to those used by artists (right). (2) At low numbers of samples, the deterministic unet method is closer (by L1 distance) to the real video than samples from vdp or ours, since L1 favors blurry frames that average many possibilities. (3) Our method shows more improvement in L1 distance and stroke area IOU than vdp as we draw more samples, indicating that our method captures a more varied distribution of videos.

5.3.3 Quantitative results

Comparing synthesized results to “ground truth” in a stochastic task is ill-defined, and developing quantitative measures of realism is difficult [25, 47]; these challenges motivated our user study above. In this section, we explore quantitative metrics designed to measure aspects of time lapse realism. For each video in the test set, we extract a 40-frame long sequence according to the criteria described in Section 5.1, and evaluate each method on 5 random crops using several video similarity metrics:

  • [itemsep=1pt,leftmargin=0pt]

  • Best (across samples) overall video similarity: For each test painting, we draw sample videos from each model and report the closest sample to the true video using a per-pixel L1 loss [5]. A method that has captured the distribution of real time lapses well should produce better “best” estimates as . This captures whether some samples drawn from a model get close to the real video, and also whether a method is diverse enough to capture each artist’s specific choices.

  • Best (across samples) stroke shape similarity: We quantify how similar the set of stroke shapes are between the ground truth and each predicted video, disregarding the order in which they were performed. We define stroke shape as a binary map of the changes made in each time step. For each test video, we compare the artist’s stroke shape to the most similarly shaped stroke synthesized by each method, as measured by intersection-over-union (IOU), and report the average IOU over all ground truth strokes. This captures whether a method paints in similar semantic regions to the artist.

In Table 2 we introduce the interp baseline, which linearly interpolates in time, as a quantitative lower bound. For , the deterministic interp and unet approaches perform poorly for both metrics. vdp and our method are able to produce samples that lead to comparable “best video similarity”, highlighting the strength of methods designed to capture distributions of videos. The stroke IOU metric shows that our method synthesizes strokes that are significantly more realistic than the other methods.

We show the effect of increasing in Figure 12. At low , the blurry videos produced by interp and unet attain lower L1 distance to the real video than vdp and ours, likely because L1 distance penalizes samples with different painting progressions more than it penalizes blurry frames. In other words, a blurry, gradually fading video with “average” frames will typically have a lower L1 distance to the artist’s time lapse, compared to different plausible painting processes. As increases, vdp and ours produce some samples that are close to the real video. Together with the user study described above, these metrics illustrate encouraging results that our method captures a realistic variety of painting time lapses.

Method Digital paintings Watercolor paintings
L1 Stroke IOU L1 Stroke IOU
Table 2: We compare videos synthesized from the digital and watercolor painting test sets to the artists’ videos. We include a simple baseline interp that linearly interpolates in time between a white frame and the completed painting. For stochastic methods vdp and ours, we draw 2000 video samples and report the closest one to the ground truth.

6 Conclusion

In this work, we introduce a new video synthesis problem: making time lapse videos that depict the creation of paintings. We proposed a recurrent probabilistic model that captures the stochastic decisions of human artists. We introduced an alternating sequential training scheme that encourages the model to make realistic predictions over many time steps. We demonstrated our model on digital and watercolor paintings, and used it to synthesize realistic and varied painting videos. Our results, including human evaluations, indicate that the proposed model is a powerful first tool for capturing stochastic changes from small video datasets.

7 Acknowledgments

We thank Zoya Bylinskii of Adobe Inc. for her insights around designing effective and accurate user studies. This work was funded by Wistron Corporation.


  • [1] M. Abadi et al. (2016)

    Tensorflow: large-scale machine learning on heterogeneous distributed systems

    arXiv preprint arXiv:1603.04467. Cited by: §4.4.
  • [2] Inc. Amazon Mechanical Turk (2005) Amazon mechanical turk: overview. Cited by: §5.3.1.
  • [3] R. Ando and R. Tsuruno (2010) Segmental brush synthesis with stroke images. Cited by: §2.3.
  • [4] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski (2011) A database and evaluation methodology for optical flow.

    International Journal of Computer Vision

    92 (1), pp. 1–31.
    Cited by: §2.2.
  • [5] G. Balakrishnan, A. V. Dalca, A. Zhao, J. V. Guttag, F. Durand, and W. T. Freeman (2019) Visual deprojection: probabilistic recovery of collapsed dimensions. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §5.2, §5.3.3.
  • [6] W. Baxter, Y. Liu, and M. C. Lin (2004) A viscous paint model for interactive applications. Computer Animation and Virtual Worlds 15 (3-4), pp. 433–441. Cited by: §2.3.
  • [7] W. V. Baxter and M. C. Lin (2004) A versatile interactive 3d brush model. In Computer Graphics and Applications, 2004. PG 2004. Proceedings. 12th Pacific Conference on, pp. 319–328. Cited by: §2.3.
  • [8] M. Bennewitz, W. Burgard, and S. Thrun (2002) Learning motion patterns of persons for mobile service robots. In Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), Vol. 4, pp. 3601–3606. Cited by: §2.1.
  • [9] Z. Chen, B. Kim, D. Ito, and H. Wang (2015) Wetbrush: gpu-based 3d painting simulation at the bristle level. ACM Transactions on Graphics (TOG) 34 (6), pp. 200. Cited by: §2.3.
  • [10] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 8789–8797. Cited by: Figure 13, §4.2.2.
  • [11] F. Chollet et al. (2015) Keras. GitHub. Note: Cited by: §4.4.
  • [12] N. S. Chu and C. Tai (2005) MoXi: real-time ink dispersion in absorbent paper. In ACM Transactions on Graphics (TOG), Vol. 24, pp. 504–511. Cited by: §2.3.
  • [13] A. V. Dalca, R. Sridharan, N. Rost, and P. Golland TipiX: rapid visualization of large image collections. MICCAI-IMIC Interactive Medical Image Computing Workshop. Cited by: §5.3.
  • [14] A. Dosovitskiy and T. Brox (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in neural information processing systems, pp. 658–666. Cited by: §4.1.
  • [15] J. Engel, M. Hoffman, and A. Roberts (2018) Latent constraints: learning to generate conditionally from unconditional generative models. In International Conference on Learning Representations, External Links: Link Cited by: §4.2.2.
  • [16] S. Gaffney and P. Smyth (1999) Trajectory clustering with mixtures of regression models. In KDD, Vol. 99, pp. 63–72. Cited by: §2.1.
  • [17] Y. Ganin, T. Kulkarni, I. Babuschkin, S. Eslami, and O. Vinyals (2018) Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118. Cited by: §2.3.
  • [18] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576. Cited by: §2.3.
  • [19] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: Figure 13, §4.2.2.
  • [20] A. Hertzmann (1998) Painterly rendering with curved brush strokes of multiple sizes. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 453–460. Cited by: §2.3.
  • [21] F. Huang, B. Wu, and B. Huang (2015) Synthesis of oil-style paintings. In Pacific-Rim Symposium on Image and Video Technology, pp. 15–26. Cited by: §2.3.
  • [22] Z. Huang, W. Heng, and S. Zhou (2019) Learning to paint with model-based deep reinforcement learning. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.3.
  • [23] S. Interactive (2016) Procreate artists’ handbook. Savage. Cited by: §3, §5.1.
  • [24] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016)

    Image-to-image translation with conditional adversarial networks

    arXiv preprint arXiv:1611.07004. Cited by: §4.1, §5.2.
  • [25] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §5.3.3.
  • [26] B. Jia, J. Brandt, R. Mech, B. Kim, and D. Manocha (2019) LPaintB: learning to paint from self-supervision. CoRR abs/1906.06841. External Links: Link, 1906.06841 Cited by: §2.3.
  • [27] B. Jia, C. Fang, J. Brandt, B. Kim, and D. Manocha (2019) PaintBot: a reinforcement learning approach for natural media painting. CoRR abs/1904.02201. External Links: Link, 1904.02201 Cited by: §2.3.
  • [28] H. Jiang, D. Sun, V. Jampani, M. Yang, E. Learned-Miller, and J. Kautz (2018) Super slomo: high quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9000–9008. Cited by: §2.2.
  • [29] P. Jodoin, E. Epstein, M. Granger-Piché, and V. Ostromoukhov (2002) Hatching by example: a statistical approach. In Proceedings of the 2nd international symposium on Non-photorealistic animation and rendering, pp. 29–36. Cited by: §2.3.
  • [30] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In European conference on computer vision, pp. 694–711. Cited by: §4.1.
  • [31] M. Kim and H. J. Shin (2010) An example-based approach to synthesize artistic strokes using graphs. In Computer Graphics Forum, Vol. 29, pp. 2145–2152. Cited by: §2.3.
  • [32] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589. Cited by: Appendix A, §4.1.
  • [33] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.1.
  • [34] C. Liu, J. Yuen, and A. Torralba (2010) Sift flow: dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 978–994. Cited by: §2.1.
  • [35] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala (2017) Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4463–4471. Cited by: §2.1.
  • [36] J. Lu, C. Barnes, S. DiVerdi, and A. Finkelstein (2013) RealBrush: painting with examples of physical media. ACM Transactions on Graphics (TOG) 32 (4), pp. 117. Cited by: §2.3.
  • [37] M. Lukáč, J. Fišer, P. Asente, J. Lu, E. Shechtman, and D. Sỳkora (2015) Brushables: example-based edge-aware directional texture painting. In Computer Graphics Forum, Vol. 34, pp. 257–267. Cited by: §2.3.
  • [38] M. Mathieu, C. Couprie, and Y. Lecun (2016-11) Deep multi-scale video prediction beyond mean square error. pp. . Cited by: §2.1.
  • [39] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung (2015) Phase-based frame interpolation for video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1410–1418. Cited by: §2.2.
  • [40] V. Michalski, R. Memisevic, and K. Konda (2014) Modeling deep temporal dependencies with recurrent grammar cells””. In Advances in neural information processing systems, pp. 1925–1933. Cited by: §2.1.
  • [41] R. Mittelman, B. Kuipers, S. Savarese, and H. Lee (2014)

    Structured recurrent temporal restricted boltzmann machines

    In International Conference on Machine Learning, pp. 1647–1655. Cited by: §2.1.
  • [42] S. E. Montesdeoca, H. S. Seah, P. Bénard, R. Vergne, J. Thollot, H. Rall, and D. Benvenuti (2017) Edge-and substrate-based effects for watercolor stylization. In Proceedings of the Symposium on Non-Photorealistic Animation and Rendering, pp. 2. Cited by: §2.3.
  • [43] S. Niklaus and F. Liu (2018) Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1710. Cited by: §2.2.
  • [44] S. Niklaus, L. Mai, and F. Liu (2017) Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 261–270. Cited by: §2.2, §4.1.
  • [45] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014) Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604. Cited by: §2.1.
  • [46] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §5.2.
  • [47] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §5.3.3.
  • [48] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • [49] I. Sutskever, G. E. Hinton, and G. W. Taylor (2009) The recurrent temporal restricted boltzmann machine. In Advances in neural information processing systems, pp. 1601–1608. Cited by: §2.1.
  • [50] D. Vasquez and T. Fraichard (2004) Motion prediction for moving objects: a statistical approach. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, Vol. 4, pp. 3931–3936. Cited by: §2.1.
  • [51] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee (2017) Learning to generate long-term future via hierarchical prediction. In ICML, Cited by: §2.1, §2.1, §4.2.2.
  • [52] C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pp. 613–621. Cited by: §2.1, §2.1.
  • [53] J. Walker, C. Doersch, A. Gupta, and M. Hebert (2016) An uncertain future: forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835–851. Cited by: §4.1.1.
  • [54] J. Walker, A. Gupta, and M. Hebert (2014) Patch to the future: unsupervised visual prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3302–3309. Cited by: §2.1.
  • [55] M. Wang, B. Wang, Y. Fei, K. Qian, W. Wang, J. Chen, and J. Yong (2014) Towards photo watercolorization with artistic verisimilitude. IEEE transactions on visualization and computer graphics 20 (10), pp. 1451–1460. Cited by: §2.3.
  • [56] D. Way and Z. Shih (2001) The Synthesis of Rock Textures in Chinese Landscape Painting. Computer Graphics Forum. External Links: ISSN 1467-8659, Document Cited by: §2.3.
  • [57] M. Werlberger, T. Pock, M. Unger, and H. Bischof (2011) Optical flow guided tv-l 1 video interpolation and restoration. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 273–286. Cited by: §2.2.
  • [58] N. Xie, H. Hachiya, and M. Sugiyama (2012) Artist agent: A reinforcement learning approach to automatic stroke generation in oriental ink painting. CoRR abs/1206.4634. External Links: Link, 1206.4634 Cited by: §2.3.
  • [59] N. Xie, T. Zhao, F. Tian, X. Zhang, and M. Sugiyama (2015) Stroke-based stylization learning and rendering with inverse reinforcement learning. In

    Proceedings of the 24th International Conference on Artificial Intelligence

    IJCAI’15, pp. 2531–2537. External Links: ISBN 978-1-57735-738-4, Link Cited by: §2.3.
  • [60] J. Xing, H. Chen, and L. Wei (2014) Autocomplete painting repetitions. ACM Transactions on Graphics (TOG) 33 (6), pp. 172. Cited by: §2.3.
  • [61] S. Xu, M. Tang, F. Lau, and Y. Pan (2002) A solid model based virtual hairy brush. In Computer Graphics Forum, Vol. 21, pp. 299–308. Cited by: §2.3.
  • [62] T. Xue, J. Wu, K. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Appendix A, §2.1, §4.1.
  • [63] X. Yan, J. Yang, K. Sohn, and H. Lee (2016) Attribute2image: conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Cited by: Appendix A, §4.1.1, §4.1.
  • [64] Z. Yu, H. Li, Z. Wang, Z. Hu, and C. W. Chen (2013) Multi-level video frame interpolation: exploiting the interaction among different levels. IEEE Transactions on Circuits and Systems for Video Technology 23 (7), pp. 1235–1248. Cited by: §2.2.
  • [65] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595. Cited by: §4.1.
  • [66] Y. Zhang, W. Dong, C. Ma, X. Mei, K. Li, F. Huang, B. Hu, and O. Deussen (2017) Data-driven synthesis of cartoon faces using different styles. IEEE Transactions on image processing 26 (1), pp. 464–478. Cited by: §2.3.
  • [67] M. Zheng, A. Milliez, M. Gross, and R. W. Sumner (2017) Example-based brushes for coherent stylized renderings. In Proceedings of the Symposium on Non-Photorealistic Animation and Rendering, pp. 3. Cited by: §2.3.
  • [68] Y. Zhou and T. L. Berg (2016-10) Learning temporal transformations from time-lapse videos. Vol. 9912, pp. 262–277. External Links: ISBN 978-3-319-46483-1, Document Cited by: §2.1, §2.1.
  • [69] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.3.

Appendix A ELBO derivation

We provide the full derivation of our model and losses from Equation (3). We start with our goal of finding model parameters

that maximize the following probability for all videos and all


We use variational inference and introduce an approximate posterior distribution [32, 62, 63].


We use the shorthand for , and apply Jensen’s inequality:


where is the Kullback-Liebler divergence, arriving at the ELBO presented in Equation (5) in the paper.

Combining the first term in Equation (5) with our image likelihood defined in Equation (1):


giving us the image similarity losses in Equation (6). We derive in Equation (6) by similarly taking the logarithm of the normal distributions defined in Equations (2) and (4).

Appendix B Network architecture

We provide details about the architecture of our recurrent model and our critic model in Figure 13.

Figure 13: We use an encoder-decoder style architecture for our model. For our critic, we use a similar architecture to StarGAN [10], and optimize the critic using WGAN-GP [19]

with a gradient penalty weight of 10 and 5 critic training iterations for each iteration of our model. All strided convolutions and downsampling layers reduce the size of the input volume by a factor of 2.

Appendix C Human study

We surveyed 150 human participants. Each participant took a survey containing a training section followed by 14 questions.

  1. [leftmargin=0pt,topsep=2pt,itemsep=0pt]

  2. Calibration: We first trained the participants by showing them several examples of real digital and watercolor painting time lapses.

  3. Evaluation: We then showed each participant 14 pairs of time lapse videos, comprised of a mix of watercolor and digital paintings selected randomly from the test sets. Although each participant only saw a subset of the test paintings, every test painting was included in the surveys. Each pair contained videos of the same center-cropped painting. The videos were randomly chosen from all pairwise comparisons between real, vdp, and ours, with the ordering within each pair randomized as well. Samples from vdp and ours were generated randomly.

  4. Validation: Within the survey, we also showed two repeated questions comparing a real video with a linearly interpolated video (which we described as interp in Table 2 in the paper) to validate that users understood the task. We did not use results from users who chose incorrect answers for one or both validation questions.

Appendix D Additional results

We include additional qualitative results in Figures 14 and 15. We encourage the reader to view the supplementary video, which illustrates many of the discussed effects.

(a) The proposed method paints similar regions to the artist. Red arrows in the second row show where unet adds fine details everywhere in the scene, ignoring the semantic boundary between the rock and the water, and contributing to an unrealistic fading effect. The video synthesized by vdp uses more coarse strokes early on, but introduces an unrealistic-looking blurring and fading effect on the rock (red arrows in the third row). Blue arrows highlight that our method makes similar strokes to the artist, filling in the base color of the water, then the base colors of the rock, and then fine details throughout the painting.
(b) The proposed method identifies appropriate colors and shape for each layer of paint. Red arrows indicate where the baselines fill in details that the artist does not complete until much later in the sequence (not shown in the real sequence, but visible in the input image). Blue arrows show where our method adds a base layer for the vase with a reasonable color and shape, and then adds fine details to it later.
Figure 14: Videos synthesized from the watercolor paintings test set. For the stochastic methods vdp and ours, we examine the nearest sample to the real video out of 2000 samples. We discuss the variability among samples from our method in Section 5, and in the supplementary video.
(a) The proposed method paints using coarse-to-fine layers of different colors, similarly to the real artist. Red arrows indicate where the baseline methods fill in details of the house and bush at the same time, adding fine-grained details even early in the painting. Blue arrows highlight where our method makes similar strokes to the artist, adding a flat base color for the bush first before filling in details, and using layers of different colors.
(b) The proposed method synthesizes watercolor-like effects such as paint fading as it dries. Red arrows indicate where the baselines fill in the house and the background at the same time. Blue arrows in the first two video frames of the last row show that our method uses coarse strokes early on. Blue arrows in frames 3-5 show where our method simulates paint drying effects (with the intensity of the color fading over time), which are common in real watercolor videos.
Figure 15: Videos synthesized from the watercolor paintings test set. For the stochastic methods vdp and ours, we show the nearest sample to the real video out of 2000 samples.