Video Synthesis from a Single Image and Motion Stroke

12/05/2018 ∙ by Qiyang Hu, et al. ∙ University of Maryland Universität Bern 10

In this paper, we propose a new method to automatically generate a video sequence from a single image and a user provided motion stroke. Generating a video sequence based on a single input image has many applications in visual content creation, but it is tedious and time-consuming to produce even for experienced artists. Automatic methods have been proposed to address this issue, but most existing video prediction approaches require multiple input frames. In addition, generated sequences have limited variety since the output is mostly determined by the input frames, without allowing the user to provide additional constraints on the result. In our technique, users can control the generated animation using a sketch stroke on a single input image. We train our system such that the trajectory of the animated object follows the stroke, which makes it both more flexible and more controllable. From a single image, users can generate a variety of video sequences corresponding to different sketch inputs. Our method is the first system that, given a single frame and a motion stroke, can generate animations by recurrently generating videos frame by frame. An important benefit of the recurrent nature of our architecture is that it facilitates the synthesis of an arbitrary number of generated frames. Our architecture uses an autoencoder and a generative adversarial network (GAN) to generate sharp texture images, and we use another GAN to guarantee that transitions between frames are realistic and smooth. We demonstrate the effectiveness of our approach on the MNIST, KTH, and Human 3.6M datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: l Illustration of our end-to-end system to synthesizes videos of variable length from a single image and a motion stroke.

Synthesizing a video from a single still image is a useful operation for visual content creation, but manually creating such animations is time consuming even for experts. Recently, deep learning has been leveraged to automate this process with some success. These methods usually require several input frames, both during training to learn from given motion sequences, and then to predict sequences of future frames. For example, architectures using LSTM and RNN have been effective in learning to produce video sequences. However, most methods based on LSTM or RNN, for example  

[3] require multiple images as input to allow the network to establish the initial “memory” to produce future frames.

In this paper instead, we propose a method to synthesize a video sequence from a single input image containing an object to be animated against a background scene. Using a single image as input, however, leaves too much ambiguity to the choice of the animation. Therefore, we allow users to provide an additional sketch stroke to the system that controls the motion trajectory of the animated object, as shown in Figure 1. This enables the creation of more meaningful and controllable video sequences. We believe we are the first to develop an end-to-end system for generating animations of arbitrary length and controlled by motion strokes.

The key component in our proposed architecture is a recursive predictor that generates features of a new frame given the features of the previous frame. To avoid degradation over time and to enable motion control by the given stroke, the predictor also uses learned features of the intial frame and the motion stroke as additional inputs. Finally, we train a generator to map the features into temporally coherent image sequences by using an autoencoding constraint and adversarial training.

We demonstrate successful results of our approach on several datasets with human motion. We show that we are able to extract video sequences from an input image corresponding to given strokes. In summary, we make the following contributions: 1) A novel system to predict video from a single image and a motion stroke that can control the video generation; 2) A novel recurrent system to predict video without the limitation of generating a fixed number of frames. Instead, while training on limited frames we can generate variable length sequences; 3) An evaluation on the MNIST, KTH, and Human3.6 datasets to demonstrate the effectiveness of our approach.

2 Related Work

Video generation has been studied for several years. Early works focused on synthesizing continuously varying textures, so-called video textures, from single or multiple still frames [26, 35]. In recent years, deep generative models such as GAN or variational autoencoders (VAE) have been successfully used to generate realistic images or videos from latent codes [9, 16, 25, 32]. Furthermore, the generated output can be conditioned on an additional input, e.g. class label or content image [20, 31]. This allows one to keep the content fixed while sampling appearence, pose, etc. In contrast, Tulyakov et al. [30]

decompose motion- and content components directly in the latent space. They use a recurrent GAN architecture and sample the motion vector at each time step. Indeed, the recurrent neural networks

[12] are a natural choice to learn from time-dependent signals such as text, video or audio. Byeon et al. [3] use a multi-dimensional LSTM that aggregates contextual information in a video for each pixel in all spatial directions and in time. However, in practice the RNN’s are more difficult to train compared to feedforward neural networks [23].

In this work, we propose a recursive network that generates video conditioned on a still frame and a motion stroke image. To the best of our knowledge this is the first work that uses strokes as motion representation for animation in a generative setting. In the following paragraphs we describe the works related to video prediction and motion editing.

Video Prediction from Multiple Frames

It is well-known that using the mean-squared error as a reconstruction loss leads to blurry future frame predictions. Mathieu et al. [19] use adversarial training to tackle this problem and combine the -loss with the GAN objective to optain sharper predictions. In contrast to the original GAN, they do not input noise to the generator and therefore their predictions are fully deterministic. This may be less of an issue since their next prediction is conditioned on multiple frames, and hence the ambiguity is minimal. Denton et al. [8] learn the prior distribution of the latent space at each time step of the their LSTM given the previous frame, and sample from the learned prior to predict the next frame.

In this work, we aim to generate the future from a single frame and only use a stroke as a guidance for global motion.

Video Prediction from Single Frame

Predicting the future from a single still image is highly ambiguous. Prior works use a variational approach in order to constrain the future outcomes in the training phase and at the same time have the possibility to sample from the latent space at test-time [1, 18]. There are two works that closely relate to ours: One is from Li et al. [18] who predict a fixed-length video from a single image. A variational autoencoder is used to sample optical flows conditioned on the input frame, and are then fed to a separately trained network that synthesizes the full-frames from the optical flow maps. The other one is by Hao et al. [10] who synthesize a video clip from one single image and sparse trajectories. Hao et al. generates a video of the whole scene whereas we focus on animating the object in the image. Due to the recursive design, our proposed architecture is able to output a variable length video and we also do not require pixel-accurate optical flows.

Human Motion Synthesis

Recent works have demonstrated the effectiveness of convolutional neural networks for accurate human pose estimation in real time

[4, 22, 33]. The pose extraction from real images can further be used to synthesize images of people in novel poses. Balakrishnan et al[2] achieve this by segmenting the persons body parts and background, transforming the parts to the new pose and fusing the result with the background. Chan et al[5] transfer the pose of a person in a source video to a target subject in another video. Their model is a personalized one, meaning that it needs to be trained for the target subject’s appearance.

Both of these works can render high-quality video with realistic motions because of the strong supervision through pose. In our case with a single stroke, we are given a very sparse description of the motion and lack of exact pose joint movements. With missing information, there is more ambiguity and hence a bigger challenge to generate realistic renderings.

Figure 2: Structure of our network. The core of our architecture is the predictor that recursively predicts features of a new frame given the previous one. Using encoders and , the predictor leverages learned features and of the input image and the motion stroke and , respectively, as additional inputs. This ensures that the generated frames preserve the appearance of the input, and the animated object follows the stroke. Finally, the predicted features are decoded into temporally consistent image sequences using the generator , which is trained both with an autoencoding constraint and in an adversarial manner by competing against discriminators and .

Sketch-based Animation

To this date, character animation remains a challenging and labor-intensive task. Early works for automated animation from sketches focused on cartoon figures, which, despite their simplistic appearance, have a similar complexity in terms of motion compared to real images. Davis et al. [7] take a sequence of 2D pose sketches and reconstruct the most-likely 3D poses which are applied to a 3D character model for animation. Thorne et al. [29] require only a sketch of the character and a continuous stroke for the motion. Chen et al. [6] reconstruct the 3D wireframe of the character from a sequence of sketches and provided correspondences. This allows them to add realistic lighting, textures and shading on top of the animated character. Our system is fully end-to-end and does not require to explicitly model the 3D or rendering pipeline, and apart from the motion sketch there are no annotations needed.

3 Video Synthesis from Motion Stroke

The formulation of our problem is simple: Given an a single image and a hand-drawn stroke we aim to synthesize future images , i.e. a video, of a plausible motion that follows the drawn stroke. It is assumed that the starting point of the drawing is roughly at the center of a movable object, or to be more precise, the center of the object’s bounding box. If the input does not satisfy this assumption, which part of the image should move becomes ambiguous and the behavior of the synthesis algorithm is undefined. In our work we focus on human motions, e.g. walking or running, although in our model we do not make any assumptions specific to humans.

3.1 Architecture

To enable applications where the user wishes to synthesize videos of variable length, we address the problem with a recursive neural network that continually outputs video frames. Our network, as depicted in Figure 2, is composed of three main stages that are all jointly trained in an end-to-end fashion: The encoding stage, the prediction stage, and the decoding stage.

We use two encoders to extract the texture information in the input frame and the motion information from the stroke. These are concatenated and fed to our predictor, which should output the feature of the next frame. At test time this predictor is applied recursively by feeding the output back as input. In addition, the feature encoding of the initial frame is always given as input as well in order to retain a reference to the beginning of the sequence. Finally, at the decoding stage the generator network outputs the RGB frame. In the following paragraphs we detail each of these building blocks and point out differences at training- and test time. For clarity, we omit the notation for network parameters from our formulations.

Encoding Stage

Our future prediction is performed in the low-dimensional latent space. At the beginning we are given the initial frame and motion stroke . First, we encode the image with to a feature . For the motion, we extract every consecutive stroke segment between keypoint and from and concatenate the two to obtain the instant motion feature .

Prediction and Decoding

At any time step , we input the features and to our recursive predictor which is a function

(1)

that operates on the features of images and motion Note that, at any time, the predictor has not only access to the instant motion, but also the entire stroke which was given as input. Finally, each output of the predictor is individually decoded by to produce RGB frames that are then concatenated to form the final video.

3.2 Training

We train all parts of the network together by minimizing a combination of reconstruction-, adversarial- and perceptual losses. Firstly, we use the -loss on all pixels between ground truth and synthesized frame, i.e.,

(2)

A second reconstruction loss enforces that the features generated by and have the same structure, i.e., when predicts it should match the encoding :

(3)

This guarantees consistent encoding at each time step. Furthermore, to avoid blurred outputs from the pixel-wise loss, we use two discriminators: One that distinguishes between predicted “fake” frames and “real” frames from the distribution of real images , i.e.,

(4)

and another one that discriminates generated pairs from real pairs , i.e.

(5)

Furthermore, the single-frame discriminator in equation 4 is conditioned on the instant stroke , which we omitted from the notation for readability. Finally, we also measure the perceptual loss

(6)

between output video and ground truth, where

are the features extracted from a pre-trained VGG network 

[28].

With all losses combined we formulate the adversarial objective as

(7)

where denote the network weights for the encoders and the generator, and

denote the network weights for the discriminators. We optimize the above objective by alternating stochastic gradient descent on

and .

Since the predicted frame at time depends on the input at time , we train the network sequentially by using the previous output as input. The feature can be regarded as the state of the system at time . However, unlike in recurrent neural networks [12] we do not compute the gradient over all the past states for simplicity.

3.3 Runtime

At test time our system is given only a single image and the stroke . The encoders and are used at time to obtain features and on which the predictor is recursively applied for a number of time steps. At each time step, the predicted next feature is fed back as input for the next step that produces and so on. As in the training phase, is given as input at each step to provide a reference point to the beginning of the sequence.

3.4 Implementation

We use convolutional layers for encoding and transposed convolutions for decoding. The predictor is composed of dense-blocks [13]. Spectral normalization [21] is only employed in the discriminator and no other normalization techniques are applied. The spatial dimensions of the input and output are . We use the Adam optimizer [15] with learning rate and for both the generator and discriminator.

Stroke Input
(a)
Stroke Input
(b)
Figure 3:

Results on the MNIST dataset. (a) The first row is one ground truth example corresponding to the given stroke. The second to the fifth row show our results. For each row, the first column is the input stroke, where the intensity of the point goes from black to light grey with increasing time. The second column is the input image. From the third column to the far right we show the generated video sequence. (b) Experiments with different strokes and the same initial image. The odd rows are the ground truth sequences given the stroke input and the even rows are the corresponding generated video sequences.

4 Experiments

We evaluate our models using three datasets: the MNIST [17] handwritten digits, KTH Actions [27] and Human3.6M [14] human actions. We show qualitative results on all datasets and quantitative evaluations and ablation studies on KTH. The results on KTH and Human 3.6M can show that our method is efficient in synthesizing video of real images of human motion. Although these datasets are large, their videos include only a restricted variety of object trajectories. This limits the ability of our model to generate videos with unusual motion strokes. However, to demonstrate that our method can potentially generate videos with very diverse trajectory strokes as input, we train on MNIST by synthetically generating a large variety of strokes.

Stroke generation

For MNIST, the trajectory is randomly generated online during training and testing. For both KTH and Human 3.6M, we compute the centroid of the bounding box of the person in each frame and use the centroid as the stroke point so that we do not have to annotate the training set. The bounding box is detected using the YOLO object detector [24]. We encode the time instant in the stroke with the pixel’s grayscale intensity (black indicates the beginning and light grey indicates the end).

Mnist

The MNIST dataset consists of K handwritten digits for the training and K for the test set. In order to test our system on arbitrary trajectories, we create a synthetic dataset of moving MNIST digits. We take the MNIST digits and move them within a window of pixels for 16 frames. The first frame is given as the input and the system predicts the following 15 frames.

Our method can generate video sequences from one single image according to the input strokes. Fig. 3 (a) shows an example of the generated different frame sequences given the same stroke input with different digit numbers. Fig. 3 (b) shows the resulting video sequence given the same input image and different strokes. We observe that the digits move along the given trajectory as desired.

Kth

The KTH Action dataset contains grayscale videos of 25 persons performing various actions from which we choose the subsets walking, running and jogging since the others do not contain large enough global motions. We train on all data of the three motions except persons 21-25 which we reserve for testing, yielding 98k and 19k frames for training and testing respectively. Since there are sequences where the person is walking out of the frame, we employ the YOLO object detector [24] to exclude frames where the confidence level of detecting a person is less than 0.5.

Stroke Input
(a)
Stroke Input
(b)
Figure 4: Experiments on the KTH dataset. For each row, the first column is the input stroke where the intensity of the point goes from black to light grey with time. The second column is the input image. Starting from the third column to the end are the frames from the generated sequence.

Qualitative results are shown in Fig. 4. We randomly take 16 sequential frames in one video sequence as one video clips We use the first frame as input and compute the centroid of the bounding box of each frame as stroke input. The system predicts the following 15 frames. In order to demonstrate that the output of our network is a function of the motion stroke, we test it with varying input strokes while keeping the initial frame fixed. Examples of these are shown in Fig. 4 for the KTH data set. The stroke in each row is taken from a different video in the test set and applied to the image in column one. We can see that given the same input image, different strokes result in different video sequence. It turns out that the intervals between two stroke points encode the information of the motion. Denser strokes are translated into walking motion and strokes with large intervals correspond to running motion. Between these two stroke types is the jogging motion. We also observe that the generated frames follow the position of the sketch point. From Fig. 4 (b) in the third and fourth row, we also see that, given similar strokes, the system can generate the same type of motion but with different details. Furthermore, from the results we can see that if the pose of a person in the input image resembles a running posture, but the strokes describe a walking motion, the system smoothly generates a realistic motion video changing from running to walking, instead of jumping directly to a walking pose.

Stroke Input
t=3   t=6    t=9    t=12    t=15    t=16    t=17    t=18    t=19     t=20     t=21      t=22     t=23
Figure 5: Longer sequence. The frames inside the yellow boundary are the extended predicted frames beyond the training regime with 16 frames.

Since our system is generating images recurrently frame by frame, we can also test it on generating long video sequences by manipulating the stroke input. Fig. 5 shows that, although we trained with sequences 16 frames long, at runtime we could synthesize up to 24 frames while preserving their sharpness. We speculate that if the system is trained with longer sequences, then it could generate even more high quality frames.

There are many video prediction methods requiring multiple frames as input. We take the latest work from Denton et al. [8] for comparison. Fig. 6 shows the qualitative comparison with Denton et al. They require 10 frames as input. We take their 10th frame as our input. We observe that our result with only one image is comparable to Denton with 10 frames as input. As in our method, Li et al. [18] requires only one image as input. Li et al. require one image and one noise vector as input and predict a fixed number (16) of frames. Fig. 7 shows that we generate images with a quality similar to that of Li et al. and better than Xue et al. [34].

One thing to notice is that during training, we do not resize the image to a square as in the other methods. This makes the animated person look slimmer than in the original input sequence. This is apparent in the videos generated by Li et al. and Denton et al. when compared to our results.

Input

Denton

Ours

   t=0    t=5    t=9   t=10   t=12   t=14   t=16   t=18   t=20   t=22   t=24
Figure 6: Qualitative comparison between Denton et al. [8] and ours on the KTH dataset. To be consistent with their paper, we also generate images of size . We take the last frame of their input frames as our single input image.
Input

Li

Ours

      t=1       t=3       t=5       t=7       t=9       t=11       t=13       t=15
Figure 7: Qualitative comparison between Li et al. [18] and ours on the KTH dataset. The image size is .
Figure 8: For each row, the first column is the input stroke where the intensity of the point becomes dark to light over time. The second column is the input image. Starting from the third column to the right are the video sequences. The odd rows are the ground truth given the stroke input and the even rows are the corresponding generated video sequences.
Stroke Input
Figure 9: Failure case of KTH, when the input stroke is not drawn in the direction the person is facing.
walking jogging running det.
mean std mean std mean std
Denton et al. [8] 7.5 10.0 9.9 11.9 10.7 11.5 54.2
Li et al. [18] 7.4 9.1 10.1 11.3 8.7 9.9 54.9
Ours 7.2 7.7 8.2 9.1 9.2 10.5 87.1
Ground truth 4.3 5.7 5.3 5.8 7.4 6.8 100.0
Table 1: Pose smoothness on generated KTH sequences. We compare the rel. mean Euclidean distance (%) and std. between pose joints in consecutive frames. The last column denotes the detection rate (%) of valid bounding boxes across all classes. The percentage is relative to image size.
Li et al. [18] Denton et al. [8] Ours
FID 149.9 55.6 26.0
LPIPS 0.15 0.11 0.09
Table 2: Comparison of Frechet Inception Distance (FID) [11] and LPIPS [36] between generated and test set distribution. Smaller values are better. Li et al. have released their code, but not the model weights for KTH. We re-trained their model, but were not able to reproduce the quality they show in the paper.

Quantitative analysis is done by performing motion statistics evaluation on the generated sequences, and using the Frechet Inception Distance (FID) [11] as well as the Learned Perceptual Image Patch Similarity (LPIPS) [36] to test how realistic the generated images are. Since video prediction in our and the compared works is non-deterministic, we have no access to ground-truth in order to evaluate our output with a pixel-wise metric. Instead, we compute motion and object detection statistics on generated as well as ground-truth KTH sequences independently. The comparison of these numbers is shown in Table 1. We extract the pose joints in consecutive images using the convolutional pose machine [33]

and measure the mean and standard deviation of the Euclidean distance between corresponding joints. Lower numbers represent a smoother trajectory of detected pose joints, meaning that the pose detection is benefitting from a better image quality. The pose is only evaluated on the subset of frames where a person is detected, and the fraction of these is listed in the last column of Table 

1. The detection rate on the ground truth is measured as 100% since we train and test only on frames where a person is detected. Table 1 shows that we outperform in all motion categories except running and have the highest object detection rate.

Furthermore, our comparison of the perceptual metrics for 15 predicted frames is shown in Table 2 (lower scores are better). We compute the numbers for Li et al. and Denton et al. by re-training their publicly available code with our data. We outperform both works, and are closest to Denton et al. who condition on 10 past frames.

Human3.6M

We also evaluate our model with another real dataset, the Human3.6M dataset, which is a collection of indoor videos with eleven actors filmed from four viewpoints performing various actions. In our experiments we only use the walking subset consisting of 219k frames, from which we take

24k for testing where none of the actors ever appear in our training data. We sample sequences of 16 frames by choosing a random start and different stride to diversify the subsequences. Fig. 

8 shows the generated video sequence with different strokes. We see that our model generates realistic video sequences.

4.1 Discussion

We also show a failure case in the last row of Fig. 9, where the stroke is not drawn in the direction the person is facing. This is due to missing training data where the person is taking a turn and walking in the opposite direction. We speculate that this shortcoming can be addressed with more diverse training data. Nonetheless these qualitative results demonstrate that our system generates diverse motions that vary according to the user input.

5 Ablation study

Fig. 11 shows the necessity of the input initial image , the whole stroke and the instant stroke from time to . We use to represent the combination of each input element, for example, represents that we give and as input, plus the predicted feature in the last iteration. For each iteration, we must recurrently use the predicted feature to predict the next frame . For each variant, the network is re-trained and tested on the same data.

In Fig. 11, the second row shows the result of and third row shows the result of . The stroke image encodes the time with increasing color intensity, but since our system is recurrent and we update the network for each frame, it is difficult for the system to know to which stroke keypoint the current frame corresponds. As a result, in Fig. 11 (b) we can see that the object does not move forward when the ordering of the frames is unknown. With the additional stroke segment as input at time the system can locate the current frame in time to predict the next frame . On the other hand, without the complete stroke , it is challenging for the system to understand the motion in the whole video. Fig. 11 (c) shows that the person moves forward as provides guidance in the general direction, but the result is not a walking animation.

In a third ablation study, demonstrated in Fig. 11, we remove the input image and only keep the strokes . One can observe that without the image quality is decreasing as more frames are generated. Since our system is recurrent, the artifacts in the generated frames are accumulated from one predicted image to the next. This shows that provides useful texture information at each time step and is necessary to reduce the accumulation of errors.

Finally, the last row in Fig. 11 is the result with all inputs combined () as it is presented in the paper. This shows that all inputs contribute to an improved image quality and realism in the motion.

[autoplay,loop, width=]5./images/supplemental/frames/frame00010080

Figure 10: Video of generated sequences demonstrating that the animation follows the stroke. Left: ground truth. Right: generated sequence. The blue line is the input stroke. The animation plays when viewed in video-compatible readers such as Adobe Reader 9 or higher.
Stroke Input
(a)
(b)
(c)
(d)
(e)
Figure 11: Input ablation study. (a) The ground truth sequence and motion stroke. (b) , i.e. instant stroke segment is removed. (c) , i.e. global stroke is removed. (d) , i.e. input image is removed. (e) Result with all the inputs in our structure, i.e. .
Encoder Layer Act. conv 64 3 2 1 LReLU conv 128 3 2 1 LReLU conv 256 3 2 1 LReLU Predictor : Dense Block conv 128 1 1 1 ReLU conv 256 3 2 1 ReLU Predictor : Transition Layer conv 256 1 1 1 ReLU Decoder deconv 256 4 2 1 ReLU deconv 128 4 2 1 ReLU deconv 64 4 2 1 ReLU deconv 3 4 1 1 ReLU Discriminator Layer Act. Norm. conv 64 3 1 1 LReLU SN conv 128 4 2 1 LReLU SN conv 128 3 1 1 LReLU SN conv 256 4 2 1 LReLU SN conv 256 3 1 1 LReLU SN conv 512 4 2 1 LReLU SN conv 512 3 1 1 LReLU SN conv 32 3 1 1 LReLU SN conv 64 4 2 1 LReLU SN conv 64 3 1 1 LReLU SN conv 128 4 2 1 LReLU SN conv 128 3 1 1 LReLU SN conv 256 4 2 1 LReLU SN conv 256 3 1 1 LReLU SN fc 1 SN Discriminator Layer Act. Norm. conv 64 3 1 1 LReLU SN conv 128 4 2 1 LReLU SN conv 128 3 1 1 LReLU SN conv 256 4 2 1 LReLU SN conv 256 3 1 1 LReLU SN conv 512 4 2 1 LReLU SN conv 512 3 1 1 LReLU SN fc 1 SN
Table 3: Details of our architecture.

6 Video clip

We show multiple sample sequences of 16 frames and the corresponding input strokes in figure 10. The corresponding video file video1.avi is also available as part of the supplementary material. On the left side is the ground truth and on the right side is our generated video. The blue path is the centroid of the bounding box of the ground truth frames. These samples demonstrate that the person moves along the provided stroke. We also show the generated videos with 24 frames (The last 8 frames are the extended predicted frames beyond the training regime with 16 frames.) in the supplementary material video2.avi. We could see in the videos that in each video clip the person moves smoothly.

7 Architecture details

We show our network details in Table  3. For the predictor, we use an architecture similar to DenseNet [13] with six dense blocks and one transition layer. In each dense block there are six bottleneck layers, each having the same architecture. In Table 3 we only shows the output channels of each conv layer before concatenation. Shown for each layer are the number of feature channels , the kernel size , the stride

, the padding

and the non-linearity that follows. conv and deconv denote the convolution and transposed convolution operations respectively. In the discriminators, we apply spectral normalization after each layer (denoted as ) and the last layer is fully-connected (denoted as fc). Discriminator has two parts, the first part is for the input image and the second part is for the cropped image with a mask centered to the object. We flatten the output of the last convolution layer in each part to a vector and then concatenate the two and feed it to the fully-connected layer at the end.

8 Conclusions

In this paper, we propose a novel method to synthesize a video clip from one single image and a user-provided motion stroke. Our method is based on a recurrent architecture and is capable of generating videos by predicting the next frame given the previous one. Thus it is possible to generate videos of arbitrary length. We demonstrate our approach on several real datasets with human motion and find that it can animate images realistically. Although the proposed method can generate different videos based on the input motion stroke, the variability of the output videos depends on the data that the model was trained on. We believe that this limitation could be addressed by collecting and training on more data.

References