Synthesizing a video from a single still image is a useful operation for visual content creation, but manually creating such animations is time consuming even for experts. Recently, deep learning has been leveraged to automate this process with some success. These methods usually require several input frames, both during training to learn from given motion sequences, and then to predict sequences of future frames. For example, architectures using LSTM and RNN have been effective in learning to produce video sequences. However, most methods based on LSTM or RNN, for example require multiple images as input to allow the network to establish the initial “memory” to produce future frames.
In this paper instead, we propose a method to synthesize a video sequence from a single input image containing an object to be animated against a background scene. Using a single image as input, however, leaves too much ambiguity to the choice of the animation. Therefore, we allow users to provide an additional sketch stroke to the system that controls the motion trajectory of the animated object, as shown in Figure 1. This enables the creation of more meaningful and controllable video sequences. We believe we are the first to develop an end-to-end system for generating animations of arbitrary length and controlled by motion strokes.
The key component in our proposed architecture is a recursive predictor that generates features of a new frame given the features of the previous frame. To avoid degradation over time and to enable motion control by the given stroke, the predictor also uses learned features of the intial frame and the motion stroke as additional inputs. Finally, we train a generator to map the features into temporally coherent image sequences by using an autoencoding constraint and adversarial training.
We demonstrate successful results of our approach on several datasets with human motion. We show that we are able to extract video sequences from an input image corresponding to given strokes. In summary, we make the following contributions: 1) A novel system to predict video from a single image and a motion stroke that can control the video generation; 2) A novel recurrent system to predict video without the limitation of generating a fixed number of frames. Instead, while training on limited frames we can generate variable length sequences; 3) An evaluation on the MNIST, KTH, and Human3.6 datasets to demonstrate the effectiveness of our approach.
2 Related Work
Video generation has been studied for several years. Early works focused on synthesizing continuously varying textures, so-called video textures, from single or multiple still frames [26, 35]. In recent years, deep generative models such as GAN or variational autoencoders (VAE) have been successfully used to generate realistic images or videos from latent codes [9, 16, 25, 32]. Furthermore, the generated output can be conditioned on an additional input, e.g. class label or content image [20, 31]. This allows one to keep the content fixed while sampling appearence, pose, etc. In contrast, Tulyakov et al. 12] are a natural choice to learn from time-dependent signals such as text, video or audio. Byeon et al.  use a multi-dimensional LSTM that aggregates contextual information in a video for each pixel in all spatial directions and in time. However, in practice the RNN’s are more difficult to train compared to feedforward neural networks .
In this work, we propose a recursive network that generates video conditioned on a still frame and a motion stroke image. To the best of our knowledge this is the first work that uses strokes as motion representation for animation in a generative setting. In the following paragraphs we describe the works related to video prediction and motion editing.
Video Prediction from Multiple Frames
It is well-known that using the mean-squared error as a reconstruction loss leads to blurry future frame predictions. Mathieu et al.  use adversarial training to tackle this problem and combine the -loss with the GAN objective to optain sharper predictions. In contrast to the original GAN, they do not input noise to the generator and therefore their predictions are fully deterministic. This may be less of an issue since their next prediction is conditioned on multiple frames, and hence the ambiguity is minimal. Denton et al.  learn the prior distribution of the latent space at each time step of the their LSTM given the previous frame, and sample from the learned prior to predict the next frame.
In this work, we aim to generate the future from a single frame and only use a stroke as a guidance for global motion.
Video Prediction from Single Frame
Predicting the future from a single still image is highly ambiguous. Prior works use a variational approach in order to constrain the future outcomes in the training phase and at the same time have the possibility to sample from the latent space at test-time [1, 18]. There are two works that closely relate to ours: One is from Li et al.  who predict a fixed-length video from a single image. A variational autoencoder is used to sample optical flows conditioned on the input frame, and are then fed to a separately trained network that synthesizes the full-frames from the optical flow maps. The other one is by Hao et al.  who synthesize a video clip from one single image and sparse trajectories. Hao et al. generates a video of the whole scene whereas we focus on animating the object in the image. Due to the recursive design, our proposed architecture is able to output a variable length video and we also do not require pixel-accurate optical flows.
Human Motion Synthesis
Both of these works can render high-quality video with realistic motions because of the strong supervision through pose. In our case with a single stroke, we are given a very sparse description of the motion and lack of exact pose joint movements. With missing information, there is more ambiguity and hence a bigger challenge to generate realistic renderings.
To this date, character animation remains a challenging and labor-intensive task. Early works for automated animation from sketches focused on cartoon figures, which, despite their simplistic appearance, have a similar complexity in terms of motion compared to real images. Davis et al.  take a sequence of 2D pose sketches and reconstruct the most-likely 3D poses which are applied to a 3D character model for animation. Thorne et al.  require only a sketch of the character and a continuous stroke for the motion. Chen et al.  reconstruct the 3D wireframe of the character from a sequence of sketches and provided correspondences. This allows them to add realistic lighting, textures and shading on top of the animated character. Our system is fully end-to-end and does not require to explicitly model the 3D or rendering pipeline, and apart from the motion sketch there are no annotations needed.
3 Video Synthesis from Motion Stroke
The formulation of our problem is simple: Given an a single image and a hand-drawn stroke we aim to synthesize future images , i.e. a video, of a plausible motion that follows the drawn stroke. It is assumed that the starting point of the drawing is roughly at the center of a movable object, or to be more precise, the center of the object’s bounding box. If the input does not satisfy this assumption, which part of the image should move becomes ambiguous and the behavior of the synthesis algorithm is undefined. In our work we focus on human motions, e.g. walking or running, although in our model we do not make any assumptions specific to humans.
To enable applications where the user wishes to synthesize videos of variable length, we address the problem with a recursive neural network that continually outputs video frames. Our network, as depicted in Figure 2, is composed of three main stages that are all jointly trained in an end-to-end fashion: The encoding stage, the prediction stage, and the decoding stage.
We use two encoders to extract the texture information in the input frame and the motion information from the stroke. These are concatenated and fed to our predictor, which should output the feature of the next frame. At test time this predictor is applied recursively by feeding the output back as input. In addition, the feature encoding of the initial frame is always given as input as well in order to retain a reference to the beginning of the sequence. Finally, at the decoding stage the generator network outputs the RGB frame. In the following paragraphs we detail each of these building blocks and point out differences at training- and test time. For clarity, we omit the notation for network parameters from our formulations.
Our future prediction is performed in the low-dimensional latent space. At the beginning we are given the initial frame and motion stroke . First, we encode the image with to a feature . For the motion, we extract every consecutive stroke segment between keypoint and from and concatenate the two to obtain the instant motion feature .
Prediction and Decoding
At any time step , we input the features and to our recursive predictor which is a function
that operates on the features of images and motion Note that, at any time, the predictor has not only access to the instant motion, but also the entire stroke which was given as input. Finally, each output of the predictor is individually decoded by to produce RGB frames that are then concatenated to form the final video.
We train all parts of the network together by minimizing a combination of reconstruction-, adversarial- and perceptual losses. Firstly, we use the -loss on all pixels between ground truth and synthesized frame, i.e.,
A second reconstruction loss enforces that the features generated by and have the same structure, i.e., when predicts it should match the encoding :
This guarantees consistent encoding at each time step. Furthermore, to avoid blurred outputs from the pixel-wise loss, we use two discriminators: One that distinguishes between predicted “fake” frames and “real” frames from the distribution of real images , i.e.,
and another one that discriminates generated pairs from real pairs , i.e.
Furthermore, the single-frame discriminator in equation 4 is conditioned on the instant stroke , which we omitted from the notation for readability. Finally, we also measure the perceptual loss
between output video and ground truth, where
are the features extracted from a pre-trained VGG network.
With all losses combined we formulate the adversarial objective as
where denote the network weights for the encoders and the generator, and
denote the network weights for the discriminators. We optimize the above objective by alternating stochastic gradient descent onand .
Since the predicted frame at time depends on the input at time , we train the network sequentially by using the previous output as input. The feature can be regarded as the state of the system at time . However, unlike in recurrent neural networks  we do not compute the gradient over all the past states for simplicity.
At test time our system is given only a single image and the stroke . The encoders and are used at time to obtain features and on which the predictor is recursively applied for a number of time steps. At each time step, the predicted next feature is fed back as input for the next step that produces and so on. As in the training phase, is given as input at each step to provide a reference point to the beginning of the sequence.
We use convolutional layers for encoding and transposed convolutions for decoding. The predictor is composed of dense-blocks . Spectral normalization  is only employed in the discriminator and no other normalization techniques are applied. The spatial dimensions of the input and output are . We use the Adam optimizer  with learning rate and for both the generator and discriminator.
Results on the MNIST dataset. (a) The first row is one ground truth example corresponding to the given stroke. The second to the fifth row show our results. For each row, the first column is the input stroke, where the intensity of the point goes from black to light grey with increasing time. The second column is the input image. From the third column to the far right we show the generated video sequence. (b) Experiments with different strokes and the same initial image. The odd rows are the ground truth sequences given the stroke input and the even rows are the corresponding generated video sequences.
We evaluate our models using three datasets: the MNIST  handwritten digits, KTH Actions  and Human3.6M  human actions. We show qualitative results on all datasets and quantitative evaluations and ablation studies on KTH. The results on KTH and Human 3.6M can show that our method is efficient in synthesizing video of real images of human motion. Although these datasets are large, their videos include only a restricted variety of object trajectories. This limits the ability of our model to generate videos with unusual motion strokes. However, to demonstrate that our method can potentially generate videos with very diverse trajectory strokes as input, we train on MNIST by synthetically generating a large variety of strokes.
For MNIST, the trajectory is randomly generated online during training and testing. For both KTH and Human 3.6M, we compute the centroid of the bounding box of the person in each frame and use the centroid as the stroke point so that we do not have to annotate the training set. The bounding box is detected using the YOLO object detector . We encode the time instant in the stroke with the pixel’s grayscale intensity (black indicates the beginning and light grey indicates the end).
The MNIST dataset consists of K handwritten digits for the training and K for the test set. In order to test our system on arbitrary trajectories, we create a synthetic dataset of moving MNIST digits. We take the MNIST digits and move them within a window of pixels for 16 frames. The first frame is given as the input and the system predicts the following 15 frames.
Our method can generate video sequences from one single image according to the input strokes. Fig. 3 (a) shows an example of the generated different frame sequences given the same stroke input with different digit numbers. Fig. 3 (b) shows the resulting video sequence given the same input image and different strokes. We observe that the digits move along the given trajectory as desired.
The KTH Action dataset contains grayscale videos of 25 persons performing various actions from which we choose the subsets walking, running and jogging since the others do not contain large enough global motions. We train on all data of the three motions except persons 21-25 which we reserve for testing, yielding 98k and 19k frames for training and testing respectively. Since there are sequences where the person is walking out of the frame, we employ the YOLO object detector  to exclude frames where the confidence level of detecting a person is less than 0.5.
Qualitative results are shown in Fig. 4. We randomly take 16 sequential frames in one video sequence as one video clips We use the first frame as input and compute the centroid of the bounding box of each frame as stroke input. The system predicts the following 15 frames. In order to demonstrate that the output of our network is a function of the motion stroke, we test it with varying input strokes while keeping the initial frame fixed. Examples of these are shown in Fig. 4 for the KTH data set. The stroke in each row is taken from a different video in the test set and applied to the image in column one. We can see that given the same input image, different strokes result in different video sequence. It turns out that the intervals between two stroke points encode the information of the motion. Denser strokes are translated into walking motion and strokes with large intervals correspond to running motion. Between these two stroke types is the jogging motion. We also observe that the generated frames follow the position of the sketch point. From Fig. 4 (b) in the third and fourth row, we also see that, given similar strokes, the system can generate the same type of motion but with different details. Furthermore, from the results we can see that if the pose of a person in the input image resembles a running posture, but the strokes describe a walking motion, the system smoothly generates a realistic motion video changing from running to walking, instead of jumping directly to a walking pose.
Since our system is generating images recurrently frame by frame, we can also test it on generating long video sequences by manipulating the stroke input. Fig. 5 shows that, although we trained with sequences 16 frames long, at runtime we could synthesize up to 24 frames while preserving their sharpness. We speculate that if the system is trained with longer sequences, then it could generate even more high quality frames.
There are many video prediction methods requiring multiple frames as input. We take the latest work from Denton et al.  for comparison. Fig. 6 shows the qualitative comparison with Denton et al. They require 10 frames as input. We take their 10th frame as our input. We observe that our result with only one image is comparable to Denton with 10 frames as input. As in our method, Li et al.  requires only one image as input. Li et al. require one image and one noise vector as input and predict a fixed number (16) of frames. Fig. 7 shows that we generate images with a quality similar to that of Li et al. and better than Xue et al. .
One thing to notice is that during training, we do not resize the image to a square as in the other methods. This makes the animated person look slimmer than in the original input sequence. This is apparent in the videos generated by Li et al. and Denton et al. when compared to our results.
|Denton et al. ||7.5||10.0||9.9||11.9||10.7||11.5||54.2|
|Li et al. ||7.4||9.1||10.1||11.3||8.7||9.9||54.9|
|Li et al. ||Denton et al. ||Ours|
Quantitative analysis is done by performing motion statistics evaluation on the generated sequences, and using the Frechet Inception Distance (FID)  as well as the Learned Perceptual Image Patch Similarity (LPIPS)  to test how realistic the generated images are. Since video prediction in our and the compared works is non-deterministic, we have no access to ground-truth in order to evaluate our output with a pixel-wise metric. Instead, we compute motion and object detection statistics on generated as well as ground-truth KTH sequences independently. The comparison of these numbers is shown in Table 1. We extract the pose joints in consecutive images using the convolutional pose machine 
and measure the mean and standard deviation of the Euclidean distance between corresponding joints. Lower numbers represent a smoother trajectory of detected pose joints, meaning that the pose detection is benefitting from a better image quality. The pose is only evaluated on the subset of frames where a person is detected, and the fraction of these is listed in the last column of Table1. The detection rate on the ground truth is measured as 100% since we train and test only on frames where a person is detected. Table 1 shows that we outperform in all motion categories except running and have the highest object detection rate.
Furthermore, our comparison of the perceptual metrics for 15 predicted frames is shown in Table 2 (lower scores are better). We compute the numbers for Li et al. and Denton et al. by re-training their publicly available code with our data. We outperform both works, and are closest to Denton et al. who condition on 10 past frames.
We also evaluate our model with another real dataset, the Human3.6M dataset, which is a collection of indoor videos with eleven actors filmed from four viewpoints performing various actions. In our experiments we only use the walking subset consisting of 219k frames, from which we take
24k for testing where none of the actors ever appear in our training data. We sample sequences of 16 frames by choosing a random start and different stride to diversify the subsequences. Fig.8 shows the generated video sequence with different strokes. We see that our model generates realistic video sequences.
We also show a failure case in the last row of Fig. 9, where the stroke is not drawn in the direction the person is facing. This is due to missing training data where the person is taking a turn and walking in the opposite direction. We speculate that this shortcoming can be addressed with more diverse training data. Nonetheless these qualitative results demonstrate that our system generates diverse motions that vary according to the user input.
5 Ablation study
Fig. 11 shows the necessity of the input initial image , the whole stroke and the instant stroke from time to . We use to represent the combination of each input element, for example, represents that we give and as input, plus the predicted feature in the last iteration. For each iteration, we must recurrently use the predicted feature to predict the next frame . For each variant, the network is re-trained and tested on the same data.
In Fig. 11, the second row shows the result of and third row shows the result of . The stroke image encodes the time with increasing color intensity, but since our system is recurrent and we update the network for each frame, it is difficult for the system to know to which stroke keypoint the current frame corresponds. As a result, in Fig. 11 (b) we can see that the object does not move forward when the ordering of the frames is unknown. With the additional stroke segment as input at time the system can locate the current frame in time to predict the next frame . On the other hand, without the complete stroke , it is challenging for the system to understand the motion in the whole video. Fig. 11 (c) shows that the person moves forward as provides guidance in the general direction, but the result is not a walking animation.
In a third ablation study, demonstrated in Fig. 11, we remove the input image and only keep the strokes . One can observe that without the image quality is decreasing as more frames are generated. Since our system is recurrent, the artifacts in the generated frames are accumulated from one predicted image to the next. This shows that provides useful texture information at each time step and is necessary to reduce the accumulation of errors.
Finally, the last row in Fig. 11 is the result with all inputs combined () as it is presented in the paper. This shows that all inputs contribute to an improved image quality and realism in the motion.
6 Video clip
We show multiple sample sequences of 16 frames and the corresponding input strokes in figure 10. The corresponding video file video1.avi is also available as part of the supplementary material. On the left side is the ground truth and on the right side is our generated video. The blue path is the centroid of the bounding box of the ground truth frames. These samples demonstrate that the person moves along the provided stroke. We also show the generated videos with 24 frames (The last 8 frames are the extended predicted frames beyond the training regime with 16 frames.) in the supplementary material video2.avi. We could see in the videos that in each video clip the person moves smoothly.
7 Architecture details
We show our network details in Table 3. For the predictor, we use an architecture similar to DenseNet  with six dense blocks and one transition layer. In each dense block there are six bottleneck layers, each having the same architecture. In Table 3 we only shows the output channels of each conv layer before concatenation. Shown for each layer are the number of feature channels , the kernel size , the stride
, the paddingand the non-linearity that follows. conv and deconv denote the convolution and transposed convolution operations respectively. In the discriminators, we apply spectral normalization after each layer (denoted as ) and the last layer is fully-connected (denoted as fc). Discriminator has two parts, the first part is for the input image and the second part is for the cropped image with a mask centered to the object. We flatten the output of the last convolution layer in each part to a vector and then concatenate the two and feed it to the fully-connected layer at the end.
In this paper, we propose a novel method to synthesize a video clip from one single image and a user-provided motion stroke. Our method is based on a recurrent architecture and is capable of generating videos by predicting the next frame given the previous one. Thus it is possible to generate videos of arbitrary length. We demonstrate our approach on several real datasets with human motion and find that it can animate images realistically. Although the proposed method can generate different videos based on the input motion stroke, the variability of the output videos depends on the data that the model was trained on. We believe that this limitation could be addressed by collecting and training on more data.
-  M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
-  G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag. Synthesizing images of humans in unseen poses. arXiv preprint arXiv:1804.07739, 2018.
-  W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos. Contextvp: Fully context-aware video prediction. arXiv preprint arXiv:1710.08518, 2017.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050, 2016.
-  C. Chan, S. Ginosar, T. Zhou, and A. A. Efros. Everybody dance now. arXiv preprint arXiv:1808.07371, 2018.
-  B.-Y. Chen, Y. Ono, and T. Nishita. Character animation creation using hand-drawn sketches. The Visual Computer, 21(8-10):551–558, 2005.
-  J. Davis, M. Agrawala, E. Chuang, Z. Popović, and D. Salesin. A sketching interface for articulated figure animation. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 320–328. Eurographics Association, 2003.
-  E. Denton and R. Fergus. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Z. Hao, X. Huang, and S. Belongie. Controllable video generation with sparse trajectories. In , pages 7854–7863, 2018.
-  M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2014.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang. Flow-grounded spatial-temporal video prediction from still images. In European Conference on Computer Vision, 2018.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
R. Pascanu, T. Mikolov, and Y. Bengio.
On the difficulty of training recurrent neural networks.
International Conference on Machine Learning, pages 1310–1318, 2013.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
M. Saito, E. Matsumoto, and S. Saito.
Temporal generative adversarial nets with singular value clipping.In IEEE International Conference on Computer Vision (ICCV), volume 2, page 5, 2017.
-  A. Schödl, R. Szeliski, D. H. Salesin, and I. Essa. Video textures. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 489–498. ACM Press/Addison-Wesley Publishing Co., 2000.
-  C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  M. Thorne, D. Burke, and M. van de Panne. Motion doodles: an interface for sketching character motion. In ACM Transactions on Graphics (TOG), volume 23, pages 424–431. ACM, 2004.
-  S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993, 2017.
-  A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pages 613–621, 2016.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4724–4732, 2016.
-  T. Xue, J. Wu, K. L. Bouman, and W. T. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016.
-  C. Yin, Y. Gui, Z. Xie, and L. Ma. Shape context based video texture synthesis from still images. In Computational and Information Sciences (ICCIS), 2011 International Conference on, pages 38–42. IEEE, 2011.
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.
The unreasonable effectiveness of deep features as a perceptual metric.arXiv preprint, 2018.