Learning a perceptual manifold with deep features for animation video resequencing

We propose a novel deep learning framework for animation video resequencing. Our system produces new video sequences by minimizing a perceptual distance of images from an existing animation video clip. To measure perceptual distance, we utilize the activations of convolutional neural networks and learn a perceptual distance by training these features on a small network with data comprised of human perceptual judgments. We show that with this perceptual metric and graph-based manifold learning techniques, our framework can produce new smooth and visually appealing animation video results for a variety of animation video styles. In contrast to previous work on animation video resequencing, the proposed framework applies to wide range of image styles and does not require hand-crafted feature extraction, background subtraction, or feature correspondence. In addition, we also show that our framework has applications to appealing arrange unordered collections of images.



There are no comments yet.


page 7

page 9

page 11

page 13

page 14


State of the Art: Image Hashing

Perceptual image hashing methods are often applied in various objectives...

Music-oriented Dance Video Synthesis with Pose Perceptual Loss

We present a learning-based approach with pose perceptual loss for autom...

Automatic Image Stylization Using Deep Fully Convolutional Networks

Color and tone stylization strives to enhance unique themes with artisti...

Good Colour Maps: How to Design Them

Many colour maps provided by vendors have highly uneven perceptual contr...

Augmenting reality: On the shared history of perceptual illusion and video projection mapping

Perceptual illusions based on the spatial correspondence between objects...

Improving the Perceptual Quality of 2D Animation Interpolation

Traditional 2D animation is labor-intensive, often requiring animators t...

Perceptual Evaluation of Liquid Simulation Methods

This paper proposes a novel framework to evaluate fluid simulation metho...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

From its beginnings, animation has brought to life the creative potential of the human mind. It has developed into a dominant visual storytelling tool, and today there exists a wealth of archived animation video sequences created with both traditional and modern computer animation techniques. The visual style of animation video sequences is diverse, from stop-motion and three-dimensional photo-realistic renderings to cartoon illustrations and line-sketches. Although many techniques have been developed to ease the computer animation pipeline, production is still an arduous process, in large part to the complexity of digital characters, environments, and motion. The goal of this paper is to regenerate new video resequence for general and diverse animation video source. Animators can efficiently and interactively regenerate new animations according to their desire. Therefore, it reduces the complexity and timing/expense cost in creating animations.

Figure 1: An example of the manifold topology and two animation video resequences generated from a collection of unordered animation images with the proposed method.

Previous works on animation resequencing [de2004cartoon, yu2012combining, yu2012semisupervised] have focused on hand-craft feature extraction techniques to measure texture and shape similarity for a cartoon style animation and require background subtraction, segmentation, and other image processing techniques. Motivated by this, we propose a framework for animation video resequencing which can be applied to a broader range of image styles and does not require hand-craft feature extraction, background subtraction, or feature correspondence. Given the abundance of animation data which is currently available, our proposed resequencing framework generates new animations from an existing animation video clips.

The proposed framework learns a topological manifold of images where paths on the manifold represent smooth and visually plausible animation video frame sequences. To demonstrate our proposed framework to be general enough to handle a variety of animation video styles as source data, including photorealistic, non-photorealistic, styzied, and line-sketches we experimentally evaluate our method using different kinds of image and animation data. Also, given selected key-frames, our animation video resequencing method can create smooth and appealing user-controlled animation video.

In this paper, we utilize the activations of deep convolutional neural networks for smooth image sequencing and animation video resequencing. Given a pre-trained CNN and selected activation layers, we learn a perceptual similarity metric which reflects the perceptual judgments of humans. Then, from a collection of images and their pair-wise perceptual distances, we generate smooth new animation video sequences by traversing paths and cycles in an estimated perceptual manifold. To the best of our knowledge, we are the first to apply deep features to the problem of animation video resequencing. Our technical contributions can be summarized in the following issues:

  • We combine deep feature extraction and a perceptual similarity metric with a graph-based manifold learning technique to generate new and smooth animation video sequences.

  • We implement our method with two well-known deep learning architectures, VGG [simonyan2014very], and AlexNet [krizhevsky2017imagenet], and perform an experimental evaluation of the deep features learned by these architectures.

  • We give a quantitative comparison of our animation video resequencing results with other image similarity metrics including

    distance in image space,

    distance of the bottleneck layer activations of a denoising autoencoder

    [vincent2010stacked], and results obtained by traditional manifold learning techniques, locally Linear Embedding [roweis2000nonlinear] and Isomap [tenenbaum2000global].

  • We demonstrate that our method can facilitate several applications such as creating image layouts and video synthesis.

  • To the best of our knowledge, this is the first work that shows our framework can generally handle different kinds of animation styles without extra efforts on different data.

We organize the remainder of this paper as follows. In section 2, most previous research in this sequencing domain are reviewed. In section 3, we introduce the overview of our proposed system. In section 4, the methods used in our system are described. In section 5, our experimental results and evaluations are presented and our additional applications follow. The conclusions and our future work are presented in the last section.

2 Related work

Smooth image sequencing is the key to producing a visually pleasing animation video. Most previous research divides this sequencing problem into two distinct steps. The first step is to establish a suitable distance measure for the similarity between the input images. The second step is to determine an optimal sequence according to the similarity measure defined in the first step. While the earliest work [schodl2000video] used simple distance in the original image space as a similarity measure, more recent works have focused on feature extraction and dimension reduction techniques to measure higher-level features of shape [ling2007shape], appearance [fried2017patch2vec], pose [osadchy2007synergistic], and motion [holden2015learning].

In the proposed method, we use the activations of deep convolutional neural networks for feature extraction and a metric inspired by the Learned Perceptual Image Patch Similarity (LPIPS) metric proposed by [zhang2018unreasonable] to measure the perceptual distance of images. Although PSNR and SSIM [hore2010image] are mentioned as a well-known perceptual metrics, they are widely used to measure the similarity between two images. The goal of our work is to learn a perceptual manifold with deep features, LPIPS metric is suitable for perceptual similarity across deep visual representations [zhang2018unreasonable].

2.1 Feature Extraction and Non-linear Dimension Reduction

Patch2Vec [fried2017patch2vec] proposed a novel learning framework for image patch embedding, where an embedding is learned so that

distance in the embedding space provides a useful measure for high-level features of texture dissimilarity. They trained a convolutional neural network (CNN), using a segmentation dataset and triplet loss function to map image patches having the same texture to points which are nearby in the embedding space while mapping image patches having other textures as far away as possible. The proposed method takes a similar approach by using deep convolutional networks to extract an optimal set of image features. However, instead of learning a perceptual metric with a triplet loss, we use the LPIPS metric trained on perceptual judgments of humans.


proposed a view-independent Energy-Based Model that simultaneously detects faces and estimates head pose. They train a CNN to map images containing faces to points on a lower-dimensional face-manifold. After training, if the CNN maps an image to a point close to the manifold then its pitch, roll, and yaw are estimated by the position of the point projection onto the manifold. However, for general animation sequencing, there does not have a related energy-based model since the motion of a variety of characters and scenes must be estimated.


used a convolutional autoencoder for learning a manifold of human motion. They trained their proposed CNN with a motion capture data consisting of time-series of human-poses, and each convolution layer of the CNN performs one-dimensional convolution over the temporal domain. Since their CNN is trained using motion capture data, it is not suitable for feature extraction of images.


described methods for cartoon retrieval and clip synthesis using a multi-feature distance function and a partially user-labeled database of cartoon characters to construct a lower dimensional feature space with a sparse transfer learning technique. Cartoon Textures

[de2004cartoon] proposed a feature-distance based on the shape [huttenlocher1993comparing], appearance, and the temporal ordering of an input cartoon sequence, and then utilize the manifold learning technique Spatio-Temporal Isomap (ST-Isomap) to recover a lower-dimensional embedding of the input. Unlike the proposed method, ST-Isomap requires an initial ordering for the input images and thus does not apply to unordered collections of images. Moreover, ST-Isomap, and traditional manifold learning techniques like Locally Linear Embedding (LLE) [roweis2000nonlinear] and Laplacian Eigen Maps (LEM) [belkin2003laplacian] required predetermined parameters including the dimension of the manifold and the number of neighbors of each input image. These parameters require fine-tuning for each collection of input images. Cartoon Textures [de2004cartoon] and the methods proposed by [yu2012combining, yu2012interactive] all use hand-crafted feature-extraction methods specific to cartoon images and require much preprocessing including segmentation of a character and pairwise computation of the Hausdorff distance. Furthermore, the Cartoon Textures [de2004cartoon] method relied on knowledge of the ordering of the original input sequence and the methods of [yu2012combining, yu2012interactive] required user labeled data to construct the embedding to measure image similarity. In contrast, the method proposed in this paper, requires no user-labeling, no segmentation, and works with a variety of image styles.

2.2 Sequential Ordering of Images

Determining a sequential ordering of images is usually posed as a path-finding problem in a weighted graph, where nodes correspond to images and transition costs are based on the image dissimilarity measure and possibly other criteria such as path smoothness, temporal ordering of the input sequence, or user-control. Previous work considers transition costs in complete graphs and nearest neighbor graphs.

Video textures [schodl2000video] is a video-based rendering technique. They apply Q-learning [kaelbling1996reinforcement] to generate a video sequence of arbitrary length with similar dynamics to the input video. The method produces convincing results when the input video has repetitive motion or unstructured stochastic motion but will fail for complex structured motion like full body human motion. This limitation is a consequence of using distance on the raw pixels of the images which cannot sufficiently measure similarity in high-level features of motion. To ameliorate this issue, Shödl and Essa [schodl2001machine, schodl2002controlled]

trained a linear binary-classifier from manually labeled training data based on six hand-crafted features. Images deemed unacceptable by the linear classifier are not considered for transitions, and for the other images, the similarity measure is the linear classifying function. Shödl and Essa

[schodl2001machine] applied a beam-search technique to obtain the optimal sequence. To improve the sequencing results of their beam search, Shödl and Essa [schodl2002controlled] considered the temporal information of the original input sequence in the transition cost and adopt a greedy hill-climbing optimization, which starts with a random image sequence and iteratively changes subsequences which lower the total path cost. Contrast to these previous methods, our method applies to unordered and unlabeled collections of images.

Unlike the previously discussed techniques which sequence photo-realistic images, Cartoon textures [de2004cartoon] and the works by [yu2012combining, yu2012interactive] synthesize cartoon animations. yu2012interactive used a greedy method, choosing a random cartoon image as the first frame, and then choosing the most similar image, measured in the low dimensional subspace, for each subsequent frame. Cartoon Textures and yu2012combining synthesized new animations by finding the shortest paths in a graph constructed by the ST-Isomap and Isomap manifold learning algorithms, respectively. It is important to note that traditional manifold learning algorithms such as ST-Isomap and Isomap do not construct the graph automatically and require defined neighbor relations for the input data and the dimension of the embeddings beforehand, which is difficult to estimate. In our framework, we do not compute an explicit embedding and can adapt to different CNN architectures. Also, we automatically determined neighbor relations by minimizing the perceptual distance of the input data. Therefore, our method does not require fine-tuning for each input collection like traditional manifold learning techniques.

3 System overview

We outline the system overview of our proposed framework in Figure 2. The input to our system is a collection of images and a trained CNN . The CNN serves as an image feature extractor, and we learn a perceptual distance by training another neural network on a dataset comprised of perceptual judgments of humans. In our implementation, we test the activations of AlexNet and VGG trained for image classification. However, features extracted by other CNNs trained for tasks other than classification are also useful for measuring perceptual distance [zhang2018unreasonable] and could be incorporated into our system. So, after extracting the deep features from each image, we compute the pairwise perceptual distance of each image in the input collection using the LPIPS metric proposed by [zhang2018unreasonable] Once the perceptual distance is learned, the proposed system can create:

  • a path sequence which uses all images in the collection given start and terminal frames;

  • a cycle animation which uses all or some subset of images in the collection;

  • or a path animation with smooth in-between images given a set of key-frames.

For some collections of images, it may not be possible to obtain a smooth animation sequence using every image in the collection. For input collections obtained from densely sampled videos, we can assume there exists at least one smooth sequence which uses all images in the input collection, i.e., the original animation. However, if the images come from sparsely sampling an existing animation video or from an unordered collection of images, it may not be possible to resequence all of the images into a smooth sequence. Therefore, we also detect and prune outliner images from the input collection by fitting the perceptual distance of nearest neighbors to an optimal probability distribution using maximum likelihood estimation.

Then, from the pairwise perceptual distances, we construct a complete graph where each image corresponds to a node, and the weight of each edge is equal to the perceptual distance of adjacent images. To generate an optimal animation sequence through all of the input images, we compute the shortest Hamiltonian path from starting, and terminal frames assigned by the user or generate a looping animation by computing shortest tours. In key-frame path finding, we compute a minimal spanning tree (MST) of the complete graph and generate animation sequences by traversing paths in the MST. Figure 2 shows an overview of the system.

Section 4.1 describes the feature extraction details, section 4.2 describes the procedure for computing perceptual distance, section 4.3 describes the method for automatic outliner detection and removal, and section 4.4 describes animation resequencing.

Figure 2: System overview of the proposed method.

4 Method

4.1 Deep feature extraction

To compute the perceptual distance, we first extract features with a trained convolutional neural network (CNN) . Image features for an image are a set of activations, , obtained from the activations of selected layers after applying the image to the CNN , where is the total number of selected activation layers, and , , and are the dimensions of a selected activation layer ’s output. Thus, the size of features and the speed at which that are extracted depends on the architecture of the network architecture of .

In our implementation, we test two off-the-shelf networks, namely, VGG [simonyan2014very] and AlexNet [krizhevsky2017imagenet]

. Both two networks are trained on ImageNet dataset

[russakovsky2015imagenet] and have excellent performance in image classification, object detection, etc. Besides, there are number of deep learning-based studies in this problem domain adopting them to extract features in their system. Thus, these two pre-trained networks are suitable to adequately extract features from images. In each backbone, we remove the later layers, fully connected, and utilize the first five activation layers as feature extractors. While VGG and AlexNet are trained for classification of natural images, our experimental results show that their activations are also useful for re-sequencing non-photorealistic image styles used in animation.

4.2 Perceptual Distance

In this section, we briefly summarize the perceptual distance used for generating animation sequences. We use the LPIPS metric proposed by [zhang2018unreasonable]. Given a trained convolutional network and two images and , we extract the activations from selected layers and unit-normalize in the channel dimension to obtain features and for each selected layer . To compute the perceptual distance

, we scale the difference of activations element-wise by learned perceptual callibration weight tensors

, compute the norm, average spatially, and sum over all layers. This distance is expressed as follows:


where the weights are learned by a small network trained to predict perceptual judgement from distance pairs where , , is a reference images, and and are distorted images of . The judgement is determined based on the proportion of humans that perceived the image to be more similar to than and weights are obtained by minimizing a cross entropy loss function which is formulated as:


The perceptual judgments are obtained from the publically available Two Alternative Forced Choice (2AFC) dataset collected by [zhang2018unreasonable]. For additional details, we refer the reader to the original paper [zhang2018unreasonable]. For each pair of images and in the input collection, we use the LPIPS metric in equation 1 to compute a perceptual distance.

In section 5.2 we show a comparative analysis of animation results obtained with the LPIPS metric and other image similarity metrics, including in image space, in a denoising autoencoder’s bottleneck activation feature space, two traditional manifold learning methods’ embeddings, and the cosine distance of the same deep features used with LPIPS.

4.3 Outliner Detection and Removal

Images which have a large perceptual distance from all other images in the input collection may negatively affect the smoothness of the resequenced animation result. Thus to maintain the smoothness of the generated sequence, we remove outliner images which have a large perceptual distance from their nearest neighbors. A naive approach would be to simply threshold the perceptual distance of nearest neighbors in the complete graph. However, a constant threshold value cannot adapt to disparate input data. Therefore, we fit the perceptual distance of nearest neighbors to a probability distribution to detect and remove outliner images.


be a random variable equal to the average perceptual distance of image

and its nearest neighbors for .


In our implement, we choose the number of nearest neighbors to be .

To find the most likely distribution, we estimate the parameters of the generalized gamma probability distribution function [stacy1962generalization] given the samples for where is the total number of images in the input collection. The generalized gamma probability density is defined as,


The parameters , , , and are obtained with maximum likelihood estimation, i.e., by maximizing the log likelihood function of given the random samples

. Once the parameters are found, we calculate the 0.9 quantile value

as a threshold and remove the column and row from the original distance matrix for any and update the complete graph.

We choose the generalized gamma function as a distribution because of its flexibility. We tested the Normal and Beta distributions but found that the generalized gamma distribution produced better fits to the sample histograms than other distribution models.

4.4 Animation Resequencing

From the perceptual distance matrix, we construct a complete graph ) where a node corresponds to image and the weight of an edge is . Initially, we view the complete graph as a crude approximation of a perceptual manifold. Traversing the complete graph would allow for large perceptual jumps since each pair of images are adjacent and a large perceptual distance of adjacent frames would result in an unsmooth animation sequence. To improve our estimation of the perception manifold, we find subgraphs which prune large edges from the complete graph and generate animations by traversing a modified graph structure. We consider three different types of subgraphs, the shortest Hamiltonian path [cormen2009introduction], the shortest Hamiltonian cycle [cormen2009introduction], and the minimum spanning tree (MST) [cormen2009introduction].

In this section, we formalize these problems and give additional details and justification for these methods. In general, verifying if a sequence is a shortest Hamiltonian path or shortest Hamiltonian cycle is an NP-complete problem [garey1979computers]. For a given input collection with -frames, verification requires an exhaustive search of all permutations of the set . For larger image collections, finding an exact solution quickly becomes infeasible. However, for a complete graph, the existence of a Hamiltonian path and Hamiltonian cycle is guaranteed, and many polynomial approximation algorithms with bounded error have been proposed [laporte1992traveling]. In our implementation we use commercial software Mathematica [Mathematica] to solve the shortest Hamiltonian path and Hamiltonian cycle problems. The MST, on the other hand, can be computed very efficiently using a greedy method such as Kruskal’s algorithm [kruskal1956shortest].

4.4.1 Shortest Hamiltonian Path Sequence

For an input collection of images with frames, the shortest Hamiltonian path in the complete graph is the permutation of images in the set of images which minimizes the total perceptual distance between adjacent frames:


Optionally, a user can add constraints to the set of permutations so that the first frame in the Hamiltonian path has index and the terminal frame has index . For input animations which do not contain cyclic motion, this method can be used to reconstruct the original animation sequence given and .

4.4.2 Shortest Hamiltonian Cycle Sequence

To compute a cyclic animation sequences, we compute the shortest Hamiltonian cycle of the complete graph. Finding the shortest Hamiltonian cycle is equivalent to the well-known traveling salesman problem, and corresponds to a cyclic permutation of images which minimizes the total perceptual distance between adjacent frames in set :


The Hamiltonian cycle can generate looping sequences with continuously smooth motion which can be chained together to create a looping animations of arbitrary length. We show the result from uniformly sampling the shortest Hamiltonian Cycle Sequence in Figure 3.

Figure 3: Results from uniformly sampling the shortest Hamiltonian cycle sequence. In this example, the sequence is generated from six images. The blue arrows navigate the starting frame (top-left frame)to the ending frame (bottom-left frame) in a cycle.

4.4.3 Key-frame Path-finding

In key-frame path-finding, we use paths in the minimum spanning tree (MST) to return temporally-coherent in-between frames given a set of key-frames by the user. Animators typically choose key-frames as the beginning and end points of a temporally coherent transition. Thus, for in-between sequencing, we would like a high level of confidence that in-between images remain close to the perceptual manifold and we hope to return a sequence of many temporally coherent images to the user.

Since all of the perceptual distances are positive, the MST is the minimum-distance subgraph which connects all of the images. Therefore our proposed method produces in-betweens by traversing the path from one key-frame node to another along an MST. The paths connecting key-frame nodes in an MST are well suited for finding in-between images since the distance between nodes is relatively small which gives us a higher level of confidence that the in-between images are temporally coherent. The MST also has the advantage of having the minimal set of edges for a path-connected graph containing each image in the input collection, thus reducing both the time and space complexity of path-finding.

With this method, users may create animations from any number of key-frames by computing paths between consecutive key-frames and combining the results. The user can also view a 2D linear embedding of the MST to see an overview of the entire dataset and help drive their decisions in key-frame selection. Figure 1 shows how a user can use the MST’s 2D linear embedding to choose key-frames and view the sequence of in-between images.

5 Experimental Results

We separate our experimental results and evaluation into five subsections. In section 5.1, we show the training details in our proposed framework. In section 5.2, we show some representative results of our framework and a variety of animation sequences generated with the Hamiltonian path, Hamiltonian cycle, and key-frame pathfinding methods. In section 5.3, we compare the effectiveness of applying the deep features and the LPIPS metric to animation resequencing with other image similarity metrics. We give a quantitative comparison between distance in image space, distance of the bottleneck layer of a denoising autoencoder, traditional manifold learning algorithms LLE and Isomap, cosine distance and the LPIPS metric [zhang2018unreasonable] applied to the deep features extracted by VGG and AlexNet. In section 5.4, we show results of our framework applied to unordered image sets for image layout applications and in section 5.5 we discuss the main limitations of our framework.

In addition to the results presented here, please see our supplementary material and video for additional results and comparisons which are available on our project website:

5.1 Training details

In our network architecture, we use both general convolutional layers and the concept of residual blocks [he2016deep]. A residual block is a block of layers where the input to the block is added element-wise to the output of the block. This technique helps prevent vanishing gradients which is a common problem in training deep neural networks. We use scaled exponential linear units (SELUs) [klambauer2017self]

as our activation functions except in the last layer where we apply a sigmoid function to guarantee that the output image pixel values are between zero and one. We also use batch normalization layers


in our model to keep the values of tensors propagating in the network to have zero mean and unit variance. The benefits of the SELU activation functions and batch normalization layers are to train a deeper network and make training converge faster. Figure

4 gives additional details about the network architecture.

For training, we collect 20 Japanese cartoon animations, where each video is about 25 minutes long, and linearly scaled each frame to pixels and a height pixels to reduce training time. To avoid images which are nearly identical, we obtained the training images by uniformly sampling one out of every ten frames. In total, our training set and validation set consists of 60000 and 10000 images, respectively. We use distance to measures the error between the original images and the reconstructed image :


where and are the red, green, and blue components of the pixel() in the original image and the reconstructed image, respectively. Then the optimal parameters of the encoder and decoder, are those which minimize a mean-square error loss across all iterations of the training process, where the batch size is set to 16. The initial parameters of the autoencoder,

, are set by drawing samples from a truncated normal distribution similar to the technique described by Klambauer


. Finally, we use the stochastic gradient descent algorithm ADAM

[kingma2014adam] to obtain the optimal solution.

Figure 4: The architecture of the denoising autoencoder used in testing animation reconstruction.

5.2 Animation Resequencing Results

In this section, we show some representative results generated with our framework. To generate new animation sequences we collected test data by sampling frames from animation videos, extracting deep features from the first five activation layers of VGG, applying the LPIPS metric to each pair of images in the input collection, and resequencing the animations with the proposed outliner and graph traversal methods.

Figure 5: Results from uniformly sampling the shortest Hamiltonian path sequence.

Figure 5 shows uniformly sampled frames of sequences generated by computing the Hamiltonian path. We show the full sequences in our supplementary video as well as a comparison with animation results using features extracted from AlexNet and other feature extraction methods described in section 5.2. The Hamiltonian path often reconstructs the original animation if the user gives the initial and terminal frames as constraints. However, by using the outliner removal and other frame constraints it is possible to create new motion sequences which do not resemble the original input.

Our results show that Hamiltonian cycles can generate novel looping motion. A Hamiltonian cycle can create a pleasing looping effect, even when the input images come from an animation that is not originally a loop. Figure 3 shows a uniform sampling of the Hamiltonian cycle results. See our supplementary material and video for additional results. Although we obtained many pleasing looking results, there are input data where the Hamiltonian cycle cannot immediately produce a smooth looping sequence. The proposed outliner removal method can improve outcomes in some of these cases. Another option is manually removing outliner images. One advantage of our system is that image layouts of the Hamiltonian cycle make it much easier to visually detect outliners and smooth subsequences.

Figure 6 shows key-frame results generated with the proposed method. To create these results, we examined the MSTs to guide key-frame selection and return precisely six in-between images. In general, the user cannot directly control the number of in-between images returned for arbitrary key-frame selection. However, using the linear embeddings of the MST for visualization provides a useful way to select key-frames that produce the desired number of in-betweens. Figure 1 shows a portion of an MST’s 2D linear embedding and our supplementary material shows the full versions for all results shown in the paper.

The in-between frames generated by the proposed method are typically temporally-coherent for key-frames which have relatively short path distance in the MST, but as the path distance between key-frame nodes increase, so does the probability of unreasonable in-betweens. In practice, we do not consider this a significant draw-back since choosing additional intermediate key-frames can avoid this issue.

5.3 LPIPS Evaluation

To evaluate the LPIPS performance for animation sequencing, we perform an animation reconstruction experiment with six related image similarity metrics. We collected 39 animations between 24 and 230 frames in lengths, shuffle the images, and attempt to reconstruct the original sequence by finding an optimal sequence which minimizes image dissimilarities between adjacent frames. The animations vary in style and content. We show original animations and the reconstructions for each image dissimilarity used for comparison in the supplementary video.

We compare the LPIPS metric of deep features of VGG and AlexNet with:

  • distance in image space;

  • on the deep features of the bottleneck layer of a custom denoising autoencoder;

  • distance on the embeddings learned by traditional manifold learning LLE and Isomap;

  • cosine distance of deep features of VGG and AlexNet.

5.3.1 Distance in Image Space

To compute

distance in image space, we represent each RGB image as a flat vector

, where the our test images have color channels and a width and height pixels:


5.3.2 Distance in DAE Bottleneck Activation Space

We compare our results with a custom denoising autoencoder (DAE) [vincent2010stacked]. An autoencoder is a kind of neural network divided into two parts, an encoder and a decoder. We consider the output of the bottleneck layer as the features that the encoder retrieves and encodes from the input. The encoding network of our DAE reduces the dimension of each image to a lower dimensional latent vector , where the latent space has channel dimensions and a width and height spatial dimensions. To measure image similarity we use distance on the activations of the bottleneck layer as below:


The architecture and training procedure of the denoising autoencoder is described in the supplementary materials.

5.3.3 Distance in LLE and Isomap Embeddings

The traditional manifold learning techniques LLE and Isomap map a set of images to a set of low dimensional vectors where and is the dimension of the embedding which must be specified by the user. In addition to the dimension of the embedding, the neighbors of each image must be specified.

In our comparison with traditional manifold learning, we test both LLE and Isomap with all parameters for the number of nearest neighbors, and embedding dimensions, , with distance on the learned embedding vectors .

Figure 6: Results of our proposed key-frame method. The first and last frames are selected by a user and the in-between frames are generated by traversing the minimum spanning tree.

5.3.4 Cosine Distance in VGG and AlexNet Activation Space

Lastly we compare the cosine distance in the channel dimension of the same deep features used with the LPIPS metric described in Section 4.1.


The Kendall tau distances for each method and each test case (sorted independently for clarity) and a box and whisker chart are shown in Figure 7.

Figure 7: (a) A comparison of the normalized Kendall tau distance for the reconstruction of 39 animations with different images similarity metrics (independently sorted). (b) A box and whisker chart for all test animations and test methods. The whisker endpoints show the maximum and minimum distance values, the solid box shows the 25 percent and 75 percent quantiles, and the white notch shows the median value.

5.3.5 Animation Reconstruction Experiment

For each image metric, we repeat the following procedure for each animation in our test set:

  1. compute the complete weighted graph of images with the appropriate distance function as edge weights;

  2. compute a Hamiltonian path from the node corresponding to the first frame to the node corresponding to the last frame;

  3. calculate the normalized Kendall tau distance [kendall1938new] of the original sequence and the sequence generated by the Hamiltonian path.

For an animation with frames, let denote the original sequence of frames and let denote the shortest Hamiltonian path from frame- to frame-. Then the normalized Kendall tau distance between the original sequence and is defined as:




The Kendall tau distance measures the number of discordant pairs in the Hamiltonian path sequence. It is normalized so that the distance for any number of frames in the animation clip. A Hamiltonian path sequence with the same order as the test animation has zero distance and a sequence with the reverse order has a distance of one, thus the Normalized Kendall tau distance also gives a measure of rank correlation.

To consider an input animation as ground truth for a Hamiltonian path, it must not contain cyclic motion. Thus we visually inspect each animation and remove examples with cyclic motion. Additionally, we removed trivial cases where all test methods perfectly reconstruct the animation.

In total, we tested the reconstruction of 39 animations. The average reconstruction errors are shown in Figure 8. In the case of traditional manifold learning algorithms LLE and Isomap, we test all parameters and and select the lowest reconstruction error for each test animation. The results show that, on average, using features from VGG or AlexNet with the LPIPS metric produce Hamiltonian Path sequences which are closer to the original sequence than all other test metrics. While all similarity metrics have relatively small reconstruction errors, the bottleneck activations of the denoising autoencoder has the worst results. This may be due to the fact that the DAE is trained solely on japanese manga style images. More diverse training data could possibly improve the results of the DAE’s bottleneck features. The traditional manifold learning technique LLE outperforms Isomap and slightly outperforms the cosine distance of the extracted deep features of VGG and Alexnet, however the experiment was slightly biased towards traditional manifold learning since each animation was tested with 171 different parameter settings and only the single best result was counted towards the average reconstruction error. Despite this bias, LPIPS with VGG features and AlexNet features performed the better than LLE without the need for any parameter tuning. We give additional results and details of the reconstruction experiment in our supplementary material.

Figure 8: Comparison of the average Kendall tau distance (equation 12) for 39 reconstructed animations. We test 8 different distance measures, pairwise distance of the raw image pixels, the bottleneck layer of a denoising autoencoder, and the low dimensional embeddings learned by Isomap and LLE; cosine distance and LPIPS of the activations of selected layers of VGG and AlexNet.

5.4 Additional Applications

5.4.1 Image Layouts

The proposed framework can also be used to create image layouts used for quickly browsing large collections of unordered images. Placing perceptually similar images next to each other can improve human image retrieval tasks by reducing the perceptual load and thus accelerating visual processing [schoeffmann2011similarity].

We tested our framework on input data for the data driven morphing technique proposed by [averbuch2016smooth] comprised of four image sets where each image set contains between 148 and 722 images of different instances of the same object. Our framework was capable of producing many smooth and visually appealing image layouts from this data. Because of a large number of images in the datasets, our framework can be useful for visualizing smooth sub-sequences. For example, by identifying continuous subsequences of a given length with a minimum perceptual distance between adjacent frames or sampling longer sequences at even intervals. Figure 9 shows a radial image layout for images sequences generated with our proposed method, an example of how our system could be used to visualize large datasets of images of similar objects. Figure 10 shows an example of a smooth linear image layout generated by the proposed Hamiltonian path sequencing method applied to a collection of textured boot images. In the supplementary materials, we also present more results on this kind of application.

Figure 9: A radial image layout with sequences automatically generated by our system.
Figure 10: Linear image layout example generated by the proposed method. Readers are suggested to see our supplementary video for a better visualization.
Figure 11: Natural image video examples.

5.4.2 Video Synthesis

While this work focuses on animation video resequencing, our framework also applies to natural image video resequencing. If the input images depict stochastic motion, such as grass swaying in the win or ripples of water in a pool as the examples shown in Figure 11, smooth video resequences and cycle animations can be generated using the LPIPS distance and the graph traversal algorithm described in our framework.

5.5 Limitations

Our framework’s main limitation is its dependence on the input data. If the collection of input images is taken by densely sampling a video sequence with a strong distinction between backward and forward motion, such as the school of fish and falling sequence shown in Figure 12

, then the MST may be path-like, and the proposed sequencing methods will likely select frames which are very similar to the dynamics of the original video sequence. In general, it may not be possible to generate new dynamics from input collections that do not contain a sufficient variety in motion and appearance. One possible way to overcome this limitation would be to develop an image-synthesis technique to generate new images that interpolate or extrapolate new motion by considering motion directions of objects.

Figure 12: The school of fish example (a) and the falling sequence (b) are the examples where the proposed framework cannot produce smooth sequences other than the original video sequence.

6 Conclusion and Future work

We proposed a novel deep-learning framework for a new application for animation video resequencing which can generate smooth sequences and subsequences for many image styles. Our framework can serve as an efficient tool to automatically create new animation sequences from a collection of images. We also believe our framework could assist users in creating a comprehensive animation dataset by extracting many smooth subsequences from existing animation data. To our knowledge, a well-labeled dataset for general animation data does not yet exist.

Our results suggest that the activations of convolutional neural networks are useful features for smooth sequencing of photorealistic, non-photorealistic, ordered, and unordered image collections. Our quantitative analysis shows that deep-features and the LPIPS metric can reconstruct animation sequences with greater accuracy than cosine distance of the same features, distance of the activations of a denoising autoencoder, distance in image space, and distance in the embedding space obtained by traditional manifold learning. Our qualitative results also show that the LPIPS metric produces a visible improvement over these other methods.

Despite the various styles, animators utilize a standard set of principles, including natural movement, to create more realistic looking animations. Thus, in the future, we would like to develop a self-supervised learning technique to extract motion features from existing animation video and combine metric learning and sequencing in a single deep learning optimization framework to solve problems in Figure