Self-Supervised Equivariant Scene Synthesis from Video

by   Cinjon Resnick, et al.
NYU college

We propose a self-supervised framework to learn scene representations from video that are automatically delineated into background, characters, and their animations. Our method capitalizes on moving characters being equivariant with respect to their transformation across frames and the background being constant with respect to that same transformation. After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components. As far as we know, we are the first method to perform unsupervised extraction and synthesis of interpretable background, character, and animation. We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.



There are no comments yet.


page 2

page 6

page 8

page 11

page 12

page 14

page 15


Learned Equivariant Rendering without Transformation Supervision

We propose a self-supervised framework to learn scene representations fr...

Self-Supervised Monocular Scene Decomposition and Depth Estimation

Self-supervised monocular depth estimation approaches either ignore inde...

Self-Supervised Real-time Video Stabilization

Videos are a popular media form, where online video streaming has recent...

NF-SAVO: Neuro-Fuzzy system for Arabic Video OCR

In this paper we propose a robust approach for text extraction and recog...

Motion Segmentation using Frequency Domain Transformer Networks

Self-supervised prediction is a powerful mechanism to learn representati...

Self-Supervised Damage-Avoiding Manipulation Strategy Optimization via Mental Simulation

Everyday robotics are challenged to deal with autonomous product handlin...

Self-Supervised Decomposition, Disentanglement and Prediction of Video Sequences while Interpreting Dynamics: A Koopman Perspective

Human interpretation of the world encompasses the use of symbols to cate...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning manipulable representations of dynamic scenes is a challenging task. Ideally, we would give our model a video and receive an inverse rendering of coherent moving objects (‘characters’) along with a set of static backgrounds through which those characters move.

A motivating example is creating a stop motion video in which each character is painstakingly moved across static backgrounds to form the frames. Our goal is to have a model train on any number of videos, each of which have an independent set of characters and background, and then be able to synthesize new scenes mixing and matching (possibly unseen) characters and the backgrounds, thereby expediting the process of creating stop motion videos. This long standing problem with no good solution (see Section 5) can be thought of as unsupervised separation and synthesis of background and foreground, and a strong solution would address other applications as well, such as video compositing. Current unsupervised methods performing synthesis are limited in that they either cannot handle backgrounds [NIPS2015_5845, visualdynamics16, Kosiorek2018sqair], cannot synthesize the foreground and background independently [DBLP:journals/corr/VillegasYHLL17, Tulyakov_2018_CVPR, siarohin2020order], or cannot handle animation [DBLP:journals/corr/VondrickPT16].

Figure 1: Approach

: We use neural networks to parameterize

, , , and . With , we encode the image to be transformed, , as well as the two images, and , whose transformation we impose on . The output of

is an affine matrix that we apply to

. That result is added to the background encoding and then decoded to yield . For training, we require only two distinct images, , from the same video, where and should approximate . At test time, we can render new scenes where only and need to be from the same sequence.

To be more specific, assume an input video of a dynamic scene captured by a static camera. The background consists of the objects that are constant across the frames. In contrast, the foreground consists of the objects that are equivariant with respect to some family of transformations. We use this difference to infer a scene representation that captures separately the background and the characters. Building on Dupont et al. 

[dupont2020equivariant], we attain two novel results:

  1. Learn the transformation We do not require the underlying transformation applied to foreground objects during training, but instead learn it from nearby video frames. At inference, this allows us to animate characters according to any transformation and not just ones that include the character being transformed.

  2. Distinguish characters and background: By independently encoding the background and the manipulable character, we yield disentangled representations for both and can mix and match them freely.

Our model is trained in a self-supervised fashion requiring only nearby frames (without any annotation) from the th video sequence. We impose no other constraint and can render new scenes combining characters and backgrounds on the fly with additional (potentially unrelated) frames from sequences and . In Sections 4.3 and 4.4 respectively, we show strong results on both Moving MNIST [DBLP:journals/corr/SrivastavaMS15] with static backgrounds and 2D video game sprites from the Liberated Pixel Cup [bart_2012] where we demonstrate the following manipulations:

  • Render the character in but with the background from .

  • Render the character in using the transformation seen in the change from .

  • Combine both manipulations to render the character in on the background from using the transformation exhibited by the change seen from .

In Section 4.5, we report results on Fashion Modeling, a more realistic dataset for which our results are not as mature.

2 Background

Following Dupont et al. [dupont2020equivariant], we define an image and a scene representation , a rendering function mapping scene representations to images, and an inverse renderer mapping images to scene representations. It is helpful to consider as a 2D rendering of a character from a specific camera viewpoint. An equivariant scene representation satisfies the following relation for an equivariant renderer with respect to transformation :

Figure 2: Labeling the rows as , the first three are ground truth, row pairs the background from and the character from , row pairs the character of with the animations from and row combines both manipulations. With counterclockwise direction and bottom left origin, the transformations are rotate(15), rotate(9), translate(10, 4), translate(-8, -6).

In other words, transforming a rendering in image space is equivalent to first transforming the scene in feature space and then rendering the result. Dupont et al. posit functions and satisfying the following, where is a 2D image rendering, is the scene representation of , and is an affine transformation:


Equation 2 is a difficult equality to satisfy, so they propose learning neural networks to approximate it from data triples , where are renderings of the same character and is a rotation transforming from its appearance in to its appearance in . They assume is the same operation as but in feature space, and consequently apply a rotation of to the 3D representation . With and , they then train and to minimize reconstruction loss:


After training, new renderings can be inferred by first inverse rendering , then applying rotations upon , and finally rendering images . In other words, the model operates entirely in feature space. For their purposes, this lets them manipulate the output rendering quickly and without access to the object in image space. As we show in Section 3, this can be further extended to manipulations that are actually very difficult to do in image space but become easy to do in feature space.

3 Method

Our motivation is in using the smoothly-changing nature of video to learn the transformations between frames. Like Dupont et al., we assume that the change between frames and can be modeled with affine transformations. While they use only rotation, we assume an arbitrary affine transformation on the character plus an invariant transformation on the background, in-painting as needed. Accordingly, we remove the requirement at both training and test by learning the transformation from data. A question arises – if we learn from data, then why assume that it is an affine operation? The first reason is that it biases

towards having a commonly used model for motion estimation. Second, it limits us to six interpretable axes of variation in our animation, which lets us compare directly against a ground truth affine transformation when given. Finally, this approach also limits the operator’s ability to ‘hide’ information about the target image in the representation, exemplified by works like CycleGAN 

[CycleGAN2017] as demonstrated by Chu et al. [DBLP:journals/corr/abs-1712-02950].


Building on Section 2, we define and respectively representing the encoding of the background and character. With scalar coefficients , , and denoting

we learn neural networks , , , and by minimizing below. During training, this requires a dataset consisting of random pairs of frames from each clip where and have roughly the same background.



Losses and help with training, but they are not by themselves indicative of successful training. We attain quality results only when is sufficiently minimized. Other possible constraints, such as a transformation inverse loss on , were not necessary.

Both and learn to handle every character similarly. This is important because it means that at inference time we can render novel scenes given a pair of nearby frames in a video. The renderings described in Section 1 and demonstrated in Section 4 are now:

  • : Render the character in as seen in but with the background from .

  • : Render the character and background in using the transformation exhibited by the change in the character from .

  • : Combine the above two to render the character in on the background from using the transformation exhibited by the change in the character from .

Learning the transformation.

The smoothly changing nature of video lets us advance past affine transformations parametrically defined with a angle of rotation and instead learn transformations based on frame changes. One advantage to this approach is that the model becomes agnostic to which transformations the data exhibits. We show this to be explicitly true for Moving MNIST and implicitly true for Sprites, where the transformations are not affine. Another advantage arises at inference time because the model is agnostic to whether the frame to be transformed is input to the transformation function.

Distinguishing characters and background

The renderer is equivariant to character transformations. A static background however will be constant across a video. We take advantage of this to learn along with functions and corresponding to the character and the background such that, at inference time, we can mix and match characters and backgrounds previously never seen together.

4 Experiments

The first dataset is a variant of Moving MNIST [DBLP:journals/corr/SrivastavaMS15] where we overlay each sequence of moving digits on a randomly drawn background. Experiments on this dataset lets us test our model on a wide range of explicit translations and rotations in the digits.

The second dataset was debuted by Reed et al.[NIPS2015_5845] and consists of moving 2D video game character sprites without any backgrounds, using graphic assets from the Liberated Pixel Cup [bart_2012]. We perform experiments both with and without backgrounds. These experiments test our model on more complex mappings like firing a bow and arrow instead of just affine transformations in image space.

This test-bed is suitable for demonstrating that our method exhibits three desirable capabilities: 1) Separate foreground from background; 2) Learn an interpretable but complex transformation of the foreground in the feature space; 3) Render transformed foreground objects. Furthermore, it allows us to demonstrate generalization to unseen characters (digits and sprites) and unseen backgrounds.

We also show results on a third dataset, Fashion Modeling, from Zablotskaia et al. [DBLP:journals/corr/abs-1910-09139], consisting of videos of models exhibiting clothing without any backgrounds. This dataset tests our model on realistic poses and fine detail.

4.1 Architecture

We parameterize , , , and with neural networks. While separate, both and share the same architecture details as a residual network. The renderer is a transposition of that architecture, although we drop the residual components. Please see Figures 222324 in the Appendix for details.

The transformation

exhibits three properties. First, it is input-order dependent. Second, it uses PyTorch’s 

[paszke2017automatic] affine_grid and grid_sample functions to transform the scene similarly to how Spatial Transformers [DBLP:journals/corr/JaderbergSZK15] operate. The third is that it is initialized as identity, reflecting our prior that does not alter the foreground.

4.2 Background Generation

We create randomly generated backgrounds for each of train, val, and test. For each background, we select a color from the Matplotlib CSS4 colors list. We then place five diamonds on the background, each with a different random color, along with an independent and randomly chosen center and radius. The radius is uniformly chosen from between seven and ten, inclusive.

For the Moving MNIST experiments, for each split because the model overfit with . For the Sprite experiments, we used for test but a range of for training.

4.3 Moving MNIST

We generate videos, each of length , of MNIST digits (characters) moving on a static background. The digits and background have dimensions and respectively. At each training step, we select digits in the train split of MNIST, as well as a background from the set of pre-generated training backgrounds. We then randomly place these digits. For each digit, and for each of times, we choose randomly between rotation and translation. If translation, then we translate the character independently in each of the and directions by a random amount in . If rotation, then we rotate the character by a random amount in . If the character leaves the image boundaries, we redo the transformation selection. Otherwise, that transformation is applied cumulatively to yield the next character position.

This results in digits on blank canvases. We overlay them on the chosen background to produce a sequence where the change in each character from frames is small and affine for the character and constant for the background. This is performed by locating the (black) MNIST pixels and blackening those locations on the canvases. Accordingly, we do not use the black color canvas. We then randomly choose two indices and use as the training pair. See Figure 2 for example sequences.

Qualitative Results

Our model learns to render new scenes using characters from the test set of MNIST and held out backgrounds. All shown sequences are on unseen backgrounds with unseen MNIST digits where there are at least two transformations of each of rotation and translation.

Figure 2 convincingly demonstrates the manipulations from Section 3. The final row is rendered by encoding the character from the first row, the background from the second row, and using the transformations exhibited in the third row. Denoting as the th frame from the th sequence:

Figure 3: Per-pixel MSE over 10,000 test examples. The transform and background manipulations use our learned functions; Video frames is MSE of a frame against a random (non-identical) frame from the same video; No object is MSE of the background versus the full frame of background and character.

Quantitative Results

We tested reconstruction by evaluating the per pixel MSE over the moving MNIST test set. For each example, we randomly chose two pairs of (background, digit) and made corresponding videos and . We then indexed into the same random position in both sequences to get frame pairs , . Fig 3 shows a boxplot comparing two manipulations, Transformation and Background, along with two baselines, Video frames and No object.

The predicted value for Transformation is . We compute an MSE for this manipulation by comparing it against the ground truth attained by rendering the background and character from , transformed like seen in . The predicted value for Background is . We compute an MSE for this manipulation by comparing it against the rendering of the character in on the background from with the original transforms. Video frames is the MSE of two random frames from the same video. No object is the MSE of a full frame against only the background from that frame.

Given that MSE is a measure of reconstruction quality with lower values being better, we expect them to serve as upper bounds. Video frames is the upper bound when reconstruction gets the character but places it incorrectly. No object

is the upper bound when reconstruction fails to include the character. On this measure, we see that the background manipulation is much better than the baselines, but we cannot say with certitude that the transform manipulation is better as it is within confidence interval of

Video frames and its boxplot overlaps with both baselines.

(a) Spellcast Animation
(b) Thrust Animation
(c) Shoot Animation
Figure 4: Transferring animations. In both panels, the first and second rows are ground truth; Image in the third row is given by applying the transformation exhibited by to . Observe how in panel (b) that a spear is not hallucinated, but instead the thrust action is applied to the character without a spear. And in panel (c), even though there is no bow in the input to the transformation, the model understands to move the bow and pull the string in the character upon which the transformation is applied.
Figure 5: The rows are background manipulation , animation transfer , and both combined .

4.4 Video Game Sprites

The dataset comprises color images of sprites with seven attributes (sex, body type, arm, hair, armor, greaves, and weapon) for a total of 672 unique characters. We use the same dataset splits as Reed et al. with 500 training characters, 72 validation characters, and 100 test characters. For each character, there are four viewpoints for each of five animations: spellcast, thrust, walk, slash, and shoot. Our sequences are the twenty animations per character, which have between six and thirteen frames. We only show results on test characters and a held out set of test backgrounds.

For results without backgrounds, we select a random sequence and then two random frames from that sequence as input to our model. For results with backgrounds, we additionally choose a random background and then use the provided masks to center the sprite data on top of the background. See Figures 4, 5 for examples of sequences without and with backgrounds.

Qualitative Results

Figure 4 shows strong results on three different types of animations (spellcast, thrust, shoot) without backgrounds. In each panel, the first and second rows are ground truth, as is the first frame of the third row – (). The rest of the third row is animating according to the transformations exhibited by the first row. In all three panels, the major concern is with the hand color. The model occasionally blends the green hands of the original character with that of the character being animated. In the first panel, this appears as gray, while in the second and third panels it has a green tint. On the other hand, the model successfully performs the thrusting action without hallucinating a spear in the second panel. Similarly, the model operates the bow and arrow in the third panel, even though there is only the shooting motion and not the actual bow in the first row.

Figure 5 shows results when using backgrounds. We see an example of each of the type of manipulations from Section 3. The first row is changing the background, the second row is animation transfer, and the third row is performing both background change and animation transfer. The model is strong but clearly not perfect at this task. In addition to the hand discoloring (2nd row), we also see that the backgrounds are a bit blurry near the character in all three rows.

Quantitative Results

For results without backgrounds, we tested reconstruction by evaluating the per pixel MSE over the test set. Figure 6 compares our approach against prior work on analogy reconstruction. We are competitive in every category and the most consistent in our results, with Figure 7 showing that our model is capable across all categories. This is in contrast to the model with the best average performance from Xue et al. [DBLP:journals/corr/XueWBF16], ‘VD’, which performs poorly on the slash category. We further show our model’s flexibility by taking a version trained with diamond backgrounds and ask it to perform the same analogy task but with a fixed gray canvas as the input to the background encoder. We use gray because the model never saw black in its training process. Figure 6 shows that this model, ‘Ours-BG’, attains comparable results. Examples from ‘Ours-BG’ and a representative MSE boxplot are in the Appendix (Figures 1617).

Figure 6:

Reconstructing analogous sprites without backgrounds in pixel MSE, with max cutoff at 60.0. Our method is competitive in every category and has much lower variance than the other models, even though it was designed for datasets with affine transformations and backgrounds, both of which are missing in this context. We highlight our model’s flexibility by using a model trained with diamond backgrounds (‘Ours-BG’), asking it to remove the background, and still yielding comparable results.

Figure 7: A boxplot of sprite analogy reconstruction MSE without backgrounds. This expands upon the reported averages in Figure 6.

For sprite results with backgrounds, we similarly evaluate per-pixel MSE evaluation in Figure 9. We include only the aggregate boxplots and put the full breakdown in the Appendix (Figure 21). The first boxplot, ‘Analogy’, represents analogy reconstruction. It shows a clear (expected) increase in MSE over reconstruction without backgrounds. Is this due to the background or the character?

To answer that question, we utilize the provided character mask. Boxplot ‘Background-only’ is the resulting MSE from masking out where the character should be in both the predicted and ground truth and then computing MSE over just the unmasked part, presumed to be the background. Boxplot ‘Character-only’ is similar but masking out everything but where the character should be and computing MSE over the presumed character. ‘Background-only’ MSE remains consistent but is also much lower than ‘Analogy’, with an aggregate of . In contrast, the ‘Character-only‘ aggregate MSE is with a high of (slash) and a low of (spell). This analysis suggests that the loss in MSE is dominantly due to the character reconstruction. This is encouraging for improving our results given the litany of recent techniques [siarohin2020order, siarohin2020motionsupervised, DBLP:journals/corr/abs-1910-09139, Tulyakov_2018_CVPR] focusing on character animation in similar settings that can be coupled with our method.

Figure 8: Analogy reconstruction with backgrounds. The more the number of backgrounds used during training, the better the model does at test set reconstruction. This effect tapers near 1000.
Figure 9: Sprite analogy reconstruction with backgrounds over all animation types. Compared to Figure 7, the ‘Analogy’ boxplot shows an (expected) increase in MSE. The other two boxplots highlight that this is dominantly due to character reconstruction rather than background reconstruction.
Figure 10: In both panels, the first four columns are ground truth. The fifth is the model’s output when inputting just the background encoding to the decoder . The sixth is the output when inputting just the transformed character encoding to . The last column is including both as input to . Notice how the model separates the character and the background more cleanly in Moving MNIST. This is because the characters move throughout the scene and so the model has distinguishing information about all areas of the frame. Figure 15 in the Appendix shows that when the sprite moves during training in only a single cardinal direction, the model learns a similarly clean separation.

4.5 Fashion Modeling

Perceptual () FID () AKD ()
MonkeyNet [DBLP:journals/corr/abs-1812-08861] 0.3726 19.74 2.47
CBTI [DBLP:journals/corr/abs-1811-11459] 0.6434 66.50 4.20
DwNet [DBLP:journals/corr/abs-1910-09139] 0.2811 13.09 1.36
Ours 0.4222 70.87 13.06
Table 1: Comparison with state-of-the-art on Fashion.

Table 1 compares our results on Fashion Modeling against the current state of the art. While our model is capable of learning the transformation, it is not up to par with respect to realism. We consider this to be a question of model architecture and training and are actively working to improve this result. One direction is to adopt the 3D architecture from Dupont et al, which we expect to improve our results greatly given the common output errors. See Figures 2526 in the Appendix for example renderings.

4.6 Analyzing

There must be structure in because of its success animating characters. We analyze this with Moving MNIST because we can compare to the ground truth character transformation in . For Sprites, that comparison is more difficult because the ground truth is categorical. We gather unique frame pairs of a fixed digit instance on a fixed background, and show results for 8 and 5.

We make an assumption that there is structure in the MNIST with respect to both translation and rotation parameters independently. Figure 12

shows a kernel density estimation plot for the euclidean norm of the translation parameters for digit

(see Figure 14 in the Appendix for digit ), attained by isolating the third column of the affine matrices. While these are the only parameters corresponding to translation in the gathered , we caution that there may be more parameters effecting translation in due to how affine matrices combine. Observe though how strongly coupled and linear is the relationship between these norms, with almost all of the joint density lying in a small range centered at about .

For rotations on MNIST, it is trivial to parse the angle from as the arctan of the first column in the corresponding affine linear matrix. This is not true of because it is an unconstrained prediction with entries violating the conditions of a rotation matrix. This observation is supported by Figure 11, which shows the density plot for the top left entry of the matrix of transformations. It is apparent that , contrary to , cannot correspond to only translation and rotation in how it manipulates the feature space because these entries have too high of a magnitude. Consequently, we use SVD [10.1007/BF02163027] to decompose , and yield the best approximating rotation matrix  [SVD_rotation_proof_paper]. We then use the angle from as the prediction and compare it the angle attained from . Their exceptionally linear relationship can be observed for both digits in Figure 13.

The relationship seen in these figures is further supported by a linear regression from

to . We flatten both to dimensional representations and use sklearn’s LinearRegression package [scikit-learn]. As expected, the score for each digit is exceptionally high, for digit 5 and for digit 8.

Altogether, this analysis suggests that the learned has a strong linear relationship with the ground truth , even though had no prior knowledge of . Furthermore, it also suggests that even though and have very different statistics (exemplified by the two plots in Figure 11), is not actually that different from in how it manipulates the image space, which is what we desire.

Figure 11:

Transformation densities over 40,000 (MNIST) and 105,000 (Sprites) random, unique frame pairs. The left plot is of the largest singular value of the affine transformation, where

is not shown as its upper left square matrix is always unitary. The right plot is of the top left entry in the matrix; positive density for value is a plotting artifact.
Figure 12: Kernel density estimation plot of the euclidean norm of the translation parameters for Moving MNIST, with on the y-axis and on the x-axis.
Figure 13: Moving MNIST scatterplot comparing the rotation angle of against the rotation angle of . The latter is ground truth, while the former is attained from the rotation matrix we yield from decomposing into matrices with SVD.

5 Related Work

Both the problem of background and foreground separation, as well as that of manipulable synthesis, are longstanding concerns in unsupervised learning 

[leroux2011learning, 786972, articlegreedy, Rubinstein13Unsupervised, DBLP:journals/corr/VillegasYHLL17, NIPS2017_2d2ca7ee]. As far as we know, we are the first to perform both without strong supervision such that we can independently manipulate each of the animation, the background, and the character.

Besides Dupont et al. [dupont2020equivariant], the most closely related papers are Worrall et al. [DBLP:journals/corr/abs-1710-07307], Olszewski et al. [DBLP:journals/corr/abs-1904-06458], and Reed et al. [NIPS2015_5845]. The first two also use equivariance to learn representations capable of manipulating scenes. However, they do not delineate characters and backgrounds, nor do they learn the from data. Dupont et al. assumes that is given during training; In Worrall et al., is a block diagonal composition of (given) domain-specific transformations; Olszewski et al. uses a user-provided transformation. Because we learn from data, we can use datasets without ground truth like Sprites [bart_2012]. Reed et al. similarly applies transformations to frames to yield an analogous frame. Where we use affine transform, they use addition. They also require three frames as input to at inference and four frames during training. Most problematic is that they require the full analogy during training. This means they cannot learn the transformation from random pair frames like we do but instead require a carefully built training set.

The papers from Siarohin et al. [siarohin2020order, siarohin2020motionsupervised] also use just pairs of images at training time and can then animate an unseen character at inference. They do not handle backgrounds independently and consequently are not able to mix and match foreground and background without artifacts. MonkeyNet [DBLP:journals/corr/abs-1812-08861] and Zablotskaia et al. [DBLP:journals/corr/abs-1910-09139] are also in this lineage and use the inductive bias provided by DensePose (and a warping module for the latter). They similarly focus on just modeling the character and place the background as is.

There are approaches that decompose the latent spaces with GANs [DBLP:journals/corr/MathieuZSRL16, Tulyakov_2018_CVPR] or VAEs [visualdynamics16, Kosiorek2018sqair]. Most have the same concern synthesizing the background and foreground independently with no mechanism for delineating the two, even though there may be mechanisms for parsing the foreground like in Kosiorek et al. [Kosiorek2018sqair]. Vondrick et al. [DBLP:journals/corr/VondrickPT16] does have such a mechanism, however it serves only to separate background from character pixels through a masking function. Because our mechanism acts on the learned character function, we are able to model the animation as well.

6 Conclusion

We have presented a self-supervised framework for learning an equivariant renderer capable of delineating the background, the characters, and their animation such that it can manipulate and synthesize each independently. Our framework requires only video sequences. We tested it on two datasets chosen to highlight the model’s capability in handling both truly affine movements (Moving MNIST) as well as movements from video (Sprites). We then showed that it can produce convincing and consistent scene manipulations, both of background and animation transfer. Our analysis revealed a strong association between the learned and the ground-truth

, which was never seen during training, implying the effectiveness of the proposed self-supervised learning criterion. We also observed aspects of

not fully explained by .

An assumption that is affine does not hold in general; a clear counterexample is videos with nonlinear lighting effects. However, there is no reason why such an assumption on could not hold. Consequently, we do not see any barriers to our approach working on real-world examples such as the motivating stop-motion animation. While our results are not yet strong enough on complex real-world applications (see Section 4.5), we see this work as being a promising step in that direction and leave that advance for future work.



Figure 14: Matching KDE plot for digit 8. See Figure 12 in the main text for digit 5.
Figure 15: Panel referenced by Figure 10 showing five results from a model trained with sprites undergoing x-translation during training in addition to the normal changes in the character animation. Observe that the model learns a clean separation of background and character in this case.
Figure 16: MSE plot for reconstructing analogies on blank canvases using a model trained with backgrounds. See Figures 36 for matching analysis and Figure 17 for examples. Note that we did not use black backgrounds for the reconstruction but gray ones instead. This is because the model was never trained with the color black and it would be out of scope to generalize to a completely new color.
Figure 17: Example sequences of removing the background. The first two columns are the animation, the third column is the character, and the fourth column is the character on a bare background. These inputs yield the output seen in the sixth column which can be compared to the ground truth expected output in the fifth column.
Figure 18: Reconstructions: The first, third, and fifth rows are original sequences. The second, fourth, and sixth rows are reconstructions of the prior row where is fixed as the same transformation as . Note that in these scenarios, the results are not as strong as when is a learned function.
Figure 19: Backgrounds: The first, second, and fourth rows are originals. The third and fifth rows are the prior row but with the background changed to that of the first sequence.
Figure 20: Transformations: The first, second, and fourth rows are originals. The third and fifth rows are the prior row but the transformation is a function of the first row.
(a) Analogy MSE
(b) Background-only MSE
(c) Character-only MSE
Figure 21: Analogy reconstruction with backgrounds. As expected, the model’s MSE increases when incorporating backgrounds. Panels (b) and (c) show that this is dominantly due to reconstructing the character and not the background.
Figure 22: Architecture of the Encoder for both characters and backgrounds.
Figure 23: Architecture of the Decoder.
Figure 24: Architecture of .
Figure 25: Fashion reconstructions. The first column serves as , , and , the second as and the target. In the first row, we can see that the model is not sure what to do with the legs. In the second and third rows, the model has successfully repositioned the character but remains blurry.
Figure 26: Fashion analogies. The first and second columns represent the animation to be done to the character in the third column. The first and fourth rows show successful animations, even with the slight animation in the latter. The second row is facing the wrong direction. The third row should have her arm still up and her leg moved to the side.