Learning manipulable representations of dynamic scenes is a challenging task. Ideally, we would give our model a video and receive an inverse rendering of coherent moving objects (‘characters’) along with a set of static backgrounds through which those characters move.
A motivating example is creating a stop motion video in which each character is painstakingly moved across static backgrounds to form the frames. Our goal is to have a model train on any number of videos, each of which have an independent set of characters and background, and then be able to synthesize new scenes mixing and matching (possibly unseen) characters and the backgrounds, thereby expediting the process of creating stop motion videos. This long standing problem with no good solution (see Section 5) can be thought of as unsupervised separation and synthesis of background and foreground, and a strong solution would address other applications as well, such as video compositing. Current unsupervised methods performing synthesis are limited in that they either cannot handle backgrounds [NIPS2015_5845, visualdynamics16, Kosiorek2018sqair], cannot synthesize the foreground and background independently [DBLP:journals/corr/VillegasYHLL17, Tulyakov_2018_CVPR, siarohin2020order], or cannot handle animation [DBLP:journals/corr/VondrickPT16].
To be more specific, assume an input video of a dynamic scene captured by a static camera. The background consists of the objects that are constant across the frames. In contrast, the foreground consists of the objects that are equivariant with respect to some family of transformations. We use this difference to infer a scene representation that captures separately the background and the characters. Building on Dupont et al.[dupont2020equivariant], we attain two novel results:
Learn the transformation We do not require the underlying transformation applied to foreground objects during training, but instead learn it from nearby video frames. At inference, this allows us to animate characters according to any transformation and not just ones that include the character being transformed.
Distinguish characters and background: By independently encoding the background and the manipulable character, we yield disentangled representations for both and can mix and match them freely.
Our model is trained in a self-supervised fashion requiring only nearby frames (without any annotation) from the th video sequence. We impose no other constraint and can render new scenes combining characters and backgrounds on the fly with additional (potentially unrelated) frames from sequences and . In Sections 4.3 and 4.4 respectively, we show strong results on both Moving MNIST [DBLP:journals/corr/SrivastavaMS15] with static backgrounds and 2D video game sprites from the Liberated Pixel Cup [bart_2012] where we demonstrate the following manipulations:
Render the character in but with the background from .
Render the character in using the transformation seen in the change from .
Combine both manipulations to render the character in on the background from using the transformation exhibited by the change seen from .
In Section 4.5, we report results on Fashion Modeling, a more realistic dataset for which our results are not as mature.
Following Dupont et al. [dupont2020equivariant], we define an image and a scene representation , a rendering function mapping scene representations to images, and an inverse renderer mapping images to scene representations. It is helpful to consider as a 2D rendering of a character from a specific camera viewpoint. An equivariant scene representation satisfies the following relation for an equivariant renderer with respect to transformation :
In other words, transforming a rendering in image space is equivalent to first transforming the scene in feature space and then rendering the result. Dupont et al. posit functions and satisfying the following, where is a 2D image rendering, is the scene representation of , and is an affine transformation:
Equation 2 is a difficult equality to satisfy, so they propose learning neural networks to approximate it from data triples , where are renderings of the same character and is a rotation transforming from its appearance in to its appearance in . They assume is the same operation as but in feature space, and consequently apply a rotation of to the 3D representation . With and , they then train and to minimize reconstruction loss:
After training, new renderings can be inferred by first inverse rendering , then applying rotations upon , and finally rendering images . In other words, the model operates entirely in feature space. For their purposes, this lets them manipulate the output rendering quickly and without access to the object in image space. As we show in Section 3, this can be further extended to manipulations that are actually very difficult to do in image space but become easy to do in feature space.
Our motivation is in using the smoothly-changing nature of video to learn the transformations between frames. Like Dupont et al., we assume that the change between frames and can be modeled with affine transformations. While they use only rotation, we assume an arbitrary affine transformation on the character plus an invariant transformation on the background, in-painting as needed. Accordingly, we remove the requirement at both training and test by learning the transformation from data. A question arises – if we learn from data, then why assume that it is an affine operation? The first reason is that it biases
towards having a commonly used model for motion estimation. Second, it limits us to six interpretable axes of variation in our animation, which lets us compare directly against a ground truth affine transformation when given. Finally, this approach also limits the operator’s ability to ‘hide’ information about the target image in the representation, exemplified by works like CycleGAN[CycleGAN2017] as demonstrated by Chu et al. [DBLP:journals/corr/abs-1712-02950].
Building on Section 2, we define and respectively representing the encoding of the background and character. With scalar coefficients , , and denoting
we learn neural networks , , , and by minimizing below. During training, this requires a dataset consisting of random pairs of frames from each clip where and have roughly the same background.
Losses and help with training, but they are not by themselves indicative of successful training. We attain quality results only when is sufficiently minimized. Other possible constraints, such as a transformation inverse loss on , were not necessary.
Both and learn to handle every character similarly. This is important because it means that at inference time we can render novel scenes given a pair of nearby frames in a video. The renderings described in Section 1 and demonstrated in Section 4 are now:
: Render the character in as seen in but with the background from .
: Render the character and background in using the transformation exhibited by the change in the character from .
: Combine the above two to render the character in on the background from using the transformation exhibited by the change in the character from .
Learning the transformation.
The smoothly changing nature of video lets us advance past affine transformations parametrically defined with a angle of rotation and instead learn transformations based on frame changes. One advantage to this approach is that the model becomes agnostic to which transformations the data exhibits. We show this to be explicitly true for Moving MNIST and implicitly true for Sprites, where the transformations are not affine. Another advantage arises at inference time because the model is agnostic to whether the frame to be transformed is input to the transformation function.
Distinguishing characters and background
The renderer is equivariant to character transformations. A static background however will be constant across a video. We take advantage of this to learn along with functions and corresponding to the character and the background such that, at inference time, we can mix and match characters and backgrounds previously never seen together.
The first dataset is a variant of Moving MNIST [DBLP:journals/corr/SrivastavaMS15] where we overlay each sequence of moving digits on a randomly drawn background. Experiments on this dataset lets us test our model on a wide range of explicit translations and rotations in the digits.
The second dataset was debuted by Reed et al.[NIPS2015_5845] and consists of moving 2D video game character sprites without any backgrounds, using graphic assets from the Liberated Pixel Cup [bart_2012]. We perform experiments both with and without backgrounds. These experiments test our model on more complex mappings like firing a bow and arrow instead of just affine transformations in image space.
This test-bed is suitable for demonstrating that our method exhibits three desirable capabilities: 1) Separate foreground from background; 2) Learn an interpretable but complex transformation of the foreground in the feature space; 3) Render transformed foreground objects. Furthermore, it allows us to demonstrate generalization to unseen characters (digits and sprites) and unseen backgrounds.
We also show results on a third dataset, Fashion Modeling, from Zablotskaia et al. [DBLP:journals/corr/abs-1910-09139], consisting of videos of models exhibiting clothing without any backgrounds. This dataset tests our model on realistic poses and fine detail.
We parameterize , , , and with neural networks. While separate, both and share the same architecture details as a residual network. The renderer is a transposition of that architecture, although we drop the residual components. Please see Figures 22, 23, 24 in the Appendix for details.
exhibits three properties. First, it is input-order dependent. Second, it uses PyTorch’s[paszke2017automatic] affine_grid and grid_sample functions to transform the scene similarly to how Spatial Transformers [DBLP:journals/corr/JaderbergSZK15] operate. The third is that it is initialized as identity, reflecting our prior that does not alter the foreground.
4.2 Background Generation
We create randomly generated backgrounds for each of train, val, and test. For each background, we select a color from the Matplotlib CSS4 colors list. We then place five diamonds on the background, each with a different random color, along with an independent and randomly chosen center and radius. The radius is uniformly chosen from between seven and ten, inclusive.
For the Moving MNIST experiments, for each split because the model overfit with . For the Sprite experiments, we used for test but a range of for training.
4.3 Moving MNIST
We generate videos, each of length , of MNIST digits (characters) moving on a static background. The digits and background have dimensions and respectively. At each training step, we select digits in the train split of MNIST, as well as a background from the set of pre-generated training backgrounds. We then randomly place these digits. For each digit, and for each of times, we choose randomly between rotation and translation. If translation, then we translate the character independently in each of the and directions by a random amount in . If rotation, then we rotate the character by a random amount in . If the character leaves the image boundaries, we redo the transformation selection. Otherwise, that transformation is applied cumulatively to yield the next character position.
This results in digits on blank canvases. We overlay them on the chosen background to produce a sequence where the change in each character from frames is small and affine for the character and constant for the background. This is performed by locating the (black) MNIST pixels and blackening those locations on the canvases. Accordingly, we do not use the black color canvas. We then randomly choose two indices and use as the training pair. See Figure 2 for example sequences.
Our model learns to render new scenes using characters from the test set of MNIST and held out backgrounds. All shown sequences are on unseen backgrounds with unseen MNIST digits where there are at least two transformations of each of rotation and translation.
We tested reconstruction by evaluating the per pixel MSE over the moving MNIST test set. For each example, we randomly chose two pairs of (background, digit) and made corresponding videos and . We then indexed into the same random position in both sequences to get frame pairs , . Fig 3 shows a boxplot comparing two manipulations, Transformation and Background, along with two baselines, Video frames and No object.
The predicted value for Transformation is . We compute an MSE for this manipulation by comparing it against the ground truth attained by rendering the background and character from , transformed like seen in . The predicted value for Background is . We compute an MSE for this manipulation by comparing it against the rendering of the character in on the background from with the original transforms. Video frames is the MSE of two random frames from the same video. No object is the MSE of a full frame against only the background from that frame.
Given that MSE is a measure of reconstruction quality with lower values being better, we expect them to serve as upper bounds. Video frames is the upper bound when reconstruction gets the character but places it incorrectly. No object
is the upper bound when reconstruction fails to include the character. On this measure, we see that the background manipulation is much better than the baselines, but we cannot say with certitude that the transform manipulation is better as it is within confidence interval ofVideo frames and its boxplot overlaps with both baselines.
4.4 Video Game Sprites
The dataset comprises color images of sprites with seven attributes (sex, body type, arm, hair, armor, greaves, and weapon) for a total of 672 unique characters. We use the same dataset splits as Reed et al. with 500 training characters, 72 validation characters, and 100 test characters. For each character, there are four viewpoints for each of five animations: spellcast, thrust, walk, slash, and shoot. Our sequences are the twenty animations per character, which have between six and thirteen frames. We only show results on test characters and a held out set of test backgrounds.
For results without backgrounds, we select a random sequence and then two random frames from that sequence as input to our model. For results with backgrounds, we additionally choose a random background and then use the provided masks to center the sprite data on top of the background. See Figures 4, 5 for examples of sequences without and with backgrounds.
Figure 4 shows strong results on three different types of animations (spellcast, thrust, shoot) without backgrounds. In each panel, the first and second rows are ground truth, as is the first frame of the third row – (). The rest of the third row is animating according to the transformations exhibited by the first row. In all three panels, the major concern is with the hand color. The model occasionally blends the green hands of the original character with that of the character being animated. In the first panel, this appears as gray, while in the second and third panels it has a green tint. On the other hand, the model successfully performs the thrusting action without hallucinating a spear in the second panel. Similarly, the model operates the bow and arrow in the third panel, even though there is only the shooting motion and not the actual bow in the first row.
Figure 5 shows results when using backgrounds. We see an example of each of the type of manipulations from Section 3. The first row is changing the background, the second row is animation transfer, and the third row is performing both background change and animation transfer. The model is strong but clearly not perfect at this task. In addition to the hand discoloring (2nd row), we also see that the backgrounds are a bit blurry near the character in all three rows.
For results without backgrounds, we tested reconstruction by evaluating the per pixel MSE over the test set. Figure 6 compares our approach against prior work on analogy reconstruction. We are competitive in every category and the most consistent in our results, with Figure 7 showing that our model is capable across all categories. This is in contrast to the model with the best average performance from Xue et al. [DBLP:journals/corr/XueWBF16], ‘VD’, which performs poorly on the slash category. We further show our model’s flexibility by taking a version trained with diamond backgrounds and ask it to perform the same analogy task but with a fixed gray canvas as the input to the background encoder. We use gray because the model never saw black in its training process. Figure 6 shows that this model, ‘Ours-BG’, attains comparable results. Examples from ‘Ours-BG’ and a representative MSE boxplot are in the Appendix (Figures 16, 17).
For sprite results with backgrounds, we similarly evaluate per-pixel MSE evaluation in Figure 9. We include only the aggregate boxplots and put the full breakdown in the Appendix (Figure 21). The first boxplot, ‘Analogy’, represents analogy reconstruction. It shows a clear (expected) increase in MSE over reconstruction without backgrounds. Is this due to the background or the character?
To answer that question, we utilize the provided character mask. Boxplot ‘Background-only’ is the resulting MSE from masking out where the character should be in both the predicted and ground truth and then computing MSE over just the unmasked part, presumed to be the background. Boxplot ‘Character-only’ is similar but masking out everything but where the character should be and computing MSE over the presumed character. ‘Background-only’ MSE remains consistent but is also much lower than ‘Analogy’, with an aggregate of . In contrast, the ‘Character-only‘ aggregate MSE is with a high of (slash) and a low of (spell). This analysis suggests that the loss in MSE is dominantly due to the character reconstruction. This is encouraging for improving our results given the litany of recent techniques [siarohin2020order, siarohin2020motionsupervised, DBLP:journals/corr/abs-1910-09139, Tulyakov_2018_CVPR] focusing on character animation in similar settings that can be coupled with our method.
4.5 Fashion Modeling
|Perceptual ()||FID ()||AKD ()|
Table 1 compares our results on Fashion Modeling against the current state of the art. While our model is capable of learning the transformation, it is not up to par with respect to realism. We consider this to be a question of model architecture and training and are actively working to improve this result. One direction is to adopt the 3D architecture from Dupont et al, which we expect to improve our results greatly given the common output errors. See Figures 25, 26 in the Appendix for example renderings.
There must be structure in because of its success animating characters. We analyze this with Moving MNIST because we can compare to the ground truth character transformation in . For Sprites, that comparison is more difficult because the ground truth is categorical. We gather unique frame pairs of a fixed digit instance on a fixed background, and show results for 8 and 5.
We make an assumption that there is structure in the MNIST with respect to both translation and rotation parameters independently. Figure 12
shows a kernel density estimation plot for the euclidean norm of the translation parameters for digit(see Figure 14 in the Appendix for digit ), attained by isolating the third column of the affine matrices. While these are the only parameters corresponding to translation in the gathered , we caution that there may be more parameters effecting translation in due to how affine matrices combine. Observe though how strongly coupled and linear is the relationship between these norms, with almost all of the joint density lying in a small range centered at about .
For rotations on MNIST, it is trivial to parse the angle from as the arctan of the first column in the corresponding affine linear matrix. This is not true of because it is an unconstrained prediction with entries violating the conditions of a rotation matrix. This observation is supported by Figure 11, which shows the density plot for the top left entry of the matrix of transformations. It is apparent that , contrary to , cannot correspond to only translation and rotation in how it manipulates the feature space because these entries have too high of a magnitude. Consequently, we use SVD [10.1007/BF02163027] to decompose , and yield the best approximating rotation matrix [SVD_rotation_proof_paper]. We then use the angle from as the prediction and compare it the angle attained from . Their exceptionally linear relationship can be observed for both digits in Figure 13.
The relationship seen in these figures is further supported by a linear regression fromto . We flatten both to dimensional representations and use sklearn’s LinearRegression package [scikit-learn]. As expected, the score for each digit is exceptionally high, for digit 5 and for digit 8.
Altogether, this analysis suggests that the learned has a strong linear relationship with the ground truth , even though had no prior knowledge of . Furthermore, it also suggests that even though and have very different statistics (exemplified by the two plots in Figure 11), is not actually that different from in how it manipulates the image space, which is what we desire.
5 Related Work
Both the problem of background and foreground separation, as well as that of manipulable synthesis, are longstanding concerns in unsupervised learning[leroux2011learning, 786972, articlegreedy, Rubinstein13Unsupervised, DBLP:journals/corr/VillegasYHLL17, NIPS2017_2d2ca7ee]. As far as we know, we are the first to perform both without strong supervision such that we can independently manipulate each of the animation, the background, and the character.
Besides Dupont et al. [dupont2020equivariant], the most closely related papers are Worrall et al. [DBLP:journals/corr/abs-1710-07307], Olszewski et al. [DBLP:journals/corr/abs-1904-06458], and Reed et al. [NIPS2015_5845]. The first two also use equivariance to learn representations capable of manipulating scenes. However, they do not delineate characters and backgrounds, nor do they learn the from data. Dupont et al. assumes that is given during training; In Worrall et al., is a block diagonal composition of (given) domain-specific transformations; Olszewski et al. uses a user-provided transformation. Because we learn from data, we can use datasets without ground truth like Sprites [bart_2012]. Reed et al. similarly applies transformations to frames to yield an analogous frame. Where we use affine transform, they use addition. They also require three frames as input to at inference and four frames during training. Most problematic is that they require the full analogy during training. This means they cannot learn the transformation from random pair frames like we do but instead require a carefully built training set.
The papers from Siarohin et al. [siarohin2020order, siarohin2020motionsupervised] also use just pairs of images at training time and can then animate an unseen character at inference. They do not handle backgrounds independently and consequently are not able to mix and match foreground and background without artifacts. MonkeyNet [DBLP:journals/corr/abs-1812-08861] and Zablotskaia et al. [DBLP:journals/corr/abs-1910-09139] are also in this lineage and use the inductive bias provided by DensePose (and a warping module for the latter). They similarly focus on just modeling the character and place the background as is.
There are approaches that decompose the latent spaces with GANs [DBLP:journals/corr/MathieuZSRL16, Tulyakov_2018_CVPR] or VAEs [visualdynamics16, Kosiorek2018sqair]. Most have the same concern synthesizing the background and foreground independently with no mechanism for delineating the two, even though there may be mechanisms for parsing the foreground like in Kosiorek et al. [Kosiorek2018sqair]. Vondrick et al. [DBLP:journals/corr/VondrickPT16] does have such a mechanism, however it serves only to separate background from character pixels through a masking function. Because our mechanism acts on the learned character function, we are able to model the animation as well.
We have presented a self-supervised framework for learning an equivariant renderer capable of delineating the background, the characters, and their animation such that it can manipulate and synthesize each independently. Our framework requires only video sequences. We tested it on two datasets chosen to highlight the model’s capability in handling both truly affine movements (Moving MNIST) as well as movements from video (Sprites). We then showed that it can produce convincing and consistent scene manipulations, both of background and animation transfer. Our analysis revealed a strong association between the learned and the ground-truth
, which was never seen during training, implying the effectiveness of the proposed self-supervised learning criterion. We also observed aspects ofnot fully explained by .
An assumption that is affine does not hold in general; a clear counterexample is videos with nonlinear lighting effects. However, there is no reason why such an assumption on could not hold. Consequently, we do not see any barriers to our approach working on real-world examples such as the motivating stop-motion animation. While our results are not yet strong enough on complex real-world applications (see Section 4.5), we see this work as being a promising step in that direction and leave that advance for future work.