Motion Inbetweening via Deep Δ-Interpolator

by   Boris N. Oreshkin, et al.

We show that the task of synthesizing missing middle frames, commonly known as motion inbetweening in the animation industry, can be solved more accurately and effectively if a deep learning interpolator operates in the delta mode, using the spherical linear interpolator as a baseline. We demonstrate our empirical findings on the publicly available LaFAN1 dataset. We further generalize this result by showing that the Δ-regime is viable with respect to the reference of the last known frame (also known as the zero-velocity model). This supports the more general conclusion that deep inbetweening in the reference frame local to input frames is more accurate and robust than inbetweening in the global (world) reference frame advocated in previous work. Our code is publicly available at



There are no comments yet.


page 1


Click to Move: Controlling Video Generation with Sparse Motion

This paper introduces Click to Move (C2M), a novel framework for video g...

Extending Neural P-frame Codecs for B-frame Coding

While most neural video codecs address P-frame coding (predicting each f...

PHYSFRAME: Type Checking Physical Frames of Reference for Robotic Systems

A robotic system continuously measures its own motions and the external ...

Asymmetric Bilateral Motion Estimation for Video Frame Interpolation

We propose a novel video frame interpolation algorithm based on asymmetr...

Spine intervertebral disc labeling using a fully convolutional redundant counting model

Labeling intervertebral discs is relevant as it notably enables clinicia...

Out-of-boundary View Synthesis Towards Full-Frame Video Stabilization

Warping-based video stabilizers smooth camera trajectory by constraining...

Spatio-Temporal Reference Frames as Geographic Objects

It is often desirable to analyse trajectory data in local coordinates re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


Figure 1: The demonstration of the proposed -interpolator approach (left, green), Linear interpolator baseline (middle, yellow), ground truth motion (right, white). The animation is best viewed in Adobe Reader.

Computer generated imagery (CGI) has enabled the generation of high quality special effects in the movie industry with the movie making process becoming ever more technologically intensive as time progresses (ji2010production). Animating 3D characters is integral to the process of creating high quality CGI in films and in a variety of other contexts. Unsurprisingly, 3D computer animation is at the core of modern film, television and video gaming industries (beane20123d). Traditional animation workflows rely on the 3D sketches of the most important key-frames in the produced character motion sequence drawn by experienced artists. For example, one viable animation workflow relies on a senior artist drawing a few 3D key-frames pin pointing the most important aspects of the authored 3D sequence and a few less senior animators working out the details. Another viable computer aided animation workflow relies on massive motion capture (MOCAP) databases, enabling an animator to query suitable motion sequences from them based on a variety of inputs, such as a few key-frames, root-motion curve, meta-data tags, etc. The discovered motion fragments can be further blended to create new motion sequences with desired characteristics. Time and effort involved in exploring and traversing such diverse databases to find and blend suitable motion segments can be a significant bottleneck.

These tedious processes can be automated to a certain extent by filling the missing in-between frames using an array of techniques including linear interpolation, motion blending trees and character specific MOCAP libraries. However, a more promising venue to enable data-driven motion generation has been unblocked more recently by the advancement in deep learning algorithms (harvey2020robust; duan2021singleshot)

. Deep learning models trained on massive high quality MOCAP data, when conditioned on a few key-frames, are capable of producing relatively high quality and relatively long animation sequences saving time for animators. This data driven approach leverages existing high quality MOCAP footage and is able to re-purpose it towards creating new animation sequences, providing a relatively simple yet powerful user interface. In this new type of AI assisted workflow, the deep learning model plays a few critical functions. First, the model compresses a MOCAP dataset compactly, encoding it in its weights by developing internal latent space representation of motion. Second, it provides a familiar yet flexible and powerful user interface by accepting key-frame conditioning from the animator. Finally, the deep neural network queries its latent space with the information provided by the user and outputs a highly believable sophisticated motion sequence matching natural motion statistics that is ideally blended into user-provided key-frames. It is quite obvious by now that deep learning will continue to expand its reach and strive for domination in the field of 3D computer animation.

The extent and velocity of this process depends significantly on the progress in improving accuracy, robustness and flexibility of deep motion models, which is the focus of our current paper. We build on previous work (duan2021singleshot) and propose a configuration of the deep motion inbetweening interpolator that has better accuracy and is more robust to the out-of-distribution changes due to the local nature of the deep neural network inputs and outputs. Additionally, we contribute to democratizing the animation authoring by integrating our model inside the Unity Editor111 as a publicly available package and making our model code publicly available 222 A sample animation demonstrating our inbetweening approach in action is presented in Figure 1.

1.1 Summary of Contributions

We propose the -mode for deep motion inbetweening and empirically show its effectiveness on LaFAN1 dataset.

The rest of the paper is organized as follows. The next section reviews existing work and the section after formally defines the inbetweening problem. Section 4 describes our proposed -interpolator approach. Sections 5 and 6 present our empirical results, ablation studies and discuss findings. Finally, Section 7 concludes the paper.

2 Related Work

Conditional motion synthesis and more specifically motion completion is the process of filling the missing frames in a temporal sequence given the set of key-frames. Early work focused on generating missing in-between frames by integrating key-frame information into space-time models (witkin1988spacetime; ngo1993spacetime). The next generation inbetweening works developed a more sophisticated probabilistic approach to the constrained motion synthesis, formulating it as a maximum a posteriori (MAP) problem (chai2007constraint) or a Gaussian process (grochow2004style; wang2007gaussian). Over the past decade, deep learning has become the modus operandi, significantly outperforming probabilistic approaches. Due to the temporal nature of the task, RNNs have dominated the field compared to other neural network architectures (holden2016framework; harvey2018recurrent). The latest RNN work by Harvey et al. (harvey2020robust)

focuses on augmenting the Long Short Term Memory (LSTM) based architecture with the time-to-arrival embeddings and a scheduled target noise vector, allowing the system to be robust to target distortions. Geleijn et al. 

(gelejein2021lightweight) propose a lightweight and simplified recurrent architecture that maintains good performance, while significantly reducing the computational complexity. Beyond RNN models, motion has been modelled by 1D temporal convolutions (holden2015learning). More recent work relied on Transformer based models (devlin2018bert; duan2021singleshot) predicting the entire motion in one pass unlike auto-regressive recurrent models. Most of the recent works in this thread rely on the LaFAN1 benchmark (harvey2018recurrent), in which a neural network is provided with 10 past poses and one future pose, the task being to fill the frames in between. This is the benchmark we use in our paper.

Motion inbetweening is an instance of conditional motion generation, in which conditioning signal has the form of key-frames. Therefore, methods using other types of conditioning to generate motion are related to ours. These range from audio conditioned human motion generation such as creating dances for a musical piece (kao2020temporally; li2021learn), to semantically conditioned motion (chuan2020action2motion; petrovich2021actionconditioned).

In the same vein, our problem can be related to the motion style transfer problem, in which conditioning is done based on the frames of the sequence with desired target style. holden2016framework was first to incorporate style loss in the motion synthesis problem, combining two distinct motion styles in a single motion. This was later extended by holden2017style and du2019stylistic. Furthermore, aberman2020unpaired transferred 3D motion styles from videos with no paired MOCAP relying on previous work retargeting video to animation (aberman2019retargeting).

3 Problem Statement and Background

For a skeleton with

joints, we define tensor

fully specifying the pose at time . It contains the concatenation of the global positions as well as rotations . By default, the rotation of the root joint is defined in the global coordinate system, whereas the rest of the joints specify rotations relative to their parent joints, unless explicitly stated otherwise.

For positions, we use a standard 3D Euclidean representation. For joint rotations, we use the robust ortho6D representation proposed by zhou2019onthecontinuity. This addresses the continuity issues reminiscent of quaternion and Euler representations. Suppose vector norm is defined as and vector cross product is ( is the angle between and in the plane containing them and is the normal to the plane). Given the ortho6D rotation, , rotation matrix of joint is computed as:

The model observes a subset of samples in the time window of total length , defined as . The model solves the inbetweening task on the subset of missing samples by predicting global positions and rotations of the root joint and local rotations of the other joints. This is sufficient to fully specify the pose by computing global positions and rotations of all joints after the Forward Kinematics (FK) pass. The forward kinematics pass operates on a skeleton defined by the offset vectors specifying the displacement of each joint with respect to its parent joint when joint rotation is zero (the norm of the offset vector is bone length). Provided with the local offset vectors and rotation matrices of all joints, the global rigid transform of any joint is computed following the tree recursion from its parent joint :

The global transform matrix of joint contains its global rotation matrix, , and its 3D global position, .

In this paper we consider the variation of the inbetweening problem defined in LaFAN1 benchmark (harvey2018recurrent): A neural network is provided with 10 past poses and one future pose, the task being to fill the frames in between.

4 Method

In this section we present the proposed -interpolator approach and its ingredients. We first introduce the spherical linear interpolator (SLERP) serving as the foundation for the proposed delta principle. We further discuss the details of the proposed neural network architecture, including the composition of its inputs and outputs as well as the mathematical details of the neural architecture whose high level block diagram is depicted in Fig 2. We close the section with the detailed analysis of the architectural novelty of the proposed method with respect to the most closely related work by duan2021singleshot and the description of loss terms.

4.1 The SLERP Interpolator Baseline

Common tool-chains in the animation industry often offer the linear interpolator as a tool for motion inbetweening. The linear interpolator generates inbetween root motion via linear interpolation and inbetween local rotations via spherical linear interpolation (SLERP) in the quaternion space (shoemake1985animating). We denote the SLERP interpolator output computed based on .

Figure 2: Proposed -interpolator architecture. Full-pose information from key-frames is referenced with respect to the last frame root reference , concatenated with positional encoding and linearly projected into model width . Self-attention block encodes key-frame representations. Cross-attention block combines key-frame encodings with zero-initialized and positionally encoded templates of missing frames. Six encoder blocks are followed by decoder MLP that produces -predictions of root position and joint rotations. SLERP interpolator output is added to the -predictions of the neural network to produce final prediction.

4.2 The -Interpolator

We propose the -interpolator approach in which the output of the neural network with parameters is defined as a delta w.r.t. the prediction of the linear interpolator and its input is a delta w.r.t. a reference . The latter can be defined, for example, as the concatenation of the 3D coordinate and 6D rotation of the root joint of the last history frame. The -solution, :


belongs to the space local to the reference frame implicitly defined by the prediction of the linear interpolator, . Additionally, the input to is localized with respect to the reference frame defined based on the information available in . The intuition behind the proposed -interpolator is two-fold. First, if the linear interpolator provides a good initial prediction , solving the -interpolation task should be easier, requiring less training data and resulting in better generalization accuracy. Second, since the output is defined in the reference frame relative to the linear interpolator, it provides the opportunity for defining the input relative to a reference without loss of information.333Note that an approach that needs to predict an absolute global position would necessarily incur information loss if its input was made local. As a result, the -interpolator neural network does not have to deal with any absolute domain shifts (either at the input or at the output). This improves its ability to cope with any global reference frame shifts and out-of-distribution domain shifts arising between training and inference scenarios. As an example, suppose the test data is defined in the reference frame that is shifted with respect to the original training reference frame by an arbitrary constant delta . In our experience, even relatively small , comparable to the span of the skeleton is enough to significantly degrade performance. In contrast, it is easy to show that the -interpolator defined in equations (1-2) is completely insensitive to it. First, the domain shift in will be compensated by that in and the input to the neural network will essentially be the same with or without . Moreover, the domain shift will be added to the output of the -interpolator in equation (2) via , resulting in the correct reference shift presented at the output without the neural network even noticing it.

4.3 The Neural Network

In this section we describe the details of the proposed neural architecture, graphically depicted in Fig. 2. We start by defining some standard notions from the Transformer literature. We use the Transformer approach with multi-head self-attention encoder for the key-frames and multi-head cross-attention encoder for the missing frames. Following the original Transformer formulation (vaswani2017attention), the attention unit is defined as:


where is model-wide hidden dimension width. The multi-head attention unit is defined as:

Next, we describe the details of the proposed architecture.

4.3.1 Inputs

As mentioned previously, the input of the network is localized w.r.t. . We define by (i) setting all its position components along time and joint axes to the root position at the last available frame and (ii) setting all its rotation components along time and joint axes to the root rotation at the last available frame. Furthermore, accepts (i) : the key-frame embedding initialized as the concatenation of with the learnable input position encoding, linearly projected into model width ; as well as (ii) : the missing frame template initialized as the learnable output position encoding, linearly projected into model width . Input and output position embeddings are shared and consist of a learnable vector of size for each possible frame index .

4.3.2 Key-frame encoder

Key-frame encoder is described by the following recursion over encoder residual blocks taking as input:

The purpose of the key-frame encoder is to create a deep representation of key-frames that can be further cross-correlated with the templates of missing frames in the missing frame encoder. Using a separate encoder for key frames is computationally efficient. Typically, the number of key frames is low and they encapsulate all the information to infer the missing in-between frames. Therefore, processing them in the same self-attention block as the missing frames, as proposed by duan2021singleshot, is unnecessary from the information processing standpoint and cost-ineffective from the computational stand point. See Section 4.4 for an extended discussion.

4.3.3 Missing frame encoder

Missing frame encoder iterates over residual blocks, cross-processing the key- and missing frame representations. At each level , it accepts the key-frame encodings created by the key-frame encoder and the missing frame encodings from the previous level, . The first level accepts the missing frame template, .

Obviously, missing frame encoder is the place where the known key-frames meet the unknown missing in-between frames. Note that the common self-attention block in (duan2021singleshot) accepts both key-frames and missing frames, cross-pollinating their representations. This can be interpreted as inductive bias towards missing frames being able to overwrite the information in key-frames. Our architecture embeds a different inductive bias, emphasizing that the information is supposed to flow in a directed fashion from the key-frames that are provided by the user as conditioning signal towards the missing frames that need to be inferred at the output from the information provided by the user.

Length 5 15 30 5 15 30 5 15 30
Zero Velocity 0.56 1.10 1.51 1.52 3.69 6.60 0.0053 0.0522 0.2318
SLERP Interpolation 0.22 0.62 0.98 0.37 1.25 2.32 0.0023 0.0391 0.2013
HHM-VAE (li2021taskgeneric) 0.24 0.54 0.94 N/A N/A N/A N/A N/A N/A
ERD-QV (rec. loss) (harvey2020robust) 0.21 0.48 0.83 0.32 0.85 1.82 0.0025 0.0304 0.1608
ERD-QV (rec. & adv. loss) (harvey2020robust) 0.17 0.42 0.69 0.23 0.65 1.28 0.0020 0.0258 0.1328
SSMCT (local) (duan2021singleshot) 0.17 0.44 0.71 0.23 0.74 1.37 0.0019 0.0291 0.1430
SSMCT (global) (duan2021singleshot) 0.14 0.36 0.61 0.22 0.56 1.10 0.0016 0.0234 0.1222
-Interpolator (Ours) 0.11 0.32 0.57 0.13 0.47 1.00 0.0014 0.0217 0.1217
Table 1: Key empirical results on LaFAN1 dataset. Lower score is better.
4.3.4 Decoder

The decoder is an MLP that maps the representations of key-frames and missing frames, and into the predictions of full poses of all frames in the sequence:

Note that the MHA and MLP blocks are shared between the key-frame and missing frame encoder blocks; similarly, same decoder is used to output key-frame reconstructions and missing frame predictions.

4.3.5 Outputs

The -predictions supplied by the decoder contain rotation prediction (global rotation of the root plus local rotations of all other joints in their parent-relative coordinate systems) and global root joint position prediction. Those are further subjected to the skeleton forward kinematics pass to derive the full pose missing frame predictions and the key-frame reconstructions containing global rotations and global positions of all joints.

4.4 Architectural novelty with respect to (duan2021singleshot).

The architecture in (duan2021singleshot) is based on a single self-attention module; key-frames concatenated with the linear interpolator predictions are used as input to the self-attention; the key-frames and the missing frames are marked with the key-frame encoding so that the neural network knows which frames to predict. We show a few interesting things in the inbetweening context. First, when we use self-attention for the key-frames and the cross-attention for the missing frames, the key-frame encoding is not needed. Second, we show that MHA and MLP blocks can be shared in the self-attention and cross-attention blocks. Weight sharing is important from the point of view of reducing the model size, making the package distribution more effective. Third, the architecture of duan2021singleshot is not operational without the interpolator providing input initialization. Therefore, using this architecture in contexts such as motion prediction is not viable. In our case, neither nor have to be based on the linear interpolator. We show that our approach can use last frame reference both for and , while missing frame input is initialized with zeros. This makes our approach significantly more general (e.g. also suitable for motion prediction applications). Finally, processing key-frames via self-attention and missing frames via cross-attention is more computationally efficient. Indeed, for a problem with key-frames and inbetween frames, attention complexity (i.e. the complexity of equation (3)) of (duan2021singleshot) scales as , whereas our approach’s attention complexity scales as . The ratio of the two quantities is . Suppose and , we have that the attention complexity of our approach is times smaller.

4.5 Losses

The neural network is supervised using a combination of quaternion and position losses, similar to (duan2021singleshot):

The position loss is L1 loss defined both on missing and key-frames. In the case of missing frames we call it predictive postion loss. In the case of key-frames we call it the reconstruction position loss, as then the network acts as an auto encoder. and are position only components of and , respectively (e.g. ).


Rotational output is supervised with the L1 loss based on quaternion representations and derived from the ortho6D predictions of global rotations contained in and :


L1 norm here is taken over the last tensor dimension whereas the two leading dimensions are average pooled.

5 Experimental Results

In this section we present our empirical results. The section starts with the description of LaFAN1 dataset, benchmark and metrics used for the qualitative evaluation of our model. We then carefully describe the details of training and evaluation setups as well as the computational budget used to produce our results. Our key results are presented in Table 1, demonstrating the state-of-the-art performance of our method relative to existing work. Finally, we conclude the section by describing the ablation studies we conducted. The results of ablations confirm the significance and flexibility of our -interpolator approach.

5.1 Dataset and Benchmarking

We empirically evaluate our method on a well-established motion synthesis benchmark LaFAN1 (harvey2020robust). LaFAN1 consists of 496,672 motion frames captured in a production-grade MOCAP studio. The footage is captured at 30 frames per second for a total of over 4.5 hours of content. There are 77 sequences in total with 15 action categories that include a diverse set of activities such as walking, dancing, fighting, dancing and various others. Each sequence contains actions performed by one of 5 subjects (MOCAP actors). Sequences with subject 5 are used as the test set. The length of each clip spans for a few minutes which is quite long compared to other human motion datasets. We build the training and test sets following the approach proposed by harvey2020robust, sampling smaller sliding windows from the original long sequences. Following their approach, we sample our training windows from sequences acted by subjects 1-4 and retrieve windows of 50 frames with an offset of 20 frames. We repeat similar exercise with larger windows and a larger offset for the test set, sampling sequences of subject 5 at every 40 frames, yielding 2232 test windows of 65 frames. We also extract the same statistics used for data normalization as outlined in (harvey2020robust), applying the same data normalization procedures.

We report our key empirical results in Table 1, demonstrating state-of-the-art performance of our method. Results in Table 1

are based on 8 different random seed runs of the algorithm and metric values averaged over 10 last optimization epochs. Following previous work, we consider L2Q (the global quaternion squared loss (

6)), L2P (the global position squared loss (7)) and NPSS (the normalized power spectrum similarity score) (gopalakrishnan2019neural).


where represents the quaternion vector of all skeletal joints at time , denotes the global position vectors, superscript denotes sequence in the test dataset . We measure the metrics for missing inbetween frames scenarios.

Hyperparameter Value
Epochs 300
Batch size 64
Optimizer Adam
Learning rate, max 2e-4
Warmup period, epochs 50
Scheduler step size 200
Scheduler gamma 0.1
Dropout 0.2
Losses L1
Width () 1024
MHA heads () 8
Residual Blocks () 6
Layers, encoder MLP () 3
Layers, decoder MLP () 2
Embedding dimensionality 32
Augmentation NONE
Table 2: Settings of the -interpolator hyperparameters.
-mode L2Q L2P NPSS
Input Output 5 15 30 5 15 30 5 15 30
Last I 0.11 0.32 0.57 0.13 0.47 1.00 0.0014 0.0217 0.1217
Last Last 0.12 0.33 0.58 0.14 0.49 1.01 0.0015 0.0221 0.1217
No No 0.15 0.35 0.62 0.22 0.56 1.17 0.0017 0.0238 0.1300
No I 0.11 0.32 0.59 0.13 0.48 1.09 0.0014 0.0221 0.1252
No Last 0.12 0.33 0.59 0.14 0.51 1.12 0.0015 0.0227 0.1245
Table 3: Ablation of the -interpolation regime based on LaFAN1 dataset. Lower score is better.
Reconstruction Loss 5 15 30 5 15 30 5 15 30
0.11 0.32 0.57 0.13 0.47 1.00 0.0014 0.0217 0.1217
0.13 0.34 0.59 0.15 0.50 1.03 0.0015 0.0228 0.1247
Table 4: Ablation of the loss terms based on LaFAN1 dataset. Lower score is better.

5.2 Training and Hyperparameters

The training loop

is implemented in PyTorch 

(paszke2019pytorch) using Adam optimizer (kingma2015adam). Learning rate is warmed to 0.0002 during first 50 epochs using linear schedule and it steps down by a factor of 10 at epoch 250; training is continued until epoch 300. Most 3D geometry operations (transformations across quaternions, ortho6d and rotation matrix representations, quaternion arithmetic) are handled by pytorch3d (ravi2020pytorch3d).

Batch sampling. Following previous work, we split the entire dataset into windows of maximum length (see Appendix A for the details of building the dataset). To construct each batch (batch size is 64), we set the number of the start key-frames to be 10 and the number of the end key-frames to be 1, we then sample the number of in-between frames in the range [5, 39] with weights without replacement. The weight associated with the number of in-between frames is set to be inversely proportional to it, . This weighting prevents overfitting on the windows with the large number of in-between frames, so shorter windows are sampled more often as they are more abundant and hence harder to overfit. Indeed, the number of unique non-overlapping sequences of a given total length is approximately inversely proportional to . Finally, given the total sampled sequence length , the sequence start index is sampled uniformly at random in the range .

Hyperparameters and compute requirements. The neural network encoder has 6 residual blocks of width 1024 with 8 MHA heads and 3 MLP layers each. The decoder’s MLP has one hidden layer. All hyperparameter settings are provided in Table 2. Each row in our empirical tables is based on 8 different random seed runs of the algorithm and metric values averaged over 10 last optimization epochs and the 8 different random seed runs. One training run takes approximately 8 hours on a single NVIDIA M40 GPU on Dell PowerEdge C4130 server equipped with 2 x Intel Xeon E5-2660v3 CPU. It takes approximately 4 days on the aforementioned server to reproduce the results presented in Tables 3 and 4 (6 rows x 8 runs x 8 hours / 4 GPUs).

5.3 Ablation Studies

5.3.1 -regime ablation

The ablation of the -regime is presented in Table 3, which compares our proposed approach (-Interpolator) against a few variants, applying different configurations at the input and output of the neural network. We explored three configurations and one configuration with no delta-regime at either input or output. Configuration “I” implies that the output is w.r.t. to the linear interpolator. Configuration “L” implies for the input that it is in the -mode w.r.t and for the output that it is in the -mode w.r.t. the Zero-Velocity model (last history frame). Finally, configuration “No” implies no -mode (i.e. at the output neural network predicts global root position directly and at the input is not subtracted).

Our proposed -Interpolator corresponds to the configuration (Last I) in the first row of Table 3. The first alternative in the ablation study (Last Last) is different in that instead of relying on SLERP as a baseline, it uses the last known frame as a baseline, both at the input and at the output, making the neural network operation completely local w.r.t. reference frame implied by the last known frame. We can see that this leads to a small but consistent deterioration of metrics compared to the proposed -interpolator. It is worth pointing out, however, that this variant does not at all rely on the SLERP interpolator. Therefore, it is viable to achieve very impressive results simply using interpolation with respect to the last known frame. We noticed that the SLERP interpolator may be computationally quite demanding and hard to optimize on GPU, therefore in applications where computation is a bottleneck, the (Last Last) configuration may turn out to be interesting. The third row of Table 3 is the configuration that does not rely on interpolation at all. We can see that it has noticeably degraded performance. Still, compared to existing methods in Table 1 performance is very competitive, validating our neural network architecture design. Comparing the first and the third rows in Table 3 proves the effectiveness of the proposed -Interpolator approach. Rows 4-5 ablate the use of the -mode at the input of the deep neural network. We can see that (i) the input -mode helps more significantly when the output -mode is set to the last frame, (ii) the positive impact of the input -mode is more pronounced when inbetweening length is longer and (iii) -mode at the input has positive effect both for the last frame (Last) and the Interpolator (I) configurations of the output -mode.

5.3.2 Ablation of Reconstruction Loss

The ablation study of the reconstruction loss terms is presented in Table 4. The second row in this table shows the generalization result with the reconstruction loss term removed. The reconstruction term includes two parts: from (4) and from (5). Interestingly, these terms do not directly penalize the errors on the missing frames, but rather only the reconstruction loss on the key frames. Also, the key frames are provided as inputs and it might be logical to think that the task of reconstructing them should be trivial. However, we can clearly see that penalizing the reconstruction error on the key frames provides noticeable positive regularizing effect showing in improved metrics measured on the test missing frames.

6 Discussion of Findings

Our key results demonstrate the state-of-the-art performance of the -mode interpolator. Note that in our approach both input and output of the deep part of the network are defined in the local coordinate frame relative to the SLERP interpolator outputs. This is opposite to the findings of duan2021singleshot, who advocate the use of the global reference frame for neural interpolation. Our approach produces more accurate results and makes the neural network robust w.r.t. out-of-distribution domain shifts due to its local nature. We believe this is an important contribution towards building the methodology of developing robust and accurate in-betweening algorithms.

Our ablation studies demonstrate that our approach without -regime is significantly worse (although competitive against (duan2021singleshot)), confirming the importance of the proposed -interpolation approach. We additionally show that the last known frame can be used in lieu of SLERP as a reference, resulting in only slightly worse performance. This obviates the dependency on SLERP while doing neural inbetweening, further validates the general -interpolation idea laid out in equations (1,2) and makes our approach suitable for forward motion prediction applications.

The ablation studies imply that different input and output referencing methods have noticeable effect on the inbetweening performance. Therefore, this opens up a promising direction for future research. In our view, answering questions “What is a simple and more optimal ?”; “What is a simple and more optimal ?” has potential to bring additional inbetweening accuracy benefits without inflating computational costs. We would also like to point out the importance of the reconstruction terms shown in Section 5.3.2, a result that was a bit surprising to us. We believe this is pointing towards (i) looking for more optimal auxiliary reconstruction losses, (ii) perhaps smarter sampling of inputs or (iii) employing data augmentation in conjunction with reconstruction as potentially promising ways to further improve generalization results.

7 Conclusions

This paper solves a prominent 3D animation task, that of motion inbetweening. We propose the inbetweening algorithm in which a deep neural network acts in the -regime predicting residual with respect to a linear SLERP interpolator. We empirically demonstrate that this mode of operation implies more accurate neural inbetweening results on the publicly available LaFAN1 benchmark. Additionally, -regime implies stronger robustness with respect to out-of-distribution shifts as both the input and the output of the network can be defined in the reference frame local to the SLERP interpolator baseline. Moreover, we show that the last known frame can be used in lieu of the SLERP interpolator, further simplifying the implementation of the algorithm at a small accuracy cost and making it applicable to forward motion prediction applications.


Supplementary Material A LaFAN1 Data Preparation

Since the authors of (harvey2020robust) did not release a PyTorch compatible data loader we implement it on our own in accordance with their description of the sliding window technique and data normalization. To ensure that our approach is correct we implement the same basic baselines as the authors, also in PyTorch, including a zero-velocity model and spherical linear interpolator (SLERP). We then validate our implementation by obtaining the same values for the metrics on the zero-velocity and SLERP baselines as the ones reported in (harvey2020robust). Note that our Table 1 reports zero-velocity and SLERP algorithms metrics based on our own implementation of the data loader and the algorithms.

As previously noted, the LaFAN1 benchmark444Accessible at: (harvey2020robust) was provided to us in raw BVH format. To make the data usable in the PyTorch and Unity pipeline we implemented it was necessary to first convert it to a set of regular CSV files (one for each BVH animation) and to assign the animations to a Unity engine avatar for integrating our model in the engine as shown in Figure 1. This section outlines the steps taken for this conversion:

  1. Download the original LAFAN1 source and follow the steps outlined in the repository (see footnote 5) to extract 77 BVH files and verify data integrity.

  2. To import the animations to the Unity engine we first need to convert them from BVH to FBX files. This is done in an open source, free computer graphics software called Blender

    555Available for download at:

  3. Create a new Unity project and import all the animations as assets and drop one in the hierarchy window (top left).

  4. Before continuing we disable compression, unit conversion, keyframe reduction and enable baking of axis conversion. If the conversion from Blender was done correctly the animated avatar should appear and be oriented correctly if the software renders a bone structure for the avatar in the scene. These settings were chosen to maximize the animation quality.

  5. To make the animations viewable in Unity we select the clip group and add an avatar to the FBX animation clips.

  6. We create a dataset from the clip group with the following parameters set: Bone positions (world space), Bone rotations (local space), root reference (local space) and timestamps. We do not enable root motion.

  7. Finally, we add an animation baker component to the clip in the hierarchy tab, link to our skeleton characterization and select 30 FPS output. The baker provides CSV files for each of the 77 animation with time stamps.

While our process results in high fidelity copies of the BHV animations in CSV format there are some small sources of error. Specifically, steps 2 and 3 introduce most of the error since Blender has a right handed coordinate system whereas the Unity engine operates with left handed coordinates. The different handedness of the coordinate systems introduces a flip of the sign along the Z-axis (which we deal with by exporting negative Z-axis coordinates from Blender hence introducing another axis flip so thus restoring the original coordinate orientation). However, this also changes the exported joint angular rotations which are challenging to fix and are hence left as is besides applying a quaternion discontinuity fix similar to (harvey2020robust).

To validate that our approach produces high fidelity conversions from BVH to CSV we re-implement and compare the zero velocity and linear interpolation baselines and evaluate them on the same tasks as (harvey2020robust). Since these dummy models have zero parameters if our pre-processing does not compromise data integrity we should obtain very similar values to what was reported in (harvey2020robust). As we can see in Table 5 the benchmark models perform within 1% thus validating our pre-processing pipeline.