Log In Sign Up

Human Motion Diffusion Model

Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. .


page 2

page 6


FLAME: Free-form Language-based Motion Synthesis Editing

Text-based motion generation models are drawing a surge of interest for ...

Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion Model

We propose a simple and novel method for generating 3D human motion from...

Human Joint Kinematics Diffusion-Refinement for Stochastic Motion Prediction

Stochastic human motion prediction aims to forecast multiple plausible f...

On Conditioning the Input Noise for Controlled Image Generation with Diffusion Models

Conditional image generation has paved the way for several breakthroughs...

UDE: A Unified Driving Engine for Human Motion Generation

Generating controllable and editable human motion sequences is a key cha...

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

Diffusion Denoising Probability Models (DDPM) and Vision Transformer (Vi...

Listen, denoise, action! Audio-driven motion synthesis with diffusion models

Diffusion models have experienced a surge of interest as highly expressi...

1 Introduction

Human motion generation is a fundamental task in computer animation, with applications spanning from gaming to robotics. It is a challenging field, due to several reasons, including the vast span of possible motions, and the difficulty and cost of acquiring high quality data. For the recently emerging text-to-motion setting, where motion is generated from natural language, another inherent problem is data labeling. For example, the label ”kick” could refer to a soccer kick, as well as a Karate one. At the same time, given a specific kick there are many ways to describe it, from how it is performed to the emotions it conveys, constituting a many-to-many problem. Current approaches have shown success in the field, demonstrating plausible mapping from text to motion (Petrovich et al., 2022; Tevet et al., 2022; Ahuja & Morency, 2019). All these approaches, however, still limit the learned distribution since they mainly employ auto-encoders or VAEs (Kingma & Welling, 2013) (implying a one-to-one mapping or a normal latent distribution respectively). In this aspect, diffusion models are a better candidate for human motion generation, as they are free from assumptions on the target distribution, and are known for expressing well the many-to-many distribution matching problem we have described.

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2020; Ho et al., 2020)

are a generative approach that is gaining significant attention in the computer vision and graphics community. When trained for conditioned generation, recent diffusion models 

(Ramesh et al., 2022; Saharia et al., 2022b) have shown breakthroughs in terms of image quality and semantics. The competence of these models have also been shown for other domains, including videos  (Ho et al., 2022), and 3D point clouds  (Luo & Hu, 2021). The problem with such models, however, is that they are notoriously resource demanding and challenging to control.

In this paper, we introduce Motion Diffusion Model (MDM) — a carefully adapted diffusion based generative model for the human motion domain. Being diffusion-based, MDM gains from the native aforementioned many-to-many expression of the domain, as evidenced by the resulting motion quality and diversity (Figure 1). In addition, MDM combines insights already well established in the motion generation domain, helping it be significantly more lightweight and controllable.

[width=1tics=10, trim=0mm 0 0mm 0,clip]figures/teaser.pdf

Figure 1: Our Motion Diffusion Model (MDM) reflects the many-to-many nature of text-to-motion mapping by generating diverse motions given a text prompt. Our custom architecture and geometric losses help yielding high-quality motion. Darker color indicates later frames in the sequence.

First, instead of the ubiquitous U-net (Ronneberger et al., 2015) backbone, MDM is transformer-based. As we demonstrate, our architecture (Figure 2) is lightweight and better fits the temporal and non-spatial nature of motion data (represented as a collection of joints). A large volume of motion generation research is devoted to learning using geometric losses (Kocabas et al., 2020; Harvey et al., 2020; Aberman et al., 2020). Some, for example, regulate the velocity of the motion (Petrovich et al., 2021) to prevent jitter, or specifically consider foot sliding using dedicated terms (Shi et al., 2020). Consistently with these works, we show that applying geometric losses in the diffusion setting improves generation.

The MDM framework has a generic design enabling different forms of conditioning. We showcase three tasks: text-to-motion, action-to-motion, and unconditioned generation. We train the model in a classifier-free manner (Ho & Salimans, 2022), which enables trading-off diversity to fidelity, and sampling both conditionally and unconditionally from the same model. In the text-to-motion task, our model generates coherent motions (Figure 1) that achieve state-of-the-art results on the HumanML3D (Guo et al., 2022a) and KIT (Plappert et al., 2016) benchmarks. Moreover, our user study shows that human evaluators prefer our generated motions over real motions of the time (Figure 4(a)). In action-to-motion, MDM outperforms the state-of-the-art (Guo et al., 2020; Petrovich et al., 2021), even though they were specifically designed for this task, on the common HumanAct12 (Guo et al., 2020) and UESTC (Ji et al., 2018) benchmarks.

Lastly, we also demonstrate completion and editing. By adapting diffusion image-inpainting 

(Song et al., 2020b; Saharia et al., 2022a), we set a motion prefix and suffix, and use our model to fill in the gap. Doing so under a textual condition guides MDM to fill the gap with a specific motion that still maintains the semantics of the original input. By performing inpainting in the joints space rather than temporally, we also demonstrate the semantic editing of specific body parts, without changing the others (Figure 3).

Overall, we introduce Motion Diffusion Model, a motion framework that achieves state-of-the-art quality in several motion generation tasks, while requiring only about three days of training on a single mid-range GPU. It supports geometric losses, which are non trivial to the diffusion setting, but are crucial to the motion domain, and offers the combination of state-of-the-art generative power with well thought-out domain knowledge.

2 Related Work

2.1 Human Motion Generation

Neural motion generation, learned from motion capture data, can be conditioned by any signal that describes the motion. Many works use parts of the motion itself for guidance. Some predict motion from its prefix poses (Fragkiadaki et al., 2015; Martinez et al., 2017; Hernandez et al., 2019; Guo et al., 2022b). Others (Harvey & Pal, 2018; Kaufmann et al., 2020; Harvey et al., 2020; Duan et al., 2021)

solve in-betweening and super-resolution tasks using bi-directional GRU 

(Cho et al., 2014) and Transformer (Vaswani et al., 2017) architectures. Holden et al. (2016) use auto-encoder to learn motion latent representation, then utilize it to edit and control motion with spatial constraints such as root trajectory and bone lengths. Motion can be controlled with a high-level guidance given from action class (Guo et al., 2020; Petrovich et al., 2021; Cervantes et al., 2022), audio (Li et al., 2021; Aristidou et al., 2022) and natural language (Ahuja & Morency, 2019; Petrovich et al., 2022). In most cases authors suggests a dedicated approach to map each conditioning domain into motion.

In recent years, the leading approach for the Text-to-Motion task is to learn a shared latent space for language and motion. JL2P (Ahuja & Morency, 2019) learns the KIT motion-language dataset (Plappert et al., 2016) with an auto-encoder, limiting one-to-one mapping from text to motion. TEMOS (Petrovich et al., 2022) and T2M (Guo et al., 2022a) suggest using a VAE (Kingma & Welling, 2013)

to map a text prompt into a normal distribution in latent space. Recently, MotionCLIP 

(Tevet et al., 2022) leverages the shared text-image latent space learned by CLIP (Radford et al., 2021) to expand text-to-motion out of the data limitations and enabled latent space editing.

The human motion manifold can also be learned without labels, as shown by Holden et al. (2016), V-Poser (Pavlakos et al., 2019), and more recently the dedicated MoDi architecture (Raab et al., 2022). We show that our model is capable for such an unsupervised setting as well.

2.2 Diffusion Generative Models

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2020) are a class of neural generative models, based on the stochastic diffusion process as it is modeled in Thermodynamics. In this setting, a sample from the data distribution is gradually noised by the diffusion process. Then, a neural model learns the reverse process of gradually denoising the sample. Sampling the learned data distribution is done by denoising a pure initial noise. Ho et al. (2020) and Song et al. (2020a) further developed the practices for image generation applications. For conditioned generation, Dhariwal & Nichol (2021), introduced classifier-guided diffusion, which was later on adapted by GLIDE (Nichol et al., 2021) to enable conditioning over CLIP textual representations. The Classifier-Free Guidance approach Ho & Salimans (2022) enables conditioning while trading-off fidelity and diversity, and achieves better results (Nichol et al., 2021)

. In this paper, we implement text-to-motion by conditioning on CLIP in a classifier-free manner, similarly to text-to-image 

(Ramesh et al., 2022; Saharia et al., 2022b). Local editing of images is typically defined as an inpainting problem, where a part of the image is constant, and the inpainted part is denoised by the model, possibly under some condition (Song et al., 2020b; Saharia et al., 2022a). We adapt this technique to edit motion’s specific body parts or temporal intervals (in-betweening) according to an optional condition.

More recently, concurrent to this work, Zhang et al. (2022) and Kim et al. (2022) have suggested diffusion models for motion generation. Our work requires significantly fewer GPU resources and makes design choices that enable geometric losses, which improve results.

3 Motion Diffusion Model


Figure 2: (Left) Motion Diffusion Model (MDM) overview. The model is fed a motion sequence of length in a noising step , as well as itself and a conditioning code . , a CLIP (Radford et al., 2021) based textual embedding in this case, is first randomly masked for classifier-free learning and then projected together with into the input token . In each sampling step, the transformer-encoder predicts the final clean motion . (Right) Sampling MDM. Given a condition , we sample random noise at the dimensions of the desired motion, then iterate from to . At each step , MDM predicts the clean sample , and diffuses it back to .

An overview of our method is described in Figure 2. Our goal is to synthesize a human motion of length given an arbitrary condition . This condition can be any real-world signal that will dictate the synthesis, such as audio (Li et al., 2021; Aristidou et al., 2022), natural language (text-to-motion) (Tevet et al., 2022; Guo et al., 2022a) or a discrete class (action-to-motion) (Guo et al., 2020; Petrovich et al., 2021). In addition, unconditioned motion generation is also possible, which we denote as the null condition . The generated motion is a sequences of human poses represented by either joint rotations or positions , where is the number of joints and is the dimension of the joint representation. MDM can accept motion represented by either locations, rotations, or both (see Section 4).

Framework. Diffusion is modeled as a Markov noising process, , where is drawn from the data distribution and


where are constant hyper-parameters. When is small enough, we can approximate . From here on we use to denote the full sequence at noising step .

In our context, conditioned motion synthesis models the distribution as the reversed diffusion process of gradually cleaning . Instead of predicting as formulated by Ho et al. (2020), we follow Ramesh et al. (2022) and predict the signal itself, i.e., with the simple objective (Ho et al., 2020),


Geometric losses. In the motion domain, generative networks are standardly regularized using geometric losses Petrovich et al. (2021); Shi et al. (2020). These losses enforce physical properties and prevent artifacts, encouraging natural and coherent motion. In this work we experiment with three common geometric losses that regulate (1) positions (in case we predict rotations), (2) foot contact, and (3) velocities.


In case we predict joint rotations, denotes the forward kinematic function converting joint rotations into joint positions (otherwise, it denotes the identity function). is the binary foot contact mask for each frame . Relevant only to feet, it indicates whether they touch the ground, and are set according to binary ground truth data (Shi et al., 2020). In essence, it mitigates the foot-sliding effect by nullifying velocities when touching the ground.

Overall, our training loss is


Model. Our model is illustrated in Figure 2. We implement with a straightforward transformer (Vaswani et al., 2017) encoder-only architecture. The transformer architecture is temporally aware, enabling learning arbitrary length motions, and is well-proven for the motion domain (Petrovich et al., 2021; Duan et al., 2021; Aksan et al., 2021). The noise time-step and the condition code are each projected to the transformer dimension by separate feed-forward networks, then summed to yield the token . Each frame of the noised input is linearly projected into the transformer dimension and summed with a standard positional embedding. and the projected frames are then fed to the encoder. Excluding the first output token (corresponding to ), the encoder result is projected back to the original motion dimensions, and serves as the prediction . We implement text-to-motion by encoding the text prompt to with CLIP (Radford et al., 2021) text encoder, and action-to-motion with learned embeddings per class.

Sampling from is done in an iterative manner, according to Ho et al. (2020). In every time step we predict the clean sample and noise it back to . This is repeated from until is achieved (Figure 2 right). We train our model using classifier-free guidance (Ho & Salimans, 2022). In practice, learns both the conditioned and the unconditioned distributions by randomly setting for of the samples, such that approximates . Then, when sampling

we can trade-off diversity and fidelity by interpolating or even extrapolating the two variants using



Editing. We enable motion in-betweening in the temporal domain, and body part editing in the spatial domain, by adapting diffusion inpainting to motion data. Editing is done only during sampling, without any training involved. Given a subset of the motion sequence inputs, when sampling the model (Figure 2 right), at each iteration we overwrite with the input part of the motion. This encourages the generation to remain coherent to original input, while completing the missing parts. In the temporal setting, the prefix and suffix frames of the motion sequence are the input, and we solve a motion in-betweening problem (Harvey et al., 2020). Editing can be done either conditionally or unconditionally (by setting ). In the spatial setting, we show that body parts can be re-synthesized according to a condition while keeping the rest intact, through the use of the same completion technique.

4 Experiments


Figure 3: Editing applications. Light blue frames represent motion input and bronze frames are the generated motion. Motion in-betweening (left+center) can be performed conditioned on text or without condition by the same model. Specific body part editing using text is demonstrated on the right: the lower body joints are fixed to the input motion while the upper body is altered to fit the input text prompt.

We implement MDM for three motion generation tasks: Text-to-Motion(4.1), Action-to-Motion(4.2) and unconditioned generation(5.2. Each sub-section reviews the data and metrics of the used benchmarks, provides implementation details, and presents qualitative and quantitative results. Then, we show implementations of motion in-betweening (both conditioned and unconditioned) and body-part editing by adapting diffusion inpainting to motion (5.1). Our models have been trained with noising steps and a cosine noise schedule. All of them have been trained on a single NVIDIA GeForce RTX 2080 Ti GPU for a period of about days.

4.1 Text-to-Motion

Text-to-motion is the task of generating motion given an input text prompt. The output motion is expected to be both implementing the textual description, and a valid sample from the data distribution (i.e. adhering to general human abilities and the rules of physics). In addition, for each text prompt, we also expect a distribution of motions matching it, rather than just a single result. We evaluate our model using two leading benchmarks - KIT (Plappert et al., 2016) and HumanML3D (Guo et al., 2022a), over the set of metrics suggested by Guo et al. (2022a): R-precision and Multimodal-Dist measure the relevancy of the generated motions to the input prompts, FID measures the dissimilarity between the generated and ground truth distributions (in latent space), Diversity measures the variability in the resulting motion distribution, and MultiModality

is the average variance given a single text prompt. For the full implementation of the metrics, please refer to

Guo et al. (2022a). We use HumanML3D as a platform to compare different backbones of our model, discovering that the diffusion framework is relatively agnostic to this attribute. In addition, we conduct a user study comparing our model to current art and ground truth motions.

Data. HumanML3D is a recent dataset, textually re-annotating motion capture from the AMASS (Mahmood et al., 2019) and HumanAct12 (Guo et al., 2020) collections. It contains motions annotated by textual descriptions. In addition, it suggests a redundant data representation including a concatenation of root velocity, joint positions, joint velocities, joint rotations and the foot contact binary labels. We also use in this section the same representation for the KIT dataset, brought by the same publishers. Although limited in the number () and the diversity of samples, most of the text-to-motion research is based on KIT, hence we view it as important to evaluate using it as well.

Implementation. In addition to our Transformer encoder-only backbone (Section 3), we experiment MDM with three more backbones: (1) Transformer decoder injects through the cross-attention layer, instead of as an input token. (2) Transformer decoder + input token, where is injected both ways, and (3) GRU (Cho et al., 2014) concatenate to each input frame (Table 1). Our models were trained with batch size , layers (except GRU that was optimal at ), and latent dimension . To encode the text we use a frozen CLIP-ViT-B/32 model. Each model was trained for steps, afterwhich a checkpoint was chosen that minimizes the FID metric to be reported. Since foot contact and joint locations are explicitly represented in HumanML3D, we don’t apply geometric losses in this section. We evaluate our models with guidance-scale which provides a diversity-fidelity sweet spot (Figure 4).

Quantitative evaluation. We evaluate and compare our models to current art (JL2P Ahuja & Morency (2019), Text2Gesture (Bhattacharya et al., 2021), and T2M (Guo et al., 2022a)) with the metrics suggested by Guo et al. (2022a). As can be seen, MDM achieves state-of-the-art results in FID, Diversity, and MultiModality, indicating high diversity per input text prompt, and high-quality samples, as can also be seen qualitatively in Figure 1.

User study. We asked 31 users to choose between MDM and state-of-the-art works in a side-by-side view, with both samples generated from the same text prompt randomly sampled from the KIT test set. We repeated this process with 10 samples per model and 10 repetitions per sample. This user study enabled a comparison with the recent TEMOS model (Petrovich et al., 2022), which was not included in the HumanML3D benchmark. Fig. 4 shows that most of the time, MDM was preferred over the compared models, and even preferred over ground truth samples in of the cases.

Method R Precision (top 3) FID Multimodal Dist Diversity Multimodality
Real -
JL2P -
Text2Gesture -
MDM (ours)
MDM (decoder)
      + input token
Table 1: Quantitative results on the HumanML3D test set. All methods use the real motion length from the ground truth. ‘’ means results are better if the metric is closer to the real distribution. We run all the evaluation 20 times (except MultiModality runs 5 times) and

indicates the 95% confidence interval.

Bold indicates best result.
Method R Precision (top 3) FID Multimodal Dist Diversity Multimodality
Real -
JL2P -
Text2Gesture -
MDM (ours)
Table 2: Quantitative results on the KIT test set.
(a) KIT User Study
(b) Classifier-free scale sweep
Figure 4: (a) Text-to-motion user study for the KIT dataset. Each bar represents the preference rate of MDM over the compared model. MDM was preferred over the other models in most of the time, and of the cases even over ground truth samples. The dashed line marks . (b) Guidance-scale sweep for HumanML3D dataset. FID (lower is better) and R-precision (higher is better) metrics as a function of the scale , draws an accuracy-fidelity sweet spot around .
Method FID Accuracy Diversity Multimodality
Real (INR)
Real (ours)
Action2Motion (2020)
ACTOR (2021)
INR (2022)
MDM (ours)
  w/o foot contact
Table 3: Evaluation of action-to-motion on the HumanAct12 dataset. Our model leads the board in three out of four metrics. Ground-truth evaluation results are slightly different for each of the works, due to implementation differences, such as python package versions. It is important to assess the diversity and multimodality of each model using its own ground-truth results, as they are measured by their distance from GT. We show the GT metrics measured by our model and by the leading compared work, INR (Cervantes et al., 2022). Bold indicates best result, indicates second best, indicates 95% confidence interval, indicates that closer to real is better.
Method Accuracy Diversity Multimodality
ACTOR (2021)
INR (2022) (best variation)
MDM (ours)
  w/o foot contact
Table 4: Evaluation of action-to-motion on the UESTC dataset. The performance improvement with our model shows a clear gap from state-of-the-art. Bold indicates best result, indicates second best, indicates 95% confidence interval, indicates that closer to real is better.

4.2 Action-to-Motion

Action-to-motion is the task of generating motion given an input action class, represented by a scalar. The output motion should faithfully animate the input action, and at the same time be natural and reflect the distribution of the dataset on which the model is trained. Two dataset are commonly used to evaluate action-to-motion models: HumanAct12 (Guo et al., 2020) and UESTC (Ji et al., 2018). We evaluate our model using the set of metrics suggested by Guo et al. (2020), namely Fréchet Inception Distance (FID), action recognition accuracy, diversity and multimodality. The combination of these metrics makes a good measure of the realism and diversity of generated motions.

Data. HumanAct12 (Guo et al., 2020) offers approximately 1200 motion clips, organized into 12 action categories, with 47 to 218 samples per label. UESTC (Ji et al., 2018) consists of 40 action classes, 40 subjects and 25K samples, and is split to train and test. We adhere to the cross-subject testing protocol used by current works, with 225-345 samples per action class. For both datasets we use the sequences provided by Petrovich et al. (2021).

Implementation. The implementation presented in Figure 2 holds for all the variations of our work. In the case of action-to-motion, the only change would be the substitution of the text embedding by an action embedding. Since action is represented by a scalar, its embedding is fairly simple; each input action class scalar is converted into a learned embedding of the transformer dimension.

The experiments have been run with batch size , a latent dimension of , and an encoder-transformer architecture. Training on HumanAct12 and UESTC has been carried out for and steps respectively. In our tables we display the evaluation of the checkpoint that minimizes the FID metric.

Quantitative evaluation. Tables 3 and 4 reflect MDM’s performance on the HumanAct12 and UESTC datasets respectively. We conduct 20 evaluations, with 1000 samples in each, and report their average and a 95% confidence interval. We test two variations, with and without foot contact loss. Our model leads the board for both datasets. The variation with no foot contact loss attains slightly better results; nevertheless, as shown in our supplementary video, the contribution of foot contact loss to the quality of results is important, and without it we witness artifacts such as shakiness and unnatural gestures.

5 Additional Applications

5.1 Motion Editing

In this section we implement two motion editing applications - in-betweening and body part editing, both using the same approach in the temporal and spatial domains correspondingly. For in-betweening, we fix the first and last of the motion, leaving the model to generate the remaining in the middle. For body part editing, we fix the joints we don’t want to edit and leave the model to generate the rest. In particular, we experiment with editing the upper body joints only. In figure 3 we show that in both cases, using the method described in Section 3 generates smooth motions that adhere both to the fixed part of the motion and the condition (if one was given).

Method FID KID Precision Recall Multimodality
ACTOR (2021) 48.80 0.53 0.72, 0.74 14.10
MoDi (2022) 13.03 0.12 0.71, 0.81 17.57
MDM (ours) 31.92 0.36 0.66, 0.62 17.00
Table 5: Evaluation of unconstrained synthesis on the HumanAct12 dataset. We test MDM in the challenging unconstrained setting, and compare with MoDi (Raab et al., 2022), a work that was specially designed for such setting. We demonstrate that in addition to being able to support any condition, we can achieve plausible results in the unconstrained setting. Bold indicates best result.

5.2 Unconstrained Synthesis

The challenging task of unconstrained synthesis has been studied by only a few (Holden et al., 2016; Raab et al., 2022). In the presence of data labeling, e.g., action classes or text description, the labels work as a supervising factor, and facilitate a structured latent space for the training network. The lack of labeling make training more difficult. The human motion field possesses rich unlabeled datasets (Adobe Systems Inc., 2021), and the ability to train on top of them is an advantage. Daring to test MDM in the challenging unconstrained setting, we follow MoDi(Raab et al., 2022) for evaluation. We use the metrics they suggest (FID, KID, precision/recall and multimodality), and run on an unconstrained version of the HumanAct12 (Guo et al., 2020) dataset.

Data. Although annotated, we use HumanAct12 (see Section 4.2) in an unconstrained fashion, ignoring its labels. The choice of HumanAct12 rather than a dataset with no labels (e.g., Mixamo (Adobe Systems Inc., 2021)), is for compatibility with previous publications.

Implementation. Our model uses the same architecture for all forms of conditioning, as well as for the unconstrained setting. The only change to the structure shown in Figure 2, is the removal of the conditional input, such that is composed of the projection of only. To simulate an unconstrained behavior, ACTOR Petrovich et al. (2021) has been trained by (Raab et al., 2022) with a labeling of one class to all motions.

Quantitative evaluation. The results of our evaluation are shown in table 5. We demonstrate superiority over works that were not designed for an unconstrained setting, and get closer to MoDi (Raab et al., 2022). MoDi is carefully molded for unconstrained settings, while our work can be applied to any (or no) constrain, and also provides editing capabilities.

6 Discussion

We have presented MDM, a method that lends itself to various human motion generation tasks. MDM is an untypical classifier-free diffusion model, featuring a transformer-encoder backbone, and predicting the signal, rather than the noise. This yields both a lightweight model, that is unburdening to train, and an accurate one, gaining much from the applicable geometric losses. Our experiments show superiority in conditioned generation, but also that this approach is not very sensitive to the choice of architecture.

A notable limitation of the diffusion approach is the long inference time, requiring about forward passes for a single result. Since our motion model is small anyway, using dimensions order of magnitude smaller than images, our inference time shifts from less than a second to only about a minute, which is an acceptable compromise. As diffusion models continue to evolve, beside better compute, in the future we would be interested in seeing how to incorporate better control into the generation process, and widen the options for applications even further.


We thank Rinon Gal for his useful suggestions and references. This research was supported in part by the Israel Science Foundation (grants no. 2492/20 and 3441/21), Len Blavatnik and the Blavatnik family foundation, and The Tel Aviv University Innovation Laboratories (TILabs).