Log In Sign Up

Playable Environments: Video Manipulation in Space and Time

by   Willi Menapace, et al.

We present Playable Environments - a new representation for interactive video generation and manipulation in space and time. With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions. The actions are learnt in an unsupervised manner. The camera can be controlled to get the desired viewpoint. Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering. To support diverse appearances of objects, we extend neural radiance fields with style-based modulation. Our method trains on a collection of various monocular videos requiring only the estimated camera parameters and 2D object locations. To set a challenging benchmark, we introduce two large scale video datasets with significant camera movements. As evidenced by our experiments, playable environments enable several creative applications not attainable by prior video synthesis works, including playable 3D video generation, stylization and manipulation. Further details, code and examples are available at


page 1

page 3

page 7

page 8


HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video

We introduce a free-viewpoint rendering method – HumanNeRF – that works ...

Playable Video Generation

This paper introduces the unsupervised learning problem of playable vide...

Diverse Generation from a Single Video Made Possible

Most advanced video generation and manipulation methods train on a large...

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Image and video synthesis are closely related areas aiming at generating...

Diverse Video Generation from a Single Video

GANs are able to perform generation and manipulation tasks, trained on a...

Manually Acquiring Targets from Multiple Viewpoints Using Video Feedback

Objective: The effect of camera viewpoint was studied when performing vi...

EGO-TOPO: Environment Affordances from Egocentric Video

First-person video naturally brings the use of a physical environment to...

Code Repositories


Official PyTorch implementation of "Playable Environments: Video Manipulation in Space and Time", CVPR 2022

view repo

1 Introduction

Figure 1: Given a single initial frame, our method creates playable environments that allow the user to interactively generate different videos by specifying discrete actions to control players, manipulating camera trajectory and indicating the style for each object in the scene.

What would you change in the last tennis match you saw? The actions of the player? The style of the field, or, perhaps, the camera trajectory to observe a highlight more dramatically? To do so interactively, the geometry and the style of the field and the players need to be reconstructed. Players’ actions need to be understood and the outcomes of future actions anticipated. To enable these features one needs to reconstruct the observed environment in 3D and provide simple and intuitive interaction, offering an experience similar to playing a video game. We call these representations Playable Environments (PE).

Such a representation enables multiple creative applications, such as 3D- and action-aware video editing, camera trajectory manipulation, changing the action sequence, the agents and their styles, or continuing the video in time, beyond the observed footage. Fig. 1 shows a playable environment for tennis matches. In it, the user specifies actions to move the players, controls the viewpoint and changes the style of the players and the field. The environment can be played, akin to a video game, but with real objects.

In this work, we propose a method to construct PEs of complex scenes that supports a large set of interactive manipulations. Trained on a dataset of monocular videos, our method presents six core characteristics listed in Tab. 1 that enable the creation of such PEs. Our framework allows the user to interactively generate videos by providing discrete actions 1 and controlling the camera pose 2. Furthermore, it can represent environments with multiple objects 3 with varying poses 4 and appearances 5 and is robust to imprecise inputs 6. In particular, we do not require ground-truth camera intrinsics and extrinsincs, but assume they can be estimated for each frame. Neither do we assume ground-truth object locations, but rely on an off-the-shelf object detector [ren2015faster] to locate the agents in 2D, such as both tennis players. No other supervision is required.

Playable Environments encapsulate and extend representations built by several prior image or video manipulation methods. Novel view synthesis and volumetric rendering methods support re-rendering of static scenes. However, while some methods support moving or articulated objects [pumarola2021dnerf, tretschk2021nonrigid, Ost_2021_CVPR, yuan2021star], it is challenging for them to handle dynamic environments and they do not allow user interaction, making them undesirable for modeling compelling environments. Video synthesis methods manipulate videos by predicting future frames [lee2018savp, kumar2020videoflow, tulyakov2018moco, tian2021good], animating [siarohin2019monkeynet, Siarohin2019firstorder, siarohin2021motion] or playing videos [menapace2021pvg], but environments modeled with such methods typically lack camera control and multi-object support. Consequently, these methods limit interactivity as they do not take into account the 3D nature of the environment.

Name Description
1 Playability The user can control generation with discrete actions.
2 Camera control The camera pose is explicitly controlled at test time.
3 Multi-object Each object is explicitly modeled.
4 Deformable objects The model handles deformable object such as human bodies
5 Appearance changes The model handles objects whose appearance is not constant is the training set
6 Robustness The model is robust to calibration and localization errors.

Table 1: Characteristics of our method for Playable Environments. Each row is referred in the text with symbols.

Our method consists of two components. The first one is the synthesis module. It extracts the state of the environment—location, style and non-rigid pose of each object—and renders the state back to the image space. Recently introduced Neural Radiance Fields (NeRFs) [mildenhall2020nerf] represent an attractive tool for their ability to render novel views. In this work, we introduce a style-based modification of NeRF to support objects of different appearances. Furthermore, we propose a compositional non-rigid volumetric rendering approach handling the rigid parts of the scene and non-rigid objects. The second component—the action module—enables playability. It takes two consecutive states of the environment and predicts an action with respect to the camera orientation. We train our framework using reconstruction losses in the image space and the state space, and a novel loss for action consistency. Finally, to improve temporal dynamics, we introduce a temporal discriminator that operates on sequences of environment states.

To thoroughly evaluate 16, we introduce two complementary large-scale datasets for the training of playable environments, a synthetic and a real one. The first is intended to evaluate 15, with a particular focus on camera control thanks to the synthetic ground truth, the second to evaluate 16, with a particular focus on 46 given the high diversity present in this dataset. We propose an extensive evaluation of our method with several baselines derived from existing NeRF and video generation methods. These experiments show that our method is able to generate high-quality videos and outperforms all baselines in terms of playability, camera control and video quality.

In summary, the primary contributions of this work are as follows: a new framework for the creation of compelling Playable Environments with the characteristics in Tab. 1, featuring a new compositional NeRF that handles deformable objects with different visual styles and an action module that operates in the latent space of our NeRF model; two challenging large-scale datasets for training and evaluating PEs to stimulate future research in this area.

2 Related Works

Video generation has seen incredible progress over past years. The video synthesis task has numerous formulations which mostly differ in the type of conditional information that is used for generation. The generation process could be conditioned on previous frames [finn2016cdna, mathieu2015deep, vondrick2015anticipating, lee2018savp, tulyakov2018moco], on another video [wang2018video, siarohin2019monkeynet, Siarohin2019firstorder, siarohin2021motion], on the pose of the agent [chan2019everybody] or even be completely unconditional [saitotrain, tulyakov2018moco]. Moreover, several works proposed to condition the generation of each single frame on an action label [chiappa2017recurrent, nunes2020action, oh2015action, Kim2020_GameGan]. Still, all these methods require action supervision for training.

Playable video generation (PVG) was recently introduced in Menapace et al. [menapace2021pvg]. Differently from prior works in this domain which required annotated action labels [Kim2020_GameGan, kim2021drivegan], their method, CADDY, automatically infers actions during training in a completely unsupervised manner from raw videos. This method is closely related to ours. However, CADDY assumes only a single controllable object while here we also model the camera movement, complex 3D interactions and support a variety of object appearances.

Novel view synthesis methods traditionally utilized depth maps [CDSD13, Soft3DReconstruction] or multi-view geometry [Kopf2013, Zitnick2004HighqualityVV, Seitz2006comparison] in order to reconstruct underlying 3D representation and later render new views of the corresponding scene. Recently, Neural Radiance Fields (NeRF) [mildenhall2020nerf] revolutionized the field of novel view synthesis. The main idea of NeRF [mildenhall2020nerf] is to model the scene as a continuous 5D function, usually represented by MLP, and directly query this function along the camera rays. Since the pioneering work in [mildenhall2020nerf], numerous NeRF-based models [mildenhall2020nerf] have been proposed. For instance, some works proposed to decompose the foreground and background [kaizhang2020nerfplusplus, Niemeyer2021CAMPARI]. Other works generalised NeRF [mildenhall2020nerf] to dynamic scenes [Ost_2021_CVPR, tretschk2021nonrigid, yuan2021star, zhang2021stnerf]. GIRAFFE [niemeyer2021giraffe] and GANcraft [hao2021GANcraft] proposed to utilize an internal representation that is rendered in a feature space and later decoded by a standard 2D convolutional network. However, none of these methods is able to generalize to multiple monocular videos, several moving and deforming objects, and diverse objects and scene appearances. Comparatively, our method can be trained with such data. Moreover, for enriching the interactivity of the playable environment, our method can control objects in the scene with action labels that are discovered in an unsupervised manner.

3 Method

Figure 2: Overview of our framework. The encoder extracts environment states for every object in the scene. The synthesis module follows a NeRF-like architecture to reconstruct the input frame and allows for camera manipulation. We introduce the action module that learns to encode state dynamics with discrete action labels. At test time, these learned action labels are provided by the user to control the generated content.

Our framework is based on the encoder-decoder architecture shown in Fig. 2 whose design is driven by the playable environment characteristics 1-6 in Tab. 1. At time

, the encoder network outputs state vector

for every object in the scene. To enable playability 1, we include an action module in the bottleneck layer that has two goals. First, it learns discrete action labels in an unsupervised manner. More precisely, we learn to discretize the transition from to using an action labels , where the number of actions is a hyper-parameter specified before training. Second, the action module is used at test time to condition the next frame generation on the action selected by the user. Finally, the decoder network, referred to as the synthesis module, is in charge of reconstructing the input frame combining the state of every object and the camera parameters to allow for camera control 2. The synthesis and action modules are trained in two separate phases using reconstruction as the main driving loss.

To handle environments with multiple objects 3, we adopt a compositional formulation for our encoder-decoder: we decompose the environment into a predefined set of objects. We distinguish between two object categories, namely static objects (e.g. background) and playable objects (e.g. human), where the latter are the dynamic objects the user will be able to control. We define the environment state of object as where is the position of the object in the environment, is a style descriptor, and is the object pose. We introduce and to handle deformable objects 4, such as humans, and to model appearance changes of objects in the training set 5. For every static object, we assume to be fixed and known. For playable object instead, given the current camera parameters and its bounding box , we approximate by projecting the middle point of the lower bounding box edge onto the ground plane. We then compute the style and pose descriptors using a convolutional encoder network for each object. The encoder takes as input the image cropped at the location defined by the bounding box for each object and outputs both and . In the rest of the paper, we omit object indexes.

To address our challenging setting, we introduce a novel synthesis module detailed in Sec 3.1. The action module is described in Sec. 3.2. The training procedures for both modules are given in Secs. 3.3 and 3.4.

3.1 Synthesis Module

Figure 3: The synthesis module consists of two steps. First, non-rigid neural radiance fields with a bending network and style modulation are used to generate a feature map. Second, the feature maps are fed to a ConvNet .

The aim of the synthesis module is to reconstruct the input image from the camera pose and states . We found NeRF [mildenhall2020nerf] to be a reasonable base architecture for explicit camera control 2. Therefore, we propose a novel architecture (Fig. 3) that combines non-rigid neural radiance fields with a convolutional image generator to address 2-6.

Camera Control 2 is achieved by employing NeRF [mildenhall2020nerf] as a base architecture. Our NeRF represents scenes using a fully-connected network , whose input is a single vector containing a point location in 3D. It outputs the volume density and radiance for the input point location. Given a desired virtual camera, 3D points are sampled along the camera ray traced through each pixel. The color value of every pixel is computed by integration over the ray :


Similarly to [niemeyer2021giraffe, hao2021GANcraft], instead of directly predicting color values, our neural radiance fields generate feature maps for the input camera pose, while a convolutional image generator is in charge of generating realistic frames. For more details about NeRFs, please refer to the Supp. Mat. and[mildenhall2020nerf].

Multi-object 3. Each object is modeled using a separate feature field parametrized as an object-specific MLP . The field is bounded by volume and centered at the respective object location . Given a ray , we compute its features according to the following procedure. We first intersect with each bounding volume to compute the ingress and egress location of the ray with each object , . For each object, we then uniformly sample a given amount of positions between and and obtain the respective features and opacities as . is obtained by integration similarly to Eq. (1).

Deformable objects 4. To handle deformable objects such as humans, we make use of non-rigid NeRF models, similarly to [tretschk2021nonrigid]. For each playable object, we introduce a ray bending network parametrized as an MLP. Given an object pose descriptor and position on ray , we use the bending network to regress the corresponding position on the bent ray as:


We then make use of the positions on when sampling . In this way, encodes the transformation from the space of the deformed object to a canonical space and encodes a canonical representation of the object.

Appearance changes 5. The appearance of each object may vary widely in the dataset. In order for each object-specific model to be able to represent the complete set of possible appearances of its object, we propose the use of a style embedding layer inspired by AdaIN [huang2017adain], which we embed into . Assuming a hidden feature in and a style code , we modulate as follows:


where and are trainable linear layers. Following [mildenhall2020nerf], we design as a backbone terminated by two separate branches, one for opacity and one for features prediction. We assume that the style of an object should influence its features, but not its geometry. Therefore, we insert our modulation layer in the features prediction branch only.

Robustness 6 to calibration and localization errors is achieved through a Feature Renderer. Our compositional NeRF model outputs a feature map corresponding to an input image patch. We employ a ConvNet to reconstruct it. Due to the ability of ConvNets to model cross-pixel relationships, inaccuracies in the estimation of features caused by input noise can be compensated, reducing the associated blur. Note that contains upsampling layers. It allows an important reduction in the number of rays that are to be sampled by the NeRF model since it outputs a feature map at a lower resolution than the image. Therefore, we reduce memory consumption allowing larger patches to be rendered. We also find it beneficial to use multiple input feature maps at different resolutions to capture details at different scales (see Sup. Mat. for details).

3.2 Action Module

Figure 4: The action module. Given the states at times and , the action network predicts a discrete action label and action variability that are combined by the dynamics network to estimate the new environment state given the old .

The action module (Fig. 4) learns the action space and enables playability 1. The actions of each playable object are modeled by a separate action module, consisting of the action and dynamics networks.

Action Network. Given two successive environment states and , we use an action network to infer a discrete representation of the action performed by the object in the input sequence. Following [menapace2021pvg], to address non determinism present in the environment, we also extract an action variability embedding describing the particular variation of performed at time :


Dynamics Network. The role of the dynamics network is to predict the state from and the action label . We adopt a recurrent model implemented as an LSTM to model the dynamics of the object. The next state prediction is given by:


In our preliminary experiments, we observe that when R directly regresses as formulated in (5), the model learns actions that are independent from the current camera position. This behavior is unnatural for the user since in applications such as video-games, object movements are typically expressed relatively to the camera pose. To avoid this behavior, is instead asked to predict the object movement expressed in the camera coordinate system. The estimated position is then given by where is the rotation matrix expressing the orientation of the camera.

3.3 Synthesis Module Training

We train our model in two steps by first training the encoder and synthesis module until convergence, and then the action module. We train the encoder and synthesis module using the perceptual loss of Johnson et al. [johnson2016perceptual] that assesses image reconstruction quality in features spaces of a pretrained VGG network. The loss is computed between the ground truth and reconstructed image patches. The perceptual loss is complemented by an L2 reconstruction loss in the pixel space.

Our preliminary experiments showed that training may fail to correctly disentangle object style and pose (i.e. and respectively). Indeed, the reconstruction losses can be minimized using alone by predicting a constant, non-deforming surface with changing style. To avoid this problem, we make the observation that the pose of an object in neighboring frames can change while the style doesn’t. Therefore, we enforce better disentanglement by permuting the order of codes along the temporal dimension for each sequence before feeding them to the synthesis module.

3.4 Action Module Training

In the second phase of training, we train the action module using a combination of losses. Each loss is computed separately for each playable object and then averaged to produce the final optimization objective.

Reconstruction loss. For each playable object, we reconstruct the input sequence of environment states , obtained by encoding each input image using the encoder , and impose an reconstruction loss with the corresponding reconstructed sequence .

Action learning losses. We employ the information-theoretic action learning loss of [menapace2021pvg] to foster the understanding of actions. For each playable object, the action network

produces internal estimates of action probabilities

and for input and reconstructed environment states respectively. By imposing the maximization of mutual information between these two distributions we foster the action network both to discover the action categories, avoiding mode collapse, and to produce consistent action estimates for the input and reconstructed sequence:


In addition, to improve consistency between discrete actions and 3D movements, we propose to optimize a novel loss consisting in a soft version of the Mean Squared Error (-MSE) introduced in [menapace2021pvg]. This metric is based on the idea that same actions should correspond to similar object motions . Assuming a batch containing image pairs, we extract the object motion and estimate the mean object motion for each action:


where denotes the probability for the image pair to be assigned to the action . We then minimize the mean squared distance between the motion and the mean motion for each action:


where is used as a normalization factor.

Temporal Discriminator. Previous methods for playable video generation [menapace2021pvg], tend to produce sequences where the playable objects move in the scene with unrealistic motions. We attribute this behavior to the use of reconstruction as the main training objective. Optimizing reconstruction losses does not penalize action representations that lead to temporally inconsistent videos. To address the problem, for each playable object we introduce a temporal discriminator

implemented as a 1D ConvNet over the temporal dimension. Given a sequence of environment states, the temporal discriminator is trained to classify them as real if produced by encoding the input images using

or as fake if reconstructed by the action module. We implement our adversarial training procedure using a vanilla GAN loss with loss terms and for the action module and temporal discriminator respectively.

Total loss. Our optimization objective for and is


where we introduce the weighting parameters , , and . For training , we minimize the adversarial objective of the discriminator.

Inference. At inference time, we assume that only the first frame of the sequence is given. We use the encoder module to extract the first environment state . At each timestep , we let the user specify a discrete action for each playable object and use the dynamics network to derive in an autoregressive way. Since the action input is specified by the user, during inference we do not make use of the action network and always set . The environment states generated by the dynamics network are rendered to images using the synthesis module.

4 Experiments

Datasets. Evaluating 1-6 is challenging and requires video datasets featuring camera motion 2, multiple playable objects 1,3, deforming objects 4 and varied appearance 5. For this reason, we collect three datasets:
• Minecraft dataset. We collect a synthetic video dataset with duration of 1h with two sparring Minecraft [minecraft] players. Wide camera movement and diverse, deforming players allow the evaluation of 15.
• Minecraft Camera dataset. We collect Minecraft [minecraft] sequences where the camera is moved in the neighborhood of a starting position. We use these frames as a synthetic ground truth for the evaluation of camera control 2.
• Tennis dataset. We collect a large-scale dataset of 43 broadcast tennis matches totalling 12h of videos for the evaluation of 1-6. The dataset features challenging player poses 5, high variability in tennis fields and players 4 and noise in camera estimation and player localizaton 6.
To allow comparison with playable video generation methods under their simplifying assumptions, we adopt the Tennis dataset of [menapace2021pvg], referred to as Static Tennis. The dataset features limited camera movement, each video is cropped to depict only a single player, only one field is present and players have uniform appearance, thus only 1,4 are evaluated. The datasets are detailed in the Supp. Mat..

Evaluation Protocol. We perform a separate evaluation of the synthesis 2-6 and the action modules 1 using similar evaluation protocols. For the former, we reconstruct each test sequence by extracting the environment state of each frame and rendering the original frame back. For the action module, we follow the evaluation protocol of [menapace2021pvg]. In particular, we consider a test sequence and extract the environment state of the first frame, then we use the action network to extract the sequence of discrete actions present in the sequence and reconstruct each frame starting from the first environment state. As video quality metrics 2,4-6 we adopt LPIPS [zhang2018unreasonable], FID [heusel2017advances] and FVD [unterthiner2018towards] computed between the test sequences and the reconstructed sequences. For evaluation of the action space 1,3, following [menapace2021pvg], we define as the difference in position of an object between two given frames and use the following metrics:
•  Mean Squared Error (-MSE): The expected error in terms of MSE in the regression of from a discrete action. For each action, the average

is used as the optimal estimator. The metric is normalized by the variance of

• -based Action Accuracy (-Acc): The accuracy with which a discrete action can be regressed from .
• Average Detection Distance (ADD): The average Euclidean distance between the bounding box centers of corresponding objects in the test and reconstructed frames.
• Missing Detection Rate (MDR): The portion of detections that are present in the test sequences but that are not matched by any detection in the reconstructed sequences.

4.1 Comparison on Playable Video Generation

MoCoGAN [tulyakov2018moco] 0.266 132 3400 101 26.4 28.5 20.2
MoCoGAN+ 0.166 56.8 1410 103 28.3 48.2 27.0
SAVP [lee2018savp] 0.245 156 3270 112 19.6 10.7 19.7
SAVP+ 0.104 25.2 223 116 33.1 13.4 19.2
CADDY [menapace2021pvg] 0.102 13.7 239 72.2 45.5 8.85 1.01

0.089 15.3 237 32.8 68.1 9.47 0.15

Table 2: Comparison with PVG state of the art on the Static Tennis dataset of [menapace2021pvg]. -MSE, -Acc and MDR in %, ADD in pixels.

In this section, we evaluate the action-modeling capabilities of our method by comparing against the state of the art in the related problem of Playable Video Generation (PVG) [menapace2021pvg] where the objective is to learn a set of discrete action labels in an unsupervised fashion to condition video generation. Differently from our setting, in PVG no explicit camera control is required. Moreover, existing PVG methods assume a single user-controllable object and that camera motion is limited.

To satisfy these simplifying assumptions, we adopt the Static Tennis dataset of [menapace2021pvg]. Tab. 2 shows the results. Our method substantially improves the -MSE and -Acc action quality metrics suggesting that the learned actions are better correlated with player movement. In addition, the reduced LPIPS and MDR indicate an improvement in the quality of the generated reconstruction which is supported by a user study shown in the Supp. Mat.. We report qualitative results in the Supp. Mat..

Tennis Minecraft Camera
CADDY [menapace2021pvg] (i) 0.313 61.0 877 0.901 42.6 35.1 36.9 0.747 306 11.7 95.8
CADDY [menapace2021pvg] (ii) 0.351 69.2 1109 0.592 59.6 29.0 24.8 0.762 324 44.7 92.2
CADDY [menapace2021pvg] (iii) 0.213 15.4 727 0.693 57.5 18.7 11.7 0.669 244 29.2 82.0
CADDY [menapace2021pvg] (iv) 0.445 70.3 1568 0.797 62.4 29.6 33.0 0.699 314 62.0 89.4
CADDY [menapace2021pvg] (v) 0.534 191 8083 0.633 73.5 20.2 60.3 0.679 337 19.1 93.6

0.181 17.4 485 0.293 95.7 14.0 4.84 0.242 29.2 5.69 8.07

Table 3: Playability evaluation with baselines on the Tennis dataset and camera control evaluation on the Minecraft Camera dataset. Aux.: use of auxiliary bounding box and camera pose information; H.Res. use of the high resolution model; use of the loss for -MSE. -MSE, -Acc and MDR in %, ADD in pixels.

4.2 Comparison with Previous Methods

Baselines. We propose to build baselines for the creation of PEs from state of the art methods in the related problem of PVG. We make use of the following set of versions of CADDY [menapace2021pvg] which are modified to account for multiple playable objects and for camera motion: (i) the action network produces a distinct output for each dynamic object in the environment; (ii) (i) + the action and dynamics networks are conditioned on bounding box and camera information; (iii) (ii) + output resolution is increased to match our method; (iv) (ii) + ; (v) (iii) + .

Playability evaluation 1. We evaluate player control capabilities in Tab. 3 and in the Supp. Mat.. On the Tennis dataset our model substantially improves over the baselines in the action space metrics, LPIPS and FVD, suggesting better controllability of the players. In particular, the considerably lower MDR indicates a better capacity of the model in generating players with respect to the baselines. Fig. 5 shows qualitative reconstruction results for our method. As suggested by the MDR and ADD scores, the model correctly synthesizes both players and is able to reconstruct the player movements of the ground truth sequence using only a sequence of discrete actions. In addition, a visualization of the learned action space (see Fig. 6) shows that the model learns a set of diverse discrete actions that correspond to the main movement directions.

To further evaluate the quality of the action space, we perform a user study (see Supp. Mat.) on the Tennis dataset, following the protocol of Menapace et al[menapace2021pvg]. To evaluate the consistency of learned actions, we measure user agreement using the Fleiss’ kappa measure [fleiss1971measuring]. Our method achieves an agreement of 0.444, while the best baseline shows a lower agreement of 0.353.

Figure 5: Qualitative reconstruction results produced by our method on the Tennis and Minecraft datasets. In the reconstructed sequence, playable objects move according to the ground truth sequence and are rendered in realistic poses.
Figure 6: Action space learned by our method on the Tennis dataset. Each color represents a learned action and each arrow shows the effects of applying the respective action six times to the initial player. The overlay on the floor shows the distribution of possible ending positions after the application of each action.

Camera control evaluation 2. We evaluate the quality with which the model can synthesize novel views. We choose to perform a quantitative evaluation on the Minecraft Camera dataset since novel view ground truth is present. We start from the first frame and reconstruct each sequence using the camera parameters of the novel views. Results are shown in Tab. 3. Despite the presence of auxiliary bounding box and camera pose inputs for CADDY [menapace2021pvg], the baseline method fails in synthesizing the scene from novel perspectives. We ascribe this phenomenon to the lack of an explicit model for the camera. Our method instead can successfully synthesize the scene from novel camera perspectives.

In Fig. 7 we show qualitative camera and style manipulation results for our method on the Tennis dataset. Our model can synthesize the scene under novel views and correctly alter the style of the field and players to the one of a target image. We present additional camera and style manipulation results in the Supp. Mat..

Var. Multi3 4 5 6 LPIPS FID FVD ADD MDR
(a) 0.735 376 2548 109.1 99.9
(b) 0.595 266 1617 45.4 86.4
(c) 0.648 301 1818 10.17 50.2
(d) 0.361 68.6 482 7.39 31.9
(e) 0.350 61.0 465 8.27 31.8
(f) 0.341 67.4 1371 88.5 88.8

0.193 16.5 289 5.45 33.7
Table 4: Synthesis module ablation results on the Minecraft dataset. Multi: use of multi-object modeling, : use of deformation, : use of style modulation layers or of direct style encoding (), : use of the feature renderer or of the simplified renderer (). ADD in pixels, MDR in %.
Figure 7: Camera and style manipulation results on the Tennis dataset. The original image is rendered under a novel camera perspective using varying styles for the field and players.

4.3 Ablation Studies

Synthesis module Ablation Study 3-6. In this section we evaluate the contribution of each proposed architecture component for the synthesis module: Multi use of multi-object modeling 3, use of deformation modeling 4, use of style modulation layers for appearance changes 5, use of the feature renderer for robustness 6. We produce the following method variations: (a) no component is used; this approach resembles NeRF [mildenhall2020nerf]; (b) Multi; (c) Multi and ; this architecture is akin to NR-NeRF [tretschk2021nonrigid] with 3; (d) Multi, , and injected with concatenation rather than style modulation layers; (e) Multi, , and with style modulation layers; (f) Multi, , and a simplified ConvNet trained by rendering the complete frame from feature maps at a single resolution; from an architectural viewpoint, this feature rendering strategy resembles the one of GIRAFFE [niemeyer2021giraffe].

Results are shown in Tab. 4 and in the Supp. Mat.. (c) and (e) show that deformation and style modeling with style modulation layers are both necessary to accurately synthesize the scene, but generate blurry results due to calibration and localization errors. We recover sharpness by introducing our ConvNet feature renderer which reduces blur by modeling cross-pixel correlations. Substituting our renderer with the one of (f) leads to performance degradation due to the excessively sparse sampling of rays imposed by memory constraints when rendering the complete frame that leads to 3D consistency artifacts which are particularly apparent in the region of dynamic objects.

Action module Ablation Study. We now evaluate the contribution of the main components of the action module by ablating the following: Rel. use of camera-relative object movement in the dynamics network; use of the temporal discriminator; use of the loss on -MSE; use of the information-theoretic action learning loss. Results are shown in Tab. 5. Removing the temporal discriminator causes an increase in the FVD. A qualitative analysis of the results (see Supp. Mat.) shows that models not using produce sequences where the players translate in the scene, but fail to realistically move their limbs. In addition, the introduction of produces a positive impact on the action space metrics. We also note that, thanks to the presence of , the model learns an action space even in the absence of . Lastly, without camera-relative object movement in the dynamics network, the model produces movements that are independent from the current camera orientation, which is undesirable (Sec. 3.2).

(A) 0.205 17.0 334 0.903 33.9 18.7 33.0
(B) 0.204 17.0 329 0.290 76.0 18.6 33.5
(C) 0.203 16.9 340 0.263 80.0 15.4 34.0
(D) 0.204 17.0 323 0.289 77.0 17.8 34.3
(E) 0.204 16.9 335 0.276 77.5 17.5 34.0

0.204 16.8 329 0.271 77.7 17.8 33.9

Table 5: Action module ablation results on the Minecraft dataset. Rel. use of camera relative residual output, use of the temporal discriminator, use of the loss for -MSE, use of the information-theoretic action learning loss. -MSE, -Acc and MDR in %, ADD in pixels.

5 Conclusions and Discussion

In conclusion, we present a new framework featuring a NeRF-based encoder-decoder architecture and an action module for the creation of compelling playable environments. Extensive experimental evaluation on two large-scale datasets shows that our method achieves state of the art performance. We discuss the main limitations and ethical aspects of the method in the Supp. Mat..