## 1 Introduction

3D reconstruction is one of the fundamental problems of computer vision and a cornerstone of augmented and virtual reality. Concurrently with steady progress towards real-time photo-realistic rendering of 3D environments in game engines, the last few decades have seen great strides towards photo-realistic 3D reconstruction. A recent achievement in this direction is the discovery of a fairly general formulation for representing radiance fields

[Mildenhall_2020_NeRF, liu2020neural, martin2021nerf, Schwarz2020NEURIPS, zhang2020nerf++, yu2021pixelnerf, trevithick2020grf, bi2020neural, srinivasan2021nerv, niemeyer2021giraffe, sucar2021imap]. Neural radiance fields are remarkably versatile for reconstructing real-world objects with high-fidelity*geometry*and

*appearance*. But static appearance is only the first step: it ignores how an object moves and interacts with its environment. 4D reconstruction tackles this problem in part by incorporating the time dimension: with more intricate capture setups and more data, we can reconstruct objects over time—but can only re-play the captured sequences. Today, in the age of mixed reality, a photo-realistically reconstructed object might still destroy immersion if it is not “physically realistic” because

*the object cannot be interacted with.*(For example, if a soft object appears as rigid as the rocks next to it when stepped on.)

By building on advances in computer vision and physics simulation, we begin to tackle the problem of physically-realistic reconstruction and create *Virtual Elastic Objects*: virtual objects that not only look like their real-world counterparts but also behave like them, even when subject to novel interactions. For the first time, this allows for full-loop reconstruction of deforming elastic objects: from capture, to reconstruction, to simulation, to interaction, to re-rendering.

Our core observation is that with the latest advances in 4D reconstruction using neural radiance fields, we can both capture radiance and deformation fields of a moving object over time, and re-render the object given novel deformation fields. That leaves as the main challenge the core problem of capturing an object’s physics from observations of its interactions with the environment. With the right representation that jointly encodes an object’s geometry, deformation, and material behavior, compatible with both differentiable physical simulation and the deformation fields provided by 4D reconstruction algorithms, we can use these deformation fields to provide the necessary supervision to learn the material parameters.

But even with this insight, multiple challenges remain to create Virtual Elastic Objects. We list them together with our technical contributions:

1) Capture.
To create VEOs, we need to collect data that not only contains visual information but also information about physical forces.
We present the new PLUSH dataset containing occlusion-free 4D recordings of elastic objects deforming under known controlled force fields.
To create this dataset, we built a multi-camera capture rig that incorporates an air compressor with a movable, tracked nozzle.
More details can be found in Sec. 3.1.

2) Reconstruction.
VEOs do not require any prior knowledge about the geometry of the object to be reconstructed; the reconstruction thus must be template-free and provide full 4D information (*i.e*., a 3D reconstruction and deformation information over time).
We extend Non-rigid Neural Radiance Fields [tretschk2021nonrigid]

with novel losses, and export point clouds and point correspondences to create the data required to supervise learning material behavior using physical simulation. We provide further details in Sec.

3.2.3) Simulation. Crucially for creating realistic interactive objects, a physical simulation is required, both to optimize for an unknown object’s physical parameters and to generate deformations of that object in response to novel interactions. We implement a differentiable quasi-static simulator that is particle-based and is compatible with the deformation field data provided by our 4D reconstruction algorithm. We present the differentiable simulator and explain how we use it to obtain physical parameters in Sec. 3.3, and describe simulations of novel interactions in Sec. 3.4.

4) Rendering. Since we convert from a neural representation of the captured object’s geometry to a point cloud reconstructing the object’s physical properties, we require a function that allows rendering the object given new simulated deformations of the point cloud. We introduce a mapping function that enables us to use deformed point clouds instead of continuous deformation fields to alter the ray casting for the Neural Radiance Fields we used for the original reconstruction. Further details on re-rendering can be found in Sec. 3.5.

## 2 Related Work

Our work integrates together multiple areas of computer vision, computer graphics, and simulation.

Recovering Elastic Parameters for 3D Templates.

A number of prior works estimate material parameters of a pre-scanned 3D template by tracking the object over time from depth input. Wang

*et al*. [wang2015deformation]

were among the first to tackle tracking, rest pose estimation, and material parameter estimation from multi-view depth streams. They adopt a gradient-free downhill simplex method for parameter fitting, and can only optimize a limited number of material parameters. Objects built from multiple types of materials cannot be faithfully captured without manual guidance or prior knowledge of a part decomposition. Hahn

*et al*. [Hahn:2019] learn an inhomogeneous viscoelastic model from recordings of motion markers covering the object. Recently, Weiss

*et al*. [weiss2020correspondence] infer homogeneous linear material properties by tracking deformations of a given template with a single depth camera. In contrast to these methods, ours jointly reconstructs not just object deformations and physics

*without a need for depth input or markers*but also geometry and appearance

*without a need for a template*. Our formulation can model inhomogeneous, nonlinear materials without prior knowledge or annotations.

Dynamic Reconstruction. Reconstructing non-rigid objects from a video sequence is a long-standing computer vision and graphics problem [Zhang2003SpacetimeSS, tung_complete_2009]. Shape-from-Template methods deform a provided template using RGB [yu2015direct] or RGB-D data [zollhofer2014real]. DynamicFusion [Newcombe_2015_CVPR] is a model-free, real-time method for reconstructing general scenes from a single RGB-D video. When reliable 2D correspondences are available from optical flow, non-rigid structure-from-motion (NRSfM) can be used to reconstruct the 3D geometry [Agudo2014OnlineDN, grassmanian_2018], perhaps even using physics-based priors [agudo2015sequential]. There are also image-based approaches that do not yield a true 3D scene [yoon2020novel, Bemana2020xfields]. Recently, reconstruction using neural representations have become more common. Whereas OccupancyFlow [niemeyer2019occupancy] requires 3D supervision, Neural Volumes [Lombardi:2019] reconstructs a dynamic scene from multi-view input only, but does not compute temporal correspondences. See a recent survey on neural rendering [Tewari2020NeuralSTAR] for more.

Neural Radiance Fields [Mildenhall_2020_NeRF], the seminal work of Mildenhall *et al*., lays the groundwork for several follow-up reconstruction methods that extend it to dynamic scenes [Li2021, park2021hypernerf, attal2021torf, pumarola2020d, park2021nerfies, li2021neural, du2021nerflow, Gaofreeviewvideo, xian2021space, Lombardi_2021_MVP]. In this work, we assume multi-view RGB video input with known camera parameters and foreground segmentation masks and so extend Non-Rigid Neural Radiance Fields (NR-NeRF) [tretschk2021nonrigid].

Data-Driven Physics Simulation.

Much recent research has explored the potential of machine learning to enhance or even replace traditional physics-based simulation. Learning natural laws from data without any priors has been shown possible for a few simple physics systems

[schmidt2009distilling], but the computational cost scales exponentially with the complexity of the system, and remains intractable for real-world problems. For simulating elastic objects specifically, one line of work replaces traditional mesh kinematics with a learned deformation representation to improve performance: Fulton*et al*. [Fulton:2018]

use an autoencoder to learn a nonlinear subspace for elastic deformation, and Holden

*et al*. [Holden:2019]

train a neural network to predict the deformation of cloth using a neural subspace. Some methods use neural networks to augment coarse traditional simulations with fine details

[deepWrinkles, geng2020coercing].Another line of work uses data to fit a parameterized material model to observed deformations. This idea has been successfully applied to muscle-actuated biomechanical systems such as human faces [Kadlecek:2019, Srinivasan:2021], learning the rest pose of an object in zero gravity [chen2014anm], the design of soft robotics [hu2019chainqueen, hu2020difftaichi], and motion planning with frictional contacts [Geilinger:2020, du2021diffpd]. Yang *et al*. [Yang2017] learn physical parameters for cloth by analysing the wrinkle patterns in video. While all of these methods learn physical parameters from data, our method is unique in requiring no template or other prior knowledge about object geometry to reconstruct and re-render novel deformations of an object.

Meshless Simulation. Meshless physics-based simulation emerged as a counter-part to traditional mesh-based methods [muller2004] and is ideal for effects such as melting or fracture [muller2004, pauly2004meshless]. These methods have been later extended to support oriented particles and skinning [muller2011solid, gilles2011frame, macklin2014unified]. Another extension of point-based simulations consists in incorporating a background Eulerian grid, which enables more efficient simulation of fluid-like phenomena [stomakhin2013material, jiang2017anisotropic].

## 3 Method

### 3.1 Capture

To create a physically accurate representation of an object, we first need to record visual data of its deformation under known physical forces.
For recording, we use a static multi-view camera setup consisting of 19 OpenCV AI-Kit Depth (OAK-D) cameras^{1}^{1}1https://store.opencv.ai/products/oak-d, each containing an RGB and two grey-scale cameras (note that VEO does not use the stereo camera data to infer classical pairwise stereo depth).
They represent an affordable, yet surprisingly powerful solution for volumetric capture.
In particular, their on-board H265 encoding capability facilitates handling the amount of data produced during recording (5.12GB/s uncompressed).
Since the cameras lack a lens system with zoom capabilities, we keep them close to the object to optimize the pixel coverage and re-configure the system depending on object size.
The maximum capture volume has a size of roughly .
We put a black sheet around it to create a dark background with the exception of five stage lights that create a uniform lighting environment.

In addition to the images, we also need to record force fields on the object surface. This raises a problem: if a prop is used to exert force on the capture subject, the prop becomes an occluder that interferes with photometric reconstruction. We solved this problem when capturing our PLUSH dataset by actuating the object using transparent fishing line and a compressed air stream; see Sec. 4.1 for further details.

### 3.2 4D Reconstruction

Given the captured video of an object deforming under external forces, we need 4D reconstruction to supply a temporally-coherent point cloud that can be used to learn the object material properties. To that end, we use NR-NeRF [tretschk2021nonrigid], which extends the static reconstruction method NeRF [Mildenhall_2020_NeRF]

to the temporal domain. NeRF learns a volumetric scene representation: a coordinate-based Multi-Layer Perceptron (MLP)

that regresses geometry (opacity ) and appearance (RGB color ) at each point in 3D space. At training time, the weights of are optimized through 2D supervision by RGB images with known camera parameters: for a given pixel of an input image, the camera parameters allow us to trace the corresponding ray through 3D space. We then sample the NeRF at points along the ray, and use a volumetric rendering equation to accumulate the samples front-to-back via weighted averaging: (*i.e*., alpha blending with alpha values derived from the opacities ). A reconstruction loss encourages the resulting RGB value to be similar to the RGB value of the input pixel.

On top of the static geometry and appearance representation (the *canonical model*), NR-NeRF models deformations explicitly via a jointly learned ray-bending MLP that regresses a 3D offset for each point in space at time .
( is an auto-decoded latent code that conditions on the deformation at time .)
When rendering a pixel at time with NR-NeRF, is queried for each sample on the ray in order to deform it into the canonical model: .
Unlike NR-NeRF’s monocular setting, we have a multi-view capture setup.
We thus disable the regularization losses of NR-NeRF and only use its reconstruction loss.

Extensions. We improve NR-NeRF in several ways to adapt it to our setting. The input videos contain background, which we do not want to reconstruct. We obtain foreground segmentations for all input images via image matting [Lin2021] together with a hard brightness threshold. During training, we use a background loss to discourage geometry along rays of background pixels. When later extracting point clouds, we need opaque samples on the inside of the object as well. However, we find that leads the canonical model to prefer empty space even inside the object. We counteract this effect with a density loss that raises the opacity of point samples of a foreground ray that are ‘behind’ the surface, while emptying out the space in front of the surface with . During training, we first build a canonical representation by pretraining the canonical model on a few frames and subsequently using it to reconstruct all images. Our capture setup not only provides RGB streams but also grey-scale images. We use these for supervision as well. In practice, we use a custom weighted combination of these techniques for each sequence to get the best reconstruction.

Point Cloud Extraction In order to extract a temporally-consistent point cloud from this reconstruction, we require a forward deformation model, which warps from the canonical model to the deformed state at time . However, NR-NeRF’s deformation model is a backward warping model: it deforms each deformed state into the canonical model. We therefore jointly train a coordinate-based MLP to approximate the inverse of . After training, we need to convert the reconstruction from its continuous MLP format into an explicit point cloud. To achieve that, we cast rays from all input cameras and extract points from the canonical model that are at or behind the surface and whose opacity exceeds a threshold. These points can then be deformed from the canonical model into the deformed state at time via . We thus obtain a 4D reconstruction in the form of a 3D point cloud’s evolving point positions , which are in correspondence across time. To keep the computational cost of the subsequent reconstruction steps feasible, we downsample the point cloud to 9-15 points if necessary.

### 3.3 Learning Material Parameters

Before we can simulate novel interactions with a captured object, we need to infer its physical behavior. Given that we have no prior knowledge of the object, we make several simplifying assumptions about its mechanics, with an eye towards minimizing the complexity of the physical model while also remaining flexible enough to capture heterogeneous objects built from multiple materials.

First, we assume a *spatially varying, isotropic nonlinear Neo-Hookean material model* for the object. Neo-Hookean elasticity well-approximates the behavior of many real-world materials, including rubber and many types of plastic, and is popular in computer graphics applications because its nonlinear stress-strain relationship guarantees that no part of the object can invert to have negative volume, even if the object is subjected to arbitrary large and nonlinear deformations.
Finally, Neo-Hookean elasticity admits a simple parameterization: a pair of Lamé parameters at each point of the point cloud .

Second, we assume that the object deforms *quasistatically* over time: that at each point in time, the internal elastic forces exactly balance gravity and applied external forces. The quasistatic assumption greatly simplifies learning material parameters, and is valid so long as inertial forces in the captured video sequences are negligible (or equivalently, so long as external forces change sufficiently slowly over time that there is no secondary motion, which is true for the air stream and string actuations in our PLUSH dataset).

#### Overview.

We first formulate a differentiable, mesh-free *forward* physical simulator that is tailored to work directly with the (potentially noisy) reconstructed point cloud. This forward simulator maps from the point cloud of the object in its *reference pose* (where it is subject to no external forces besides gravity), an assignment of Lamé parameters to every point, and an assignment of an external force to each point on the object surface, to the deformed position of every point in the point cloud after the object equilibrates against the applied forces.

Next, we learn the Lamé parameters that match the object’s observed behavior by minimizing a loss function

that sums, over all times , the distance between and the corresponding target position of the point in the 4D point cloud .#### Quasistatic Simulation.

To compute the equilibrium positions of the points in for given external loads and material parameters, we solve the variational problem

(1) |

where is the total energy of the physical system, capturing both the elastic energy of deformation as well as work done on the system by external forces. In what follows, we derive the expression for , and discuss how to solve Eq. 1.

Following Müller *et al*. [muller2004], we adopt a mesh-free, point-based discretization of elasticity to perform forward simulation. For every point in the reference point cloud , we define a neighborhood containing the 6 nearest neighbors of in . For any given set of deformed positions of the points in , we estimate strain within the neighborhood in the least-squares sense. More specifically, the local material deformation gradient maps the neighborhood from the reference to the deformed state:

(2) |

For neighborhoods larger than three, Eq. 2 is over-determined, and we hence solve for in the least-squares sense, yielding the closed-form solution:

(3) |

where the -th column of and are and , respectively, and is a diagonal matrix of weights depending on the distance from to [muller2004].

The elastic energy of the object can be computed from the classic Neo-Hookean energy density [ogden1984non]:

(4) |

where

is the trace of the right Cauchy-Green tensor

, and is the determinant of . and are the Lamé parameters assigned to point . The total elastic energy is then:(5) |

where approximates the volume of .

We also need to include the virtual work done by the external force field to Eq. 1:

(6) |

where is the force applied to point (the force of the air stream on the boundary). If we measured the tension in the fishing lines, we could also include the forces they exert on the object in Eq. 6. But since a fishing line is effectively inextensible relative to the object we are reconstructing, we instead incorporate the fishing lines as soft constraints on the positions of the points attached to the lines: we assume that at time , points in should match their observed positions in , and formulate an attraction energy:

(7) |

where is the position of the point corresponding to in , and is a large penalty constant. We found that this soft constraint formulation works better in practice than alternatives such as enforcing as a hard constraint.

The total energy in Eq. 1 is thus , which we minimize using Newton’s method. Since Newton’s method can fail when the Hessian of is not positive-definite, we perform a per-neighborhood eigen-decomposition of

and replace all eigenvalues that are smaller than a threshold

with ; note that this is a well-known technique to improve robustness of physical simulations [teran2005robust]. We also make use of a line search to ensure stability and handling of position constraints at points where the capture subject touches the ground; see the supplemental material for further implementation details.#### Material Reconstruction.

Given the 4D point cloud and forces acting on the object , we use our forward simulator to learn the Lamé parameters that best explain the observed deformations. More specifically, at each time we define the loss:

(8) |

where is the position of point in , and is the output of the forward simulation. We use an

loss to penalize outliers strongly, which would jeopardize the reconstruction quality otherwise.

We choose a training subsequence of 20-50 frames from the input where the impact of the air stream roughly covers the surface so that we have some reference for each part of the object, and compute the desired Lamé parameters by minimizing the sum of the loss over all using the gradient-based Adam optimizer [KingmaB14]:

(9) |

It is not trivial to back-propagate through the Newton solve for , even if we ignore the line search and assume a fixed number of Newton iterations . The gradient of with respect to the Lamé parameters (

for instance) can be computed using the chain rule:

(10) |

and, for any ,

(11) |

To avoid an exponentially-large expression tree, we approximate the derivative of the th Newton iterate by neglecting the higher-order derivative of the Hessian and of the gradient of the energy with respect to the previous position update:

Although it is not guaranteed that the higher-order terms are always negligible, this approximation provides a sufficiently high-quality descent direction for all examples we tested. To improve performance and to capture hysteresis in cases where has multiple local minima at some times , we warm-start the Newton optimization at time using the solution from time .

### 3.4 Novel Interactions

Given a reconstructed VEO, we can use the same physical simulator used for material inference to re-simulate the captured object subject to novel interactions. New force fields can easily be introduced by modifying in the energy Other possible interactions include changing the direction of gravity, adding contact forces to allow multiple objects to mutually interact, or to allow manipulation of the object using mixed-reality tools, etc.

We demonstrate the feasibility of re-simulating novel interactions by implementing a simple penalty energy to handle contact between a VEO and a secondary object, represented implicitly as a signed distance field . The penalty energy is given by:

(12) | ||||

(13) |

where is chosen large enough to prevent visually-noticeable penetration of the VEO by the secondary object.

### 3.5 Rendering

We are able to interact freely with the VEO in a physically plausible manner. Hence, we can close the full loop and realistically render the results of simulated novel interactions using neural radiance fields. While we used for deformations during the reconstruction, we are now given a new deformed state induced by a discrete point cloud: a canonical reference point cloud and its deformed version . We need to obtain a continuous backward-warping field from that point cloud in order to replace

, which bends straight rays into the canonical model. To that end, we interpolate the deformation offsets

at a 3D sample point in deformed space using inverse distance weighting (IDW):(14) |

where are the nearest neighbors of in , and with . We can then sample the canonical model at as before: . To remove spurious geometry that might show, we set for that are further than some threshold from . Thus, we can now bend straight rays into the canonical model and render the interactively deformed state of the object in a realistic fashion.

When needed, we can upsample the point cloud from the simulation to make it denser. Unlike for rendering, we need to consider forward warping for this case.

## 4 Results

### 4.1 Dataset

Baby Alien (179g, 41s) | Dino Rainbow (672g, 37s) | Dino Blue (148g, 55s) | Dino Green (76g, 42s) | Fish (282g, 65s) | Leaf (58g, 32s) | Serpentine (54g, 40s) |

Mr. Seal (444g, 53s) | Pillow (406g, 42s) | Pony (197g, 51s) | Dog (213g, 67s) | Sponge (21g, 46s) | Baby Alien Lamé | Pony Lamé |

The PLUSH dataset consists of 12 soft items encountered in everyday life (see Fig. 2): a pillow, a sponge, and various plush toys.
We chose items that are composed of soft (and in some cases, heterogeneous) material, complex geometry, and rich texture and color to enable successful background subtraction, 4D reconstruction and tracking.
Our strategy for applying external forces is based on the observation that our chosen objects consist of *bulk volumes* (such as the body of a plush toy) along with *flexible extremities* (ears and fingers of the toy). We move object extremities by using transparent fishing line, and we use a stream of compressed air to exert force on bulk volumes.
The nozzle position and stream direction must be tracked during video capture to provide the direction and magnitude of forces acting on the object at every point in time.
Of the 19 cameras in our capture rig, we use three to track the nozzle using an attached ArUco marker [Garrido-Jurado2016, Romero-Ramirez2018].
Using this system, we generate multi-part video sequences for each capture subject, where we sequentially actuate the fishing lines (when applicable) followed by sweeping the air stream over the object.
We record between 32s and 67s of video for each object, at a frame rate of 40FPS.

### 4.2 Virtual Elastic Objects

Object | average (mm) | 95% (mm) | max (mm) |

Baby Alien | 3.8 | 14.4 | 29.3 |

Fish | 1.1 | 6.6 | 18.5 |

Leaf | 0.4 | 1.1 | 9.8 |

Mr. Seal | 0.4 | 1.9 | 171.9 |

Pillow | 1.5 | 7.8 | 18.35 |

Dog | 1.7 | 7.5 | 28.8 |

Sponge | 0.2 | 1.8 | 15.8 |

Dino Rainbow | 4.0 | 14.6 | 171.4 |

Dino Blue | 5.5 | 56.0 | 105.8 |

Dino Green | 6.2 | 68.4 | 132.0 |

Pony | 21.1 | 164.3 | 204.9 |

Serpentine | 7.5 | 43.1 | 94.7 |

Average* | 2.5 | 18.0 | 70.2 |

Average | 4.4 | 32.3 | 83.5 |

For each of the 12 examples, we create a VEO using 20-50 frames from the reconstruction and evaluate on the remaining 500-1500 frames. We use the distance between the surface points of the VEO to the reconstructed point cloud from the captured data to evaluate the quality of the reconstructed parameters. For all examples except for the Baby Alien we use the external force field data obtained using the air stream. For the Baby Alien, we specifically use the arm and ear motion to demonstrate the versatility of our method in this scenario. We present the results in Tab. 1.

The error is relatively small for all objects, which shows that our method is applicable to objects with different geometries, and can learn the corresponding material parameters even for heterogeneous objects. Larger errors are observed for objects with a thin and tall component (see the last 4 rows of the table). This error is largely caused by tracking inaccuracies of the nozzle: even slight inaccuracies can cause large errors when, for example, the neck of the dinosaur moves while the recorded air stream direction does not, or barely, touch the object.

Inhomogeneous Material. An important feature of our method is that it can identify different material parameters for different parts of the object (c.t. Fig. 2, lower right). This is crucial for building a detailed physics model with no prior knowledge of the object. Even more, our method can reliably learn ‘effective’ softness of the material even in places with unreliable tracking, for example thin geometrical structures close to joints. In case of Baby Alien, our method learns that the ears and arms are softer compared to the other body parts; the mane and tail of the Pony are softer, even though these regions are very hard to track. Both reconstructions match the properties of their real counterparts.

We show a comparison between our method that assumes an independent material parameter on all points with a baseline with only one global material parameter. We trained the baseline model with the exact same procedure as before but learn just one and one for the energy in Eq. 4. As shown in Fig. 3, our inhomogeneous model is visually indistinguishable from the ground truth point cloud, while the homogeneous baseline model has a larger error. The homogeneous model fails to capture the exact movements at the arms and ears of the Baby Alien, but instead distributes the deformation evenly at the ears and arms.

Generalization to Novel Poses. The strength of the underlying physics simulator is the ability to generalize to scenarios that are not encountered in the training set. We show different simulated poses of the Baby Alien in Fig. 4, such as pulling the ears in opposite directions, and moving just one single arm. This deformation is particularly challenging for purely data-driven methods since both ears and arms only move synchronously in the training data.

Interaction with Virtual Objects. The physical model of the object enables interactions with all kinds of different virtual items. Fig. 5 shows the one-way coupled interaction of the learned elastic objects with other virtual items.

Rendering Our pipeline ends with re-rendering an object under novel interactions not seen during training. Fig. 5 contains renderings of the Dino Blue and Dog objects, including interactions with two virtual objects. Tab. 2 contains quantitative results, where we compare the renderings obtained from the reconstructed point clouds (which are used for supervision when learning the material parameters) and the simulated point clouds. The former thus provides a soft upper bound of the quality that the simulator can achieve. We find that the simulator results are very close to those from the reconstructed point clouds. Thus, both the quantitative and qualitative results show that our approach is able to synthesize renderings of novel deformed states in a realistic manner.

Simulated | Reconstructed | |||||||||||

Not Masked | Masked | Not Masked | Masked | |||||||||

Object | PSNR | SSIM | LPIPS | PSNR | SSIM | LPIPS | PSNR | SSIM | LPIPS | PSNR | SSIM | LPIPS |

Baby Alien | 18.40 | 0.734 | 0.255 | 21.17 | 0.840 | 0.174 | 18.75 | 0.747 | 0.249 | 21.92 | 0.853 | 0.167 |

Fish | 19.75 | 0.692 | 0.239 | 22.55 | 0.808 | 0.173 | 20.03 | 0.701 | 0.235 | 22.96 | 0.818 | 0.169 |

Leaf | 25.14 | 0.901 | 0.091 | 27.32 | 0.935 | 0.065 | 25.19 | 0.901 | 0.091 | 27.37 | 0.935 | 0.065 |

Mr. Seal | 20.61 | 0.697 | 0.240 | 24.03 | 0.801 | 0.180 | 20.65 | 0.698 | 0.239 | 24.11 | 0.802 | 0.180 |

Pillow | 21.45 | 0.743 | 0.223 | 23.18 | 0.806 | 0.174 | 21.92 | 0.760 | 0.218 | 23.84 | 0.823 | 0.169 |

Dog | 18.98 | 0.751 | 0.206 | 24.68 | 0.904 | 0.104 | 19.05 | 0.757 | 0.203 | 25.24 | 0.912 | 0.100 |

Sponge | 21.94 | 0.846 | 0.130 | 26.99 | 0.925 | 0.070 | 21.92 | 0.846 | 0.130 | 27.01 | 0.925 | 0.070 |

Dino Rainbow | 18.64 | 0.754 | 0.302 | 23.87 | 0.839 | 0.232 | 20.22 | 0.778 | 0.281 | 26.21 | 0.859 | 0.213 |

Dino Blue | 18.48 | 0.702 | 0.244 | 20.70 | 0.848 | 0.160 | 19.56 | 0.726 | 0.227 | 22.06 | 0.871 | 0.143 |

Dino Green | 18.94 | 0.779 | 0.190 | 21.49 | 0.863 | 0.135 | 20.46 | 0.794 | 0.180 | 23.59 | 0.879 | 0.121 |

Pony | 16.54 | 0.758 | 0.245 | 19.20 | 0.859 | 0.163 | 19.31 | 0.798 | 0.200 | 24.65 | 0.906 | 0.108 |

Serpentine | 18.22 | 0.798 | 0.181 | 21.39 | 0.903 | 0.111 | 19.95 | 0.813 | 0.162 | 23.14 | 0.916 | 0.091 |

Average* | 20.23 | 0.760 | 0.212 | 23.60 | 0.857 | 0.145 | 20.78 | 0.771 | 0.205 | 24.43 | 0.868 | 0.140 |

Average | 19.76 | 0.763 | 0.212 | 23.05 | 0.861 | 0.147 | 20.58 | 0.777 | 0.201 | 24.34 | 0.875 | 0.133 |

*Rendering evaluation.*We report the classic error metrics PSNR and SSIM [wang2004image] ( to ), where higher is better for both, and the learned perceptual metric LPIPS [zhang2018unreasonable] (0 is best). We use deformed point clouds to render deformed states of the canonical model, see Sec. 3.5. We use both the point cloud that the reconstruction (Sec. 3.2) provides directly (‘Reconstructed’) or the point cloud that the simulator provides after learning the material parameters (Sec. 3.3, ‘Simulated’). We report two versions: we either apply the segmentation masks of the input images to the rendered image to remove all artifacts that spill-over onto the background (‘Masked’) or we do not (‘Not Masked’). Note that the values on the reconstructed point cloud are a (soft) upper bound for what the simulator can achieve. The simulated results are close the reconstructed results, demonstrating that the learned material parameters yield deformation fields that allow to re-render the object as well as the reconstruction can.

## 5 Limitations

Artifacts.
Due to the sparse camera setup (16 cameras for 360 degree coverage), we found NeRF unable to reconstruct viewpoint dependent effects, leading to artifacts around specular regions like eyes.
Furthermore, the air compressor leads to quickly oscillating surfaces (*e.g*., the fins of the fish), which pose a challenge for reconstruction and material parameter estimation, and impacts calibration.
These issues impact the extracted point clouds as well as the final renderings (artifacts visible in Fig. 5); we manually removed resulting background clutter in the point clouds.
The physical simulator turned out to be remarkably robust towards noise and can run with any reconstructed point cloud with temporal correspondences.

Known Forces. The simulator requires the forces impacting the object during capture to be known. This limits the variety of forces that can be applied and hence the kind of objects that are compatible with the presented method. We expect an extension to handle unknown forces to be a challenging but exciting direction for future work. Finding good force priors could be a viable approach in this direction.

## 6 Conclusion

We introduced a novel, holistic problem setting: estimating physical parameters of a general deformable object from RGB input and known physical forces, and rendering its physically plausible response to novel interactions realistically.
We further proposed Virtual Elastic Objects as a solution and demonstrated their ability to synthesize deformed states that greatly differ from observed deformations.
Our method leverages a physical simulator that is able to estimate plausible physical parameters from a 4D reconstruction of a captured object.
Finally, we showed that these deformed states can be re-rendered with high quality.
We hope that the presented results and the accompanying dataset will inspire and enable future work on reconstructing and re-rendering *interactive* objects.