3D Neural Scene Representations for Visuomotor Control

07/08/2021 ∙ by Yunzhu Li, et al. ∙ MIT 10

Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 6

page 7

page 8

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Existing state-of-the-art model-based systems operating from vision treat images as 2D grids of pixels [4, 9, 36]. The world, however, is three-dimensional. Modeling the environment from 3D enables amodal completion and allows the agents to operate from different views. Therefore, it is desirable to obtain good 3D-aware representations of the environment from 2D observations to achieve better task performance when an accurate inference of 3D information is essential, which can further make it easier to specify tasks and learn from third-person videos, etc.

One of the core questions of model learning in robotic manipulation is how to determine the state representation for learning the dynamics model. The desired representation should make it easy to capture the environment dynamics, exhibit a good 3D understanding of the objects in the scene, and be applicable to diverse object sets such as rigid or deformable objects, and fluids. One line of prior work learns the dynamics model directly in the image pixel space [7, 5, 53, 41]. However, modeling dynamics in such a high-dimensional space is challenging, and these methods typically generate blurry images when performing the long-horizon future predictions. Another line of work focused on only predicting task-relevant features identified as keypoints [21, 22, 48, 8, 15]. Such models perform well in terms of category-level generalization, i.e., the same set of keypoints can represent different instances within the same category, but are not sufficient to model objects with large variations like fluids and granular materials. Other methods learn dynamics in the latent space [49, 1, 10, 9, 36]

. However, the majority of these methods learn dynamics models using 2D convolutional neural networks and reconstruction loss – which has the same problem as predicting dynamics in the image space, i.e. their learned representations lack equivariance to 3D transformations. Time contrastive networks 

[37]

, on the other hand, aim to learn viewpoint-invariant representations from multi-view inputs, but do not require detailed modeling of 3D contents. As a result, previously unseen scene configurations and camera poses are out-of-distribution samples for the state estimator. As we will see, this leads to wrong state estimates and results in faulty control trajectories.

Figure 1: Comparison of the control results between a 2D-based baseline and our 3D-aware approach. The task here is to achieve the configuration shown on the left, observed from a viewpoint that is outside the training distribution. The agent only takes a single-view visual observation as input from a viewpoint that is vastly different from the goal (images with blue frames). Our method generalizes well in this scenario and outperforms the 2D-based baseline, demonstrating the benefits of the learned 3D-aware scene representations.

Meanwhile, recent work in computer vision has made impressive progress in the learning of 3D-structured neural scene representations. These approaches allow inference of 3D structure and appearance, trained only given 2D observations, either by overfitting on a single scene 

[39, 19, 23] or by generalizing across scenes [40, 46, 47]. Through their 3D inductive bias, the scene representations inferred by these models encode the scene contents with better accuracy and are invariant to changes in camera perspectives. It is desirable to push these ideas further to obtain a deeper understanding of how these methods, which directly reason over 3D, can bring in new characteristics and how they can be beneficial for dynamics modeling and complicated control tasks.

In this work, we aim to leverage recently proposed 3D-structure-aware implicit neural scene representations for visuomotor control tasks. We thus propose to embed neural radiance fields [23] in an auto-encoder framework, enabling tractable inference of the 3D-structure-aware scene state for dynamic environments. By additionally enforcing a time-contrastive loss on the estimated states, we ensure that the learned state representations are viewpoint-invariant. We then train a dynamics model that predicts the evolution of the state space conditioned on the input action, enabling us to perform control in the learned state space. Though the representation itself is 3D-structured, the convolutional encoder is not. At test time, we overcome this limitation by performing inference-via-optimization [29, 40], enabling accurate state estimation even for out-of-distribution camera poses and, therefore, control of tasks where the goal view is specified in an entirely unseen camera perspective. These contributions enable us to perform model-based visuomotor control of complex scenes, modeling both 3D dynamics of rigid objects and fluids. Through comparison with various baselines, the learned representation from our model is more precise at describing the contents of 3D scenes, which allows it to accomplish control tasks involving both rigid objects and fluids with significantly better accuracy (Figure 1). Please see our supplementary video for better visualization.

We summarize our contributions as follows: (i) We extend an autoencoding framework with a neural radiance field rendering module and time contrastive learning that allows us to learn 3D-aware scene representations for dynamics modeling and control purely from visual observations. (ii) By incorporating the auto-decoder mechanism at test time, our framework can adjust the learned representation and accomplish the control tasks with the goal specified from camera viewpoints outside the training distribution. (iii) We are the first to augment neural radiance fields using a time-invariant dynamics model, supporting future prediction and novel view synthesis across a wide range of environments with different types of objects.

Figure 2: Overview of the training procedure. Left:

an encoder that maps the input images into a latent scene representation. The images are first sent to an image encoder to generate the image feature representations

. Then we combine the image features from the same time step using a state encoder to obtain the state representation . A time contrastive loss is applied to enable our model to be invariant to camera viewpoints.
Middle: a decoder that takes the scene representation as input and generates the observation image conditioned on a given viewpoint. We use an L2 loss to ensure the reconstructed image to be similar to the ground truth image. Right: a dynamics model that predicts the future scene representations by taking in the current representation and action . We use an L2 loss to enforce the predicted latent representation to be similar to the feature representation extracted from the true visual observation .

2 Related Work

3D Scene Representation Learning. Prior work leverages the latent spaces of autoencoder-like models as learned representations of the underlying 3D scene to enable novel view synthesis from a single image [51, 42]. Eslami et al. [6] embed this approach in a probabilistic framework. To endow models with 3D structures, voxelgrids can be leveraged as neural scene representations [32, 25, 39, 45, 56], while others have tried to predict particle sets from images [14]. Sitzmann et al. [40] propose to learn neural implicit representations of 3D shape and appearance supervised with posed 2D images via a differentiable renderer. Generalizing across neural implicit representations can also be realized by local conditioning on CNN features [33, 44, 54], but this does not learn a global representation of the scene state. Alternatively, gradient-based meta-learning has been proposed for faster inference of implicit neural representations [38]. Deformable scenes can be modeled by transporting input coordinates to neural implicit representations with an implicitly represented flow field [26, 30, 31, 43, 18, 3, 52, 28]; however, they typically fit one trajectory and cannot handle different initial conditions and external action inputs, limiting their use in control.

Model-Based RL in Robotic Manipulation. We can categorize model-based RL methods by whether they use physics-based or data-driven models, and whether they assume full state access or only visual observation. Methods that rely on first principles typically assume full-state information of the environment [12, 55] and require the knowledge of the object models, making them hard to generalize to novel objects or partially observable scenarios. For data-driven models, people have attempted to learn a dynamics model for closed-loop planar pushing [2] or dexterous manipulation [24]. Schenck and Fox [34, 35] tackle a similar fluid pouring task via closed-loop simulation. Although they have achieved impressive results, they rely on state estimators customized for specific tasks, limiting their applicability to more general and diversified manipulation tasks.

Various model-based RL methods have been proposed to learn state representations from visual observations, such as image-space dynamics [7, 5, 4, 41], keypoint representation [13, 22, 15], and low-dimensional latent space [49, 10, 9, 36]. Some works learn a meaningful representation space using reconstruction loss [10, 9]. Others jointly train the forward and inverse dynamics models [1], or use time contrastive loss to regularize the latent embedding [37]. We differ from the previous methods by explicitly incorporating a 3D volumetric rendering process during training.

3 3D-Aware Representations for Dynamics Modeling

Inspired by Neural Radiance Fields (NeRF) [23], we propose a framework that learns a viewpoint-invariant model for dynamic environments. As shown in Figure 2, our framework has three parts: (1) an encoder that maps the input images into a latent state representation, (2) a decoder that generates an observation image under a certain viewpoint based on the state representation, and (3) a dynamics model that predicts the future state representations based on the current state and the input action.

3.1 3D-Aware Scene Representation Learning

Neural Radiance Field. Given a 3D point

in a scene and a viewing direction unit vector

from a camera, NeRF learns to predict a volumetric radiance field. This is represented using a differentiable rendering function that predicts the corresponding density and RGB color using . To render the color of an image pixel, NeRF integrates the information along the camera ray using , where is the camera ray with its origin and unit direction vector , and is the accumulated transparency between the pre-defined near depth and far depth along that camera ray. The mean squared error between the reconstructed color and the ground truth is:

(1)

Neural Radiance Field for Dynamic Scenes. One key limitation of NeRF is that it assumes the scene is static. For a dynamic scene, it must learn a separate radiance field for each time step. This severely limits the ability of NeRF to model environments that change over time, as it is both time-consuming and unclear how to transfer knowledge across time or when a new scene is similar to an old one. While other models have shown generalization across scenes [40, 27], these representations do not capture fine details. To enable to model dynamic scenes, we learn an encoding function that maps the visual observations to a feature representation for each scene and learn the volumetric radiance field decoding function based on . Let denotes the set of 2D images that capture a 3D scene at time from one or more camera viewpoints. The image taken from the viewpoint is represented as . We use ResNet-18 [11] to extract a feature vector for each image. We take the output of ResNet-18 before the pooling layer and sent it to a fully-connected layer, resulting in a 256 dimension image feature

. This image feature is concatenated with the corresponding camera viewpoint information (a 16-D vector obtained by flattening the camera view matrix). The concatenated feature is processed using a small multilayer perceptron (MLP) to generate the final image feature. The scene representation

at time is generated by first averaging the image features across multiple camera viewpoints and then being encoded using another small MLP.

Given a 3D point , a viewing direction unit vector , and a scene representation , we learn a function to predict the radiance field represented by the density and RGB color . Similar to NeRF, we use the integrated information along the camera ray to render the color of image pixels from an input viewpoint and then compute the image reconstruction loss using Equation 1. During each training iteration, we render two images from different viewpoints to calculate more accurate gradient updates. depends on the scene representation , forcing it to encode the 3D contents of the scene to support rendering from different camera poses.

Figure 3: Forward prediction and viewpoint extrapolation. (a) We first feed the input image(s) at time to the encoder to derive the scene representation . The dynamics model then takes and the corresponding action sequence as input to iteratively predict the future. The decoder synthesizes the visual observation conditioned on the predicted state representation and an input viewpoint. (b) We propose an auto-decoding inference-via-optimization framework to enable extrapolated viewpoint generalization. Given an input image taken from a viewpoint outside the training distribution, the encoder first predicts the scene representation . Then the decoder reconstructs the observation from and the camera viewpoint from . We calculate the L2 distance between and

and backpropagate the gradient through the decoder to update

. The updating process is repeated for iterations, resulting in a more accurate of the underlying 3D scene.

Time Contrastive Learning. To enable the image encoder to be viewpoint invariant, we regularize the feature representation of each image using multi-view time contrastive loss (TCN) [37] (see Figure 2a). The TCN loss encourages features of images from different viewpoints at the same time step to be similar, while repulsing features of images from different time steps to be dissimilar. More specifically, given a time step , we randomly select one image as the anchor and extract its image feature using the image encoder. Then we randomly select one positive image from the same time step but different camera viewpoint and one negative image from a different time step but the same viewpoint . We use the same image encoder to extract their image features and . Similar to [37], we minimize the following time contrastive loss:

(2)

where is a hyper-parameter denoting the margin between the positive and negative pairs.

3.2 Learning the Predictive Model

After we have obtained the latent state representation

, we use supervised learning to estimate the forward dynamics model,

. Given and a sequence of actions , we predict steps in the future by iteratively feeding in actions into the one-step forward model. We implement

as an MLP network which is trained by optimizing the following loss function:

(3)

We define the final loss as a combination of the image reconstruction loss, the time contrastive loss, and the dynamics prediction loss: . We first train the encoder and decoder

together using stochastic gradient descent (SGD) by minimizing

and , which makes sure that the learned scene representation encodes the 3D contents and is viewpoint-invariant. We then fix the encoder parameters, and train the dynamics model by minimizing using SGD. See the supplementary materials for the network architecture and training details.

4 Visuomotor Control

4.1 Online Planning for Closed-Loop Control

When given the goal image and its associated camera pose, we can feed them through the encoder to get the state representation for the goal configuration . We use the same method to compute the state representation for the current scene configuration . The goal of the online planning problem is to find an action sequence that minimizes the distance between the predicted future representation and the goal representation at time . As shown in Figure 3a, given a sequence of actions, our model can iteratively predict a sequence of latent state representations. The latent-space dynamics model can then be used for downstream closed-loop control tasks via online planning with model-predictive control (MPC), which is used to get feedback from the environment and give the agent a chance to adjust the current action sequence. We formally define the online planning problem as follows:

s.t. (4)

Many existing off-the-shelf model-based RL methods can be used to solve the MPC problem [5, 24, 10, 7, 22, 16, 17]. We experimented with random shooting, gradient-based trajectory optimization, cross-entropy method, and model-predictive path integral (MPPI) planners [50] and found that MPPI performed the best. MPPI is a sampling-based, gradient-free optimizer that considers temporal coordination between time steps when sampling action trajectories. At time , the algorithm first samples action sequences based on the current actions via . Each noise sample , denoting the noise value at the time step of the trajectory, is generated using filtering coefficient as the following:

(5)

We then roll them out in parallel using the learned model on the GPU to derive , and then re-weight the trajectories according to the reward to update the action sequence using a reward-weighting factor : , where . This procedure is repeated for iterations at which point the best action sequence is selected. In our experiments, we specify the action space as the position and orientation of the arms’ end-effector. The joint angle of the arm is calculated via inverse kinematics.

Figure 4: Qualitative control results of our method on three types of testing scenarios. The image on the right shows the target configuration we aim to achieve. The left three columns show the control process, which are also the input images to the agent. The fourth column is the control results from the same viewpoint as the goal image. Trial specifies the goal using a different viewpoint from the agent’s but has been encountered during training. Trial

uses a goal view that is an interpolation of training viewpoints. Trial

uses an extrapolated viewpoint that is outside the training distribution. Our method performs well in all settings.
Figure 5: Qualitative comparisons between our method and baseline approaches on the control tasks. We show the closed-loop control results on the FluidPour and FluidShake environment. The goal image viewpoint (top-left image of each block) is outside the training distribution and is different from the viewpoint observed by the agent (bottom-left image of each block). Our final control results are much better than a variant that does not perform auto-decoding test-time optimization (Ours w/o AD) and the best-performing baseline (TC+AE), both of which fail to accomplish the task and their control results (blue points) exhibit an apparent deviation from the target configuration (red points) when measured in the 3D points space.

4.2 Auto-Decoder for Viewpoint Extrapolation

End-to-end visuomotor agents can undergo significant performance drop when the test-time visual observations are captured from camera poses outside the training distribution. The convolutional image encoder suffers from the same problem as it is not equivariant to changes in the camera pose, meaning it has a hard time generalizing to out-of-distribution camera views. As shown in Figure 3b, when encountered an image from a viewpoint outside training distribution, with a pixel distribution vastly different from what the model is trained on, passing it through the encoder will give us an amortized estimation of the scene representation . It has a high chance that the decoded image is different from the ground truth as the viewpoint has never been encountered during training.

We fix this problem at test time by applying the inference-by-optimization (also named as an auto-decoding) framework that backpropagates through the volumetric renderer and through the neural implicit representation into the state estimate [29, 40]. This is inspired by the fact that the rendering function, is viewpoint equivariant, where the output only depends on the state representation , the location , and the ray direction , meaning that the output is invariant to the camera position along the camera ray, i.e., even we move the camera closer or farther away along the camera ray, still tends to generate the same results. We leverage this property and calculate the L2 distance between the input image and the reconstructed image , and then update the scene representation using stochastic gradient descent. We repeat this updating process times to derive the state representation of the underlying 3D scene. Note that this update only changes the scene representation while keeping the parameters in the decoder fixed. The resulting representation is used as in Equation 4 to solve the online planning problem.

Figure 6: Quantitative comparisons between our method and baselines on the FluidPour and FluidShake.

In each environment, we compare the results using three different evaluation metrics under three settings, i.e., (1) the target image view seen during training, (2) the target image view is inside the training distribution but not seen during training (interpolation), and (3) the target image view is outside the training distribution (extrapolation). The height of the bars indicates the mean, and the error bar denotes the standard error. Our model significantly outperforms all baselines under all testing settings.

5 Experiments

Environments. We consider the following four environments that involve both fluid and rigid objects for evaluating the proposed model and baseline approaches. The environments are simulated using NVIDIA FleX [20]. (1) FluidPour (Figure 8a): This environment contains a fully-actuated cup that pours fluids into a container at the bottom. (2) FluidShake (Figure 8b): A fully-actuated container moves on a 2D plane. Inside the container are fluids and a rigid cube floating on the surface. (3) RigidStack (Figure 8c): Three rigid cubes form a vertical stack and are released from a certain height but in different horizontal positions. They fall down and collide with each other and the ground. (4) RigidDrop (Figure 8d): A cube falls down from a certain height. There is a container fixed at a random position on the ground. The cube either falls into the container or bounces out.

Evaluation Metrics. We use the first two environments, i.e., FluidPour and FluidShake, to measure the control performance, where we specify the target configuration of the control task using images from (1) one of the viewpoints encountered during training, (2) an interpolated viewpoint between training viewpoints, and (3) an extrapolated viewpoint outside the training distribution (Figure 4).

We provide quantitative evaluations on the control performance in FluidPour and FluidShake by extracting the particle set from the simulator, and measure the Chamfer distance between the result and the goal, which we denote as “Chamfer Dist.”. In FluidPour, we provide additional measurements on the L2 distance of the position/orientation of the cup towards the goal, denoting as “Position Error” and “Angle Error” respectively. In FluidShake, we calculate the L2 distance of the container and cube’s position towards the goal and denote them as “Container Error” and “Cube Error” respectively.

5.1 Baseline Methods

For comparison, we consider the following three baselines: TC: Similar to [37], it only uses time contrastive loss for learning the image feature without having to reconstruct the scene. We learn a dynamics model directly on the image features for control. TC+AE: Instead of using Neural Radiance Fields to reconstruct the image, this method uses the standard convolutional decoder to reconstruct the target image when given a new viewpoint. This would then be similar to [6] augmented with a time contrastive loss. NeRF: This method is a direct adaptation from the original NeRF paper [23] and is the same as ours except that it does not include the time contrastive loss during training and the auto-decoding test-time optimization. We use the same dynamics model shown in Figure 2b and train the model for each baseline respectively for dynamic prediction. We use the same feedback control method, i.e., MPPI [50], for our model and all the baselines.

Figure 7: Nearest neighbor (NN) results using our learned state representation. Given a query image (red boundary), we search its nearest neighbors based on their state representation. Our learned scene representations can retrieve reasonable neighbor images, indicating that our state representations retain a good estimation of the contents inside the 3D scene and are invariant to camera poses.
Figure 8: Forward prediction and novel view synthesis on four environments. Given a scene representation and an input action sequence, our dynamics model predicts the subsequent latent scene representation, which is used as the input of our decoder model to reconstruct the corresponding visual observation based on different viewpoints. In each block, we render images based on the open-loop future dynamic prediction. Images in the dotted box compare the novel view synthesis results of our model in the last time step with the ground truth from three different viewpoints.

5.2 Control Results

Goal Specification from Novel Viewpoints. Figure 4c shows the goal configuration, and we ask the learned model to perform three control trials where the goal is specified from different types of viewpoints. The left three columns show the MPC control process from the agent’s viewpoint. The fourth column visualizes the final configuration the agent achieves from the same viewpoint as the goal image. Trial specifies the goal using a different viewpoint from the agent’s but has been encountered during training. Trial uses a goal view that is an interpolation of training viewpoints. Our agent can achieve the target configuration with a decent accuracy. For trial , we specify the goal view by moving the camera closer, higher, and more downwards with respect to the container. Note this goal image view is outside the distribution of training viewpoints. With the help of test-time auto-decoding optimization introduced in Section 4.2, our method can successfully achieve the target configuration as shown in the figure.

Baseline comparisons. We benchmark our model with the baselines by assessing their performance on the downstream control tasks. Figure 5 shows the qualitative comparison between our model (Ours), a variant of our model that does not perform the auto-decoding test-time optimization (Ours w/o AD), and the best-performing baseline (TC+AE) introduced in Section 5.1. We find that when the target view is outside the training distribution and vastly different from the agent view (bottom left image in each trial block), our full method shows a much better performance in achieving the target configuration. The variant without auto-decoding optimization and TC+AE fail to accomplish the task and exhibit an apparent deviation from the ground truth in the 3D points space.

We also provide quantitative comparisons on the control results. Figure 6 shows the performance in the FluidPour and FluidShake environments. We find our full model significantly outperforms the baseline approaches in both environments under all scenarios and evaluation metrics. The results effectively demonstrate the advantages of the learned 3D-aware scene representations, which contain a more precise encoding of the contents in the 3D environments.

5.3 Analysis of the Learned Model

To better understand why our 3D-aware model outperforms other baselines in the downstream control tasks, we provide a deeper investigation of the learned state representation.

Nearest Neighbor Search Using the Learned Representation. When the goal image is specified from a viewpoint different from the agent’s view, to ensure the planning problem defined in Equation 4 still work, it is essential that the distance in the learned feature space reflects the distance in the actual 3D space, i.e., scenes that are more similar in the real 3D space should be closer in the learned feature space, even if the visual observations are captured from different viewpoints. We visualize the nearest neighbor results in Figure 7. Given a query image, we search its nearest neighbors based on their state representation (introduced in Section 1). Even if the images look quite different from each other, the learned 3D-aware scene representations can retrieve reasonable neighborhood images that share similar 3D contents, indicating that the learned 3D-aware scene representations hold a good understanding of the real 3D scene and are invariant to camera viewpoints.

Dynamic Prediction and Novel View Synthesis. Conditioned on a scene representation and an input action sequence, our dynamics model can iteratively predict the evolution of the scene representations. Our decoder can then take the predicted state representation and reconstruct the corresponding visual observation from a query viewpoint. Figure 8 shows that our model can accurately predict the future and perform novel view synthesis on four environments involving both fluid and rigid objects, suggesting its usefulness in trajectory optimization. Please check our video results in the supplementary material for better visualization.

6 Conclusion

In this paper, we proposed to learn viewpoint-invariant 3D-aware scene representations from visual observations using an autoencoding framework augmented with a neural radiance field rendering module and time contrastive learning. We show that the learned 3D representations perform well on the model-based visuomotor control tasks. When coupled with an auto-decoding test-time optimization mechanism, our method allows goal specification from a viewpoint outside the training distribution. We demonstrate the applicability of the proposed framework in a range of complicated physics environments involving rigid objects and fluids, which we hope can facilitate future visuomotor control works for complex 3D manipulation tasks.

References

  • [1] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine (2016) Learning to poke by poking: experiential learning of intuitive physics. arXiv preprint arXiv:1606.07419. Cited by: §1, §2.
  • [2] M. Bauza, F. R. Hogan, and A. Rodriguez (2018) A data-efficient approach to precise and controlled pushing. In Conference on Robot Learning, pp. 336–345. Cited by: §2.
  • [3] Y. Du, Y. Zhang, H. Yu, J. B. Tenenbaum, and J. Wu (2020) Neural radiance flow for 4d view synthesis and video processing. arXiv e-prints, pp. arXiv–2012. Cited by: §2.
  • [4] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine (2018)

    Visual foresight: model-based deep reinforcement learning for vision-based robotic control

    .
    arXiv preprint arXiv:1812.00568. Cited by: §1, §2.
  • [5] F. Ebert, C. Finn, A. X. Lee, and S. Levine (2017) Self-supervised visual planning with temporal skip connections.. In CoRL, pp. 344–356. Cited by: §1, §2, §4.1.
  • [6] S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. (2018) Neural scene representation and rendering. Science 360 (6394), pp. 1204–1210. Cited by: §2, §5.1.
  • [7] C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. arXiv preprint arXiv:1605.07157. Cited by: §1, §2, §4.1.
  • [8] W. Gao and R. Tedrake (2021) KPAM 2.0: feedback control for category-level robotic manipulation. arXiv preprint arXiv:2102.06279. Cited by: §1.
  • [9] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2019) Dream to control: learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. Cited by: §1, §1, §2.
  • [10] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019) Learning latent dynamics for planning from pixels. In

    International Conference on Machine Learning

    ,
    pp. 2555–2565. Cited by: §1, §2, §4.1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §3.1.
  • [12] F. R. Hogan and A. Rodriguez (2016) Feedback control of the pusher-slider system: a story of hybrid and underactuated contact dynamics. arXiv preprint arXiv:1611.08268. Cited by: §2.
  • [13] T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih (2019) Unsupervised learning of object keypoints for perception and control. Advances in neural information processing systems 32, pp. 10724–10734. Cited by: §2.
  • [14] Y. Li, T. Lin, K. Yi, D. Bear, D. Yamins, J. Wu, J. Tenenbaum, and A. Torralba (2020) Visual grounding of learned physical models. In International conference on machine learning, pp. 5927–5936. Cited by: §2.
  • [15] Y. Li, A. Torralba, A. Anandkumar, D. Fox, and A. Garg (2020) Causal discovery in physical systems from videos. Advances in Neural Information Processing Systems 33. Cited by: §1, §2.
  • [16] Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba (2019) Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In ICLR, Cited by: §4.1.
  • [17] Y. Li, J. Wu, J. Zhu, J. B. Tenenbaum, A. Torralba, and R. Tedrake (2019) Propagation networks for model-based control under partial observation. In ICRA, Cited by: §4.1.
  • [18] Z. Li, S. Niklaus, N. Snavely, and O. Wang (2020) Neural scene flow fields for space-time view synthesis of dynamic scenes. arXiv preprint arXiv:2011.13084. Cited by: §2.
  • [19] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh (2019) Neural volumes: learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751. Cited by: §1.
  • [20] M. Macklin, M. Müller, N. Chentanez, and T. Kim (2014) Unified particle physics for real-time applications. ACM Transactions on Graphics (TOG) 33 (4), pp. 1–12. Cited by: §5.
  • [21] L. Manuelli, W. Gao, P. Florence, and R. Tedrake (2019) Kpam: keypoint affordances for category-level robotic manipulation. arXiv preprint arXiv:1903.06684. Cited by: §1.
  • [22] L. Manuelli, Y. Li, P. Florence, and R. Tedrake (2020) Keypoints into the future: self-supervised correspondence in model-based reinforcement learning. In Conference on Robot Learning (CoRL), Cited by: §1, §2, §4.1.
  • [23] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, pp. 405–421. Cited by: Appendix B, Figure 11, Appendix D, §1, §1, §3, §5.1.
  • [24] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar (2020) Deep dynamics models for learning dexterous manipulation. In Conference on Robot Learning, pp. 1101–1112. Cited by: §2, §4.1.
  • [25] T. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang (2018) Rendernet: a deep convolutional network for differentiable rendering from 3d shapes. arXiv preprint arXiv:1806.06575. Cited by: §2.
  • [26] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2019) Occupancy flow: 4d reconstruction by learning particle dynamics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5379–5389. Cited by: §2.
  • [27] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3504–3515. Cited by: §3.1.
  • [28] J. Ost, F. Mannan, N. Thuerey, J. Knodt, and F. Heide (2021) Neural scene graphs for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2856–2865. Cited by: §2.
  • [29] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §1, §4.2.
  • [30] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Brualla (2020) Deformable neural radiance fields. arXiv preprint arXiv:2011.12948. Cited by: §2.
  • [31] A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer (2020) D-nerf: neural radiance fields for dynamic scenes. arXiv preprint arXiv:2011.13961. Cited by: §2.
  • [32] D. J. Rezende, S. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess (2016) Unsupervised learning of 3d structure from images. arXiv preprint arXiv:1607.00662. Cited by: §2.
  • [33] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314. Cited by: §2.
  • [34] C. Schenck and D. Fox (2017) Reasoning about liquids via closed-loop simulation. arXiv preprint arXiv:1703.01656. Cited by: §2.
  • [35] C. Schenck and D. Fox (2018) Perceiving and reasoning about liquids using fully convolutional networks. The International Journal of Robotics Research 37 (4-5), pp. 452–471. Cited by: §2.
  • [36] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. (2020) Mastering atari, go, chess and shogi by planning with a learned model. Nature 588 (7839), pp. 604–609. Cited by: §1, §1, §2.
  • [37] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain (2018) Time-contrastive networks: self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1134–1141. Cited by: §1, §2, §3.1, §5.1.
  • [38] V. Sitzmann, E. R. Chan, R. Tucker, N. Snavely, and G. Wetzstein (2020) Metasdf: meta-learning signed distance functions. arXiv preprint arXiv:2006.09662. Cited by: §2.
  • [39] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer (2019) DeepVoxels: learning persistent 3d feature embeddings. In Proc. CVPR, Cited by: §1, §2.
  • [40] V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. arXiv preprint arXiv:1906.01618. Cited by: §1, §1, §2, §3.1, §4.2.
  • [41] H. Suh and R. Tedrake (2020) The surprising effectiveness of linear models for visual foresight in object pile manipulation. arXiv preprint arXiv:2002.09093. Cited by: §1, §2.
  • [42] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2015) Single-view to multi-view: reconstructing unseen views with a convolutional network. CoRR abs/1511.06702 1 (2), pp. 2. Cited by: §2.
  • [43] E. Tretschk, A. Tewari, V. Golyanik, M. Zollhöfer, C. Lassner, and C. Theobalt (2020) Non-rigid neural radiance fields: reconstruction and novel view synthesis of a deforming scene from monocular video. arXiv preprint arXiv:2012.12247. Cited by: §2.
  • [44] A. Trevithick and B. Yang (2020) GRF: learning a general radiance field for 3d scene representation and rendering. arXiv preprint arXiv:2010.04595. Cited by: §2.
  • [45] H. F. Tung, R. Cheng, and K. Fragkiadaki (2019) Learning spatial common sense with geometry-aware recurrent networks. Proc. CVPR. Cited by: §2.
  • [46] H. F. Tung, R. Cheng, and K. Fragkiadaki (2019) Learning spatial common sense with geometry-aware recurrent networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2595–2603. Cited by: §1.
  • [47] H. F. Tung, Z. Xian, M. Prabhudesai, S. Lal, and K. Fragkiadaki (2020) 3D-oes: viewpoint-invariant object-factorized environment simulators. arXiv preprint arXiv:2011.06464. Cited by: §1.
  • [48] C. Wang, R. Martín-Martín, D. Xu, J. Lv, C. Lu, L. Fei-Fei, S. Savarese, and Y. Zhu (2020) 6-pack: category-level 6d pose tracker with anchor-based keypoints. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 10059–10066. Cited by: §1.
  • [49] M. Watter, J. T. Springenberg, J. Boedecker, and M. Riedmiller (2015) Embed to control: a locally linear latent dynamics model for control from raw images. arXiv preprint arXiv:1506.07365. Cited by: §1, §2.
  • [50] G. Williams, A. Aldrich, and E. Theodorou (2015) Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149. Cited by: §4.1, §5.1.
  • [51] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow (2017) Interpretable transformations with encoder-decoder networks. In Proc. ICCV, Vol. 4. Cited by: §2.
  • [52] W. Xian, J. Huang, J. Kopf, and C. Kim (2021) Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9421–9431. Cited by: §2.
  • [53] L. Yen-Chen, M. Bauza, and P. Isola (2020) Experience-embedded visual foresight. In Conference on Robot Learning, pp. 1015–1024. Cited by: §1.
  • [54] A. Yu, V. Ye, M. Tancik, and A. Kanazawa (2020) PixelNeRF: neural radiance fields from one or few images. arXiv preprint arXiv:2012.02190. Cited by: §2.
  • [55] J. Zhou, Y. Hou, and M. T. Mason (2019) Pushing revisited: differential flatness, trajectory planning, and stabilization. The International Journal of Robotics Research 38 (12-13), pp. 1477–1489. Cited by: §2.
  • [56] J. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, and B. Freeman (2018) Visual object networks: image generation with disentangled 3d representations. In Proc. NIPS, pp. 118–129. Cited by: §2.

Appendix A Auto-Decoder for Viewpoint Extrapolation

Figure 9 shows the qualitative results on auto-decoding test-time optimization. When encountered an image that is from a viewpoint outside the training distribution. This mechanism can help us derive a better representation of the scene that holds a more accurate description of the 3D contents. The obtained representation after the optimization can then be used as the goal embedding in Equation 4 that the agent needs to achieve.

Appendix B Model Details

In the decoder model, we use a similar network architecture as the NeRF paper [23]. As shown in Figure 11, we send a 3D point , a camera ray , and a state feature representation into a fully-connected network and output the corresponding density and RGB color .

Appendix C Environment Details

In the FluidPour environment, we generated trajectories for training. Each trajectory has frames with camera views sampled around the objects with a fixed distance towards the world origin. The action space for the control task is the position and tilting angle of the cup, which are randomly generated when constructing the training set.

In the FluidShake environment, we generated trajectories for training. Each trajectory has frames with camera views sampled around the objects with a fixed distance towards the world origin. The action space for the control task is the 2D location of the container in the world coordinate, which is also randomly generated when constructing the training set.

Figure 12 shows some example visual observations for FluidPour and FluidShake used during training. We also include some example images from viewpoints that are outside the training distribution, which are then used to evaluate our model’s extrapolated generalization ability.

In the RigidStack environment, we generated trajectories for training. Each trajectory has frames with camera views sampled around the objects with a fixed distance towards the world origin.

In the RigidDrop environment, we generated trajectories for training. Each trajectory has frames with camera views sampled around the objects with a fixed distance towards the world origin.

Figure 9: Qualitative results on auto-decoding test-time optimization. Following from the pipeline illustrated in Figure 3b, if the input image is outside the training distribution as shown on the left column, the encoder won’t be able to generate the most accurate state representation. When passing the predicted state embedding and the same viewpoint as the input to the decoder, the generated image does not match the underlying scene as shown in the second column. We then calculate the L2 distance of the pixels between the generated image and true observation, backpropagating the gradient until the state representation and making updates to using SGD. As discussed in Section 4.2, the translational equivariance nature of the decoder allows it to effectively optimize the latent representation to make it better reflect the 3D contents in the scene. After the optimization, the generated visual observation is much closer to the ground truth, as shown on the third column. On the contrary, the vanilla autoencoder that uses a CNN-based decoder won’t be able to capture the underlying scene even with test-time autodecoding optimization, as shown on the right.
Figure 10: Quantitative comparisons between our method and baselines on the FluidPour and FluidShake environment. In each environment, we compare the results using three different evaluation metrics under three settings, i.e., (1) the target image view seen during training, (2) the target image view is inside the training distribution but not seen during training (interpolation), and (3) the target image view is outside the training distribution (extrapolation). The height of the bars indicates the mean, and the error bar denotes the standard error. Our model significantly outperforms all baselines under all testing settings.
Figure 11: A visualization of our decoder network architecture.

All layers are standard fully-connected layers, black arrows indicate layers with ReLU activations, orange arrows indicate layers with no activation, dashed black arrows indicate layers with sigmoid activation, and “+” denotes vector concatenation. We concatenate the positional encoding of the input location

and our learned state representation and pass them through 8 fully-connected ReLU layers, each with 256 channels. We follow the NeRF [23] architecture and include a skip connection that concatenates this input to the fifth layer’s activation. An additional layer outputs the volume density (which is rectified using a ReLU to ensure that the output volume density is nonnegative) and a 256-dimensional feature vector. This feature vector is concatenated with the positional encoding of the input viewing direction , and is processed by an additional fully-connected ReLU layer with 128 channels. A final layer (with a sigmoid activation) outputs the emitted RGB radiance at position , as viewed by a ray with direction , given the current 3D state representation .
Figure 12: Comparison between the viewpoints used for training and the subsequent viewpoint extrapolation experiments. (a) Here, we show some example images that the model used during training. For both environments, the camera is placed from a fixed distance and facing towards the world origin. (b) To evaluate the model’s ability for viewpoint extrapolation, i.e., processing visual observations from viewpoints that are outside the training distribution, we generate another set of viewpoints that are closer, higher, and facing more downwards with respect to the world origin. It is clear from the figure that the images from viewpoints used during training are very different from the ones used for viewpoint extrapolation when measured using pixel difference. It is essential to build a model that can directly reason over 3D in order to provide the desired extrapolated generalization ability. Although the model has access to visual observations from multiple cameras during training, it can only observe the environment from one camera when performing the downstream control task.

Appendix D Training Details

For the encoder and decoder model as described in the main text Figure 2, we use the Adam optimizer with the initial learning rate and decreased to for all the experiments. The batch size is

. The hyperparameters in the decoder are the same as the original NeRF model

[23] except the near and far distance between the objects and cameras are different in our environments. In our FluidPour environment, we have and . In our FluidShake environment, we have and . In our RigidStack environment, we have and . In our FluidPour environment, we have and .

Appendix E Control Details

The number of updating iteration for the auto-decoding test-time optimization is . The number of sampled trajectories during MPPI optimization is set to . The number of iterations for updating the action sequence is set to for the first time step, and for the subsequent control steps to maintain a better trade-off between efficiency and effectiveness. The reward-weighting factor is set to and the filtering coefficient is specified as . The control horizon is set as both for FluidPour and FluidShake. The hyperparameters are the same for all compared methods.