Synthetic images rendered by graphics engines have emerged as a promising source of training data for deep networks, especially for vision and robotics tasks that involve perceiving 3D structures from RGB pixels Butler et al. (2012); Yeh et al. (2012); Varol et al. (2017); Ros et al. (2016); McCormac et al. (2017); Xia et al. (2018); Chang et al. (2017); Kolve et al. (2017); Song et al. (2017); Richter et al. (2016, 2017); Zhang et al. (2017); Li and Snavely (2018). A major appeal of generating training images from computer graphics is that they have a virtually unlimited supply and come with high-quality 3D ground truth for free. Despite its great promise, however, using synthetic training images from graphics poses its own challenges. One of them is ensuring that the synthetic training images are useful for real world tasks, in the sense that they help train a network to perform well on real images. Ensuring this is challenging because a graphics-based generation pipeline requires numerous design decisions including the selection of 3D shapes, the composition of scene layout, the application of texture, the configuration of lighting, and the placement of the camera. These design decisions can profoundly impact the usefulness of the generated training data, but have largely been made manually by researchers in prior work, potentially leading to suboptimal results. In this paper we address the problem of automatically optimizing a generation pipeline of synthetic 3D training data, with the explicit objective of improving the generalization performance of a trained deep network on real images. One idea is black-box optimization: we try a particular configuration of the pipeline, use the pipeline to generate training images, train a deep network on these images, and evaluate the network on a validation set of real images. We can treat the performance of the trained network as a black-box function of the configuration of the generation pipeline, and apply black-box optimization techniques. In fact, recent work by Yang and Deng (2018)
has explored this exact direction. They use genetic algorithms to optimize the 3D shapes used in the generation pipeline. In particular, they start with a collection of simple primitive shapes such as cubes and spheres, and evolve them through mutation and combination into complex shapes, whose fitness is determined by the generalization performance of a trained network. They show that the 3D shapes evolved from scratch can provide more useful training data than manually created 3D CAD models. The advantage of black-box optimization is that it makes no assumption about the function being optimized as long as it can be evaluated. As a result, it can be applied to any existing function including advanced photorealistic renderers. On the other hand, black-box optimization is computationally expensive—knowing nothing else about the function, it needs many trials to find a good update to the current solution. In contrast, gradient-based optimization can be much more efficient by assuming the availability of analytical gradients, which can be efficiently computed and directly correspond to good updates to the current solution, but the downside is that analytical gradients are often unavailable, especially for many advanced photorealistic renderers. In this work, we propose a new method that optimizes the generation of 3D training data based on what we call “hybrid gradient”. The basic idea is to make use of analytical gradients where they are available, and combine them with black-box optimization for the rest of the function. Our hypothesis is that hybrid gradient will lead to more efficient optimization than black-box methods because it makes use of the partially available analytical gradient. Concretely, if we parametrize the design decisions as a real vector, the function mapping to network performance can be decomposed into two parts: (1) from design parameters to the generated training images , and (2) from the training images to the network performance . The first part often does not have analytical gradients, due to the use of advanced photorealistic renderers. We instead compute an approximate gradient by averaging finite difference approximations along random directions Mania et al. (2018)
. For the second part, we compute analytical gradients through backpropagation—with SGD training unrolled, the performance of the network is a differentiable function of the training images. Then we combine the approximate gradient and the analytical gradient to obtain the hybrid gradient of the network performancewith respect to parameters , as illustrated in Fig. 1.
A key ingredient of our approach is representing design decisions as real vectors of fixed dimensions, including the selection and composition of shapes. Yang and Deng (2018) represent 3D shapes as a finite set of graphs, one for each shape. This representation is suitable for a genetic algorithm but is incompatible with our method. Instead, we propose to represent 3D shapes as random samples generated by a Probabilistic Context-Free Grammar (PCFG) Harrison (1978)
. To sample a 3D shape, we start with an initial shape, and repeatedly sample a production rule in the grammar to modify it. The (conditional) probabilities of applying the production rules are parametrized as a real vector of a fixed dimension. Our approach is novel in multiple aspects. First, to the best our knowledge, we are the first to propose the idea of hybrid gradient, i.e. combining approximate gradients and analytical gradients, especially in the context of optimizing the generation of 3D training data. Second, the integration of PCFG-based shape generation and hybrid gradient is also novel. We evaluate our approach on the task of estimating surface normals from a single image. Experiments on standard benchmarks show that our approach can outperform the prior state of the art on optimizing the generation of 3D training data, particularly in terms of computational efficiency.
2 Related Work
Generating 3D training data
Synthetic images generated by computer graphics have been extensively used for training deep networks for numerous tasks, including single image 3D reconstruction Song et al. (2015); Hua et al. (2016); McCormac et al. (2017); Janoch et al. (2011); Yang and Deng (2018); Chang et al. (2015), optical flow estimation Mayer et al. (2018); Butler et al. (2012); Gaidon et al. (2016), human pose estimation Varol et al. (2017); Chen et al. (2016), action recognition Roberto de Souza et al. (2017), natural language modeling Johnson et al. (2017), and many others Weichao Qiu (2017); Martinez-Gonzalez et al. (2018); Xia et al. (2018); Tobin et al. (2017); Richter et al. (2017, 2016); Wu et al. (2018). The success of these works has demonstrated the effectiveness of synthetic images. To ensure the relevance of the generated the training data to real world tasks, a large amount of manual effort has been necessary, particularly in acquiring 3D assets such as shapes and scenes Chang et al. (2015); Janoch et al. (2011); Choi et al. (2016); Xiang et al. (2016); Hua et al. (2016); McCormac et al. (2017); Song et al. (2017)
. To reduce manual labor, some heuristics have been proposed to automatically generate 3D configurations. For example,Zhang et al. (2017) design an approach to use entropy of object masks and color distribution of the rendered image to select sampled camera poses. McCormac et al. (2017) simulate gravity for physically plausible object configurations inside a room. Prior work has also performed explicit optimization of 3D configurations. For example, Yeh et al. (2012) synthesizes layouts with the target of satisfying constraints such as non-overlapping and occupation. Jiang et al. (2018) learns a probabilistic grammar model for indoor scene generation, with parameters learned using maximum likelihood estimation on the existing 3D configurations in SUNCG Song et al. (2017). Similarly, Veeravasarapu et al. (2017) tunes the parameters for stochastic scene generation using generative adversarial networks, with the goal of making synthetic images indistinguishable from real images. Qi et al. (2018) synthesize 3D room layouts based on human-centric relations among furniture, to achieve visual realism, functionality and naturalness of the scenes. However, these optimization objectives are different from ours, which is the generalization performance of a trained network on real images. The closest prior work to ours is that of Yang and Deng (2018), who use a genetic algorithm to optimize the 3D shapes used for rendering synthetic training images. Their optimization objective is the same as ours, but their optimization method is different in that they do not use any gradient information.
Unrolling and backpropogating through network training
One component of our approach is unrolling and backpropagating through the training iterations of a deep network. This is a technique that has often been used by existing work in other contexts, including hyperparameter optimizationMaclaurin et al. (2015) and meta-learning Andrychowicz et al. (2016); Ha et al. (2017); Munkhdalai and Yu (2017); Li and Malik (2017); Finn et al. (2018). Our work is different in that we apply this technique in a novel context: it is used to optimize the generation of 3D training data and it is integrated with approximate gradients to form hybrid gradients.
Our method is connected to hyperparameter optimization in the sense that we can treat the design decisions of the 3D generation pipeline as hyperparameters of the training procedure. Hyperparameter optimization is typically approached as black-box optimization Bergstra and Bengio (2012); Bergstra et al. (2011); Lacoste et al. (2914); Brochu et al. (2010). Since black-box optimization does not assume knowledge about the function being optimized, it requires repeated evaluation of the function, which is expensive in this case because it contains the process of training and evaluating a deep network. In contrast, we combine analytical gradients from backpropagation and approximate gradient from generalized finite difference for more efficient optimization.
3 Problem Setup
Suppose we have a probabilistic generative pipeline that takes a real vector and a random number as input. After 3D composition and rendering, an image and its 3D ground truth are computed through a function . By randomly sampling for times, we obtain a dataset of size for training:
Then, a deep neural network with initialized weightsis trained on the training data , with the function representing the optimization process and generating the weights of the trained network. The network is then evaluated on real data with a validation loss to obtain a generalization performance :
Combining the above two functions, is a function of , and the task is to optimize this value with respect to the parameters . As we mentioned in the previous section, black-box algorithms typically need repeated evaluation of this function, which is expensive.
4.1 Generative Modeling of Synthetic Training Data
We decompose the function into two parts: 3D composition and rendering.
Context-free grammars have been used in scene generation Jiang et al. (2018); Qi et al. (2018) and in parsing of Constructive Solid Geometry (CSG) shapes (Sharma et al., 2018). Here, we design a probabilistic context-free grammar (PCFG) Foley et al. (1990)
to control the random generation of unlimited shapes. In a PCFG, a tree is randomly sampled given a set of probabilities. Starting from a root node, the leaf nodes of the tree keeps expanding according to a set of rules. The process is stopped until all leaf nodes cannot expand. Since multiple rules may apply, the parameters in a PCFG define the probability distribution of applying different rules. In our PCFG, a shape can be constructed by composing two other shapes through union and difference, and this construction can be recursively applied until all leaf nodes are a predefined set of concrete primitive shapes (terminals). The parameters can be the probability of either expanding the node or replacing it with a terminal. Given our PCFG model with the probability parameters, a 3D shape can be composed:
Rendering training images
we use a graphics renderer to render the composed shape . The rendering configurations (e.g. camera poses), are also sampled from a distribution controlled by a set of parameters :
where and . By drawing the random number
from a uniform distribution, we obtain a set of training images and their 3D ground truth.
4.2 Hybrid Gradient
After a deep network is trained on synthetic training data , it is evaluated on a set of validation images to obtain the generalization loss . Recall that to compute the hybrid gradient to optimize , we multiply two gradients: the gradient of network training and the gradient of image generation , as is shown in Fig. 2.
Analytical gradient from backpropagation
We assume the network is trained on a a set of previously generated training images
. Without loss of generality, we assume mini-batch stochastic gradient descent (SGD) with a batch size of 1 is used for weight update. Let functiondenote the SGD step and let denote the training loss:
Note that the SGD step is differentiable with respect to the network weights as well as the training batch , if our training loss is twice (sub-)differentiable. This requirement is satisfied in most practical cases. To simplify the equation, we assume the training loss and the learning rate do not change during one update step of , so the variables can be safely discarded in the equation. Therefore, the gradient of the generalization loss for each sample can be computed through backpropagation:
with the boundary condition computed from the evaluation function :
Aproximate gradient from finite difference
For the formulation in Eq. 5, the graphics renderer can be a general black box and non-differentiable. We can approximate the gradient of each rendered image with ground truth with respect to the generation parameters , with Basic Random Search, a generalized finite difference method described in (Mania et al., 2018)
. First, we sample a set of noise from an uncorrelated multivariate Gaussian distributionMania et al. (2018):
Next, we approximate the Jacobian for each sample ( denotes cross product) Mania et al. (2018):
Following Yang and Deng (2018), we incrementally update parameters and network weights . At timestamp , we update with the hybrid gradient; for network weights, we simply use the latest trained network for initialization in timestamp :
5 Experimental Setup
We experiment on the task of surface normal estimation, a standard task for single-image 3D. The input is a RGB image and the output is pixel-wise surface normals. We evaluate on two datasets of real images: MIT-Berkeley Intrinsic Images Dataset (MBII) Barron and Malik (2015), which focuses on images of single objects, and NYU Depth Silberman et al. (2012), which focuses on indoor scenes. For MBII, we use pure synthetic shapes Yang and Deng (2018) to render training images. We first compare our method with ablation baselines, then show that our algorithm is better than the previous state of the art. For NYU Depth, we base our generative model on SUNCG Song et al. (2017) and augment the original 3D configurations in Zhang et al. (2017). We report the performance of surface normal directions with the metrics commonly used in previous works, including mean angle error (MAE), median angle error, mean squared error (MSE), and the proportion of pixels that normals fall in an error range ().
5.1 MIT-Berkeley Intrinsic Images
Following the work of Yang and DengYang and Deng (2018), we recover the surface normals of an object from a single image.
Synthetic shape generation
In Yang and Deng (2018), a population of primitive shapes such as cylinders, spheres and cubes are evolved and rendered to train deep networks. The evolution operators are defined as transformations of individual shapes, as well as boolean operations of shapes in Constructive Solid Geometry (CSG) Foley et al. (1990). In our algorithm, we also use the CSG grammar for our PCFG:
S => E; E => C(E, T(E)) | P; C => union | subtract; P => sphere | cube | truncated_cone | tetrahedron; T => attach * rand_translate * rand_rotate * rand_scale;
In this PCFG, the parameter vector
consists of three parts: (1) The probability of the different rules; (2) The means and variations of log-normal distributions controlling shape primitives (
P), such as the radius of the sphere; (3) The means and variations of log-normal distributions controlling transformation parameters (
T), such as scale values. Examples of sampled shapes are shown in Fig. 3. We compose our shape in mesh representations, slightly different from the implicit functions in (Yang and Deng, 2018). Therefore, we re-implemented their algorithm with mesh representations for fair comparison. For network training and evaluation, we follow (Yang and Deng, 2018) and train the Stacked Hourglass Network Newell et al. (2016) on the images, and use the standard split of the MBII dataset for the optimization of and testing.
5.2 NYU Depth
S => E,P; E => T_shapes * R_shapes * E0; P => T_camera * R_camera * P0; T_shapes => translate(rand_x, rand_y, rand_z); R_shapes => rotate_euler(rand_yaw, rand_pitch, rand_roll);
For each 3D scene
S, we perturb the positions and poses of the original cameras (
P0) and shapes (
The position perturbations follow a mixture of uncorrelated Gaussians, and the perturbations
for pose angles (yaw, pitch & roll) follow a mixture of von Mises, i.e. wrapped Gaussians.
The vector consists of the parameters of the above distributions.
Our networks are only trained on synthetic images, and evaluated on NYU Depth V2 Silberman et al. (2012) with the same setup as in Zhang et al. (2017).
For real images in our optimization pipeline, we sample a subset of images from the standard validation images in NYU Depth V2.
6 Experiment Results
6.1 MIT-Berkeley Intrinsic Images
We first sample 10 random values of in advance, then for each we train a network, with the exact same training and evaluation configurations as in our hybrid gradient. We then report the best, median and worst performance of those 10 networks, and label the corresponding as , and . In hybrid gradient, we initialize from these three values and report the performance on test images also in Table 1. From the table we can observe that training with a fixed can hardly match the performance of our method, even with multiple trials. Instead, our hybrid gradient approach can optimize to a reasonable performance regardless of different initialization. This simple diagnostic experiment demonstrates that our algorithm is working properly.
Comparison with previous work
In this experiment, we compare with black-box algorithms including Basic Random Search Mania et al. (2018) and Shape Evolution Yang and Deng (2018). Because we use mesh implementation instead of implicit computation graph in Yang and Deng (2018) for CSG, we re-implemented Shape Evolution with the same setting for fair comparison. We follow (Yang and Deng, 2018) for the initialization of , train the networks and update for the same number of steps. We then report the test performance of the network which has the best validation performance. The results are shown in Table 2.
|SIRFS(Barron and Malik, 2015)||—|
|Evolution (Yang and Deng, 2018)(Reported)||—|
|Evolution (Yang and Deng, 2018)(Our Impl.)|
|Basic Random Search Mania et al. (2018)|
We also run the experiments on the same set of CPUs and GPUs, and plot the test mean angle error with respect to the CPU time, GPU time and total computation time (Fig. 4). We see that our algorithm is more efficient than the above baselines. Shapes sampled from our optimized PCFG are shown in Fig. 3.
6.2 NYU Depth
We initialize our network using the original model in (Zhang et al., 2017) and initialize using a small value. To compare with random , we construct a dataset of k images with a small random for each image. We then load the same pre-trained network and train for the same number of iterations as in hybrid gradient. We then evaluate the networks on the test set of NYU Depth V2 Silberman et al. (2012), following the same protocol. The results are reported in Table 3. Note that none of these networks has seen a single real image except for validation.
|Original (Zhang et al., 2017)|
|Training with random|
The numbers indicate that our parametrized generation of SUNCG augmentation exceeds the original baseline performance. Note that the network trained with random is worse than original performance. This means without proper optimization of perturbation parameters, such random augmentation may hurt generalization, demonstrating that good choices of these parameters are crucial for generalization to real images.
In this paper, we have proposed hybrid gradient, a novel approach to the problem of automatically optimizing a generation pipeline of synthetic 3D training data. We evaluate our approach on the task of estimating surface normals from a single image. Our experiments show that our algorithm can outperform the prior state of the art on optimizing the generation of 3D training data, particularly in terms of computational efficiency. Acknowledgments This work is partially supported by the National Science Foundation under Grant No. 1617767.
- Andrychowicz et al.  Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
- Barron and Malik  Jonathan T Barron and Jitendra Malik. Shape, illumination, and reflectance from shading. TPAMI, 2015.
- Bergstra and Bengio  James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(1):281–305, February 2012. ISSN 1532-4435.
- Bergstra et al.  James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2546–2554. Curran Associates Inc., 2011. ISBN 978-1-61839-599-3.
- Brochu et al.  Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599, 2010.
Butler et al. 
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black.
A naturalistic open source movie for optical flow evaluation.
In A. Fitzgibbon et al. (Eds.), editor,
European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012.
- Chang et al.  Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
- Chang et al.  Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
- Chen et al.  Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Synthesizing training images for boosting human 3d pose estimation. In 3D Vision (3DV), 2016.
- Choi et al.  Sungjoon Choi, Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. A large dataset of object scans. arXiv:1602.02481, 2016.
- Finn et al.  Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9516–9527. Curran Associates, Inc., 2018.
- Foley et al.  James D. Foley, Andries van Dam, Steven K. Feiner, and John F. Hughes. Computer Graphics: Principles and Practice (2Nd Ed.). Addison-Wesley Longman Publishing Co., Inc., 1990. ISBN 0-201-12110-7.
Gaidon et al. 
Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig.
Virtual worlds as proxy for multi-object tracking analysis.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Ha et al.  David Ha, Andrew Dai, and Quoc Le. Hypernetworks. In ICLR, 2017.
- Harrison  M. A. Harrison. Introduction to Formal Language Theory. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1978. ISBN 0201029553.
- Hua et al.  Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In International Conference on 3D Vision (3DV), 2016.
- Janoch et al.  A. Janoch, S. Karayev, , J. T. Barron, M. Fritz, K. Saenko, and T. Darrell. A category-level 3-d object dataset: Putting the kinect to work. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 1168–1174, Nov 2011.
- Jiang et al.  Chenfanfu Jiang, Siyuan Qi, Yixin Zhu, Siyuan Huang, Jenny Lin, Lap-Fai Yu, Demetri Terzopoulos, and Song-Chun Zhu. Configurable 3d scene synthesis and 2d image rendering with per-pixel ground truth using stochastic grammars. International Journal of Computer Vision, 126(9):920–941, 2018.
- Johnson et al.  Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Kolve et al.  Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai, 2017.
Lacoste et al. 
Alexandre Lacoste, Hugo Larochelle, Mario Marchand, and François
Sequential model-based ensemble optimization.
Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pages 440–448. AUAI Press, 2914. ISBN 978-0-9749039-1-0.
- Li and Malik  Ke Li and Jitendra Malik. Learning to optimize. In ICLR, 2017.
- Li and Snavely  Zhengqi Li and Noah Snavely. Cgintrinsics: Better intrinsic image decomposition through physically-based rendering. In The European Conference on Computer Vision (ECCV), September 2018.
Maclaurin et al. 
Dougal Maclaurin, David Duvenaud, and Ryan Adams.
Gradient-based hyperparameter optimization through reversible
International Conference on Machine Learning, pages 2113–2122, 2015.
Mania et al. 
Horia Mania, Aurelia Guy, and Benjamin Recht.
Simple random search of static linear policies is competitive for reinforcement learning.In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1800–1809. Curran Associates, Inc., 2018.
- Martinez-Gonzalez et al.  Pablo Martinez-Gonzalez, Sergiu Oprea, Alberto Garcia-Garcia, Alvaro Jover-Alvarez, Sergio Orts-Escolano, and Jose Garcia-Rodriguez. UnrealROX: An extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. ArXiv e-prints, 2018.
- Mayer et al.  Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. What makes good synthetic training data for learning disparity and optical flow estimation? Int. J. Comput. Vision, 126(9):942–960, September 2018. ISSN 0920-5691.
McCormac et al. 
John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J. Davison.
Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Munkhdalai and Yu  Tsendsuren Munkhdalai and Hong Yu. Meta networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2554–2563. JMLR.org, 2017.
- Newell et al.  Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, volume 9912 of Lecture Notes in Computer Science, pages 483–499. Springer, 2016.
- Qi et al.  Siyuan Qi, Yixin Zhu, Siyuan Huang, Chenfanfu Jiang, and Song-Chun Zhu. Human-centric indoor scene synthesis using stochastic grammar. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Richter et al.  Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In Computer Vision – ECCV 2016, pages 102–118. Springer International Publishing, 2016.
- Richter et al.  Stephan R. Richter, Zeeshan Hayder, and Vladlen Koltun. Playing for benchmarks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- Roberto de Souza et al.  Cesar Roberto de Souza, Adrien Gaidon, Yohann Cabon, and Antonio Manuel Lopez. Procedural generation of videos to train deep action recognition networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Ros et al.  German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio Lopez. The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
- Sharma et al.  Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. Csgnet: Neural shape parser for constructive solid geometry. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Silberman et al.  Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision – ECCV 2012, pages 746–760. Springer Berlin Heidelberg, 2012.
Song et al. 
Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.
Sun rgb-d: A rgb-d scene understanding benchmark suite.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 567–576, 2015.
- Song et al.  Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. Proceedings of 29th IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Tobin et al.  Joshua Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.
- Varol et al.  Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In CVPR, 2017.
- Veeravasarapu et al.  V. S. R. Veeravasarapu, Constantin A. Rothkopf, and Visvanathan Ramesh. Adversarially tuned scene generation. CoRR, abs/1701.00405, 2017.
- Weichao Qiu  Yi Zhang Siyuan Qiao Zihao Xiao Tae Soo Kim Yizhou Wang Alan Yuille Weichao Qiu, Fangwei Zhong. Unrealcv: Virtual worlds for computer vision. ACM Multimedia Open Source Software Competition, 2017.
- Wu et al.  Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. In ICLR (Workshop). OpenReview.net, 2018.
- Xia et al.  Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Xiang et al.  Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh Mottaghi, Leonidas Guibas, and Silvio Savarese. Objectnet3d: A large scale database for 3d object recognition. In European Conference Computer Vision (ECCV), 2016.
- Yang and Deng  Dawei Yang and Jia Deng. Shape from shading through shape evolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Yeh et al.  Yi-Ting Yeh, Lingfeng Yang, Matthew Watson, Noah D. Goodman, and Pat Hanrahan. Synthesizing open worlds with constraints using locally annealed reversible jump mcmc. ACM Trans. Graph., 31(4):56:1–56:11, July 2012. ISSN 0730-0301.
Zhang et al. 
Yinda Zhang, Shuran Song, Ersin Yumer, Manolis Savva, Joon-Young Lee, Hailin
Jin, and Thomas Funkhouser.
Physically-based rendering for indoor scene understanding using convolutional neural networks.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.