Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

11/12/2019 ∙ by Krishna Murthy Jatavallabhula, et al. ∙ Nvidia 54

We present Kaolin, a PyTorch library aiming to accelerate 3D deep learning research. Kaolin provides efficient implementations of differentiable 3D modules for use in deep learning systems. With functionality to load and preprocess several popular 3D datasets, and native functions to manipulate meshes, pointclouds, signed distance functions, and voxel grids, Kaolin mitigates the need to write wasteful boilerplate code. Kaolin packages together several differentiable graphics modules including rendering, lighting, shading, and view warping. Kaolin also supports an array of loss functions and evaluation metrics for seamless evaluation and provides visualization functionality to render the 3D results. Importantly, we curate a comprehensive model zoo comprising many state-of-the-art 3D deep learning architectures, to serve as a starting point for future research endeavours. Kaolin is available as open-source software at https://github.com/NVIDIAGameWorks/kaolin/.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D deep learning is receiving attention and recognition at an accelerated rate due to its high relevance in complex tasks such as robotics [19, 43, 35, 44], self-driving cars [32, 27, 6], and augmented and virtual reality [10, 1]. The advent of deep learning and an ever-growing compute infrastructures have allowed for the analysis of highly complicated, and previously intractable 3D data [16, 12, 30]. Furthermore, 3D vision research has started an interesting trend of exploiting well-known concepts from related areas such as robotics and computer graphics [17, 21, 25]

. Despite this accelerating interest, conducting research within the field involves a steep learning curve due to the lack of standardized tools. No system yet exists that would allow a researcher to easily load popular 3D datasets, convert 3D data across various representations and levels of complexity, plug into modern machine learning frameworks, and train and evaluate deep learning architectures. New researchers in the field of 3D deep learning must inevitably compile a collection of mismatched code snippets from various code bases to perform even basic tasks, which has resulted in an uncomfortable absence of comparisons across different state-of-the-art methods.

With the aim of removing the barriers to entry into 3D deep learning and expediting research, we present Kaolin, a 3D deep learning library for PyTorch [29]. Kaolin provides efficient implementations of all core modules required to quickly build 3D deep learning applications. From loading and pre-processing data, to converting it across popular 3D representations (meshes, voxels, signed distance functions, pointclouds, etc.), to performing deep learning tasks on these representations, to computing task-specific metrics and visualizations of 3D data, Kaolin makes the entire life-cycle of a 3D deep learning applications intuitive and approachable. In addition, Kaolin implements a large set of popular methods for 3D tasks along with their pre-trained models in our model zoo, to demonstrate the ease through which new methods can now be implemented, and to highlight it as a home for future 3D DL research. Finally, with the advent of differentiable renders for explicit modeling of geometric structure and other physical processes (lighting, shading, projection, etc.) in 3D deep learning applications [17, 22, 5], Kaolin features a generic, modular differentiable renderer which easily extends to all popular differentiable rendering methods, and is also simple to build upon for future research and development.

Figure 1: Kaolin

makes training 3D DL models simple. We provide an illustration of the code required to train and test a PointNet++ classifier for

car vs airplane in 5 lines of code.

max width= Library # 3D representations Common dataset preprocessing Differentiable rendering Model Zoo USD support GVNN [13] Mainly RGB(D) images - - - - Kornia [31] Mainly RGB(D) images - - - - TensorFlow Graphics [38] Mainly meshes - - Kaolin (ours) Comprehensive

Table 1: Kaolin is the first comprehensive 3D DL library. With extensive support for various representations, datasets, and models, it complements existing 3D libraries such as TensorFlow Graphics [38], Kornia [31], and GVNN [13].
Figure 2: Kaolin provides efficient PyTorch operations for converting across 3D representations. While meshes, pointclouds, and voxel grids continue to be the most popular 3D representations, Kaolin has extensive support for signed distance functions (SDFs), orthographic depth maps (ODMs), and RGB-D images.

2 Kaolin - Overview

Kaolin aims to provide efficient and easy-to-use tools for constructing 3D deep learning architectures and manipulating 3D data. By extensively providing useful boilerplate code, 3D deep learning researchers and practitioners can direct their efforts exclusively to developing the novel aspects of their applications. In the following section, we briefly describe each major functionality of this 3D deep learning package. For an illustrated overview see Fig. LABEL:fig:splash.

2.1 3D Representations

The choice of representation in a 3D deep learning project can have a large impact on its success due to the varied properties different 3D data types posses [15]. To ensure high flexibility in this choice of representation, Kaolin exhaustively supports all popular 3D representations:

  • Polygon meshes

  • Pointclouds

  • Voxel grids

  • Signed distance functions and level sets

  • Depth images (2.5D)

Each representation type is stored a as collection of PyTorch Tensors, within an independent class. This allows for operator overloading over common functions for data augmentation and modifications supported by the package. Efficient (and wherever possible, differentiable) conversions across representations are provided within each class. For example, we provide differentiable surface sampling mechanisms that enable conversion from polygon meshes to pointclouds, by application of the

reparameterization trick [33]. Network architectures are also supported for each representation, such as graph convolutional networks and MeshCNN for meshes[18, 14], 3D convolutions for voxels[16], and PointNet and PointNet++ for pointclouds[30, 40]. The following piece of example code demonstrates the ease with which a mesh model can be loaded into Kaolin, differentiably converted into a point cloud, and then rendered in both representations:

2.2 Datasets

Kaolin provides complete support for many popular 3D datasets; reducing the large overhead involved in file handling, parsing, and augmentation into a single function call222For datasets which do not possess open access licenses, the data must be downloaded independently, and their location specified to Kaolin’s dataloaders.. Access to all data is provided via extensions to the PyTorch Dataset, and DataLoader classes. This makes pre-processing and loading 3D data as simple and intuitive as loading MNIST [20], and also directly grants users the efficient loading of batched data that PyTorch dataloaders natively support. All data is importable and exportable in Universal Scene Description (USD) format [37], which provides a common language for defining, packaging, assembling, and editing 3D data across graphics applications.

Datasets currently supported include ShapeNet [4], PartNet [26], SHREC [4, 42], ModelNet [42], ScanNet [8], HumanSeg [23], and many more common and custom collections. Through ShapeNet [4], for example, a huge repository of CAD models is provided, including over tens of thousands of objects, across dozens of classes. Through ScanNet [8], more then 1500 RGD-B videos scans, including over 2.5 million unique depth maps are provided, with full annotations for camera pose, surface reconstructions, and semantic segmentations. Both these large collections of 3D information, and many more are easily accessed through single function calls. For example, access to ModelNet [42] providing it to a Pytorch dataloader, and loading a batch of voxel models is as easy as:

Figure 3: Modular differentiable renderer: Kaolin hosts a flexible, modular differentiable renderer that allows for easy swapping of individual sub-operation, to compose new variations.
Figure 4: Applications of Kaolin: (Clockwise from top-left) 3D object prediction with 2D supervision [5], 3D content creation with generative adversarial networks [34], 3D segmentation [14], automatically tagging 3D assets from TurboSquid [36], 3D object prediction with 3D supervision [33], and a lot more…

2.3 3D Geometry Functions

At the core of Kaolin

is an efficient suite of 3D geometric functions, which allow manipulation of 3D content. Rigid body transformations are implemented in several of their parameterizations (Euler angles, Lie groups, and Quaternions). Differentiable image warping layers, such as the perspective warping layers defined in GVNN (Neural network library for geometric vision)

[13], are also implemented. The geometry submodule allows for 3D rigid-body, affine, and projective transformations, as well as 3D-2D projection, and 2D-3D backprojection. It currently supports orthographic and perspective (pinhole) projection.

2.4 Modular Differentiable Renderer

Recently, differentiable rendering has manifested into an active area of research, allowing deep learning researchers to perform 3D tasks using predominantly 2D supervision [17, 22, 5]. Developing differentiable rendering tools is no easy feat however; the operations involved are computationally heavy and complicated. With the aim of removing these roadblocks to further research in this area, and to allow for easy use of popular differentiable rendering methods, Kaolin provides a flexible, and modular differentiable renderer. Kaolin defines an abstract base class—DifferentiableRenderer—containing abstract methods for each component in a rendering pipeline (geometric transformations, lighting, shading, rasterization, and projection). Assembling the components, swapping out modules, and developing new techniques using this abstract class is simple and intuitive.

Kaolin supports multiple lighting (ambient, directional, specular), shading (Lambertian, Phong, Cosine), projection (perspective, orthographic, distorted), and rasterization modes. An illustration of the architecture of the abstract DifferentiableRenderer() class is shown in Fig. 3. Wherever necessary, implementations are written in CUDA, for optimal performance (c.f. Table 2). To demonstrate the reduced overhead of development in this area, multiple publicly available differentiable renderers [17, 22, 5] are available as concrete instances of our DifferentiableRenderer class. One such example, DIB-Renderer [5], is instantiated and used to differentiably render a mesh to an image using Kaolin in the following few lines of code:

2.5 Loss Functions and Metrics

A common challenge for 3D deep learning applications lies in defining and implementing tools for evaluating performance and for supervising neural networks. For example, comparing surface representations such as meshes or point clouds might require matching positions of thousands of points or triangles, and CUDA functions are a necessity [9, 39, 33]. As a result, Kaolin provides implementations for an array of commonly used 3D metrics for each 3D representation. Included in this collection of metrics are intersection over union for voxels [7], Chamfer distance and (a quadratic approximation of) Earth-mover’s distance for pointclouds [9], and the point-to-surface loss [33] for Meshes, along with many other mesh metrics such as the laplacian, smoothness, and the edge length regularizers [39, 17].

2.6 Model-zoo

New researchers to the field of 3D Deep learning are faced with a storm of questions over the choice of 3D representations, model architectures, loss functions, etc. We ameliorate this by providing a rich collection of baselines, as well as state-of-the-art architectures for a variety of 3D tasks, including, but not limited to classification, segmentation, 3D reconstruction from images, super-resolution, and differentiable rendering. In addition to source code, we also release pre-trained models for these tasks on popular benchmarks, to serve as baselines for future research. We also hope that this will help encourage standardization in a field where evaluation methodology and criteria are still nascent.

Methods found in this model-zoo currently include Pixel2Mesh [39], GEOMetrics [33], and AtlasNet [11] for reconstructing mesh objects from single images, NM3DR [17], Soft-Rasterizer [22], and Dib-Renderer [5] for the same task with only 2D supervision, MeshCNN [14] is implemented for generic learning over meshes, PointNet [30] and PointNet++ [40] for generic learning over point clouds, 3D-GAN [41], 3D-IWGAN [34], and 3D-R2N2[7] for learning over distributions of voxels, and Occupancy Networks [24] and DeepSDF [28] for learning over level-set and SDFs, among many more. As examples of the these methods and the pre-trained models available to them in Figure 4 we highlight an array of results directly accessible through Kaolin’s model zoo.

max width= Feature/operation Reference approach Our speedup Mesh adjacency information MeshCNN [14] X DIB-Renderer DIB-R [5] X Sign testing points with meshes Occupancy Networks [24] X SoftRenderer SoftRasterizer [22] X

Table 2: Sample speedups obtained by Kaolin over existing open-source code.

2.7 Visualization

An undeniably important aspect of any computer vision task is visualizing data. For 3D data however, this is not at all trivial. While python packages exist for visualizing some datatypes, such as voxels and point clouds, no package supports visualization across all popular 3D representations. One of Kaolin’s key features is visualization support for all of its representation types. This is implemented via lightweight visualization libraries such as Trimesh, and pptk for running time visualization. As all data is exportable to USD 

[37], 3D results can also easily be visualized in more intensive graphics applications with far higher fidelity (see Figure 4 for example renderings). For headless applications such as when running on a server that has no attached display, we provide compact utilities to render images and animations to disk, for visualization at a later point.

3 Roadmap

While we view Kaolin as a major step in accelerating 3D DL research, the efforts do not stop here. We intend to foster a strong open-source community around Kaolin, and welcome contributions from other 3D deep learning researchers and practitioners. In this section, we present a general roadmap of Kaolin as open-source software.

  1. Model Zoo: We seek to constantly keep improving our model zoo, especially given that Kaolin provides extensive functionality that reduces the time required to implement new methods (most approaches can be implemented in a day or two of work).

  2. Differentiable rendering: We plan on extending support to newer differentiable rendering tools, and include functionality for additional tasks such as domain randomization, material recovery, and the like.

  3. LiDAR datasets: We plan to include several large scale semantic and instance segmentation datasets. For example supporting S3DIS [2] and nuScenes [3] is a high-priority task for future releases.

  4. 3D object detection: Currently, Kaolin does not have models for 3D object detection in its model zoo. This is a thrust area for future releases.

  5. Automatic Mixed Precision: To make 3D neural network architectures more compact and fast, we are investigating the applicability of Automatic Mixed Precision (AMP) to commonly used 3D architectures (PointNet, MeshCNN, Voxel U-Net, etc.). Nvidia Apex supports most AMP modes for popular 2D deep learning architectures, and we would like to investigate extending this support to 3D.

  6. Secondary light effects: Kaolin currently only supports primary lighting effects for its differentiable rendering class, which limits the application’s ability to reason about more complex scene information such as shadows. Future releases are planned to contain support for path-tracing and ray-tracing [21] such that these secondary effects are within the scope of the package.

We look forward to the 3D community trying out Kaolin, giving us feedback, and contributing to its development.


The authors would like to thank Amlan Kar for suggesting the need for this library. We also thank Ankur Handa for his advice during the initial and final stages of the project. Many thanks to Johan Philion, Daiqing Li, Mark Brophy, Jun Gao, and Huan Ling who performed detailed internal reviews, and provided constructive comments. We also thank Gavriel State for all his help during the project.


  • [1] (2016) Applying deep learning in augmented reality tracking. 2016 12th International Conference on Signal-Image Technology and Internet-Based Systems (SITIS), pp. 47–54. Cited by: §1.
  • [2] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese (2017)

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    ArXiv e-prints. External Links: 1702.01105 Cited by: item 3.
  • [3] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: item 3.
  • [4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §2.2.
  • [5] W. Chen, J. Gao, H. Ling, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler (2019)

    Learning to predict 3d objects with an interpolation-based differentiable renderer

    NeurIPS. Cited by: §1, Figure 4, §2.4, §2.4, §2.6, Table 2.
  • [6] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun (2016) Monocular 3d object detection for autonomous driving. In CVPR, Cited by: §1.
  • [7] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In ECCV, Cited by: §2.5, §2.6.
  • [8] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) Scannet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §2.2.
  • [9] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In CVPR, Cited by: §2.5.
  • [10] M. Garon, P. Boulet, J. Doironz, L. Beaulieu, and J. Lalonde (2016) Real-time high resolution 3d data on the hololens. In 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), Cited by: §1.
  • [11] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018) AtlasNet: a papier-m^ ach’e approach to learning 3d surface generation. CVPR. Cited by: §2.6.
  • [12] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1.
  • [13] A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, and A. Davison (2016) Gvnn: neural network library for geometric computer vision. In ECCV Workshop on Geometry Meets Deep Learning, Cited by: Table 1, §2.3.
  • [14] R. Hanocka, A. Hertz, N. Fish, R. Giryes, S. Fleishman, and D. Cohen-Or (2019) MeshCNN: a network with an edge. ACM Transactions on Graphics (TOG) 38 (4), pp. 90:1–90:12. Cited by: Figure 4, §2.1, §2.6, Table 2.
  • [15] Hao Su (2019)(Website) University of California San Diego. External Links: Link Cited by: §2.1.
  • [16] S. Ji, W. Xu, M. Yang, and K. Yu (2012)

    3D convolutional neural networks for human action recognition

    IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §1, §2.1.
  • [17] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §1, §2.4, §2.4, §2.5, §2.6.
  • [18] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §2.1.
  • [19] J. Lalonde, N. Vandapel, D. F. Huber, and M. Hebert (2006) Natural terrain classification using three-dimensional ladar data for ground robot mobility. J. Field Robotics 23, pp. 839–861. Cited by: §1.
  • [20] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE transactions on signam processing 86 (11), pp. 2278–2324. Cited by: §2.2.
  • [21] T. Li, M. Aittala, F. Durand, and J. Lehtinen (2018) Differentiable monte-carlo ray tracing through edge sampling. In SIGGRAPH Asia 2018 Technical Papers, pp. 222. Cited by: §1, item 6.
  • [22] S. Liu, T. Li, W. Chen, and H. Li (2019) Soft rasterizer: a differentiable renderer for image-based 3d reasoning. ICCV. Cited by: §1, §2.4, §2.4, §2.6, Table 2.
  • [23] H. Maron, M. Galun, N. Aigerman, M. Trope, N. Dym, E. Yumer, V. G. Kim, and Y. Lipman (2017) Convolutional neural networks on surfaces via seamless toric covers.. ACM Trans. Graph. 36 (4), pp. 71–1. Cited by: §2.2.
  • [24] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §2.6, Table 2.
  • [25] S. Milz, G. Arbeiter, C. Witt, B. Abdallah, and S. Yogamani (2018) Visual slam for automated driving: exploring the applications of deep learning. In CVPR Workshops, Cited by: §1.
  • [26] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su (2019-06) PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In CVPR, Cited by: §2.2.
  • [27] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017)

    3d bounding box estimation using deep learning and geometry

    In CVPR, Cited by: §1.
  • [28] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019-06) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §2.6.
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §1.
  • [30] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3d classification and segmentation. CVPR. Cited by: §1, §2.1, §2.6.
  • [31] E. Riba and G. Bradski (2019) Kornia: an open source differentiable computer vision library for pytorch. In Winter Conference on Applications of Computer Vision, External Links: Link Cited by: Table 1.
  • [32] T. Roddick, A. Kendall, and R. Cipolla (2019) Orthographic feature transform for monocular 3d object detection. British Machine Vision Conference (BMVC). Cited by: §1.
  • [33] E. Smith, S. Fujimoto, A. Romero, and D. Meger (2019-09–15 Jun) GEOMetrics: exploiting geometric structure for graph-encoded objects. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5866–5876. External Links: Link Cited by: Figure 4, §2.1, §2.5, §2.6.
  • [34] E. J. Smith and D. Meger (2017) Improved adversarial systems for 3d object generation and reconstruction. In Conference on Robot Learning, Cited by: Figure 4, §2.6.
  • [35] J. Sung, S. H. Jin, and A. Saxena (2018) Robobarista: object part based transfer of manipulation trajectories from crowd-sourcing in 3d pointclouds. In Robotics Research, pp. 701–720. Cited by: §1.
  • [36] TurboSquid: 3d models for professionals. Note: https://www.turbosquid.com/ Cited by: Figure 4.
  • [37] Universal scene description. Note: https://github.com/PixarAnimationStudios/USD Cited by: §2.2, §2.7.
  • [38] J. Valentin, C. Keskin, P. Pidlypenskyi, A. Makadia, A. Sud, and S. Bouaziz (2019) TensorFlow graphics: computer graphics meets deep learning. Cited by: Table 1.
  • [39] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2Mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §2.5, §2.6.
  • [40] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG). Cited by: §2.1, §2.6.
  • [41] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, Cited by: §2.6.
  • [42] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In CVPR, Cited by: §2.2.
  • [43] S. Yang, D. Maturana, and S. Scherer (2016) Real-time 3d scene layout from a single image using convolutional neural networks. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §1.
  • [44] J. Yu, K. Weng, G. Liang, and G. Xie (2013)

    A vision-based robotic grasping system using deep learning for 3d object recognition and pose estimation

    2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1175–1180. Cited by: §1.