3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers

by   Zai Shi, et al.

3D reconstruction aims to reconstruct 3D objects from 2D views. Previous works for 3D reconstruction mainly focus on feature matching between views or using CNNs as backbones. Recently, Transformers have been shown effective in multiple applications of computer vision. However, whether or not Transformers can be used for 3D reconstruction is still unclear. In this paper, we fill this gap by proposing 3D-RETR, which is able to perform end-to-end 3D REconstruction with TRansformers. 3D-RETR first uses a pretrained Transformer to extract visual features from 2D input images. 3D-RETR then uses another Transformer Decoder to obtain the voxel features. A CNN Decoder then takes as input the voxel features to obtain the reconstructed objects. 3D-RETR is capable of 3D reconstruction from a single view or multiple views. Experimental results on two datasets show that 3DRETR reaches state-of-the-art performance on 3D reconstruction. Additional ablation study also demonstrates that 3D-DETR benefits from using Transformers.



page 1

page 8

page 9

page 16

page 17


VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

Recent volumetric 3D reconstruction methods can produce very accurate re...

WT-MVSNet: Window-based Transformers for Multi-view Stereo

Recently, Transformers were shown to enhance the performance of multi-vi...

LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction

Most modern deep learning-based multi-view 3D reconstruction techniques ...

Multi-layer Depth and Epipolar Feature Transformers for 3D Scene Reconstruction

We tackle the problem of automatically reconstructing a complete 3D mode...

Multi-view analysis of unregistered medical images using cross-view transformers

Multi-view medical image analysis often depends on the combination of in...

Transformers Improve Breast Cancer Diagnosis from Unregistered Multi-View Mammograms

Deep convolutional neural networks (CNNs) have been widely used in vario...

Improving Image Clustering With Multiple Pretrained CNN Feature Extractors

For many image clustering problems, replacing raw image data with featur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D reconstruction focuses on using a single or multiple 2D images of an object to rebuild its 3D representations. 3D reconstruction has played an important role in various downstream applications, including CAD [cad], human detection [human], architecture [archi], etc. The wide applications of 3D reconstruction have motivated researchers to develop numerous methods for 3D reconstruction. Early works for 3D reconstruction mostly use feature matching between different views of an object [slam, stereoscan, fast_matching]. However, the performance of such methods largely depends on accurate and consistent margins between different views of objects and are thus vulnerable to rapid changes between views [match_0, match_1, match_2]. Additionally, these methods are not suitable for single-view 3D reconstruction, where only one view of an object is available.

The advances of deep learning have shed some light on neural network-based approaches for 3D reconstruction 


. On the one hand, some researchers formulate 3D reconstruction as a sequence learning problem and use recurrent neural networks to solve the problem 

[3dr2n2, lsm]. On the other hand, other researchers employ the encoder-decoder architecture for 3D reconstruction [mvcon, pix2vox]

. Furthermore, researchers have also used Generative Adversarial Networks (GANs) for 3D reconstruction 


. However, these approaches often rely on sophisticated pipelines of convolutional neural networks (CNNs), and build models with large amounts of parameters, which are computationally expensive.

Recently, Transformers [transformers] have gained attention from the computer vision community. Transformer-based models have achieved state-of-the-art performance in many downstream applications of computer vision, including image classification [vit], semantic segmentation [swin]

, image super-resolution 

[texture], etc. Despite these achievements, whether or not Transformers can be used in 3D reconstruction is still unclear.

In this paper, we propose 3D-RETR111Code: https://github.com/FomalhautB/3D-RETR, which is capable of performing end-to-end single and multi-view 3D REconstruction with TRansformers. 3D-RETR uses a pretrained Transformer to extract visual features from 2D images. 3D-RETR then obtains the 3D voxel features by using another Transformer Decoder. Finally, a CNN Decoder outputs the 3D representation from the voxel features. Our contributions in this paper are three-folded:

  • We propose 3D-RETR for end-to-end single and multi-view 3D reconstruction with Transformers. To the best of our knowledge, we are the first to use Transformers for end-to-end 3D reconstruction. Experimental results show that 3D-RETR reaches state-of-the-art performance under both synthetic and real-world settings.

  • We conduct additional ablation studies to understand how each part of 3D-RETR contributes to the final performance. The experimental results show that our choices of the encoder, decoder, and loss are beneficial.

  • 3D-RETR is efficient compared to previous models. 3D-RETR reaches higher performance than previous models, despite that it uses far fewer parameters.

2 Related Work

In this Section, we briefly review previous works. Section 2.1 gives an overview of previous works on 3D reconstruction. Section 2.2 introduces Transformers.

2.1 3D reconstruction

3D reconstruction has been widely used in various downstream applications, including architecture [archi], CAD [cad], human detection [human], etc. Researchers have mainly focused on two types of methods for 3D reconstruction. Some researchers use depth cameras such as Kinect to collect images with depth information [depth], which is subsequently processed for 3D reconstruction. However, such methods require sophisticated hardware and data collection procedures and are thus not practical in many scenarios.

To mitigate this problem, other researchers have resorted to 3D reconstruction from single or multiple views, where only 2D images are available. Early researchers leverage feature matching between different views for 3D reconstruction with 2D images. For example, [3d_match_0] uses a multi-stage parallel matching algorithm for feature matching, and [fast_matching] proposes a cascade hashing strategy for efficient image matching and 3D reconstruction. Although these methods are useful, their performance degrades when margins between different views are large, making these methods hard to generalize.

Recent works mainly focus on neural network-based approaches. Some researchers have formulated multi-view 3D construction as a sequence learning problem. For example, [3dr2n2]

proposes a 3D recurrent neural network, which takes as input one view at each timestep and outputs the reconstructed object representation. Others employ an encoder-decoder architecture by first encoding the 2D images into fixed-size vectors, from which a decoder decodes the 3D representations 

[mvcon, pix2vox, mvsuper]. Furthermore, researchers have also used Generative Adversarial Networks (GANs) [gal] and 3D-VAEs [3d_vae_0, 3d_vae_1] for 3D reconstruction. However, these neural network-based methods often rely on sophisticated pipelines of different convolutional neural networks, and are often with models of large amounts of parameters, which are computationally expensive.

2.2 Transformers

Researchers first propose Transformers for applications in natural language processing 

[transformers], including machine translation, language modeling, etc. Transformers use a multi-head self-attention mechanism, in which inputs from a specific time step would attend to the entire input sequence.

Recently, Transformers have also gained attention from the computer vision community. In image classification, Vision Transformer (ViT) [vit] reach state-of-the-art performance on image classification by feeding images as patches into a Transformer. DeiT [deit] achieves better performance than ViT [vit] with much less pretraining data and a smaller parameter size. Transformers are also useful in other computer vision applications. For example, DETR [detr], consisting of a Transformer Encoder and a Transformer Decoder, has reached state-of-the-art performance on object detection. Other applications of Transformers also include super-resolution [texture], semantic segmentation [swin], video understanding [video], etc.

In this paper, we propose 3D-RETR for end-to-end single and multi-view 3D reconstruction. 3D-RETR consists of a Transformer Encoder, a Transformer Decoder, and another CNN Decoder. We show in our experiments (see Section 4) that 3D-RETR reaches state-of-the-art performance, while using much fewer parameters than previous models, including pix2vox++ [pix2vox++], 3D-R2N2 [3dr2n2], etc.

2.3 Differentiable Rendering

Recently, differentiable rendering methods like NeRF [nerf] and IDR [idr] have become popular. These methods implicitly represent the scene using deep neural networks and have achieved impressive results. Nevertheless, there are three main limitations in these methods for 3D reconstruction: (1) Each neural network can only represent a single scene. In this case, we need to train a new model every time we reconstruct an object from images, which might take hours to days. (2) These methods require camera positions. (3) They require large amounts of images from different views, ranging from 50 to hundreds. As a result, these methods struggle to output reasonable results when fewer images are available.

In contrast, our method, along with other previous 3D reconstruction methods including 3D-R2N2 [3dr2n2], OGN [ogn] and pix2vox [pix2vox], aims to reconstruct the volume without rendering the 2D images. 3D-RETR learns the 3D shape prior out of input 2D images and generates 3D-voxels during the inference time.

3 Methodology

Figure 1: An illustration of our 3D-RETR. A Transformer Encoder firsts extract image features from 2D images. 3D-RETR then obtain the voxel features by using another Transformer Decoder. A CNN Decoder finally outputs the 3D object representation.

From a high level, 3D-RETR consists of three main components (see Figure 1): a Transformer Encoder, a Transformer Decoder, and a CNN Decoder. The Transformer Encoder takes as input the images, which are subsequently encoded into fixed-size image feature vectors. Then, the Transformer Decoder obtains voxel features by cross-attending to the image features. Finally, the CNN Decoder decodes 3D object representations from the voxel features. Figure 1 illustrates the architecture of 3D-RETR.

In this paper, we have two variants of 3D-RETR: (1) The base model, 3D-RETR-B, has 163M parameters; (2) The smaller model, 3D-RETR-S, has 11M parameters. We describe the details of these two models in Section 4.

We denote the input images of an object from different views as , where is the RGB channel, and and are the height and width of the images, respectively. We denote the reconstructed voxel by , where is the index to the voxel grids, indicates an empty voxel grid, indicates an occupied grid, and is the resolution of the voxel representation.

3.1 Transformer Encoder

A Vision Transformer takes as input image by splitting the image into

patches. At each time step, the corresponding patch is embedded by first linearly transformed into a fixed-size vector, which is then added with positional embeddings. The Transformer takes the embedded patch feature as input and outputs

encoded dense image feature vectors. For single-view reconstruction, we keep all the dense image vectors. For multi-view reconstruction, at each time step, we take the average across different views, and keep the averaged dense vectors.

In our implementation, we use the Data-efficient image Transformer (DeiT) [deit]. Our base model, 3D-RETR-B, uses the DeiT Base (DeiT-B) as the Transformer Encoder. DeiT-B consists of 12 layers, each of which has 12 heads and 768-dimensional hidden embeddings. The smaller model, 3D-RETR-S, uses the DeiT Tiny (DeiT-Ti) as the Transformer Encoder. DeiT-Ti has 12 layers, each of which has 3 heads and 192-dimensional hidden embeddings. Both 3D-RETR-B and 3D-RETR-S have . We feed all dense image vectors in the next stage into the Transformer Decoder, which we introduce in Section 3.2.

3.2 Transformer Decoder

The Transformer Decoder takes learned positional embeddings as its input and cross-attends to the output of the Transformer Encoder. Our Transformer Decoder is similar to that of DETR [detr], where the Transformer decodes all input vectors in parallel, instead of autoregressively as in the original Transformer [transformers].

The 3D-RETR-B model has a Transformer Decoder of 8 layers, each of which has 12 heads and 768-dimensional hidden embeddings. For the 3D-RETR-S, we use a Transformer Decoder of layers, each of which 3 heads and 192-dimensional hidden embeddings. To enable the Transformer Decoder to understand the spatial relations between voxel features, we create positional embeddings for the Transformer Decoder. The positional embeddings are learnable and are updated during training. We use for both 3D-RETR-B and 3D-RETR-S.

3.3 CNN Decoder

Figure 2: Details of the CNN Decoder in 3D-RETR. The CNN Decoder consists of two residual blocks and three transposed 3D convolutional layers. is the hidden size of the Transformer.

The CNN Decoder takes as input the voxel features from the Transformer Decoder and outputs the voxel representation. As the Transformer Encoder and Transformer Decoder already give rich information, we use a relatively simple architecture for the CNN Decoder. The CNN Decoder first stacks the voxel feature vectors into a cube of size , and then upsample the cube iteratively until the desired resolution is obtained.

Figure 2 illustrates the architecture of our CNN Decoder. Specifically, the CNN Decoder has two residual blocks [resnet]

, each consisting of four transposed 3D convolutional layers. For the residual blocks, the first two convolutional layers have a kernel size of 3, and the last one uses a kernel size of 1. In addition, all three layers have 64 channels. For the transposed 3D convolutional layers, all three layers have a kernel size of 4, a stride of 2, a channel size of 64, and a padding size of 1. We add an additional

convolutional layer at the end of the CNN Decoder to compress the 64 channels into one channel. The model finally outputs cubes of size .

3.4 Loss Function

While previous works on 3D reconstruction mainly use the cross-entropy loss, researchers have also shown that other losses such as Dice and Focal loss are better for optimizing IoUs [dice, focal, lovasz]. Although these losses are originally proposed for 2D image tasks, they can be easily adapted to 3D tasks.

This paper uses Dice loss as the loss function, which is suitable for 3D reconstruction as the voxel occupancy is highly unbalanced. Formally, we have the Dice loss for the 3D voxels as follows:

where is the

-th predicted probability of the voxel occupation and

is the -th ground-truth voxel.

3.5 Optimization

To train the 3D-RETR. We use the AdamW [adamw] optimizer with a learning rate of , , , and a weight decay of . The batch size is set to 16 for all the experiments. We use two RTX Titan GPUs in our experiments. Training takes 1 to 3 days, depending on the exact setting. We use mixed-precision to speed up training.

4 Experiments

We show in this Section our experimental results. We evaluate 3D-RETR on ShapeNet [shapenet] and Pix3d [pix3d]. Following previous works [3dr2n2, pix2vox]

, we use Intersection of Union (IoU) as our evaluation metric.

4.1 ShapeNet

ShapeNet is a large-scale 3D object dataset consisting of object categories with 3D models. Following the setting in Pix2Vox [pix2vox], we use the same subset of categories and about models. The 3D models are pre-processed using Binvox222https://www.patrickmin.com/binvox/ with a resolution of .333We asked the authors of previous works for higher resolution datasets. Unfortunately, the authors do not have access to the datasets anymore. Therefore, we cannot compare and evaluate the performance of other resolutions. The images are then rendered in the resolution of from random views.

For single-view 3D reconstruction on ShapeNet, we compare our results with previous state-of-the-art models, including 3D-R2N2 [3dr2n2], OGN [ogn], Matryoshka Networks [matryoshka], AtlasNet [atlasnet], Pixel2Mesh [pixel2mesh], OccNet [occnet], IM-Net [imnet], AttSets [attsets], and Pix2Vox++ [pix2vox++]. Table 1 shows the results. We can observe that both 3D-RETR-S and 3D-RETR-B outperform all previous models in terms of overall IoU. Additionally, 3D-RETR-S outperforms all other baselines in 6 of the 13 categories, while 3D-RETR-B is the best among other baselines in 9 of the 13 categories.

Category 3D-R2N2 OGN Matroyoshka AtlasNet Pixel2Mesh OccNet IM-Net AttSets Pix2Vox++ 3D-RETR-S 3D-RETR-B
aeroplane 0.513 0.587 0.647 0.493 0.508 0.532 0.702 0.594 0.674 0.696 0.704
bench 0.421 0.481 0.577 0.431 0.379 0.597 0.564 0.552 0.608 0.643 0.650
cabinet 0.716 0.729 0.776 0.257 0.732 0.674 0.680 0.783 0.799 0.804 0.802
car 0.798 0.816 0.850 0.282 0.670 0.671 0.756 0.844 0.858 0.858 0.861
chair 0.466 0.483 0.547 0.328 0.484 0.583 0.644 0.559 0.581 0.579 0.592
display 0.468 0.502 0.532 0.457 0.582 0.651 0.585 0.565 0.548 0.576 0.574
lamp 0.381 0.398 0.408 0.261 0.399 0.474 0.433 0.445 0.457 0.463 0.483
rifle 0.544 0.593 0.616 0.573 0.468 0.656 0.723 0.601 0.721 0.665 0.668
sofa 0.628 0.646 0.681 0.354 0.622 0.669 0.694 0.703 0.725 0.729 0.735
speaker 0.662 0.637 0.701 0.296 0.672 0.655 0.683 0.721 0.617 0.719 0.724
table 0.513 0.536 0.573 0.301 0.536 0.659 0.621 0.590 0.620 0.615 0.633
telephone 0.661 0.702 0.756 0.543 0.762 0.794 0.762 0.743 0.809 0.796 0.781
watercraft 0.513 0.632 0.591 0.355 0.471 0.579 0.607 0.601 0.603 0.621 0.636
overall 0.560 0.596 0.635 0.352 0.552 0.626 0.659 0.642* 0.670* 0.674 0.680
Table 1: Results of single-view 3D reconstruction on the ShapeNet dataset. Bold is the best performance, while Italic is the second best. For overall IoU, we report the mean IoU across all 13 categories. However, for entries with , the overall IoU is NOT the averaged IoU across categories. We nevertheless use the original number from Pix2Vox++ [pix2vox++] and AttSets [attsets]. As a reference, the average IoU across categories for AttSets and Pix2Vox++ are and , respectively.

For the multi-view setting, we take the number of input 2D images and compare the performance of 3D-RETR with previous state-of-the-art models, including 3D-R2N2 [3dr2n2], AttSets [attsets], and Pix2Vox++ [pix2vox++]. As we can see in Table 2, 3D-RETR outperforms all previous works on all different views. Furthermore, Figure 3 illustrates the relation between the number of views and model performance. One can observe that the performance of 3D-RETR increases rapidly compared to other methods as more views become available. Additionally, while models like 3D-R2N2 and AttSets gradually become saturated, our best model, 3D-RETR-B, continues to benefit from more views, indicating that 3D-RETR has a higher capacity.

As 3D-RETR simply takes the average over different views in the Transformer Encoder, we can train and evaluate 3D-RETR with different numbers of views. To understand how 3D-RETR performs when the number of views is different during training and evaluation, we conduct additional experiments to train 3D-RETR-B with 3 views and evaluate its performance under different numbers of views. We show the results in Table 2 (See the row of 3D-RETR-B (3 views)). Surprisingly, 3D-RETR-B still outperforms previous state-of-the-art models, even if the number of views during training and evaluation is different. In particular, the model seeing different numbers of views during training and evaluation demonstrates that 3D-RETR is flexible.

Model 1 view 2 views 3 views 4 views 5 views 8 views 12 views 16 views 20 views
3D-R2N2 0.560 0.603 0.617 0.625 0.634 0.635 0.636 0.636 0.636
AttSets 0.642 0.662 0.670 0.675 0.677 0.685 0.688 0.692 0.693
Pix2Vox++* 0.670 0.695 0.704 0.708 0.711 0.715 0.717 0.718 0.719
3D-RETR-S 0.674 0.695 0.707 0.715 0.719 0.728 0.734 0.737 0.738
3D-RETR-B (3 views) 0.674 0.707 0.716 0.720 0.723 0.727 0.729 0.730 0.731
3D-RETR-B 0.680 0.701 0.716 0.725 0.736 0.739 0.747 0.755 0.757

Table 2: Results of multi-view 3D reconstruction on the ShapeNet dataset. Our smallest model (3D-RETR-S) already reaches state-of-the-art performance. *: As mentioned in Table 1, while all other models report mean IoU across categories, Pix2Vox++ [pix2vox++] reports their overall IoU by taking the average across all the examples. For Pix2Vox++, we cannot compute mean IoU across different categories as Pix2Vox++ does not report per-category IoU for multi-view reconstruction.
Input 3D-R2N2 AtlasNet OccNet IM-NET AttSets Pix2Vox++ 3D-RETR-S 3D-RETR-B GT
Table 3: Examples of single-view 3D reconstruction from the ShapeNet dataset.
Figure 3: Model performance with different views. 3D-RETR-B continues to benefit from more views, while baselines including AttSets and Pix2Vox++ become saturated.

4.2 Pix3D

Different from ShapeNet, in which all examples are synthetic, Pix3D [pix3d] is a dataset of aligned 3D models and real-world 2D images. Evaluating models on Pix3D gives a better understanding of the model performance under practical settings. Following the same setting in Pix3D [pix3d], we use the subset consisting of 2,894 untruncated and unoccluded chair images as the test set. Moreover, we follow [renderforcnn] to synthesize 60 random images for each image in the ShapeNet-Chair category and use these synthesized images as our training set.

We compare 3D-RETR with previous state-of-art models, including DRC [drc], 3D-R2N2 [3dr2n2], Pix3D [pix3d], and Pix2Vox++ [pix2vox++]. Table 4 shows the results. 3D-RETR-B outperforms all previous models, and 3D-RETR-S reaches comparable performance despite that 3D-RETR-S is much smaller than 3D-RETR-B in terms of parameter size.

3D-R2N2 DRC Pix3D Pix2Vox++ 3D-RETR-S 3D-RETR-B
0.136 0.265 0.282 0.288 0.283 0.290

Table 4: Results of single-view reconstruction on the Pix3D-Chair dataset.

Table 5: Example outputs of 3D-RETR on single-view 3D reconstruction from the Pix3D dataset. We do not show examples from baseline models as none of the baselines have released their implementation on Pix3D.

4.3 Ablation Study

Name Encoder First Decoder Second Decoder Loss IoU
3D-RETR-B Base Base CNN Dice 0.680
3D-RETR-S Small Small CNN Dice 0.674
Setup 1 Base Tiny CNN Dice 0.667
Setup 2 Base (w/o pre.) Base CNN Dice 0.279
Setup 3 ResNet-50 Base CNN Dice 0.670
Setup 4 Base Base VQ-VAE - 0.598
Setup 5 Base Base CNN CE 0.668
Setup 6 Base Base MLP Dice 0.658

Table 6: Ablation Study. We ablate 3D-RETR by using different encoders, decoders, and loss functions.

We ablate 3D-RETR by using different Transformer Encoders, Transformer Decoders, CNN Decoders, and loss functions. Table 6 shows the results of our ablation study. Specifically, we discuss the following model variants:

  • 6: One might think that the Transformer Decoder is redundant, as the Transformer Encoder and CNN Decoder are already available. We show that the Transformer Decoder is necessary by replacing it with a tiny Transformer Decoder. The tiny Transformer Decoder has only layer and head, which serves only as a simple mapping between the outputs of the Transformer Encoder and the input of the CNN Decoder. We can see that the performance decreases from 0.680 to 0.667 after using the tiny Transformer Decoder.

  • 6: Pretraining for the Transformer Encoder is crucial since Transformers large amounts of data to gain prior knowledge for images. In this setup, we observe that the performance of 3D-RETR decreases significantly without pretraining.

  • 6: We show the advantage of the Transformer Encoder over a CNN Encoder by replacing the Transformer Encoder with a pretrained ResNet-50 [resnet]. After replacing, the model performance decreases from 0.680 to 0.670.

  • 6: Previous studies, including VQ-VAE [vqvae], VQ-GAN [vqgan], and DALL[dalle] have employed a two-stage approach for generating images. We adopt a similar approach to 3D-RETR by first training a 3D VQ-VAE [vqvae] and replacing the CNN Decoder with the VQ-VAE Decoder. In this setting, the Transformer Decoder decodes autoregressively. The training process for this variant is also different from the standard 3D-RETR. We first generate the discretized features using ground-truth voxels and the VQ-VAE Encoder. These discretized features are then used as the ground truth for the Transformer Decoder. During the evaluation, the Transformer Decoder generates the discretized features one by one and then feeds them into the VQ-VAE Decoder. We show in Table 6 that the performance of this two-stage approach is not as good as our single-stage setup.

  • 6: To understand how loss functions affect model performance, we train a 3D-RETR-B with the standard cross-entropy loss. From Table 6, we can see that replacing Dice loss with cross-entropy loss results in performance degradation, indicating that Dice loss is optimal for 3D-RETR.

  • 6: We replace the CNN Decoder with a simple one-layer MLP, so the model becomes a pure Transformer model. The performance is not as good as the original model with CNN Decoder, but still achieves comparable results.

We give further comparisons of parameter size and model performance in Table 7. Despite that our 3D-RETR-S is smaller than previous state-of-the-art models, it still reaches better performance. Furthermore, 3D-RETR-B outperforms 3D-RETR-S, showing that increasing the parameter size is helpful for 3D-RETR.

Model 3D-R2N2 OGN Matryoshka Pix2Vox++ 3D-RETR-S 3D-RETR-B
#Parameter 36M 12M 46M 98M 11M 163M
IoU 0.560 0.596 0.635 0.670* 0.674 0.680

Table 7: Parameter size and performance comparison between 3D-RETR and other baseline models. 3D-RETR reaches better performance with fewer parameters. *See Table 1 and Table 2.

5 Conclusion

Despite that Transformers have been widely used in various applications in computer vision [vit, detr, texture], whether or not Transformers can be used for single and multi-view 3D reconstruction remains unclear. In this paper, we fill in this gap by proposing 3D-RETR, which is capable of performing end-to-end single and multi-view 3D REconstruction with TRansformers. 3D-RETR consists of a Transformer Encoder, a Transformer Decoder, and a CNN Decoder. Experimental results show that 3D-RETR reaches state-of-the-art performance on 3D reconstruction under both synthetic and real-world settings. 3D-RETR is more efficient than previous models [pix2vox, ogn, matryoshka], as 3D-RETR reaches better performance with much fewer parameters. In the future, we plan to improve 3D-RETR by using other variants of Transformers, including Performer [performer], Reformer [reformer], etc.


Appendix A 3D-RETR with VQ-VAE

We describe in detail the VQ-VAE setting in our ablation study of Section 4.3 (see Figure 4). We train 3D-RETR with VQ-VAE in two separate stages.

In the first stage, we pretrain a VQ-VAE with a codebook size of 2048, where each codebook vector has dimensions. The VQ-VAE Encoder and Decoder have three layers, respectively. For the VQ-VAE Decoder, we use the same residual blocks as in the CNN Decoder. The VQ-VAE Encoder encodes the voxel into a discrete sequence of length 64, where each element in the sequence is an integer between 0 and 2047. The VQ-VAE is trained with cross-entropy loss. The reconstruction IoU is about 0.885.

In the second stage, for every input image and its correspondent ground-truth voxel , we first generate a discrete sequence

using the pretrained VQ-VAE Encoder. Then, the Transformer Encoder generates the hidden representation for the input image

, and the Transformer Decoder uses the output of the Transformer Encoder to generate another discrete sequence . To generate , we use a linear layer with softmax at the output of the Transformer Decoder. We use the sequence as the ground truth and train the Transformer Encoder and Decoder with cross-entropy loss to generate , which should be as close as possible to .

Figure 4: 3D-RETR with VQ-VAE. This corresponds to Setup 4 of our ablation study.

Appendix B Additional Examples

We show more examples of the ShapeNet dataset and the Pix3D dataset from our 3D-RETR-B model. Table 8 shows additional examples of the Pix3D dataset. Table 9 shows examples from the ShapeNet dataset with different numbers of views as inputs. We can see a clear quality improvement when more views become available.

Table 8: Examples from the Pix3D dataset. All predictions are generated by 3D-RETR-B.
1 view
Table 9: Examples from the ShapeNet dataset. All predictions are generated by 3D-RETR-B.

Appendix C Model Performance with Different Views

In Table 2 of the paper, we show that 3D-RETR trained on three views still outperforms previous state-of-the-art results even when evaluated under different numbers of input views. In Table 10 and Figure 5, we give additional results on training and evaluating under different numbers of views. We can observe that more views during evaluation can boost model performance. Another observation is that models trained with more views are not necessarily better than models trained with fewer views, especially when the number of views available during evaluation is far fewer than the number of available views during training. For example, when only one view is available, the model trained with one view reaches an IoU of 0.680, while the model trained with 20 views only reaches an IoU of 0.534.

Figure 5: Models performance with different views.
TrainEval 1 view 2 views 3 views 4 views 5 views 8 views 12 views 16 views 20 views
1 view 0.680 0.688 0.688 0.687 0.687 0.686 0.686 0.685 0.684
2 views 0.676 0.701 0.709 0.711 0.713 0.716 0.718 0.719 0.720
3 views 0.674 0.707 0.716 0.720 0.723 0.729 0.729 0.730 0.731
4 views 0.674 0.711 0.721 0.725 0.728 0.731 0.734 0.735 0.736
5 views 0.667 0.712 0.724 0.729 0.734 0.738 0.741 0.743 0.743
8 views 0.634 0.699 0.719 0.726 0.732 0.739 0.742 0.745 0.746
12 views 0.606 0.691 0.714 0.724 0.733 0.742 0.747 0.750 0.751
16 views 0.588 0.687 0.713 0.726 0.735 0.745 0.752 0.755 0.757
20 views 0.534 0.657 0.694 0.712 0.727 0.742 0.750 0.755 0.757
Table 10: Model performance with different views during training and evaluation. Bold indicates the best performance in an evaluation setting.