3D reconstruction focuses on using a single or multiple 2D images of an object to rebuild its 3D representations. 3D reconstruction has played an important role in various downstream applications, including CAD [cad], human detection [human], architecture [archi], etc. The wide applications of 3D reconstruction have motivated researchers to develop numerous methods for 3D reconstruction. Early works for 3D reconstruction mostly use feature matching between different views of an object [slam, stereoscan, fast_matching]. However, the performance of such methods largely depends on accurate and consistent margins between different views of objects and are thus vulnerable to rapid changes between views [match_0, match_1, match_2]. Additionally, these methods are not suitable for single-view 3D reconstruction, where only one view of an object is available.
. On the one hand, some researchers formulate 3D reconstruction as a sequence learning problem and use recurrent neural networks to solve the problem[3dr2n2, lsm]. On the other hand, other researchers employ the encoder-decoder architecture for 3D reconstruction [mvcon, pix2vox]
. Furthermore, researchers have also used Generative Adversarial Networks (GANs) for 3D reconstruction[gal]
. However, these approaches often rely on sophisticated pipelines of convolutional neural networks (CNNs), and build models with large amounts of parameters, which are computationally expensive.
Recently, Transformers [transformers] have gained attention from the computer vision community. Transformer-based models have achieved state-of-the-art performance in many downstream applications of computer vision, including image classification [vit], semantic segmentation [swin]
, image super-resolution[texture], etc. Despite these achievements, whether or not Transformers can be used in 3D reconstruction is still unclear.
In this paper, we propose 3D-RETR111Code: https://github.com/FomalhautB/3D-RETR, which is capable of performing end-to-end single and multi-view 3D REconstruction with TRansformers. 3D-RETR uses a pretrained Transformer to extract visual features from 2D images. 3D-RETR then obtains the 3D voxel features by using another Transformer Decoder. Finally, a CNN Decoder outputs the 3D representation from the voxel features. Our contributions in this paper are three-folded:
We propose 3D-RETR for end-to-end single and multi-view 3D reconstruction with Transformers. To the best of our knowledge, we are the first to use Transformers for end-to-end 3D reconstruction. Experimental results show that 3D-RETR reaches state-of-the-art performance under both synthetic and real-world settings.
We conduct additional ablation studies to understand how each part of 3D-RETR contributes to the final performance. The experimental results show that our choices of the encoder, decoder, and loss are beneficial.
3D-RETR is efficient compared to previous models. 3D-RETR reaches higher performance than previous models, despite that it uses far fewer parameters.
2 Related Work
2.1 3D reconstruction
3D reconstruction has been widely used in various downstream applications, including architecture [archi], CAD [cad], human detection [human], etc. Researchers have mainly focused on two types of methods for 3D reconstruction. Some researchers use depth cameras such as Kinect to collect images with depth information [depth], which is subsequently processed for 3D reconstruction. However, such methods require sophisticated hardware and data collection procedures and are thus not practical in many scenarios.
To mitigate this problem, other researchers have resorted to 3D reconstruction from single or multiple views, where only 2D images are available. Early researchers leverage feature matching between different views for 3D reconstruction with 2D images. For example, [3d_match_0] uses a multi-stage parallel matching algorithm for feature matching, and [fast_matching] proposes a cascade hashing strategy for efficient image matching and 3D reconstruction. Although these methods are useful, their performance degrades when margins between different views are large, making these methods hard to generalize.
Recent works mainly focus on neural network-based approaches. Some researchers have formulated multi-view 3D construction as a sequence learning problem. For example, [3dr2n2]
proposes a 3D recurrent neural network, which takes as input one view at each timestep and outputs the reconstructed object representation. Others employ an encoder-decoder architecture by first encoding the 2D images into fixed-size vectors, from which a decoder decodes the 3D representations[mvcon, pix2vox, mvsuper]. Furthermore, researchers have also used Generative Adversarial Networks (GANs) [gal] and 3D-VAEs [3d_vae_0, 3d_vae_1] for 3D reconstruction. However, these neural network-based methods often rely on sophisticated pipelines of different convolutional neural networks, and are often with models of large amounts of parameters, which are computationally expensive.
Researchers first propose Transformers for applications in natural language processing[transformers], including machine translation, language modeling, etc. Transformers use a multi-head self-attention mechanism, in which inputs from a specific time step would attend to the entire input sequence.
Recently, Transformers have also gained attention from the computer vision community. In image classification, Vision Transformer (ViT) [vit] reach state-of-the-art performance on image classification by feeding images as patches into a Transformer. DeiT [deit] achieves better performance than ViT [vit] with much less pretraining data and a smaller parameter size. Transformers are also useful in other computer vision applications. For example, DETR [detr], consisting of a Transformer Encoder and a Transformer Decoder, has reached state-of-the-art performance on object detection. Other applications of Transformers also include super-resolution [texture], semantic segmentation [swin], video understanding [video], etc.
In this paper, we propose 3D-RETR for end-to-end single and multi-view 3D reconstruction. 3D-RETR consists of a Transformer Encoder, a Transformer Decoder, and another CNN Decoder. We show in our experiments (see Section 4) that 3D-RETR reaches state-of-the-art performance, while using much fewer parameters than previous models, including pix2vox++ [pix2vox++], 3D-R2N2 [3dr2n2], etc.
2.3 Differentiable Rendering
Recently, differentiable rendering methods like NeRF [nerf] and IDR [idr] have become popular. These methods implicitly represent the scene using deep neural networks and have achieved impressive results. Nevertheless, there are three main limitations in these methods for 3D reconstruction: (1) Each neural network can only represent a single scene. In this case, we need to train a new model every time we reconstruct an object from images, which might take hours to days. (2) These methods require camera positions. (3) They require large amounts of images from different views, ranging from 50 to hundreds. As a result, these methods struggle to output reasonable results when fewer images are available.
In contrast, our method, along with other previous 3D reconstruction methods including 3D-R2N2 [3dr2n2], OGN [ogn] and pix2vox [pix2vox], aims to reconstruct the volume without rendering the 2D images. 3D-RETR learns the 3D shape prior out of input 2D images and generates 3D-voxels during the inference time.
From a high level, 3D-RETR consists of three main components (see Figure 1): a Transformer Encoder, a Transformer Decoder, and a CNN Decoder. The Transformer Encoder takes as input the images, which are subsequently encoded into fixed-size image feature vectors. Then, the Transformer Decoder obtains voxel features by cross-attending to the image features. Finally, the CNN Decoder decodes 3D object representations from the voxel features. Figure 1 illustrates the architecture of 3D-RETR.
In this paper, we have two variants of 3D-RETR: (1) The base model, 3D-RETR-B, has 163M parameters; (2) The smaller model, 3D-RETR-S, has 11M parameters. We describe the details of these two models in Section 4.
We denote the input images of an object from different views as , where is the RGB channel, and and are the height and width of the images, respectively. We denote the reconstructed voxel by , where is the index to the voxel grids, indicates an empty voxel grid, indicates an occupied grid, and is the resolution of the voxel representation.
3.1 Transformer Encoder
A Vision Transformer takes as input image by splitting the image into
patches. At each time step, the corresponding patch is embedded by first linearly transformed into a fixed-size vector, which is then added with positional embeddings. The Transformer takes the embedded patch feature as input and outputsencoded dense image feature vectors. For single-view reconstruction, we keep all the dense image vectors. For multi-view reconstruction, at each time step, we take the average across different views, and keep the averaged dense vectors.
In our implementation, we use the Data-efficient image Transformer (DeiT) [deit]. Our base model, 3D-RETR-B, uses the DeiT Base (DeiT-B) as the Transformer Encoder. DeiT-B consists of 12 layers, each of which has 12 heads and 768-dimensional hidden embeddings. The smaller model, 3D-RETR-S, uses the DeiT Tiny (DeiT-Ti) as the Transformer Encoder. DeiT-Ti has 12 layers, each of which has 3 heads and 192-dimensional hidden embeddings. Both 3D-RETR-B and 3D-RETR-S have . We feed all dense image vectors in the next stage into the Transformer Decoder, which we introduce in Section 3.2.
3.2 Transformer Decoder
The Transformer Decoder takes learned positional embeddings as its input and cross-attends to the output of the Transformer Encoder. Our Transformer Decoder is similar to that of DETR [detr], where the Transformer decodes all input vectors in parallel, instead of autoregressively as in the original Transformer [transformers].
The 3D-RETR-B model has a Transformer Decoder of 8 layers, each of which has 12 heads and 768-dimensional hidden embeddings. For the 3D-RETR-S, we use a Transformer Decoder of layers, each of which 3 heads and 192-dimensional hidden embeddings. To enable the Transformer Decoder to understand the spatial relations between voxel features, we create positional embeddings for the Transformer Decoder. The positional embeddings are learnable and are updated during training. We use for both 3D-RETR-B and 3D-RETR-S.
3.3 CNN Decoder
The CNN Decoder takes as input the voxel features from the Transformer Decoder and outputs the voxel representation. As the Transformer Encoder and Transformer Decoder already give rich information, we use a relatively simple architecture for the CNN Decoder. The CNN Decoder first stacks the voxel feature vectors into a cube of size , and then upsample the cube iteratively until the desired resolution is obtained.
Figure 2 illustrates the architecture of our CNN Decoder. Specifically, the CNN Decoder has two residual blocks [resnet]
, each consisting of four transposed 3D convolutional layers. For the residual blocks, the first two convolutional layers have a kernel size of 3, and the last one uses a kernel size of 1. In addition, all three layers have 64 channels. For the transposed 3D convolutional layers, all three layers have a kernel size of 4, a stride of 2, a channel size of 64, and a padding size of 1. We add an additionalconvolutional layer at the end of the CNN Decoder to compress the 64 channels into one channel. The model finally outputs cubes of size .
3.4 Loss Function
While previous works on 3D reconstruction mainly use the cross-entropy loss, researchers have also shown that other losses such as Dice and Focal loss are better for optimizing IoUs [dice, focal, lovasz]. Although these losses are originally proposed for 2D image tasks, they can be easily adapted to 3D tasks.
This paper uses Dice loss as the loss function, which is suitable for 3D reconstruction as the voxel occupancy is highly unbalanced. Formally, we have the Dice loss for the 3D voxels as follows:
where is the
-th predicted probability of the voxel occupation andis the -th ground-truth voxel.
To train the 3D-RETR. We use the AdamW [adamw] optimizer with a learning rate of , , , and a weight decay of . The batch size is set to 16 for all the experiments. We use two RTX Titan GPUs in our experiments. Training takes 1 to 3 days, depending on the exact setting. We use mixed-precision to speed up training.
We show in this Section our experimental results. We evaluate 3D-RETR on ShapeNet [shapenet] and Pix3d [pix3d]. Following previous works [3dr2n2, pix2vox]
, we use Intersection of Union (IoU) as our evaluation metric.
ShapeNet is a large-scale 3D object dataset consisting of object categories with 3D models. Following the setting in Pix2Vox [pix2vox], we use the same subset of categories and about models. The 3D models are pre-processed using Binvox222https://www.patrickmin.com/binvox/ with a resolution of .333We asked the authors of previous works for higher resolution datasets. Unfortunately, the authors do not have access to the datasets anymore. Therefore, we cannot compare and evaluate the performance of other resolutions. The images are then rendered in the resolution of from random views.
For single-view 3D reconstruction on ShapeNet, we compare our results with previous state-of-the-art models, including 3D-R2N2 [3dr2n2], OGN [ogn], Matryoshka Networks [matryoshka], AtlasNet [atlasnet], Pixel2Mesh [pixel2mesh], OccNet [occnet], IM-Net [imnet], AttSets [attsets], and Pix2Vox++ [pix2vox++]. Table 1 shows the results. We can observe that both 3D-RETR-S and 3D-RETR-B outperform all previous models in terms of overall IoU. Additionally, 3D-RETR-S outperforms all other baselines in 6 of the 13 categories, while 3D-RETR-B is the best among other baselines in 9 of the 13 categories.
For the multi-view setting, we take the number of input 2D images and compare the performance of 3D-RETR with previous state-of-the-art models, including 3D-R2N2 [3dr2n2], AttSets [attsets], and Pix2Vox++ [pix2vox++]. As we can see in Table 2, 3D-RETR outperforms all previous works on all different views. Furthermore, Figure 3 illustrates the relation between the number of views and model performance. One can observe that the performance of 3D-RETR increases rapidly compared to other methods as more views become available. Additionally, while models like 3D-R2N2 and AttSets gradually become saturated, our best model, 3D-RETR-B, continues to benefit from more views, indicating that 3D-RETR has a higher capacity.
As 3D-RETR simply takes the average over different views in the Transformer Encoder, we can train and evaluate 3D-RETR with different numbers of views. To understand how 3D-RETR performs when the number of views is different during training and evaluation, we conduct additional experiments to train 3D-RETR-B with 3 views and evaluate its performance under different numbers of views. We show the results in Table 2 (See the row of 3D-RETR-B (3 views)). Surprisingly, 3D-RETR-B still outperforms previous state-of-the-art models, even if the number of views during training and evaluation is different. In particular, the model seeing different numbers of views during training and evaluation demonstrates that 3D-RETR is flexible.
|Model||1 view||2 views||3 views||4 views||5 views||8 views||12 views||16 views||20 views|
|3D-RETR-B (3 views)||0.674||0.707||0.716||0.720||0.723||0.727||0.729||0.730||0.731|
Different from ShapeNet, in which all examples are synthetic, Pix3D [pix3d] is a dataset of aligned 3D models and real-world 2D images. Evaluating models on Pix3D gives a better understanding of the model performance under practical settings. Following the same setting in Pix3D [pix3d], we use the subset consisting of 2,894 untruncated and unoccluded chair images as the test set. Moreover, we follow [renderforcnn] to synthesize 60 random images for each image in the ShapeNet-Chair category and use these synthesized images as our training set.
We compare 3D-RETR with previous state-of-art models, including DRC [drc], 3D-R2N2 [3dr2n2], Pix3D [pix3d], and Pix2Vox++ [pix2vox++]. Table 4 shows the results. 3D-RETR-B outperforms all previous models, and 3D-RETR-S reaches comparable performance despite that 3D-RETR-S is much smaller than 3D-RETR-B in terms of parameter size.
4.3 Ablation Study
|Name||Encoder||First Decoder||Second Decoder||Loss||IoU|
|Setup 2||Base (w/o pre.)||Base||CNN||Dice||0.279|
We ablate 3D-RETR by using different Transformer Encoders, Transformer Decoders, CNN Decoders, and loss functions. Table 6 shows the results of our ablation study. Specifically, we discuss the following model variants:
6: One might think that the Transformer Decoder is redundant, as the Transformer Encoder and CNN Decoder are already available. We show that the Transformer Decoder is necessary by replacing it with a tiny Transformer Decoder. The tiny Transformer Decoder has only layer and head, which serves only as a simple mapping between the outputs of the Transformer Encoder and the input of the CNN Decoder. We can see that the performance decreases from 0.680 to 0.667 after using the tiny Transformer Decoder.
6: Pretraining for the Transformer Encoder is crucial since Transformers large amounts of data to gain prior knowledge for images. In this setup, we observe that the performance of 3D-RETR decreases significantly without pretraining.
6: We show the advantage of the Transformer Encoder over a CNN Encoder by replacing the Transformer Encoder with a pretrained ResNet-50 [resnet]. After replacing, the model performance decreases from 0.680 to 0.670.
6: Previous studies, including VQ-VAE [vqvae], VQ-GAN [vqgan], and DALLE [dalle] have employed a two-stage approach for generating images. We adopt a similar approach to 3D-RETR by first training a 3D VQ-VAE [vqvae] and replacing the CNN Decoder with the VQ-VAE Decoder. In this setting, the Transformer Decoder decodes autoregressively. The training process for this variant is also different from the standard 3D-RETR. We first generate the discretized features using ground-truth voxels and the VQ-VAE Encoder. These discretized features are then used as the ground truth for the Transformer Decoder. During the evaluation, the Transformer Decoder generates the discretized features one by one and then feeds them into the VQ-VAE Decoder. We show in Table 6 that the performance of this two-stage approach is not as good as our single-stage setup.
6: We replace the CNN Decoder with a simple one-layer MLP, so the model becomes a pure Transformer model. The performance is not as good as the original model with CNN Decoder, but still achieves comparable results.
We give further comparisons of parameter size and model performance in Table 7. Despite that our 3D-RETR-S is smaller than previous state-of-the-art models, it still reaches better performance. Furthermore, 3D-RETR-B outperforms 3D-RETR-S, showing that increasing the parameter size is helpful for 3D-RETR.
Despite that Transformers have been widely used in various applications in computer vision [vit, detr, texture], whether or not Transformers can be used for single and multi-view 3D reconstruction remains unclear. In this paper, we fill in this gap by proposing 3D-RETR, which is capable of performing end-to-end single and multi-view 3D REconstruction with TRansformers. 3D-RETR consists of a Transformer Encoder, a Transformer Decoder, and a CNN Decoder. Experimental results show that 3D-RETR reaches state-of-the-art performance on 3D reconstruction under both synthetic and real-world settings. 3D-RETR is more efficient than previous models [pix2vox, ogn, matryoshka], as 3D-RETR reaches better performance with much fewer parameters. In the future, we plan to improve 3D-RETR by using other variants of Transformers, including Performer [performer], Reformer [reformer], etc.
Appendix A 3D-RETR with VQ-VAE
We describe in detail the VQ-VAE setting in our ablation study of Section 4.3 (see Figure 4). We train 3D-RETR with VQ-VAE in two separate stages.
In the first stage, we pretrain a VQ-VAE with a codebook size of 2048, where each codebook vector has dimensions. The VQ-VAE Encoder and Decoder have three layers, respectively. For the VQ-VAE Decoder, we use the same residual blocks as in the CNN Decoder. The VQ-VAE Encoder encodes the voxel into a discrete sequence of length 64, where each element in the sequence is an integer between 0 and 2047. The VQ-VAE is trained with cross-entropy loss. The reconstruction IoU is about 0.885.
In the second stage, for every input image and its correspondent ground-truth voxel , we first generate a discrete sequence
using the pretrained VQ-VAE Encoder. Then, the Transformer Encoder generates the hidden representation for the input image, and the Transformer Decoder uses the output of the Transformer Encoder to generate another discrete sequence . To generate , we use a linear layer with softmax at the output of the Transformer Decoder. We use the sequence as the ground truth and train the Transformer Encoder and Decoder with cross-entropy loss to generate , which should be as close as possible to .
Appendix B Additional Examples
We show more examples of the ShapeNet dataset and the Pix3D dataset from our 3D-RETR-B model. Table 8 shows additional examples of the Pix3D dataset. Table 9 shows examples from the ShapeNet dataset with different numbers of views as inputs. We can see a clear quality improvement when more views become available.
Appendix C Model Performance with Different Views
In Table 2 of the paper, we show that 3D-RETR trained on three views still outperforms previous state-of-the-art results even when evaluated under different numbers of input views. In Table 10 and Figure 5, we give additional results on training and evaluating under different numbers of views. We can observe that more views during evaluation can boost model performance. Another observation is that models trained with more views are not necessarily better than models trained with fewer views, especially when the number of views available during evaluation is far fewer than the number of available views during training. For example, when only one view is available, the model trained with one view reaches an IoU of 0.680, while the model trained with 20 views only reaches an IoU of 0.534.
|TrainEval||1 view||2 views||3 views||4 views||5 views||8 views||12 views||16 views||20 views|