Humans are able to understand the visual world in 3D. This helps us manipulate objects and intuitively make sense of the world. Crucially, we don’t need to observe everyday objects from all viewpoints in order to have this ability. Usually only one views will suffice for us to have a strong understanding of the 3D object – for example, we can imagine the object from different novel viewpoints and “mentally rotate” them in our heads. In computer vision, this task is calledsingle view 3D reconstruction: given a single 2D input image, we wish to output 3D object models (eg in the form of voxels, point clouds, or triangular meshes). Note that this is different from the classical problem of structure from motion, which requires a dense, full coverage of all viewpoints . The single view case is difficult because in general, it is highly unconstrained. For instance, given the view of picture of a car from the front, no algorithm or person can produce an image of the car from the back with 100 percent certainty. However, there are certain clues which can lead to a reasonable result since we know that the world follows certain geometric patterns and rules. Additionally, since we have seen many pictures of cars in the past, we can incorporate our prior knowledge specific to the car class.
Regardless, nearly all of them use a Synthetic dataset of mesh models called ShapeNet  for both training and testing. This is because it provides many 2D rendered images of objects, as well as their ground truth 3D representations. However, because they are all CAD models with little detail, no texture, and no background, models trained on ShapeNet do not perform well when presented images in the real world due to a domain gap between synthetic and real. To be practical, this model should be able to work with images in the real world, with complicated backgrounds. Therefore, techniques from the domain adaptation literature can be applied. This would allow us to transfer knowledge from a synthetic source domain with 3D ground truth to an real target domain without 3D ground truth.
In this paper, we present results which extend the work of a current state-of-the-art, synthetic-based, single view voxel reconstruction method called pix2vox , so that it can be applied in the real world. First, we provide a summary of the related literature in Section 2. Then, our rationale for choosing this framework to work on is discussed in Section 3. In particular, we utilize several domain adaptation methods based on the maximum mean discrepancy (MMD) loss, Deep CORAL, and the domain adversarial neural network (DANN). We also propose a novel architecture which takes advantage of the fact that in this setting, target domain data is unsupervised with regards to the 3D model but supervised for class labels. Then, we share our results in Section 4. We demonstrate that as is, pix2vox fails with real-world images, evaluate the domain adaptation methods, and verify the usefulness incorporating class labels.
2 Related Work
2.1 3D Single View Object Reconstruction
Single view 3D reconstruction has several immediate applications. For example, it would enable objects to be reconstructed as a 3D model and placed into an augmented (AR) or virtual (VR) environment, so that the user could manipulate them in that environment. Another use case is for robotic grasping – if an object such as a cup could be scanned, it would provide valuable information for a robot trying to pick it up by its handle. As a result, several methods have been proposed in the computer vision literature [28, 16, 6, 20, 25, 9]
. They vary in the type of 3D representation used and each have their own trade-offs. For example, voxels are easily adapted to Convolutional Neural Networks (CNNs) but are spatially inefficient; triangular meshes are efficient but suffer from irregularities; point clouds are simple but lack explicit structural information. However, most of the these methods use ShapeNet for training and testing, and do not incorporate techniques used in the domain adaptation literature to allow for real-world viability. uses Graph Convolutional Neural Net (GCN) to deform a mesh of ellipsoid to obtain an output mesh. The results, however, are not very accurate and it fails to keep the genus of the ground truth.  uses 2D convolutional network to generate dense point clouds that shapes the surface of 3D objects in an undiscretized 3D space. The method predicts accurate shapes with higher point density but is problematic when objects contain very thin structures.  proposed a novel architecture that unifies single and multi-view 3D reconstruction into a single framework. The method uses deep convolutional neural networks (3D Recurrent Reconstruction Neural Network) to learn a mapping from observations to their underlying 3D shapes of objects from a large collection of training data. It incrementally improves its reconstructions as it sees more views of an object but is unable to reconstruct many details and struggles with objects having high texture levels.
2.2 Unsupervised Domain Adaptation
Classical approaches to unsupervised domain adaptation usually consist of matching the feature distribution between the source and target domain. Generally these methods can be categorized as either sample re-weighting (eg. , ) or feature space transformations (eg. , ). Convolutional neural networks are also used for this purpose, because of their ability to learn powerful features. These methods, in general, are trained to minimize a classification loss while maximizing domain confusion. The classification loss is usually computed using a fully-connected or convolutional neural network trained on the labeled data. The domain confusion is usually achieved either by using a discrepancy loss, which reduces the shift between the two domains such as in (, , ) or via an adverserial loss which encourages a common feature space with respect to a discriminator loss, such as in , , .  achieves the domain-confusion by aligning the second-order statistics of the learned feature representations. In  a domain-confusion loss based on MMD  is applied to the final layer representation of a network.  uses a sum of multiple MMDs between several layers and 
3.1 3D Reconstruction Backbone Architecture
The backbone for single view 3D reconstruction that was chosen for this project is the pix2vox architecture . It is a convolutional neural network which encodes input images into a latent feature map, which is then decoded into a dimensional voxel. The encoder uses convolutional layers while the decoder uses 3D transpose convolutional layers. An additional refiner CNN which is based on the U-Net is also employed to increase performance 
. Standard techniques are used throughout the network, including batch normalization8] weights pretrained on VGG . This architecture was chosen because it is much more efficient than other competing methods such as PSGN , OGN , and 3D-R2N2 , while being comparable or better in terms of performance. Due to limited computational resources, this was a critical factor in our decision. Note that in the original paper, pix2vox utilizes the ShapeNet dataset.
3.2 Maximum Mean Discrepancy (MMD)
Unsupervised domain adaptation is quite challenging since we do not have labeled information for the target domain. Some approaches to the problem are to try to bound the target error by the source error plus a discrepancy metric between the source and the target. The Maximum Mean Discrepancy (MMD) is a measure of the difference between two probability distributions from their samples. It is an effective criterion that compares distributions without initially estimating their density functions. Given two probability distributionsand on , MMD is defined as
where is a class of functions . By defining as the set of functions of the unit ball in a universal Reproducing Kernel Hilbert Space (RKHS), denoted by , it was shown that will detect any discrepancy between and .
be data vectors drawn from distributionsand on the data space , respectively. Since is in the unit ball in a universal RKHS, we can rewrite the empirical estimate of MMD as
where is referred to as the feature space map.
3.3 Deep CORAL
Another approach to domain adaptation is by aligning the statistics of the source and target domains. CORAL 
does this by using a linear transformation to align the covariances (second order statistic) of the domains. Assuming we have a labeled source domainand an unlabeled target domain where each sample is a dimensional vector, the CORAL loss is defined as:
Where is the Frobenius norm and are the feature covariance matrices for the the source and target data, respectively. These matrices are given by:
Where is a dimensional vector containing all and the matrices are the data matrices containing the source and target data, respectively.
3.4 Domain Adversarial Neural Network (DANN)
DANN  focuses on combining domain adaptation and deep feature learning under one training process. It embeds the domain adaptation method into the process of learning representation to obtain features which are discriminative and domain invariant. This is achieved by jointly optimizing the underlying features as well as two discriminative classifiers operating on these features, the label predictor and domain classifier. The label predictor predicts class labels and is used both during training and at test time. domain classifier discriminates between the source and the target domains during training. The model works to minimize the loss of the label classifier and maximize the domain classifier loss adversarially, thereby encouraging domain-invariant features.
3.5 Voxel Classification Architecture
The domain adaptation techniques discussed above can be readily applied to the pix2vox architecture. However, their success may be limited, since the gap between synthetic and real domains is large. To help with this process, in addition to using domain adaptation techniques, we perform classification of the output voxel by vectorizing it, and using several fully connected layers of size 100 and 20, with ReLU activations. This is possible since we have the ground truth class labels for both the source and target domain. This proposed architecture can be seen in Figure 2; all losses are trained end-to-end. We utilize the standard cross entropy loss. This idea is inspired by the fact that in general, the output voxels should resemble their respective classes. We found that this additional source of supervision, which applies to both the source and target domain, is highly beneficial; further details can be found in Section 4.
4.1 Relevant Datasets
As mentioned previously, ShapeNet is a synthetic dataset which provides many 2D rendered images of objects, as well as their ground truth 3D mesh representations (which can be converted into voxels). As shown in Figure 1, they are CAD models with little detail, no texture, and no background. While the original ShapeNet has 270 classes, pix2vox uses a subset of 13 classes.
In addition, we utilize the ODDS dataset, which is a real, multiview, class-organized image dataset with multiple domains. It contains 25 classes, 6 of which overlap with the ShapeNet classes used to train the original pix2vox model. The overlapping classes are airplanes, cars, monitors, lamps, telephones, and boats, so these are the classes we use when comparing results between datasets. Each ODDS class contains 20 object instances (for example, the monitor class has 20 different types of monitors); each object instance has 8 images of it taken at 45 degree increments. Please see Figure 1 for some example images. There are 3 domains in the ODDS dataset that we work with. First, OWILD contains images of objects in various real-world locations, and pictures are taken with a smartphone. Second, OOWL is taken with a drone inside a lab setting. As a result, it contains several domain peculiarities such as as camera blur and a lower camera resolution. Finally, OOWLSeg is a segmented version of OOWL. Note that this data captures the real-world input statistics that we would like to work with: they’re real objects, taken with smartphone cameras in various real-world locations. However, they also represent a trade-off between ease of collection and realism – this is shown through the arrow in Figure 1 on the right.
4.2 Evaluation Metrics
The standard evaluation for 3D reconstruction when using voxels is the intersection over union (IoU) score. Formally, it is:
where is the predicted occupancy probability at voxel location and is the ground truth value at voxel location . Given a predicted reconstruction voxel and ground truth voxel, if , then they are the same voxel. If
, then there is no intersection between the two voxels. Thus, a higher IoU score indicates a better reconstruction result. It is important to note that for the case of the ODDS dataset, we do not have ground truth (currently, we are not aware of any publicly available, sufficiently large multiview dataset with 3D ground truth). Therefore, for the scope of this paper, it is primarily used as a test dataset to judge qualitative reconstruction results. We cannot quantitatively evaluate metrics like IoU, due to the lack of 3D ground truth (it is unsupervised in this regard). However, we do try to utilize the ground truth class labels that come with OWILD as supervision.
4.3 Application of Domain Adaptation on a Vanilla Pix2Vox Model
First, we report the results of applying DANN and Deep CORAL domain adaptation to the vanilla pix2vox model for single view 3D reconstruction. Qualitative reconstruction results are shown in Figure 7, under “CORAL” and “DANN”. We observed that reconstruction when not using any domain adaptation generally looks like random noise (these reconstruction results are omitted to save space). Also, in our experiments we found that MMD was not effective; therefore, MMD results have also been omitted. Regarding Deep CORAL and DANN, we can see that in general both help, though results are still far from perfect. Visually checking the reconstruction results, we found that DANN performed better than CORAL. We also embed the learned feature maps into 2D, using t-SNE as the dimensionality reduction algorithm. This is shown in Figure 5. We can see that the use of DANN and Deep CORAL both helps to make embedded features more domain invariant – the distributions of the source ShapeNet domain (purple) are more matched with the distributions of the target OOWL domain (yellow). We also note that the introduction of domain adaptation also negatively impacts the IoU on the source domain – this is shown in Table 1. Intuitively, this makes sense since the network is constrained to only output domain invariant latent representations. This makes training more difficult. In the future, we would like to look into this more and see if we can maintain IoU results on the source domain while performing domain adaptation.
4.4 Reconstruction with a Voxel Classification Loss
As mentioned above, we found that regarding the vanilla pix2vox model, DANN is helpful. However, results are still sub optimal. To address this, we utilize our proposed voxel classification network. Training is performed end-to-end, and we report training losses as a function of epoch in Figure6. While the voxel classification loss does decrease, during training we found it difficult to reduce it past epoch 20 – it fluctuates beyond that point. In the future, we plan on trying to look into ways of address this. Meanwhile, the domain discrepancy loss is maintained at around 0.5. This is expected due to adversarial training induced by the gradient reversal layer.
Next, using this trained model we evaluate the differences between the domains in ODDS. This gives us a way to see how large the domain gap is between Shapenet and the target domains OOWL, OOWL Seg, and OWILD. We report t-SNE embeddings in Figure 4 and the reconstructions in Figure 3 for the three target domains. We can see that in general, it appears that OWILD is the most challenging dataset. We believe that this is because OWILD has complex backgrounds, which make it very far from the shapenet domain. On the other hand, we can see that OOWL seg performs quite well. We believe that the segmentation makes the images very similar to ShapeNet, which also do not have a background.
5 Conclusion and Future Work
In this paper, we have focused on the task of single view voxel reconstruction in the real world. To do this, we extended the pix2vox architecture using domain adaptation between the supervised synthetic ShapeNet dataset and the unsupervised, real ODDS dataset. However, we showed that simply applying domain adaptation is not enough; reconstruction results are only marginally better. Therefore, we proposed an architecture which also utilizes a voxel classification loss in addition to an adversarial loss, which led to better results.
There are several extensions that were not done due to time and computational constraints which we plan on exploring in the future. First, it would be interesting to see if our conclusions in the project hold for other architectures (eg mesh or point clouds). Second, no large dataset exists with real world ground truth 3D data. Perhaps the closest is Pix3D, but it only has 8 classes, and pictures are only from one angle. No class overlaps with Pix3D, OWILD, and ShapeNet. If such a dataset existed, it would be feasible to achieve quantitative, not just qualitative reconstruction results. Third, we plan on working towards improving results on OWILD through more experimentation. For example, we want to try data augmentation on ShapeNet by pasting background from the MIT Places datasets  and see if that improves results. We also want to further explore the domain gaps between the datasets through methods like domain bridges, which proposes domain adaptation gradually over several intermediate domains which increase in difficulty to the final target domain . Finally, because ODDS is a multiview dataset, it would be natural to generalize results to the multiview reconstruction case.
-  (2014) Domain-adversarial neural networks. External Links: Cited by: §2.2.
-  (2013) Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 769–776. Cited by: §2.2.
-  (2006-08) Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics (Oxford, England) 22, pp. e49–57. External Links: Cited by: §3.2.
-  (2016) Domain separation networks. CoRR abs/1608.06019. External Links: Cited by: §2.2.
-  (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §1.
-  (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pp. 628–644. Cited by: §2.1, §3.1.
-  (2019) Adaptation across extreme variations using unlabeled domain bridges. arXiv preprint arXiv:1906.02238. Cited by: §5.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.1.
-  (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §2.1, §3.1.
-  (2015) Domain-adversarial training of neural networks. External Links: Cited by: §2.2.
-  (2017) Domain-adversarial training of neural networks. In Domain Adaptation in Computer Vision Applications, G. Csurka (Ed.), pp. 189–209. External Links: Cited by: §3.4.
-  (2008) A kernel method for the two-sample problem. External Links: Cited by: §2.2.
-  (2007-09) Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, Biologische Kybernetik, Cambridge, MA, USA, pp. 601–608. Cited by: §2.2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.
-  (2007-06) Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 264–271. External Links: Cited by: §2.2.
Learning efficient point cloud generation for dense 3d object reconstruction.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.1.
-  (2015) Learning transferable features with deep adaptation networks. External Links: Cited by: §2.2.
Deep transfer learning with joint adaptation networks. In
Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 2208–2217. External Links: Cited by: §2.2.
-  (2017) A survey of structure from motion*.. Acta Numerica 26, pp. 305–364. Cited by: §1.
-  (2018) Matryoshka networks: predicting 3d geometry via nested shape layers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1936–1944. Cited by: §2.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.
-  (2015) Return of frustratingly easy domain adaptation. External Links: Cited by: §2.2, §3.3.
-  (2016) Deep coral: correlation alignment for deep domain adaptation. External Links: Cited by: §2.2.
-  (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096. Cited by: §2.1, §3.1.
-  (2014) Deep domain confusion: maximizing for domain invariance. External Links: Cited by: §2.2.
-  (2018) Pixel2Mesh: generating 3d mesh models from single RGB images. CoRR abs/1804.01654. External Links: Cited by: §2.1.
-  (2019) Pix2Vox: context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2690–2698. Cited by: §1, §2.1, §3.1.
Places: a 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §5.