Reconstructing 3D shape from a single-view RGB image is a very vital but challenging computer vision task in the areas of robotics, CAD, virtual and augmented reality. Humans can easily infer the 3D shape from a single image due to sufficient prior knowledge and ability of visual understanding, while it is an extremely difficult and ill-posed problem for a machine vision system because a single-view image can not deliver sufficient information of an object to be reconstructed.
Most of the existing methods use prior shapes implicitly and employ conventional encoder-decoder architectures to predict 3D voxel (volume pixel) grid from a single RGB image. However, implicitly encoding prior knowledge from all available shape priors into the latent parameter space of reconstruction network cannot retain shape information completely or consider different shape prior requirements of various objects differently. In general, only a few shape priors can be very helpful when reconstructing a particular object. Utilizing irrelevant shape priors can possibly introduce noisy and thus make reconstruction worse. The encoder-decoder based approaches seriously suffer from the issues of object occlusions, rough surface and noisy background, which commonly exist in a single-view image. Unsatisfactory performance of these methods are shown in Fig. 1.
Humans can infer a reasonable 3D shape from a single image with incomplete visual cues because humans can retrieve similar shape priors from their memory and apply these the shape priors to repair the invisible parts of the object. Inspired by memory-based meta-learning, which uses an external memory to store prior knowledge and achieves fast domain adaptation by exploiting the relationship between the current input and items in memory, we propose a framework named Meta3D. Meta3D transforms the usage of prior shapes from implicit to explicit by storing the shape priors into external memory, as shown in Fig. 2. The Meta3D can store existing ‘image-3D shape’ pairs into the memory slots with a novel write controller. Later, for each input image, a read controller is carefully designed to retrieve the top few relevant 3D shapes to the input from memory and followed by an LSTM, which is introduced to contextually encode the knowledge useful for reconstruction among retrieved shape priors to synthesis the parameters of a shape-specific refiner. Our main contributions are summarized as follows:
We are the first to propose a memory-based meta-learning approach, i.e., Meta3D, for single-view 3D reconstruction. The proposed Meta3D can reconstruct reasonable 3D shape even for the object with invisible parts and complex background by explicitly utilizing shape priors and fully exploiting the relationship between a single-view image and shapes priors in memory.
A novel write controller is devised to treat the memory update as an image-shape feature aggregation problem. Our write controller makes images with similar volumes (3D shapes) closer to each other while pulling away from images with different volumes. Moreover, a novel read controller with a parameter prediction network is proposed to encode the most input-relevant prior shapes sequentially in the memory and form a shape-specific-refiner for each object to be reconstructed.
Experimental results on three popular benchmarks demonstrate that Meta3D outperforms state-of-the-art methods with a large margin. Through select input-relevant prior shapes and apply priors efficiently, the proposed Meta3D can deal with extremely difficult cases, e.g., objects with truncated and occluded parts or with very complex background, which cannot be handled by other methods.
2 Related Work
2.1 Single-image 3D Reconstruction
Recently, 3D shape reconstruction from a single-view image has attracted increasing research efforts because of its wide applications in the real world. Recovering object shape from a single-view image is an ill-posed problem due to the limitation of visual clues. Existing works use the representation of silhouettes [Dibra_2017_CVPR], shading [Richter_2015_CVPR], and texture [DBLP:journals/ai/Witkin81]
to recover 3D shape. With the success of deep learning, especially the generative adversarial networks[DBLP:conf/nips/GoodfellowPMXWOCB14]
and variational autoencoders[DBLP:journals/corr/KingmaW13]
, the deep neural network based encoder-decoder has become the main-stream architecture, such as 3D-VAE-GAN[DBLP:conf/nips/0001ZXFT16]. MarrNet [DBLP:conf/nips/0001WXSFT17]
reconstructs 3D objects by estimating depth, surface normals, and silhouettes. PSGN[DBLP:conf/cvpr/FanSG17] and 3DLMNet [DBLP:conf/bmvc/MandikalLAR18] generate point representations from single-view images. 3D-R2N2 [DBLP:conf/eccv/ChoyXGCS16]
applies a 2D CNN to encode the input single-view image into a feature map. Then a 3D convolutional neural network was used to decode the feature representation into a 3D shape. Tulsianiet al. [DBLP:conf/cvpr/TulsianiEM18] adopt an unsupervised solution for 3D object reconstruction. However, all these works suffer from the issues of object occlusion and indistinguishable foreground and background 1. Different from existing works using shape priors in an implicit way by tuning parameters of reconstruction networks, we propose a framework, so-called Meta3D that can explicitly select and use the most relevant prior shapes to guide the reconstruction and has superiority of reconstructing invisible parts of the object.
2.2 Memory-based Meta-learning
Meta-learning aims to research how to distill prior knowledge from the past experience and enable fast adaptation to new, even unseen, tasks with only a limited amount of samples. The main-stream approaches of meta-learning can be broadly categorized into three groups: optimization-based [DBLP:conf/iclr/RaviL17][DBLP:conf/icml/FinnAL17][DBLP:conf/nips/FinnXL18][DBLP:conf/iclr/RusuRSVPOH19], metric-based [DBLP:conf/nips/VinyalsBLKW16][DBLP:conf/nips/SnellSZ17][DBLP:journals/corr/abs-1803-00676][DBLP:conf/iclr/BertinettoHTV19][DBLP:conf/cvpr/SungYZXTH18] and memory-based [DBLP:journals/corr/SantoroBBWL16]. The MANN [DBLP:journals/corr/SantoroBBWL16] is the very first work to demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data and leverage them to make accurate predictions after only a few samples.
Memory Network is first proposed in [weston2014memory], which augmented neural networks with an external memory module that enables the neural network to store long-term memory. Then later works [DBLP:conf/nips/SukhbaatarSWF15][DBLP:conf/icml/KumarIOIBGZPS16] improve the Memory Networks to be trained in an end-to-end manner. Hierarchical Memory Networks [ch2016hierarchical] is proposed to allow read controller to efficiently access large scale memories. Key-Value Memory Networks [DBLP:journals/corr/MillerFDKBW16] stores prior knowledge in a key-value structured memory, keys are used to address relevant memories whose corresponding values are returned. We introduce and improve the memory-based approach in our Meta3D since the memory module has the ability to store critical information over long periods which is close to human perception. In this paper, we utilize memory network to store shape priors so that prior information can be retained completely. Novel write and read strategies are devised for our specific task.
2.3 Parameter Prediction
The parameter prediction is one of the meta-learning strategies. Parameter prediction refers to evolve one network to generate parameters for another network, which is an effective way to encode relational information into the network and make the network adaptive for novel samples. Early work in [DBLP:conf/iccv/BaSFS15]
is to explore the prediction of the weight parameters in deep neural network, which trains a multi-layer perceptron to predict a binary classifier for class-specific description in text. The authors of[DBLP:journals/neco/Schmidhuber92] suggest the fast weights in which one network can produce the changes of context-dependent weights for second networks. A few subsequent works study practical applications with the fast parameters prediction, e.g., objects detection [Wang_2019_ICCV]
and image super-resolution[Hu_2019_CVPR]. We devise a shape-specific parameter predictor to achieve fast shape domain adaption so that our framework can refine the reconstructed volumes shape by shape.
3 Method Overview
The basic idea of our Meta3D is to refine generated coarse volume with relevant prior shapes stored in memory during the training stage. To store and utilize shape priors explicitly, we adopt a Key-Value Memory Network [DBLP:conf/emnlp/MillerFDKBW16] as our memory module to store shape-discriminative feature of input image as ‘Key’ and it’s corresponding ground-truth volume as ‘Value’.
As illustrated in Fig. 2, our framework is composed of the following modules: an external memory module (which including many memory slots, an shape-discriminative feature extractor, a write controller and a read controller), a shape-agnostic volume generator, and a shape-specific refiner. The feature extractor is used to extract shape-discriminative features from input images with the help of novel devised shape threshold triplet loss. The write controller takes extracted shape-discriminative features and their corresponding ground-truth volumes as input, and determines how to store the ‘image-volume’ pairs into the memory. The read controller can retrieve the most relevant volumes of input image by computing similarities between extracted features of input image and the ‘Key’s stored in the memory. After several volumes are retrieved, the read controller contextually encodes all the retrieved volumes to synthesis parameters of the shape-specific refiner which is adaptive for each input image.
4 The Proposed Method
In this section, we will introduce the external memory module, the shape-agnostic generator module and the shape-specific refiner module individually.
4.1 External Memory Module
Employing RNN based models is a natural way to encode contextual relationships and similarities among shape priors. The latent parameter space of RNN is treated as memory, which may be too small to completely remember prior shape information. Reconstructing a particular object may only rely on a subset of shape priors. However, the latent parameter space of RNN is fixed after training without considering different shape prior requirements of various objects. Inspired by the Memory Network [DBLP:journals/corr/WestonCB14] and the Key-Value Memory Network [DBLP:conf/emnlp/MillerFDKBW16], we design an external memory module for 3D reconstruction that can manipulate a large external memory module and store shape priors in a flexible Key-Value mode. By explicitly storing and utilizing shape priors, our Meta3D can construct a relevant shape prior subset to guide the reconstruction for a particular object.
We construct the memory items as: , which is denoted as , means that the memory has slots to store. A ‘key’ memory slot stores a -dimensional shape-discriminative feature of the input image which is denoted as
, and the keys will be used to compute the cosine similarities with the queries. A ‘value’ memory stores the ground truth 3D volumeof it’s corresponding key. The age keeps track of the age of items not being used so that we can choose rarely used slots to overwrite.
4.1.1 Shape-discriminative Feature Extractor and Write Controller.
It is worth noting that the memory slots is a set of constructed data without any trainable parameters. The construction of memory slots relies on the extracted image features and the write strategy. The write controller takes both extracted features (query) and ground truth volumes as input, while the write controller determines how to update the memory slots
and how to optimize the shape-discriminative feature extractor. Different from feature extractor for image retrieval[DBLP:conf/mm/Yang0ZYM19], which aims to make images of the same semantic or visual classes closer to each other while making images of different semantic or visual classes further away, our shape-discriminative feature extractor needs to have the ability to retrieve shape-similar images with the input image instead of appearance-similar images. Because images with similar semantic or visual classes may not have similar shapes. Thus we devise a objective function to treat the feature extraction as an image-volume feature aggregation problem to make images with similar volumes closer to each other while pulling away from those images with different volumes. We define the positive neighbor of the input ground truth volume as the memory slot with the smallest index where the similarity between and is over a threshold , i.e.,
Similarly, the negative neighbor is the memory slot with the smallest index where the similarity between and is less than a threshold , i.e.,
Where denotes the dimension of the 3D volume is , i.e. the equals to 32 in the ShapeNet dataset.
The triplet loss is constructed as
Which can minimize the distance among images with similar 3D volumes and maximize the distance among images with different 3D volumes.
After extracting the shape-discriminative features of the input RGB image as query , the write controller takes the and it’s corresponding ground truth volume as input. To update the memory slots, two write strategies are considered: 1) Aggregating the similar samples into one memory slot by updating the corresponding memory key. 2) Storing the distinctive samples into a new memory slot. It depends on the similarity between the and it’s 1-st nearest neighbor in the memory .
Case 1: If the similarity between and is over the threshold , we update the key and set the the age of the slot with zero, but keep the unchanged.
Case 2: If the similarity between and is less than the threshold , it means that there is no memory slot that storing a similar volume with the current input . So we need to seek an oldest memory slot (with the largest age ) to store the new input .
It is worth noting that the write controller only works when performing the training process because it takes the ground truth volumes as inputs that are invisible in testing.
4.1.2 Read Controller with Shape-Specific Parameter Prediction.
The ultimate aim of our Meta3D is to read and construct a object-relevant prior shape subset from the memory to guide the particular object reconstruction. Inspired by the success in fast domain adaption and few-shot learning on the fast parameters generation[DBLP:journals/corr/BertinettoHVTV16][DBLP:journals/neco/Schmidhuber92][Noh_2016_CVPR], we utilize an LSTM network [doi:10.1162/neco.1922.214.171.1245] to sequentially encode the relationships and contextual information of the k specific shape priors read from the memory and generate the shape-specific-parameter for the refiner to guide the reconstruction of the specific object . The procedure is formulated as
Where is the query which is extracted shape-discriminative features, is Key slots in the memory. We compute the cosine similarities between and all the keys and choose volumes with the top-k similarities as the k specific shape priors of query . The is a weight matrix of the full-connected layer which transform the -dimension outputs of the LSTM to the -dimension of the refiner weights. The generated parameter will be applied to the volume refiner to make it a shape-specific volume refiner.
4.2 Shape-Agnostic Generator Module
This module is used to generate coarse 3D shapes from input RGB images. The 3D shape of an object is represented by a 3D voxel grid, where 0 is an empty cell and 1 denotes an occupied cell. The term of ‘shape agnostic’ indicates that this generator is trained using all kinds of objects, therefore it is a generic solution to reconstruct coarse 3D models. To make a fair comparison to prove the effectiveness of our memory module, we adopt the same network architecture of the current state-of-the-art work [Xie_2019_ICCV]. The generator is composed of an encoder and a decoder, the encoder takes images as input and computes 2D feature maps of images. The decoder is responsible for transforming the 2D feature maps into 3D volumes. The details of the network architecture will be discussed in the experiment section 5.2
. The reconstruction loss function of generator is formulated as
where is the dimension of the 3D volume, and represents the predicted volume of the generator and the corresponding ground truth.
4.3 Shape-Specific Refiner Module
This module is used to correct and refine coarse volumes . ‘Shape-specific’ is used to emphasis that the refiner is able to adapt different input images, i.e. different shapes, through applying shape-specific parameters. The shape-specific-parameters are produced by the and networks mentioned above. It takes as input and produces a refined volume .
The parameter prediction loss of the and networks is computed as
Then the parameter prediction network and performs backward propagation using the to minimize the prediction error.
5.1 Datasets and Metrics
ShapeNet [DBLP:journals/corr/ChangFGHHLSSSSX15], Pix3D [DBLP:conf/cvpr/Sun0ZZZXTF18] and PASCAL 3D+  are used to evaluate the performance of our Meta3D, which are the most commonly used public datasets in single-view image 3D reconstruction. The ShapeNet, which is used to learn shape priors, is composed of synthetic images and corresponding 3D volumes. The evaluate datasets, Pix3D and the PASCAL 3D+, are much more challenging because the images are from the real-world and contain noisy background and occlusions, as shown in Fig. 1. We evaluate our Meta3D on both synthetic and real-world images to demonstrate the effectiveness of our model to handle the complicated self-occlusion, noisy background, and truncation issues.
5.1.2 Evaluation Metrics
We apply the intersection over union (IOU) and Chamfer Distance (CD) evaluation metrics widely used by existing works. The IOU measures the similarity between ground-truth and reconstructed voxels. Which can be formulated as
indicate predicted occupancy probability and ground-truth at(i,j,k), respectively. is the indication function which will equal to one when the requirements are satisfied. The denotes a threshold. The Chamfer distance between two point clouds is defined as
For each point in each set, CD finds the closest point in the other set and average the distances. We sample points in the voxel isosurface to compute the CD for voxel occupancy as same as [Pinheiro_2019_ICCV].
5.2 Implementation Details
Most of our settings follow previous works to make a fair comparison. Specifically, we resize input images into 224x224 and downsample the voxels provided by the official repository into . We train all the modules with Adam optimizer with = 0.9 and
5.2.1 Memory Module
In the memory module, we use a ResNet pre-trained on Imagenet as our shape-discriminative feature extractor. We use the feature extracted by the pool5 layer of the ResNet18, the dimension of the feature is 512, the volume similarity thresholdis set as 0.7, 0.9, and 0.9 in ShapeNet, Pix3D and PASCAL 3D+, respectively. The margin in the triplet loss is 0.1. We set the size of the memory as 20,000, which means the memory module has storage capacity to store 20,000 items. One item is [key, value, age]. The in the read controller is 256 and the size of the hidden layer in LSTM is 512.
5.2.2 Generator Module
We follow the same generator architecture as previous work [Xie_2019_ICCV]
to make comparison fair enough. Specifically, the generator is composed of an encoder and a decoder. The encoder adopts the first nine convolutional layers of VGG pre-trained on Imagenet and followed by three sets of 2D convolutional layers, batch normalization layers, and ELU layers. The kernel sizes are, and , respectively. The decoder is composed of five 3D transposed convolutional layers. The first four transposed convolutional layers are of a kernel size of Xie_2019_ICCV].
5.2.3 Refiner Module
The encoder of the refiner has three 3D convolutional layers, each of which has a bank of
filters with padding of 2, followed by a batch normalization layer, a leaky ReLU and a max-pooling layer with a kernel size of. The encoder is followed by two fully connected layers with the dimension of 2048 and 8192. The decoder consists of three transposed convolutional layers, with a bank of filters with padding of 2 and stride of 1. And all the layers except for the last one followed by a batch normalization layer and a ReLU activation. The last layer is followed by a sigmoid function.
5.3 Reconstruction on ShapeNet
Same as previous works [DBLP:conf/eccv/ChoyXGCS16], we use a subset of ShapeNet consisting of 13 major categories and 43,783 3D models, in which the voxel resolution is . In memory training stage, we train a shape-discriminative feature extractor and apply a write controller to fill the memory slots. Training of the generator and the parameter predictor is performed in the next stage. The experimental results are shown in Table. 1. As shown in Table. 1, although several methods take extra information as input (The PSGN needs the objects masks and uses 220k 3D CAD models), our Meta3D over performs all other method with a large margin cross all categories. Our Meta3D benefits from the external memory module which can explicitly retain complete shape priors and apply them according to object’s individual needs through an efficient fast domain adaption.
|Category||3D-R2N2 [DBLP:conf/eccv/ChoyXGCS16]||OGN [Tatarchenko_2017_ICCV]||DRC [DBLP:conf/cvpr/TulsianiEM18]||PSGN [Fan_2017_CVPR]||Pix2Vox [Xie_2019_ICCV]||Meta3D|
5.4 Reconstruction on Pix3D
Pix3D is a large-scale benchmark of diverse image-shape pairs. The most significant category in this dataset is chairs. Most of the previous works [Xie_2019_ICCV][DBLP:conf/cvpr/Sun0ZZZXTF18][Wu_2018_ECCV] evaluate their approaches using the hand-selected 2894 untruncated and unoccluded ‘chair’ images. In this work, we evaluate our Meta3D on both 2895 untruncated and unoccluded ‘chair’ images and the truncated and occluded ‘chair’ set to demonstrate the ability of our Meta3D to handle the challenges of having invisible parts and noisy background.
The training procedure is performed on the ShapeNet-Core[DBLP:journals/corr/ChangFGHHLSSSSX15] dataset which contains over 50k object instances of 55 categories. In the first training stage, we use ShapeNet-Core[DBLP:journals/corr/ChangFGHHLSSSSX15] dataset to help the shape-agnostic generator learn shape priors and optimize the shape-discriminative feature extractor and parameter predictor. At the next training stage, we empty the memory slots and train with a higher volume similarity threshold on the subset only containing ‘chairs’ to store more diverse ‘chair’ shape priors. We follow the previous work and use the same rendered view images in [Pinheiro_2019_ICCV] and [DBLP:conf/eccv/ChoyXGCS16].
|Meta3D (w/o replacing slots)||0.331||0.101|
|Meta3D (w/ replacing slots)|
First of all, we evaluate our Meta3D at 2894 untruncated and unoccluded chair images same as most of the previous works and compared with the state-of-the-art methods, as shown in Table. 2. Note that these methods use different types of extra information. For instance, MarrNet [DBLP:conf/nips/0001WXSFT17], DRC [DBLP:conf/cvpr/TulsianiEM18] and ShapeHD [Wu_2018_ECCV] use extra depth, surface normals and silhouettes information and PSGN [DBLP:conf/cvpr/FanSG17] takes objects masks as input. Although our Meta3D uses only an RGB image as input, Meta3D still surpasses the state-of-the-art method by a large margin. The result also shows that replacing all the shapes in memory slots into shapes of ‘chair’ performs better. This may be caused by the limitation of the memory capacity to store all shapes of ‘chair’.
We make another comparison on the truncated and occluded ‘chair’ images of the Pix3D dataset, which is a extremely challenging task. The experimental results are shown in Table. 3 and Fig. 4. The performance of all other methods is decreased significantly, while our Meta3D can handle the hardest samples better than other methods. The external memory module transforms the usage of shape priors from implicit to explicit and construct a relevant shape priors subset according to different object’s needs. Our Meta3D extracts the feature of the input and retrieves relevant shape priors from the memory to guide the reconstruction process. Using retrieved 3D volumes, which are clean and complete, to guide the reconstruction can significantly filter out the background noise and refine the generated coarse shapes. Fig. 3 shows some examples of retrieved volumes from memory and refined results.
|Meta3D (w/o replacing slots)||0.289||0.193|
|Meta3D (w/ replacing slots)|
5.5 Reconstruction on Pascal 3D+
Similar to previous works, we use PASCAL 3D+ to evaluate instead of training. Our model learns the shape priors from ShapeNet. We train our model using the categories that are present in both PASCAL 3D+ and ShapeNet renderings: ‘aeroplane’, ‘car’, ‘chair’, ‘table’, and ‘tv’. Comparison results of three categories reported in [Pinheiro_2019_ICCV][DBLP:conf/cvpr/TulsianiEM18][Wu_2018_ECCV] are shown in Table.4. Note that DRC and ShapeHD use depth/normal/silhouettes as extra information during training. Only taking single-view RGB images as input, our Meta3D still achieves the lowest CD metric by explicitly retaining and using the shape priors.
All the experiments and comparison demonstrate the superiority of our Meta3D on 3D shapes reconstruction from a single-view image. Our method can efficiently clean and complete the generated coarse volumes, and handle the self-occlusion and the diverse noisy background very well.
In this work, we proposed a Meta3D network that can explicitly store the shape priors into an external memory module, then retrieve and apply priors according to object’s individual needs through an efficient fast domain adaption. The novel devised write and read controller provide the memory module the ability to aggregate shape similar images and encode the shape priors effectively. Experimental results on both synthetic image 3D reconstruction and real-world 3D reconstruction demonstrate the superiority of our Meta3D. The experiments on occluded and truncated images also demonstrates that our Meta3D can handle more difficult samples, which makes our Meta3D more valuable in real-world applications.