AutoSweep
The implementation for the AutoSweep (TVCG 2018)
view repo
This paper presents a fully automatic framework for extracting editable 3D objects directly from a single photograph. Unlike previous methods which recover either depth maps, point clouds, or mesh surfaces, we aim to recover 3D objects with semantic parts and can be directly edited. We base our work on the assumption that most human-made objects are constituted by parts and these parts can be well represented by generalized primitives. Our work makes an attempt towards recovering two types of primitive-shaped objects, namely, generalized cuboids and generalized cylinders. To this end, we build a novel instance-aware segmentation network for accurate part separation. Our GeoNet outputs a set of smooth part-level masks labeled as profiles and bodies. Then in a key stage, we simultaneously identify profile-body relations and recover 3D parts by sweeping the recognized profile along their body contour and jointly optimize the geometry to align with the recovered masks. Qualitative and quantitative experiments show that our algorithm can recover high quality 3D models and outperforms existing methods in both instance segmentation and 3D reconstruction. The dataset and code of AutoSweep are available at https://chenxin.tech/AutoSweep.html.
READ FULL TEXT VIEW PDFThe implementation for the AutoSweep (TVCG 2018)
There is an emerging demand on automatic extraction of high quality 3D objects from a single photograph. Applications are numerous, ranging from image manipulation [65, 64, 38], to emerging 3D printing [48, 22] and virtual reality and augmented reality [2, 3]. For example, in e-commerce, it is highly desirable to automatically and quickly recover the 3D model of a commercial product from its 2D image (e.g., in advertisement). Further, the geometry and the texture map should be of high quality to be useful. The problem, however, remains challenging: any successful solution should be able to reliably segment an object from the image and then recover its shape and structure whereas both problems are ill-posed and generally require imposing priors and using sophisticated optimization.
A photograph is inherently “flat” and does not contain associated depth information. Traditional solutions rely on multi-view stereo or volumetric reconstructions to recover the point cloud, normal, or visual hull of the object. They require using multiple images of an object which is most likely inaccessible in applications such as e-commerce. More importantly, the recovered 3D geometry is of low quality even with the most advanced reconstruction algorithms. Alternative solutions [11, 19, 37, 64] treat an object as a composition of simple, primitive components [8, 29]
and set out to estimate each individual component. Most existing methods in this category require extensive human inputs for partitioning the object. Most recently, end-to-end methods
[14, 26]have leveraged generative neural networks to directly infer point cloud or volumetric representations of an object from a single image. They are able to produce coarse geometry that resembles the actual shape. Yet, the quality of the resulting model still barely meet the one of a CAD model or a parametric mesh.
In this paper, we present a fully automatic, single-image based technique for producing very high quality 3D geometry of a specific class of objects: objects composed of generalized cuboids and generalized cylinders or GC-GCs, for short. Both a generalized cuboid and a generalized cylinder could be represented as a profile (i.e., a circle or a rectangle) sweeping along a trajectory axis as in traditional CAD systems. Normally, the profile is allowed to scale and the trajectory axis is curved [11]. An intriguing benefit of our reconstruction pipeline is that each cuboid and cylindrical part can be directly edited by altering the profile or the trajectory axis and then composed together to form a new GC-GC. See Fig. 1, 5 for examples of GC-GCs.
In our solution, we first partition and recognize each semantic part of a GC-GC object. We exploit instance segmentation network Mask R-CNN [31] which is capable of handling “invisible profiles” that caused by occlusions of the foreground or even self occlusions. However, due to a small receptive field, the output often contains erroneous boundaries and incomplete masks that do not agree with the actual object mask. We extend the structure of Mask R-CNN and construct our Geometry Network (GeoNet for short) by incorporating contour and edge maps into a concatenating network which we call the deformable convolutional network (DCN) derived from [10, 18]. The edge maps and 2D contours are used to better learn the boundary of the body and face regions which are crucial in the subsequent modeling process. Our network outputs smooth masks around the boundary regions.
Once we segment each component, we conduct reconstruction via a volume sweeping scheme. We decouple the process into two stages of profile fitting and profile sweeping. To estimate the 3D profile, we jointly optimize the profile with the camera pose. We then extract the trajectory axis of each body mask and map it to 3D with the estimated camera pose to guide the optimization of the profile sweeping.
We demonstrate our approach to various images. Our system is capable of automatically generating 3D models from a single photograph which can then be used for editing and rearranging. Qualitative and quantitative experiments are conducted to verify the effectiveness of our method.
Semantic Segmentation.
Recent deep neural networks have shown great success in improving traditional classification and semantic segmentation tasks. The classifier in the fully convolutional networks (FCNs)
[44] can conduct inference and learning on an arbitrary sized image but does not directly output individual object instances. Mask R-CNN [31] extends Faster R-CNN [51] by adding a branch for predicting object masks on top of bounding box extraction. [17, 16] use a multi-task cascaded structure to identify instances with position-sensitive score maps. FCIS [42] proposed inside and outside maps to preserve the spatial extent of the original image. [7] observed that the large receptive fields and the amount of pooling layers of these networks can degrade the quality of instance masks, causing aliasing effect. Region proposal network (RPN) [51] can only capture the rough shape of the object and its extensions [7, 43, 47] aim to improve the segmentation boundary.Single-Image Depth Estimation. Classic methods on monocular depth estimation mainly relied on hand-crafted features and graphical models[33, 55]. More recently, several learning-based approaches boost the performance by utilizing deep models. [25] employed a multi-scale deep network with two stacks for both global and local prediction to achieve depth estimation on a single image. [46] used CNN for simultaneous depth estimation and semantic segmentation. However, these works only seek to obtain the relative 3D relationship between different layers and therefore their depth results are much less accurate and clearly insufficient for high quality reconstruction. In our reconstruction, the surface is highly curved but smooth and therefore the depth map needs to be at an ultra-high accuracy, which is extremely difficult to achieve even under the stereo setting, let alone single-image.
Single-Image 3D Reconstruction.
Recovering 3D shape from a single image is a long standing problem in computer vision
[8], stemming from image metrology [49, 15]. The problem is inherently ill-posed and tremendous efforts have focused on imposing constraints such as geometric priors [19, 60], symmetry [27, 34, 37], planarity constraints [63], shape priors [56, 36], etc., or relying on stock 3D models for 2D-3D alignments [38, 4, 50, 6].Latest approaches leverage deep learning techniques
[61, 32, 23] on large datasets. Eigen et al.[24] infer depth maps using a multi-scale deep network. The 3D-R2N2 [14] attempts to recover 3D voxels from a single and multiple photographs. [26] recovers a dense set of 3D point cloud using a generation network. It is also possible to incorporate 3D geometry proxies such as volumetric abstraction [54, 59], hierarchical CSG tree [12], part models [53], etc. Results from these techniques are promising but still fall short compared with CSG models. Closest to ours is the work from the Magic Leap group, with a clear interest in virtual and augmented reality, to recognize and reconstruct 3D cuboids in a single photograph [23]. Our approach is able to recover more general shapes, namely generalized cuboid and cylindrical objects.Sweep-based 3D Modeling. A core technique we employ is 3D sweeping. Sweeping a 2D profile along a specific 3D trajectory is a common practice for generating 3D models in computer-aided design (CAD). Early CAD systems [52] use simple linear sweeps (sweeping a 2D polygon along a linear path) to generate solid models. Shiroma et al.[58] develop a generalized sweeping method for CSG modeling. Their technique supports curved sweep axis with varying shapes to produce highly complex objects. [1] conducts volume preserving stretching while avoiding self-intersections. More recent 3-Sweep [11] and its extension, D-Sweep [35], pair sweeping with image snapping. All previous approaches require manual inputs from the user whereas we focus on fully automated shape generation.
The pipeline of our framework is shown in Fig. 2. We take a single photograph containing objects of interests and feed it into our GeoNet to produce instance masks labeled as cuboid profile, cuboid body, cylinder profile, and cylinder body. These instance masks are then used for estimating the 3D profile (a circle or a rectangle) and the camera pose, along with a trajectory axis (a planar 3D curve) for the profile to sweep to create the 3D model.
The architecture of our GeoNet is illustrated in Fig. 3. We build upon the instance segmentation network of Mask R-CNN. The output of Mask R-CNN, coupled with contour image and the edge map, is fed into a deformable convolutional network which is derived from [10] and [18]. With the information of contour and edge maps, DCN is capable of learning a better and smooth boundary. Details are given in Section 4.1.
To sweep a primitive part, we first co-relate profile/body masks which could constitute a 3D part. Given correlated profile-body masks, a 3D profile is optimized with camera FoV and a trajectory axis is computed from the body/profile masks. Then, sweeping is performed in 3D to progressively transform and place the estimated 3D profile along the trajectory axis to construct the final model.
Our GeoNet takes an image as input and outputs the following four types of instance masks: cuboid profile, cuboid body, cylinder profile, and cylinder body. A direct instance segmentation network (Mask R-CNN) could lead to erroneous boundaries and incomplete masks that do not agree with the actual object mask, because the resolution of feature map are lower due to the ROI memory consumption [31]. (Fig. 4). Atrous convolution of Deeplab controls the respective fields under a reasonable range, while deformable convolution causes more effective respective fields which can improve the detail of segmentation results. Thus, we integrate deformable convolution layers proposed in [18] into the network structure of Deeplab [10] and concatenate it with Mask R-CNN for segmentation refinement. We call the sub network concatenated to Mask R-CNN the deformable convolutional network (DCN).
To boost the performance of our GeoNet, instead of directly feeding into the DCN with the results from Mask R-CNN, we use more information from the original image to help GeoNet learn more boundary features. We have tested various case, including using different combination of the original image, the edge map of the original image, and the probability maps from Mask R-CNN, etc., to feed into DCN. Quantitative comparisons are demonstrated in Section
6. At last, we find combining the edge map [11] and contour map [13] of the input image with probability maps given by Mask R-CNN achieves the best performance. We thus combine these with each instance probability map and feed into DCN. Specifically, for each instance probability maps from Mask R-CNN, we combine it with the edge map and the contour map and convert them into a single image ( takes the Green channel, and take the Red and Blue channel respectively, see Fig. 3 middle). We assign different green values (40 for cuboid body, 100 for cuboid profile, 150 for cylinder body, and 200 for cylinder profile) weighted with probability map for different instance categories to distinguish the instances. The shape of instances in one category have quite similar geometrical characteristics, thus labeling the instance with different green values helps the network to learn a better geometrical feature within this category. We find this simple strategy greatly improves the performance of DCN.The output of DCN is a refined instance mask . After getting through the DCN, we combine all instance masks to form the final mask. To enforce feature learning, the beginning of our DCN is formatted by Res-Net with deformable convolution layers in res-5a, res-5b and res-5c, and connected with 2 convolution layers and 1 deconvolution layer.
Pre-training. Large nets are typically difficult to train. A good initial guess of the parameters usually leads to better convergence. Thus before using the real images, we pre-train the net with synthetic data. We manually construct a dataset containing 10 exemplar cuboids and generalized cylinders collected from ShapeNet [9] (see in Fig. 5). We render these examples from uniformly sampled view angles to generate 1000 images for each example, which gives us 10000 examples for pre-training. Since we do not have a large number of instances in our dataset, we decrease the ROI number from 256 to 128 during the training of Mask R-CNN. We also enlarge our dataset with flipped images.
Given the output masks from GeoNet, our next task is to create a 3D model that agrees with the target masks. We first separate the masks into independent parts (i.e., primitives) constituted by profiles and body and then construct each part independently.
Let us denote the set of instances segmented from the network as unlabelled profile faces and labelled bodies . Our task is to match each unlabelled profile with its corresponding body . This is essentially a labeling problem.
We formulate the following minimization problem:
(1) |
where is the unary term. measures the closest Euclidean distance between profile and body . We set it to a large constant if the distance exceeds a threshold ( 3% of the image height in our implementation). measures the proximity of the face to the body. We define it as , where is the portion of the points on profile which are inside the oriented bounding box of body . Both and are set to 0.3.
The binary term is defined as , where is a function which takes value 1 if and overlaps and is equal to and takes value 0 otherwise. The binary term is basically set to penalize two overlapped (i.e., occluded) profiles being assigned to the same body. We solve the above optimization by MRF.
For bodies that have no corresponding profiles, such as the handle of a mug whose profile is invisible due to occlusion, we gather them to form a handle set and attach them to the closest bodies in . Fig. 2 left gives a brief illustration. We discard false detected handles if their distance is far away from any detected body ( of the image height in our experiments).
To fit our 3D model, we use perspective projection rather than orthogonal (which was used in [11]) to create 3D models resembling real world objects. Direct global optimization of the primitive and camera parameters could easily render the problem difficult due to the large variable space. We thus decouple the problem into three steps: profile fitting, trajectory axis estimation, and 3D sweeping.
As the object profiles in our case are circles and rectangles, this imposes strong priors for our optimization. We assume a fixed camera pose and camera-to-object distance. Below are details for fitting the 3D circle and rectangle respectively. The key is to find a plausible initial value for the optimization.
Circle. Circles in 3D become ellipses in 2D after projection. We use the PCA center as the initial circle center, with a default depth value . The 3D position of the endpoints of the PCA major axis are also obtained at depth 10. The initial radius is then assigned according to the length of the 3D major axis. For the circle orientation, we cast a ray from the camera to one of the endpoints of the minor axis to intersect with the sphere of radius centered at . Let be the intersecting point. The orientation is set as the normal of the plane passing through , , and .
Given the initial circle , together with the mask outline, we optimize 5 variables using Levenberg-Marquardt. The 5 variables are , and which is the field of view (FoV) of the camera. We define the following optimization formulation:
(2) |
stands for alignment error after projection, it is defined as , where denotes the portion of points which are not inside the mask. is set to 40. ensures the circle is inside the mask boundary while its radius is as large as possible after the projection. stands for the error between profile normal and the starting direction of the trajectory axis (Section 5.3) under different FoVs. We define it as follows: , where is the acute angle between and , with denoting the normal projected to 2d and denoting the starting direction of the medial axis mentioned in Section 5.3. is a function that guarantees normal has a square magnitude of 1.
In a second step, we optimize the circle position separately using only the first term of the objective function to get an updated . With the new , we go back to the optimization of radius, normal and camera FoV. The two steps are iterated until convergence.
Rectangle. Rectangles are optimized in a similar way. We first detect four vertices by fitting a quadrilateral to the profile mask. Then cast four rays from the camera to the four vertices. The 3D vertices (in clockwise) of the four vertices which lie on the four rays are then optimized as follows:
(3) |
where keeps the spatial information of the rectangle through the following constraints: (1) parallel edges have equal length, (2) adjacent edges are perpendicular to each other, (3) four vertices are coplanar. We define as:
(4) |
where
are the vector created by adjacent vertices
. computes the cosine of the acute angle between two vectors. We add parameters to normalize each term. and are the same as above with radius replaced by side length. We rectify the 3D vertices to form a strict planar rectangle during iteration.We then extract a trajectory axis that approximates the main axis of the body. The curve will be a guiding line for the sweeping procedure. We use a morphology operation called thinning [39] to get a single width skeleton of the mask image, as shown in Fig. 6, (b). To better account for the completeness of the skeleton, we use both body and profile masks for thinning. To remove the spurious branches in the skeleton, we use a simple way to prune the branches. We mark the skeleton points as branching point and end points using hit-or-miss [21]. Branches are identified as paths connecting end points and branching points. We progressively delete shortest branches until we get no branching point.
As our purpose is to reconstruct cylindrical and cuboid object whose trajectory axis is either a straight line or a curve. We perform trajectory axis classification. The goal is to classify whether the trajectory axis is a straight line or not. Simple heuristics such as using line fitting with specific thresholds could lead to erroneous estimations. For a more general solution, we utilize the training data available in our dataset. We employ the LeNet
[40] and modify the last FC layer into 2 classes. We use both the body mask and the associated profile masks as input to provide the net with more contextual information. Specifically, we compute their bounding box and scale them to the size of as input to the net. We get an accuracy of for this task.If the trajectory axis is labeled as a straight line, we rectify the axis direction w.r.t. profile axis in cases the thinning process gives erroneous skeleton (e.g., for a cylinder we simply set the axis to be orthogonal (in 2D) to the major axis, see Fig. 7). In case when the trajectory axis is labeled as a curve. We set the starting point to profile center and perform bilateral filtering to get the final curve axis. See Fig. 6 (d) for an example. We find this simple thinning-and-rectifying strategy to perform well in our experiments.
Given the 3D profile and the trajectory axis, our next task is to sweep a 3D model which approximates the body mask. As in [11], we assume that the trajectory axis lies on a plane which is orthogonal to the profile plane and passes through the profile center. For simplicity, we set the plane orientation to be orthogonal to the camera direction if the object is a generalized cylinder. For a cuboid, we let the plane pass through one of the diagonal lines of the rectangle profile.
We project the 2D body mask and the trajectory axis on that plane and start to place the 3D profile uniformly along the projected trajectory axis. For each part to sweep, we start with the profile with a smaller fitting error if there are two. For each individual profile , stands for frame index, we cast a 3D ray from its center to intersect with the projected body mask and regard this distance as an initial guess for the profile radius. The final radius of is optimized with
(5) |
Here is the intermediate sweeping profile. represents the sampling points of profile . M is a 2D logical matrix representing the segmentation mask. is a 3D-to-2D projection function which outputs a 2 dimensional vector in the camera space. The vector is regarded as the index of with and representing row and column respectively. In Eqn 5, the first term measures how many sample points fall inside the body mask; the second term is the distance between the intersection points and its nearest point on the profile boundary as in [11]. is computed by casting a 2D ray from the projected center , then intersect with the edges on the edge map . Here we reuse the edges of mentioned in Section4.1; the third term aims to ensure that the radius is not too small. equals 0.025 in our experiment.
The above procedure optimizes the radius for individual sweeping profiles. To ensure the continuity of the geometry, we perform a global optimization on all swept profiles after the individual frame optimization. For all frames, the aim is to refine all centers and orientations .
(6) |
is the Laplacian smoothing operator. Here the first term in Eqn 6 measures the smoothness of the geometry, and the second is the deviation of and to initial values from frames, every weight inside is computed by the dot product between the tangential directions of the current and the next frame center on the trajectory axis. Eqn 5 and 6 are iterated to get the final result. In our experiments, both optimizations take around 1-3 iterations to converge.
For generalized cylinder or cuboid which have no associated profiles (e.g., a teapot handle), we estimate an initial position and radius for the profile by analyzing the contact region to the part of the already constructed 3D body. The sweeping process is performed similarly to finally create those parts (see Fig. 2, 9). Note that before the sweeping process, we globally optimize the camera pose (FoV) with all estimated 3D profiles.
Method | cub | cuf | cyb | cyf | mAP@0.7 | cub | cuf | cyb | cyf | mAP@0.9 |
---|---|---|---|---|---|---|---|---|---|---|
FCIS | 68.19 | 61.24 | 50.33 | 37.51 | 54.32 | 33.04 | 23.71 | 10.51 | 9.09 | 19.09 |
GeoNet w. FCIS | 68.61 | 61.47 | 56.75 | 37.23 | 56.01 | 48.64 | 36.88 | 17.14 | 10.30 | 28.24 |
Mask R-CNN | 68.36 | 61.22 | 55.93 | 40.26 | 56.44 | 35.73 | 30.13 | 7.29 | 10.17 | 20.83 |
GeoNet w. Mask R-CNN | 69.49 | 61.04 | 57.90 | 37.84 | 56.57 | 50.18 | 37.92 | 13.89 | 11.37 | 28.34 |
Dataset. Besides the synthetic data described in Section 4.1, our real dataset contains multiple human-made primitive-shaped objects widely used in daily life such as mugs, bottles, taps, cages, books, and fridges, etc. There are 11657 real images and 10000 synthetic images (with 11590 generalized cuboids and 15008 generalized cylinders). The real dataset contains about
unannotated images from ImageNet
[20], annotated images from Xiao et al. [61], and images collected from the Internet. The real dataset is further separated into training images and testing images. We perform evaluations of all experiments on the testing set of real images.Experiment of GeoNet. In order to make full use of the information from original image as well as the outputs of instance segmentation network, We test various combination of gray map , edge map , contour map of the image, mask , probability map
from the network. We restrict the combination to form a three channel image, and duplicate channels when the assembled map number is less than 3. For this experiment of combination strategy, we adopt Mask R-CNN as the first stage of our GeoNet. We use mean intersection-over-union (mIoU) defined over image pixels as the evaluation metric, since we are focusing on boundary refinement because the instances are the same during these experiments. The results are shown in Table
III, the combination of significantly outperforms the others.Method | cub | cuf | cyb | cyf | mean |
---|---|---|---|---|---|
Mask R-CNN | 77.56 | 80.51 | 68.68 | 75.74 | 75.62 |
GeoNet w. | 87.51 | 85.50 | 77.89 | 82.87 | 83.44 |
GeoNet w. | 89.34 | 85.84 | 79.01 | 83.19 | 84.34 |
GeoNet w. | 90.12 | 85.92 | 78.28 | 83.22 | 84.39 |
GeoNet w. | 89.67 | 86.03 | 79.78 | 83.82 | 84.83 |
GeoNet w. | 90.88 | 86.84 | 79.51 | 84.36 | 85.40 |
GeoNet w. | 91.80 | 86.24 | 85.27 | 85.37 | 87.17 |
GeoNet w. | 92.47 | 86.81 | 84.72 | 87.02 | 87.76 |
Since our GeoNet is built upon existing instance segmentation networks, to evaluate its effectiveness, we experimented with generally accepted networks of FCIS [42] and Mask R-CNN [31]. We attach the DCN to both FCIS and Mask R-CNN and evaluate the performance of improvements in the segmentation results.
Accuracy is evaluated by mean average precision, mAP[30], at mask-level IOU (intersection-over-union) with overlap threshold set to 0.7 and 0.9 respectively. The results are shown in Table I. DCN performs better at larger overlap thresholds. At threshold 0.9, DCN improves the performance by and (mAP), respectively, which shows that DCN is capable of refining the segmentation result on an adequate basis (see also Fig. 4 for a visual comparison). For a plausible comparison, we set the instance count to a fixed number for computing mAP. The chart in Fig. 8 shows the mAP at different overlap thresholds. DCN works better when the base results from FCIS and Mask R-CNN agree with the ground truth. We only visualize the range [0.6, 0.9] since DCN is capable of boosting the performance when the segmentation results are rather accurate w.r.t. the ground truth, while when mAP is lower than 0.6, we find that DCN is much less helpful for refining the boundary.
It is also noteworthy that our method is capable of segmenting and reconstructing objects from raw sketch inputs as shown in the last column of Fig. 9. This indicates that our DCN network is able to learn cues from the input contour images and edge maps for predicting the final mask.
Metric | Method | cub | cuf | cyb | cyf | mean |
---|---|---|---|---|---|---|
PP-IOU | Baseline | 80.97 | 76.66 | 78.75 | 59.85 | 74.06 |
BNF [7] | 83.40 | 77.86 | 79.02 | 58.46 | 74.69 | |
Ours | 82.94 | 77.69 | 80.70 | 60.62 | 75.49 | |
PI-IOU | Baseline | 79.50 | 78.52 | 77.70 | 59.36 | 73.77 |
BNF [7] | 80.39 | 76.67 | 77.19 | 47.81 | 70.52 | |
Ours | 81.47 | 79.94 | 78.51 | 59.42 | 74.84 |
Comparisons to boundary refinement method. We compare GeoNet with Boundary Neural Fields [7] on semantic segmentation task on our test set containing 1614 cuboids and 1840 cylinders. We use the evaluation metrics pixel intersection-over-union averaged per pixels (PP-IOU) and pixel intersection-over-union averaged per image (PI-IOU) same as [7]. We also run the evaluation on the Mask R-CNN output as a baseline for the comparison.
According to this metric, PP-IOU is computed on a per pixel basis. As a result, the images that contain large object regions are given more importance. On the other hand, PI-IOU gives equal weight to each of the images. As shown in Table III, BNF has lower accuracy on PI-IOU indicates that it is not able to segment small objects accurately. However our method outperforms Mask R-CNN and BNF on average accuracy on both metrics.
Comparisons to cuboid detection and reconstruction methods. We use the SUN primitive dataset [61] to evaluate our method on cuboid reconstruction and compare with the methods of [61] and [23]. For cuboid detection, a bounding box is correct if the Intersection over Union (IOU) overlap is greater than 0.5. For keypoint localization, we use re-projection accuracy (RA) used in a baseline approach Xiao et al.[61] as well as the Probability of Correct Keypoint (PCK) and Average Precision of Keypoint (APK) metrics used in the state-of-the-art method Dwibedi et al.[23]
. The latter two are commonly used in the human pose estimation task. We use the re-projection corners of the reconstructed cuboids as keypoints for this task. The comparison results are shown in Table
IV. The numbers show that our approach performs better in both tasks.Method | AP | RA | APK | PCK |
---|---|---|---|---|
Xiao et al.[61] | 24.00 | 38.00 | - | - |
Dwibedi et al.[23] | 75.47 | - | 41.21 | 38.27 |
Ours | 79.56 | 49.79 | 47.56 | 45.11 |
Comparisons to point/voxel-based and semi-automatic reconstruction methods. We compare our method with two single image reconstruction methods using neural networks, Choy et al.[14] and Fan et al.[26].
We also compare with Densely Connected 3D Autoencoder in Li
et al.[62] from the ShapeNet reconstruction challenge. All of them are able to generate a rough representation of the 3D object from a single photograph. The visual comparison examples are shown in Fig. 10 and Fig. 11. We train their network using the 2000 cup and 2000 lamp models collected from ShapeNet[9]. The models are generated with the code provided by the authors with default parameters. It can be seen that our result is cleaner and more accurate. In addition, our models can be directly textured and edited while theirs can not due to the lack of semantic part information.Additionally, we conduct experiments to compare our approach vs. semi-automatic method 3-sweep[11] on models (5 tables, 5 lamps) using our own implementation. The average reconstruction error for 3-sweep is 1.263% whereas our is 1.262%. Fig. 11 shows the qualitative examples.
Timing. The training of the networks is performed on a server with NVIDIA GeForce GTX Titan X GPUs, an Intel i7-6700K CPU, and 64GB RAM. It takes three days to train the Mask R-CNN and one day to train the DCN on our dataset of 8183 images. It takes for GeoNet to segment one image and less than 1 second to reconstruct objects from the masks including stages of instance labeling, profile fitting, and 3D sweeping with multi-thread acceleration (the individual profile optimization can be performed in parallel).
Limitations.
Our method has a few limitations. As shown in Fig. 12, the network is not able to infer the regions of instances which are cluttered or under occlusion. Priors such as symmetry and physical validity can be enforced to alleviate the problem as in [28, 57]. Next, the network may also give wrong class labels when the 2D projection of the shape is vague. As shown in Fig. 12, the remote control is mistaken for a generalized cylinder by the network. For complex objects, our method is currently not able to predict parts which deviate much from the training set or cannot be approximated by GC-GCs such as the parts of the table shown in Fig. 12. In this example, it should also be noted that our method may fail to predict correct alignments between the parts. This is because in our experiments, individual parts are constructed in parallel whereas their semantic relations such as coplanar or co-axial may need further rectification utilizing methods of e.g., [41]. In the future, it would be interesting to incorporate such semantics in the network design. Finally, our method cannot handle cases where the axis of the object does not lie on a spatial plane. Thus the object can not have spiral axis such as a spring. To infer such spatially varying curved trajectory requires additional assumptions [5]. We also leave this for future work.
This paper presents a fully automatic method for extracting 3D editable objects from a single photograph. Our framework uses Mask R-CNN as a basis to build a network which is capable of improving the instance segmentation results. In the subsequent modeling stage, we simultaneously optimize for the camera pose and the 3D object profile and estimate the 3D body shape by a sweeping algorithm.
Our framework is capable of reconstructing primitive objects constituted by generalized cuboids and generalized cylinders. Unlike previous 3D reconstruction methods which reconstruct either 3D point clouds, voxels, or surface meshes, our model recovers high-quality semantic parts and their relations, which naturally enables plausible edits of the image objects. Qualitative and quantitative results have demonstrated the effectiveness of our method. In the future, we plan to explore possibilities of building a more generic and end-to-end framework to reconstruct high-quality primitive 3D shapes from single images or videos.
The authors would like to thank all the reviewers for their insightful comments. This work was supported in part the National Natural Science Foundation of China No. 61502306, No. U1609215, the National Key Research & Development Program of China (2016YFB1001403), and the China Young 1000 Talents Program.
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3762–3769. Cited by: §2.Proceedings of the 6th International Joint Conference on Artificial Intelligence - Volume 1
, IJCAI’79, San Francisco, CA, USA, pp. 105–113. External Links: ISBN 0-934613-47-8, Link Cited by: §1, §2.Imagining the unseen: stability-based cuboid arrangements for scene understanding
. ACM Trans. Graph. 33 (6), pp. 209:1–209:11. External Links: ISSN 0730-0301 Cited by: §6.