AutoSweep: Recovering 3D Editable Objectsfrom a Single Photograph

by   Xin Chen, et al.
University of Leeds
Zhejiang University

This paper presents a fully automatic framework for extracting editable 3D objects directly from a single photograph. Unlike previous methods which recover either depth maps, point clouds, or mesh surfaces, we aim to recover 3D objects with semantic parts and can be directly edited. We base our work on the assumption that most human-made objects are constituted by parts and these parts can be well represented by generalized primitives. Our work makes an attempt towards recovering two types of primitive-shaped objects, namely, generalized cuboids and generalized cylinders. To this end, we build a novel instance-aware segmentation network for accurate part separation. Our GeoNet outputs a set of smooth part-level masks labeled as profiles and bodies. Then in a key stage, we simultaneously identify profile-body relations and recover 3D parts by sweeping the recognized profile along their body contour and jointly optimize the geometry to align with the recovered masks. Qualitative and quantitative experiments show that our algorithm can recover high quality 3D models and outperforms existing methods in both instance segmentation and 3D reconstruction. The dataset and code of AutoSweep are available at


page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9


RefineMask: Towards High-Quality Instance Segmentation with Fine-Grained Features

The two-stage methods for instance segmentation, e.g. Mask R-CNN, have a...

Learning Unsupervised Hierarchical Part Decomposition of 3D Objects from a Single RGB Image

Humans perceive the 3D world as a set of distinct objects that are chara...

Towards Robust Part-aware Instance Segmentation for Industrial Bin Picking

Industrial bin picking is a challenging task that requires accurate and ...

Human De-occlusion: Invisible Perception and Recovery for Humans

In this paper, we tackle the problem of human de-occlusion which reasons...

Learning Gaussian Instance Segmentation in Point Clouds

This paper presents a novel method for instance segmentation of 3D point...

A New Image Codec Paradigm for Human and Machine Uses

With the AI of Things (AIoT) development, a huge amount of visual data, ...

Code Repositories


The implementation for the AutoSweep (TVCG 2018)

view repo

1 Introduction

There is an emerging demand on automatic extraction of high quality 3D objects from a single photograph. Applications are numerous, ranging from image manipulation [65, 64, 38], to emerging 3D printing [48, 22] and virtual reality and augmented reality [2, 3]. For example, in e-commerce, it is highly desirable to automatically and quickly recover the 3D model of a commercial product from its 2D image (e.g., in advertisement). Further, the geometry and the texture map should be of high quality to be useful. The problem, however, remains challenging: any successful solution should be able to reliably segment an object from the image and then recover its shape and structure whereas both problems are ill-posed and generally require imposing priors and using sophisticated optimization.

A photograph is inherently “flat” and does not contain associated depth information. Traditional solutions rely on multi-view stereo or volumetric reconstructions to recover the point cloud, normal, or visual hull of the object. They require using multiple images of an object which is most likely inaccessible in applications such as e-commerce. More importantly, the recovered 3D geometry is of low quality even with the most advanced reconstruction algorithms. Alternative solutions [11, 19, 37, 64] treat an object as a composition of simple, primitive components [8, 29]

and set out to estimate each individual component. Most existing methods in this category require extensive human inputs for partitioning the object. Most recently, end-to-end methods

[14, 26]

have leveraged generative neural networks to directly infer point cloud or volumetric representations of an object from a single image. They are able to produce coarse geometry that resembles the actual shape. Yet, the quality of the resulting model still barely meet the one of a CAD model or a parametric mesh.

Fig. 1: Exemplar 3D models generated using our method.
Fig. 2: The pipeline. Our method takes as input a single photograph and extracts its semantic part masks labeled as cylinder profile, cuboid profile, cylinder body, etc., which are then used in a sweeping procedure to construct a textured 3D model.

In this paper, we present a fully automatic, single-image based technique for producing very high quality 3D geometry of a specific class of objects: objects composed of generalized cuboids and generalized cylinders or GC-GCs, for short. Both a generalized cuboid and a generalized cylinder could be represented as a profile (i.e., a circle or a rectangle) sweeping along a trajectory axis as in traditional CAD systems. Normally, the profile is allowed to scale and the trajectory axis is curved [11]. An intriguing benefit of our reconstruction pipeline is that each cuboid and cylindrical part can be directly edited by altering the profile or the trajectory axis and then composed together to form a new GC-GC. See Fig. 1, 5 for examples of GC-GCs.

In our solution, we first partition and recognize each semantic part of a GC-GC object. We exploit instance segmentation network Mask R-CNN [31] which is capable of handling “invisible profiles” that caused by occlusions of the foreground or even self occlusions. However, due to a small receptive field, the output often contains erroneous boundaries and incomplete masks that do not agree with the actual object mask. We extend the structure of Mask R-CNN and construct our Geometry Network (GeoNet for short) by incorporating contour and edge maps into a concatenating network which we call the deformable convolutional network (DCN) derived from [10, 18]. The edge maps and 2D contours are used to better learn the boundary of the body and face regions which are crucial in the subsequent modeling process. Our network outputs smooth masks around the boundary regions.

Once we segment each component, we conduct reconstruction via a volume sweeping scheme. We decouple the process into two stages of profile fitting and profile sweeping. To estimate the 3D profile, we jointly optimize the profile with the camera pose. We then extract the trajectory axis of each body mask and map it to 3D with the estimated camera pose to guide the optimization of the profile sweeping.

We demonstrate our approach to various images. Our system is capable of automatically generating 3D models from a single photograph which can then be used for editing and rearranging. Qualitative and quantitative experiments are conducted to verify the effectiveness of our method.

2 Related Work

Semantic Segmentation.

Recent deep neural networks have shown great success in improving traditional classification and semantic segmentation tasks. The classifier in the fully convolutional networks (FCNs)

[44] can conduct inference and learning on an arbitrary sized image but does not directly output individual object instances. Mask R-CNN [31] extends Faster R-CNN [51] by adding a branch for predicting object masks on top of bounding box extraction. [17, 16] use a multi-task cascaded structure to identify instances with position-sensitive score maps. FCIS [42] proposed inside and outside maps to preserve the spatial extent of the original image. [7] observed that the large receptive fields and the amount of pooling layers of these networks can degrade the quality of instance masks, causing aliasing effect. Region proposal network (RPN) [51] can only capture the rough shape of the object and its extensions [7, 43, 47] aim to improve the segmentation boundary.

Single-Image Depth Estimation. Classic methods on monocular depth estimation mainly relied on hand-crafted features and graphical models[33, 55]. More recently, several learning-based approaches boost the performance by utilizing deep models. [25] employed a multi-scale deep network with two stacks for both global and local prediction to achieve depth estimation on a single image. [46] used CNN for simultaneous depth estimation and semantic segmentation. However, these works only seek to obtain the relative 3D relationship between different layers and therefore their depth results are much less accurate and clearly insufficient for high quality reconstruction. In our reconstruction, the surface is highly curved but smooth and therefore the depth map needs to be at an ultra-high accuracy, which is extremely difficult to achieve even under the stereo setting, let alone single-image.

Single-Image 3D Reconstruction.

Recovering 3D shape from a single image is a long standing problem in computer vision

[8], stemming from image metrology [49, 15]. The problem is inherently ill-posed and tremendous efforts have focused on imposing constraints such as geometric priors [19, 60], symmetry [27, 34, 37], planarity constraints [63], shape priors [56, 36], etc., or relying on stock 3D models for 2D-3D alignments [38, 4, 50, 6].

Latest approaches leverage deep learning techniques

[61, 32, 23] on large datasets. Eigen et al.[24] infer depth maps using a multi-scale deep network. The 3D-R2N2 [14] attempts to recover 3D voxels from a single and multiple photographs. [26] recovers a dense set of 3D point cloud using a generation network. It is also possible to incorporate 3D geometry proxies such as volumetric abstraction [54, 59], hierarchical CSG tree [12], part models [53], etc. Results from these techniques are promising but still fall short compared with CSG models. Closest to ours is the work from the Magic Leap group, with a clear interest in virtual and augmented reality, to recognize and reconstruct 3D cuboids in a single photograph [23]. Our approach is able to recover more general shapes, namely generalized cuboid and cylindrical objects.

Sweep-based 3D Modeling. A core technique we employ is 3D sweeping. Sweeping a 2D profile along a specific 3D trajectory is a common practice for generating 3D models in computer-aided design (CAD). Early CAD systems [52] use simple linear sweeps (sweeping a 2D polygon along a linear path) to generate solid models. Shiroma et al.[58] develop a generalized sweeping method for CSG modeling. Their technique supports curved sweep axis with varying shapes to produce highly complex objects. [1] conducts volume preserving stretching while avoiding self-intersections. More recent 3-Sweep [11] and its extension, D-Sweep [35], pair sweeping with image snapping. All previous approaches require manual inputs from the user whereas we focus on fully automated shape generation.

Fig. 3: The structure of our GeoNet is composed by an instance segmentation network (Mask R-CNN) and a deformable convolutional network derived from [10, 18]. The net outputs instance masks labeled as semantic parts (profiles, bodies).

3 Overview

The pipeline of our framework is shown in Fig. 2. We take a single photograph containing objects of interests and feed it into our GeoNet to produce instance masks labeled as cuboid profile, cuboid body, cylinder profile, and cylinder body. These instance masks are then used for estimating the 3D profile (a circle or a rectangle) and the camera pose, along with a trajectory axis (a planar 3D curve) for the profile to sweep to create the 3D model.

The architecture of our GeoNet is illustrated in Fig. 3. We build upon the instance segmentation network of Mask R-CNN. The output of Mask R-CNN, coupled with contour image and the edge map, is fed into a deformable convolutional network which is derived from [10] and [18]. With the information of contour and edge maps, DCN is capable of learning a better and smooth boundary. Details are given in Section 4.1.

To sweep a primitive part, we first co-relate profile/body masks which could constitute a 3D part. Given correlated profile-body masks, a 3D profile is optimized with camera FoV and a trajectory axis is computed from the body/profile masks. Then, sweeping is performed in 3D to progressively transform and place the estimated 3D profile along the trajectory axis to construct the final model.

4 Instance Segmentation

4.1 GeoNet

Our GeoNet takes an image as input and outputs the following four types of instance masks: cuboid profile, cuboid body, cylinder profile, and cylinder body. A direct instance segmentation network (Mask R-CNN) could lead to erroneous boundaries and incomplete masks that do not agree with the actual object mask, because the resolution of feature map are lower due to the ROI memory consumption [31]. (Fig. 4). Atrous convolution of Deeplab controls the respective fields under a reasonable range, while deformable convolution causes more effective respective fields which can improve the detail of segmentation results. Thus, we integrate deformable convolution layers proposed in [18] into the network structure of Deeplab [10] and concatenate it with Mask R-CNN for segmentation refinement. We call the sub network concatenated to Mask R-CNN the deformable convolutional network (DCN).

Fig. 4: (a) Input image. Segmentation (b) and modeling (c) results of Mask R-CNN. Our GeoNet is capable of filling the gaps and snapping to the boundary (d), (e).

To boost the performance of our GeoNet, instead of directly feeding into the DCN with the results from Mask R-CNN, we use more information from the original image to help GeoNet learn more boundary features. We have tested various case, including using different combination of the original image, the edge map of the original image, and the probability maps from Mask R-CNN, etc., to feed into DCN. Quantitative comparisons are demonstrated in Section

6. At last, we find combining the edge map [11] and contour map [13] of the input image with probability maps given by Mask R-CNN achieves the best performance. We thus combine these with each instance probability map and feed into DCN. Specifically, for each instance probability maps from Mask R-CNN, we combine it with the edge map and the contour map and convert them into a single image ( takes the Green channel, and take the Red and Blue channel respectively, see Fig. 3 middle). We assign different green values (40 for cuboid body, 100 for cuboid profile, 150 for cylinder body, and 200 for cylinder profile) weighted with probability map for different instance categories to distinguish the instances. The shape of instances in one category have quite similar geometrical characteristics, thus labeling the instance with different green values helps the network to learn a better geometrical feature within this category. We find this simple strategy greatly improves the performance of DCN.

The output of DCN is a refined instance mask . After getting through the DCN, we combine all instance masks to form the final mask. To enforce feature learning, the beginning of our DCN is formatted by Res-Net with deformable convolution layers in res-5a, res-5b and res-5c, and connected with 2 convolution layers and 1 deconvolution layer.

Pre-training. Large nets are typically difficult to train. A good initial guess of the parameters usually leads to better convergence. Thus before using the real images, we pre-train the net with synthetic data. We manually construct a dataset containing 10 exemplar cuboids and generalized cylinders collected from ShapeNet [9] (see in Fig. 5). We render these examples from uniformly sampled view angles to generate 1000 images for each example, which gives us 10000 examples for pre-training. Since we do not have a large number of instances in our dataset, we decrease the ROI number from 256 to 128 during the training of Mask R-CNN. We also enlarge our dataset with flipped images.

5 Modeling

Given the output masks from GeoNet, our next task is to create a 3D model that agrees with the target masks. We first separate the masks into independent parts (i.e., primitives) constituted by profiles and body and then construct each part independently.

Fig. 5: Representative synthetic models used in our pre-training. The second and third rows are the corresponding contour maps and label masks, respectively.

5.1 Instance labelling

Let us denote the set of instances segmented from the network as unlabelled profile faces and labelled bodies . Our task is to match each unlabelled profile with its corresponding body . This is essentially a labeling problem.

We formulate the following minimization problem:


where is the unary term. measures the closest Euclidean distance between profile and body . We set it to a large constant if the distance exceeds a threshold ( 3% of the image height in our implementation). measures the proximity of the face to the body. We define it as , where is the portion of the points on profile which are inside the oriented bounding box of body . Both and are set to 0.3.

The binary term is defined as , where is a function which takes value 1 if and overlaps and is equal to and takes value 0 otherwise. The binary term is basically set to penalize two overlapped (i.e., occluded) profiles being assigned to the same body. We solve the above optimization by MRF.

For bodies that have no corresponding profiles, such as the handle of a mug whose profile is invisible due to occlusion, we gather them to form a handle set and attach them to the closest bodies in . Fig. 2 left gives a brief illustration. We discard false detected handles if their distance is far away from any detected body ( of the image height in our experiments).

To fit our 3D model, we use perspective projection rather than orthogonal (which was used in [11]) to create 3D models resembling real world objects. Direct global optimization of the primitive and camera parameters could easily render the problem difficult due to the large variable space. We thus decouple the problem into three steps: profile fitting, trajectory axis estimation, and 3D sweeping.

5.2 Profile fitting

As the object profiles in our case are circles and rectangles, this imposes strong priors for our optimization. We assume a fixed camera pose and camera-to-object distance. Below are details for fitting the 3D circle and rectangle respectively. The key is to find a plausible initial value for the optimization.

Circle. Circles in 3D become ellipses in 2D after projection. We use the PCA center as the initial circle center, with a default depth value . The 3D position of the endpoints of the PCA major axis are also obtained at depth 10. The initial radius is then assigned according to the length of the 3D major axis. For the circle orientation, we cast a ray from the camera to one of the endpoints of the minor axis to intersect with the sphere of radius centered at . Let be the intersecting point. The orientation is set as the normal of the plane passing through , , and .

Given the initial circle , together with the mask outline, we optimize 5 variables using Levenberg-Marquardt. The 5 variables are , and which is the field of view (FoV) of the camera. We define the following optimization formulation:


stands for alignment error after projection, it is defined as , where denotes the portion of points which are not inside the mask. is set to 40. ensures the circle is inside the mask boundary while its radius is as large as possible after the projection. stands for the error between profile normal and the starting direction of the trajectory axis (Section 5.3) under different FoVs. We define it as follows: , where is the acute angle between and , with denoting the normal projected to 2d and denoting the starting direction of the medial axis mentioned in Section 5.3. is a function that guarantees normal has a square magnitude of 1.

In a second step, we optimize the circle position separately using only the first term of the objective function to get an updated . With the new , we go back to the optimization of radius, normal and camera FoV. The two steps are iterated until convergence.

Rectangle. Rectangles are optimized in a similar way. We first detect four vertices by fitting a quadrilateral to the profile mask. Then cast four rays from the camera to the four vertices. The 3D vertices (in clockwise) of the four vertices which lie on the four rays are then optimized as follows:


where keeps the spatial information of the rectangle through the following constraints: (1) parallel edges have equal length, (2) adjacent edges are perpendicular to each other, (3) four vertices are coplanar. We define as:



are the vector created by adjacent vertices

. computes the cosine of the acute angle between two vectors. We add parameters to normalize each term. and are the same as above with radius replaced by side length. We rectify the 3D vertices to form a strict planar rectangle during iteration.

5.3 Trajectory axis extraction

Fig. 6: (a) Original image with profile mask. (b) 3D profile (in white). (c) Trajectory axis after thinning. (d) Trajectory axis after pruning.

We then extract a trajectory axis that approximates the main axis of the body. The curve will be a guiding line for the sweeping procedure. We use a morphology operation called thinning [39] to get a single width skeleton of the mask image, as shown in Fig. 6, (b). To better account for the completeness of the skeleton, we use both body and profile masks for thinning. To remove the spurious branches in the skeleton, we use a simple way to prune the branches. We mark the skeleton points as branching point and end points using hit-or-miss [21]. Branches are identified as paths connecting end points and branching points. We progressively delete shortest branches until we get no branching point.

Fig. 7: Trajectory axis extraction. (a) The input mask image. (b) Our result. (c) Medial axis extracted using the method of [45]

As our purpose is to reconstruct cylindrical and cuboid object whose trajectory axis is either a straight line or a curve. We perform trajectory axis classification. The goal is to classify whether the trajectory axis is a straight line or not. Simple heuristics such as using line fitting with specific thresholds could lead to erroneous estimations. For a more general solution, we utilize the training data available in our dataset. We employ the LeNet

[40] and modify the last FC layer into 2 classes. We use both the body mask and the associated profile masks as input to provide the net with more contextual information. Specifically, we compute their bounding box and scale them to the size of as input to the net. We get an accuracy of for this task.

If the trajectory axis is labeled as a straight line, we rectify the axis direction w.r.t. profile axis in cases the thinning process gives erroneous skeleton (e.g., for a cylinder we simply set the axis to be orthogonal (in 2D) to the major axis, see Fig. 7). In case when the trajectory axis is labeled as a curve. We set the starting point to profile center and perform bilateral filtering to get the final curve axis. See Fig. 6 (d) for an example. We find this simple thinning-and-rectifying strategy to perform well in our experiments.

We also investigated previous medial axis extraction method of [45]. Since their method disregards the context information of the profile faces and thus could lead to erroneous estimations (see an example in Fig. 7).

5.4 Sweeping

Given the 3D profile and the trajectory axis, our next task is to sweep a 3D model which approximates the body mask. As in [11], we assume that the trajectory axis lies on a plane which is orthogonal to the profile plane and passes through the profile center. For simplicity, we set the plane orientation to be orthogonal to the camera direction if the object is a generalized cylinder. For a cuboid, we let the plane pass through one of the diagonal lines of the rectangle profile.

We project the 2D body mask and the trajectory axis on that plane and start to place the 3D profile uniformly along the projected trajectory axis. For each part to sweep, we start with the profile with a smaller fitting error if there are two. For each individual profile , stands for frame index, we cast a 3D ray from its center to intersect with the projected body mask and regard this distance as an initial guess for the profile radius. The final radius of is optimized with


Here is the intermediate sweeping profile. represents the sampling points of profile . M is a 2D logical matrix representing the segmentation mask. is a 3D-to-2D projection function which outputs a 2 dimensional vector in the camera space. The vector is regarded as the index of with and representing row and column respectively. In Eqn 5, the first term measures how many sample points fall inside the body mask; the second term is the distance between the intersection points and its nearest point on the profile boundary as in [11]. is computed by casting a 2D ray from the projected center , then intersect with the edges on the edge map . Here we reuse the edges of mentioned in Section4.1; the third term aims to ensure that the radius is not too small. equals 0.025 in our experiment.

The above procedure optimizes the radius for individual sweeping profiles. To ensure the continuity of the geometry, we perform a global optimization on all swept profiles after the individual frame optimization. For all frames, the aim is to refine all centers and orientations .


is the Laplacian smoothing operator. Here the first term in Eqn 6 measures the smoothness of the geometry, and the second is the deviation of and to initial values from frames, every weight inside is computed by the dot product between the tangential directions of the current and the next frame center on the trajectory axis. Eqn 5 and 6 are iterated to get the final result. In our experiments, both optimizations take around 1-3 iterations to converge.

For generalized cylinder or cuboid which have no associated profiles (e.g., a teapot handle), we estimate an initial position and radius for the profile by analyzing the contact region to the part of the already constructed 3D body. The sweeping process is performed similarly to finally create those parts (see Fig. 2, 9). Note that before the sweeping process, we globally optimize the camera pose (FoV) with all estimated 3D profiles.

Method cub cuf cyb cyf mAP@0.7 cub cuf cyb cyf mAP@0.9
FCIS 68.19 61.24 50.33 37.51 54.32 33.04 23.71 10.51 9.09 19.09
GeoNet w. FCIS 68.61 61.47 56.75 37.23 56.01 48.64 36.88 17.14 10.30 28.24
Mask R-CNN 68.36 61.22 55.93 40.26 56.44 35.73 30.13 7.29 10.17 20.83
GeoNet w. Mask R-CNN 69.49 61.04 57.90 37.84 56.57 50.18 37.92 13.89 11.37 28.34
TABLE I: Evaluation of GeoNet with FCIS [42] and Mask R-CNN [31] at overlap thresholds of 0.7 and 0.9 respectively.

6 Experiments

Dataset. Besides the synthetic data described in Section 4.1, our real dataset contains multiple human-made primitive-shaped objects widely used in daily life such as mugs, bottles, taps, cages, books, and fridges, etc. There are 11657 real images and 10000 synthetic images (with 11590 generalized cuboids and 15008 generalized cylinders). The real dataset contains about

unannotated images from ImageNet

[20], annotated images from Xiao et al. [61], and images collected from the Internet. The real dataset is further separated into training images and testing images. We perform evaluations of all experiments on the testing set of real images.

Experiment of GeoNet. In order to make full use of the information from original image as well as the outputs of instance segmentation network, We test various combination of gray map , edge map , contour map of the image, mask , probability map

from the network. We restrict the combination to form a three channel image, and duplicate channels when the assembled map number is less than 3. For this experiment of combination strategy, we adopt Mask R-CNN as the first stage of our GeoNet. We use mean intersection-over-union (mIoU) defined over image pixels as the evaluation metric, since we are focusing on boundary refinement because the instances are the same during these experiments. The results are shown in Table

III, the combination of significantly outperforms the others.

Method cub cuf cyb cyf mean
Mask R-CNN 77.56 80.51 68.68 75.74 75.62
GeoNet w. 87.51 85.50 77.89 82.87 83.44
GeoNet w. 89.34 85.84 79.01 83.19 84.34
GeoNet w. 90.12 85.92 78.28 83.22 84.39
GeoNet w. 89.67 86.03 79.78 83.82 84.83
GeoNet w. 90.88 86.84 79.51 84.36 85.40
GeoNet w. 91.80 86.24 85.27 85.37 87.17
GeoNet w. 92.47 86.81 84.72 87.02 87.76
TABLE II: Evaluation of GeoNet on different combinations of gray map , edge map , contour map of image, mask , probability map from Mask R-CNN.

Since our GeoNet is built upon existing instance segmentation networks, to evaluate its effectiveness, we experimented with generally accepted networks of FCIS [42] and Mask R-CNN [31]. We attach the DCN to both FCIS and Mask R-CNN and evaluate the performance of improvements in the segmentation results.

Accuracy is evaluated by mean average precision, mAP[30], at mask-level IOU (intersection-over-union) with overlap threshold set to 0.7 and 0.9 respectively. The results are shown in Table I. DCN performs better at larger overlap thresholds. At threshold 0.9, DCN improves the performance by and (mAP), respectively, which shows that DCN is capable of refining the segmentation result on an adequate basis (see also Fig. 4 for a visual comparison). For a plausible comparison, we set the instance count to a fixed number for computing mAP. The chart in Fig. 8 shows the mAP at different overlap thresholds. DCN works better when the base results from FCIS and Mask R-CNN agree with the ground truth. We only visualize the range [0.6, 0.9] since DCN is capable of boosting the performance when the segmentation results are rather accurate w.r.t. the ground truth, while when mAP is lower than 0.6, we find that DCN is much less helpful for refining the boundary.

It is also noteworthy that our method is capable of segmenting and reconstructing objects from raw sketch inputs as shown in the last column of Fig. 9. This indicates that our DCN network is able to learn cues from the input contour images and edge maps for predicting the final mask.

Fig. 8: DCN improves the performance of segmentation results when the base segmentation results are more faithful to the ground truth.
Fig. 9: Representative results generated using our method. Our method is able to recover objects constituted by multiple semantic parts (e.g., teapots, lamps, water taps, etc.). The first row shows some of the editing results of the model created. The two examples (last column) show that our method can be directly applied to sketch input. We assume symmetry in texture maps, mirror the front texture to back, and finally stitch them together.
Metric Method cub cuf cyb cyf mean
PP-IOU Baseline 80.97 76.66 78.75 59.85 74.06
BNF [7] 83.40 77.86 79.02 58.46 74.69
Ours 82.94 77.69 80.70 60.62 75.49
PI-IOU Baseline 79.50 78.52 77.70 59.36 73.77
BNF [7] 80.39 76.67 77.19 47.81 70.52
Ours 81.47 79.94 78.51 59.42 74.84
TABLE III: Semantic segmentation comparison on our dataset. Note that BNF has a significant drop on cylinder profile because it may fail when the boundaries are not clear, while many cylinder profiles have no clear boundaries due to self occlusion in our case.

Comparisons to boundary refinement method. We compare GeoNet with Boundary Neural Fields [7] on semantic segmentation task on our test set containing 1614 cuboids and 1840 cylinders. We use the evaluation metrics pixel intersection-over-union averaged per pixels (PP-IOU) and pixel intersection-over-union averaged per image (PI-IOU) same as [7]. We also run the evaluation on the Mask R-CNN output as a baseline for the comparison.

According to this metric, PP-IOU is computed on a per pixel basis. As a result, the images that contain large object regions are given more importance. On the other hand, PI-IOU gives equal weight to each of the images. As shown in Table III, BNF has lower accuracy on PI-IOU indicates that it is not able to segment small objects accurately. However our method outperforms Mask R-CNN and BNF on average accuracy on both metrics.

Comparisons to cuboid detection and reconstruction methods. We use the SUN primitive dataset [61] to evaluate our method on cuboid reconstruction and compare with the methods of [61] and [23]. For cuboid detection, a bounding box is correct if the Intersection over Union (IOU) overlap is greater than 0.5. For keypoint localization, we use re-projection accuracy (RA) used in a baseline approach Xiao et al.[61] as well as the Probability of Correct Keypoint (PCK) and Average Precision of Keypoint (APK) metrics used in the state-of-the-art method Dwibedi et al.[23]

. The latter two are commonly used in the human pose estimation task. We use the re-projection corners of the reconstructed cuboids as keypoints for this task. The comparison results are shown in Table

IV. The numbers show that our approach performs better in both tasks.

Xiao et al.[61] 24.00 38.00 - -
Dwibedi et al.[23] 75.47 - 41.21 38.27
Ours 79.56 49.79 47.56 45.11
TABLE IV: Comparison of cuboid bounding box detection and keypoint localization. AP is the average precision for bounding box detection used in Xiao et al.[61].
Fig. 10: Comparison with point/voxel-based image reconstruction methods. (a) The input image. (b) The result of point-based framework [26]. (c) Our result.

Comparisons to point/voxel-based and semi-automatic reconstruction methods. We compare our method with two single image reconstruction methods using neural networks, Choy et al.[14] and Fan et al.[26].

We also compare with Densely Connected 3D Autoencoder in Li

et al.[62] from the ShapeNet reconstruction challenge. All of them are able to generate a rough representation of the 3D object from a single photograph. The visual comparison examples are shown in Fig. 10 and Fig. 11. We train their network using the 2000 cup and 2000 lamp models collected from ShapeNet[9]. The models are generated with the code provided by the authors with default parameters. It can be seen that our result is cleaner and more accurate. In addition, our models can be directly textured and edited while theirs can not due to the lack of semantic part information.

Additionally, we conduct experiments to compare our approach vs. semi-automatic method 3-sweep[11] on models (5 tables, 5 lamps) using our own implementation. The average reconstruction error for 3-sweep is 1.263% whereas our is 1.262%. Fig. 11 shows the qualitative examples.

Fig. 11: The comparison with 3-sweep, 3D-R2N2 and the Densely Connected 3D Autoencoder in Li et al.[62].

Timing. The training of the networks is performed on a server with NVIDIA GeForce GTX Titan X GPUs, an Intel i7-6700K CPU, and 64GB RAM. It takes three days to train the Mask R-CNN and one day to train the DCN on our dataset of 8183 images. It takes for GeoNet to segment one image and less than 1 second to reconstruct objects from the masks including stages of instance labeling, profile fitting, and 3D sweeping with multi-thread acceleration (the individual profile optimization can be performed in parallel).


Fig. 12: The failure cases of our approach.

Our method has a few limitations. As shown in Fig. 12, the network is not able to infer the regions of instances which are cluttered or under occlusion. Priors such as symmetry and physical validity can be enforced to alleviate the problem as in [28, 57]. Next, the network may also give wrong class labels when the 2D projection of the shape is vague. As shown in Fig. 12, the remote control is mistaken for a generalized cylinder by the network. For complex objects, our method is currently not able to predict parts which deviate much from the training set or cannot be approximated by GC-GCs such as the parts of the table shown in Fig. 12. In this example, it should also be noted that our method may fail to predict correct alignments between the parts. This is because in our experiments, individual parts are constructed in parallel whereas their semantic relations such as coplanar or co-axial may need further rectification utilizing methods of e.g., [41]. In the future, it would be interesting to incorporate such semantics in the network design. Finally, our method cannot handle cases where the axis of the object does not lie on a spatial plane. Thus the object can not have spiral axis such as a spring. To infer such spatially varying curved trajectory requires additional assumptions [5]. We also leave this for future work.

7 Conclusion

This paper presents a fully automatic method for extracting 3D editable objects from a single photograph. Our framework uses Mask R-CNN as a basis to build a network which is capable of improving the instance segmentation results. In the subsequent modeling stage, we simultaneously optimize for the camera pose and the 3D object profile and estimate the 3D body shape by a sweeping algorithm.

Our framework is capable of reconstructing primitive objects constituted by generalized cuboids and generalized cylinders. Unlike previous 3D reconstruction methods which reconstruct either 3D point clouds, voxels, or surface meshes, our model recovers high-quality semantic parts and their relations, which naturally enables plausible edits of the image objects. Qualitative and quantitative results have demonstrated the effectiveness of our method. In the future, we plan to explore possibilities of building a more generic and end-to-end framework to reconstruct high-quality primitive 3D shapes from single images or videos.


The authors would like to thank all the reviewers for their insightful comments. This work was supported in part the National Natural Science Foundation of China No. 61502306, No. U1609215, the National Key Research & Development Program of China (2016YFB1001403), and the China Young 1000 Talents Program.


  • [1] A. Angelidis, M. Cani, G. Wyvill, and S. King (2006) Swirling-sweepers: constant-volume modeling. Graphical Models 68 (4), pp. 324–332. Cited by: §2.
  • [2] (2017) Apple arkit. Apple Inc.. Cited by: §1.
  • [3] R. Arora, R. H. Kazi, F. Anderson, T. Grossman, K. Singh, and G. Fitzmaurice (2017) Experimental evaluation of sketching on surfaces in vr. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, New York, NY, USA. External Links: Document Cited by: §1.
  • [4] M. Aubry, D. Maturana, A. A. Efros, B. C. Russell, and J. Sivic (2014) Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3762–3769. Cited by: §2.
  • [5] S. Bae, R. Balakrishnan, and K. Singh (2008) ILoveSketch: as-natural-as-possible sketching system for creating 3d curve models. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology, UIST ’08, pp. 151–160. External Links: ISBN 978-1-59593-975-3 Cited by: §6.
  • [6] A. Bansal, B. Russell, and A. Gupta (2016) Marr revisited: 2d-3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5965–5974. Cited by: §2.
  • [7] G. Bertasius, J. Shi, and L. Torresani (2016) Semantic segmentation with boundary neural fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3602–3610. Cited by: §2, TABLE III, §6.
  • [8] R. A. Brooks, R. Creiner, and T. O. Binford (1979) The acronym model-based vision system. In

    Proceedings of the 6th International Joint Conference on Artificial Intelligence - Volume 1

    IJCAI’79, San Francisco, CA, USA, pp. 105–113. External Links: ISBN 0-934613-47-8, Link Cited by: §1, §2.
  • [9] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: an information-rich 3d model repository. CoRR abs/1512.03012. External Links: Link, 1512.03012 Cited by: §4.1, §6.
  • [10] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2016) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR abs/1606.00915. External Links: Link, 1606.00915 Cited by: §1, Fig. 3, §3, §4.1.
  • [11] T. Chen, Z. Zhu, A. Shamir, S. Hu, and D. Cohen-Or (2013) 3-sweep: extracting editable objects from a single photo. ACM Transactions on Graphics (TOG) 32 (6), pp. 195. Cited by: §1, §1, §2, §4.1, §5.1, §5.4, §5.4, §6.
  • [12] X. Chen, J. Tang, and C. Li (2017) Progressive 3d shape abstraction via hierarchical csg tree. In Second International Workshop on Pattern Recognition, Vol. 10443, pp. 1044315. Cited by: §2.
  • [13] M. Cheng (2009) Curve structure extraction for cartoon images. In Proceedings of the 5th Joint Conference on Harmonious Human Machine Environment, pp. 13–25. Cited by: §4.1.
  • [14] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In European Conference on Computer Vision, pp. 628–644. Cited by: §1, §2, §6.
  • [15] A. Criminisi, I. Reid, and A. Zisserman (2000-11) Single view metrology. Int. J. Comput. Vision 40 (2), pp. 123–148. External Links: ISSN 0920-5691, Link, Document Cited by: §2.
  • [16] J. Dai, K. He, Y. Li, S. Ren, and J. Sun (2016) Instance-sensitive fully convolutional networks. In European Conference on Computer Vision, pp. 534–549. Cited by: §2.
  • [17] J. Dai, K. He, and J. Sun (2016) Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3150–3158. Cited by: §2.
  • [18] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017-10) Deformable convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, Fig. 3, §3, §4.1.
  • [19] P. E. Debevec, C. J. Taylor, and J. Malik (1996) Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’96, New York, NY, USA, pp. 11–20. External Links: ISBN 0-89791-746-4, Link, Document Cited by: §1, §2.
  • [20] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §6.
  • [21] E. R. Dougherty (1992) An introduction to morphological image processing. Spie Optical Engineering tt9. Cited by: §5.3.
  • [22] J. Dumas, J. Hergel, and S. Lefebvre (2014-07) Bridging the gap: automated steady scaffoldings for 3d printing. ACM Trans. Graph. 33 (4), pp. 98:1–98:10. External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • [23] D. Dwibedi, T. Malisiewicz, V. Badrinarayanan, and A. Rabinovich (2016) Deep cuboid detection: beyond 2d bounding boxes. CoRR abs/1611.10010. External Links: Link, 1611.10010 Cited by: §2, TABLE IV, §6.
  • [24] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems, Vol. 3, pp. 2366–2374. Cited by: §2.
  • [25] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §2.
  • [26] H. Fan, H. Su, and L. J. Guibas (2017-07) A point set generation network for 3d object reconstruction from a single image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Fig. 10, §6.
  • [27] A. R. J. François and G. G. Medioni (2002) Reconstructing mirror symmetric scenes from a single view using 2-view stereo geometry. In Proceedings of the 16 th International Conference on Pattern Recognition (ICPR’02) Volume 4 - Volume 4, ICPR ’02, Washington, DC, USA, pp. 40012–. External Links: ISBN 0-7695-1695-X, Link Cited by: §2.
  • [28] R. Guo and D. Hoiem (2013) Support surface prediction in indoor scenes. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV ’13, Washington, DC, USA, pp. 2144–2151. External Links: ISBN 978-1-4799-2840-8, Link, Document Cited by: §6.
  • [29] A. Gupta, A. A. Efros, and M. Hebert (2010) Blocks world revisited: image understanding using qualitative geometry and mechanics. In European Conference on Computer Vision(ECCV), Cited by: §1.
  • [30] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik (2014) Simultaneous detection and segmentation. In European Conference on Computer Vision, pp. 297–312. Cited by: §6.
  • [31] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017-10) Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §4.1, TABLE I, §6.
  • [32] M. Hejrati and D. Ramanan (2016) Categorizing cubes: revisiting pose normalization. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, pp. 1–9. Cited by: §2.
  • [33] D. Hoiem, A. A. Efros, and M. Hebert (2005) Geometric context from a single image. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, Vol. 1, pp. 654–661. Cited by: §2.
  • [34] W. Hong, A. Y. Yang, K. Huang, and Y. Ma (2004-12-01) On symmetry and multiple-view geometry: structure, pose, and calibration from a single image. International Journal of Computer Vision 60 (3), pp. 241–265. External Links: ISSN 1573-1405, Document, Link Cited by: §2.
  • [35] P. Hu, H. Cai, and F. Bu (2014) D-sweep: using profile snapping for 3d object extraction from single image. In International Symposium on Smart Graphics, pp. 39–50. Cited by: §2.
  • [36] Q. Huang, H. Wang, and V. Koltun (2015-07) Single-view reconstruction via joint analysis of image and shape collections. ACM Trans. Graph. 34 (4), pp. 87:1–87:10. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [37] N. Jiang, P. Tan, and L. Cheong (2009-12) Symmetric architecture modeling with a single image. ACM Trans. Graph. 28 (5), pp. 113:1–113:8. External Links: ISSN 0730-0301, Link, Document Cited by: §1, §2.
  • [38] N. Kholgade, T. Simon, A. Efros, and Y. Sheikh (2014) 3D object manipulation in a single photograph using stock 3d models. ACM Transactions on Graphics (TOG) 33 (4), pp. 127. Cited by: §1, §2.
  • [39] L. Lam, S. Lee, and C. Y. Suen (1992) Thinning methodologies-a comprehensive survey. IEEE Transactions on pattern analysis and machine intelligence 14 (9), pp. 869–885. Cited by: §5.3.
  • [40] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §5.3.
  • [41] Y. Li, X. Wu, Y. Chrysathou, A. Sharf, D. Cohen-Or, and N. J. Mitra (2011-07) GlobFit: consistently fitting primitives by discovering global relations. ACM Trans. Graph. 30 (4), pp. 52:1–52:12. External Links: ISSN 0730-0301, Link, Document Cited by: §6.
  • [42] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei (2017-07) Fully convolutional instance-aware semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, TABLE I, §6.
  • [43] G. Lin, A. Milan, C. Shen, and I. Reid (2017-07) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [44] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §2.
  • [45] A. S. Montero and J. Lang (2012) Skeleton pruning by contour approximation and the integer medial axis transform. Computers & Graphics 36 (5), pp. 477 – 487. Note: Shape Modeling International (SMI) Conference 2012 External Links: ISSN 0097-8493, Document, Link Cited by: Fig. 7, §5.3.
  • [46] A. Mousavian, H. Pirsiavash, and J. Košecká (2016) Joint semantic segmentation and depth estimation with deep convolutional networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pp. 611–619. Cited by: §2.
  • [47] P. O. Pinheiro, T. Lin, R. Collobert, and P. Dollár (2016) Learning to refine object segments. In European Conference on Computer Vision, pp. 75–91. Cited by: §2.
  • [48] R. Prévost, E. Whiting, S. Lefebvre, and O. Sorkine-Hornung (2013-07) Make it stand: balancing shapes for 3d fabrication. ACM Trans. Graph. 32 (4), pp. 81:1–81:10. External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • [49] I. D. Reid and A. Zisserman (1996) Goal-directed video metrology. In Proceedings of the 4th European Conference on Computer Vision-Volume II - Volume II, ECCV ’96, London, UK, UK, pp. 647–658. External Links: ISBN 3-540-61123-1, Link Cited by: §2.
  • [50] K. Rematas, C. Nguyen, T. Ritschel, M. Fritz, and T. Tuytelaars (2016) Novel views of objects from a single image. arXiv preprint arXiv:1602.00328. Cited by: §2.
  • [51] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.
  • [52] A. A. Requicha and H. B. Voelcker (1982) Solid modeling: a historical summary and contemporary assessment. IEEE Computer Graphics and Applications (2), pp. 9–24. Cited by: §2.
  • [53] P. Sala and S. Dickinson (2010) Contour grouping and abstraction using simple part models. Computer Vision–ECCV 2010, pp. 603–616. Cited by: §2.
  • [54] P. Sala and S. Dickinson (2015) 3-d volumetric shape abstraction from a single 2-d image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1–9. Cited by: §2.
  • [55] A. Saxena, S. H. Chung, and A. Y. Ng (2006) Learning depth from single monocular images. In Advances in neural information processing systems, pp. 1161–1168. Cited by: §2.
  • [56] A. Saxena, M. Sun, and A. Y. Ng (2008) Make3D: depth perception from a single still image. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pp. 1571–1576. External Links: ISBN 978-1-57735-368-3, Link Cited by: §2.
  • [57] T. Shao, A. Monszpart, Y. Zheng, B. Koo, W. Xu, K. Zhou, and N. J. Mitra (2014-11)

    Imagining the unseen: stability-based cuboid arrangements for scene understanding

    ACM Trans. Graph. 33 (6), pp. 209:1–209:11. External Links: ISSN 0730-0301 Cited by: §6.
  • [58] Y. Shiroma, Y. Kakazu, and N. Okino (1991) A generalized sweeping method for sgc modeling. In Proceedings of the first ACM symposium on Solid modeling foundations and CAD/CAM applications, pp. 149–157. Cited by: §2.
  • [59] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik (2017) Learning shape abstractions by assembling volumetric primitives. In Proc. CVPR, Vol. 2. Cited by: §2.
  • [60] M. Wilczkowiak, P. Sturm, and E. Boyer (2005-02) Using geometric constraints through parallelepipeds for calibration and 3d modeling. IEEE Trans. Pattern Anal. Mach. Intell. 27 (2), pp. 194–207. External Links: ISSN 0162-8828, Link, Document Cited by: §2.
  • [61] J. Xiao, B. Russell, and A. Torralba (2012) Localizing 3d cuboids in single-view images. In Advances in neural information processing systems, pp. 746–754. Cited by: §2, TABLE IV, §6, §6.
  • [62] L. Yi, H. Su, L. Shao, M. Savva, H. Huang, Y. Zhou, B. Graham, M. Engelcke, R. Klokov, V. Lempitsky, et al. (2017) Large-scale 3d shape reconstruction and segmentation from shapenet core55. arXiv preprint arXiv:1710.06104. Cited by: Fig. 11, §6.
  • [63] Y. Zhang, W. Xu, Y. Tong, and K. Zhou (2015-11) Online structure analysis for real-time indoor scene reconstruction. ACM Trans. Graph. 34 (5), pp. 159:1–159:13. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [64] Y. Zheng, X. Chen, M. Cheng, K. Zhou, S. Hu, and N. J. Mitra (2012-07) Interactive images: cuboid proxies for smart image manipulation. ACM Trans. Graph. 31 (4), pp. 99:1–99:11. External Links: ISSN 0730-0301 Cited by: §1, §1.
  • [65] S. Zhou, H. Fu, L. Liu, D. Cohen-Or, and X. Han (2010-07) Parametric reshaping of human bodies in images. ACM Trans. Graph. 29 (4), pp. 126:1–126:10. External Links: ISSN 0730-0301, Link, Document Cited by: §1.