OASIS: A Large-Scale Dataset for Single Image 3D in the Wild

07/26/2020 ∙ by Weifeng Chen, et al. ∙ University of Michigan Princeton University 30

Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate leading models on a variety of single-image 3D tasks. We expect OASIS to be a useful resource for 3D vision research. Project site: https://pvl.cs.princeton.edu/OASIS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 7

page 10

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single RGB image. It is a core computer vision problem of critical importance. 3D scene interpretation is a foundation for understanding events and planning actions. 3D shape representation is crucial for making object recognition robust against changes in viewpoint, pose, and illumination. 3D from a single image is especially important due to the ubiquity of monocular images and videos. Even with a stereo camera with which 3D can be reconstructed by triangulating matching pixels from different views, monocular 3D cues are still necessary in textureless or specular regions where it is difficult to reliably match pixel values.

Single-image 3D is challenging. Unlike multiview 3D, it is ill-posed and resists tractable analytical formulation except in the most simplistic settings. As a result, data-driven approaches have shown greater promise, as evidenced by a plethora of works that train deep networks to map an RGB image to depth, surface normals, or 3D models [12, 19, 39, 16, 46, 27]. However, despite substantial progress, the best systems today still struggle with handling scenes “in the wild”— arbitrary scenes that a camera may encounter in the real world. As prior work has shown [5], state-of-art systems often give erroneous results when presented with unfamiliar scenes with novel shapes or layouts.

We hypothesize that a major obstacle of single-image 3D is data. Unlike object recognition, whose progress has been propelled by datasets like ImageNet 

[10] covering diverse object categories with high-quality labels, single-image 3D has lacked an ImageNet equivalent that covers diverse scenes with high-quality 3D ground truth. Existing datasets are restricted to either a narrow range of scenes [34, 9] or simplistic annotations such as sparse relative depth pairs or surface normals [5, 7].

In this paper we introduce Open Annotations of Single-Image Surfaces (OASIS), a large-scale dataset for single-image 3D in the wild. It consists of human annotations that enable pixel-wise reconstruction of 3D surfaces for 140,000 randomly sampled Internet images. Fig. LABEL:fig:teaser shows the human annotations of example images along with the reconstructed surfaces.

A key feature of OASIS is its rich annotations of human 3D perception. Six types of 3D properties are annotated for each image: occlusion boundary (depth discontinuity), fold boundary (normal discontinuity), surface normal, relative depth, relative normal (orthogonal, parallel, or neither), and planarity (planar or not). These annotations together enable a reconstruction of pixelwise depth.

To construct OASIS, we created a UI for interactive 3D annotation. The UI allows a crowd worker to annotate the aforementioned 3D properties. It also provides a live, rotatable rendering of the resulting 3D surface reconstruction to help the crowd worker fine-tune their annotations.

It is worth noting that 140,000 images may not seem very large compared to millions of images in datasets like ImageNet. But the number of images can be a misleading metric. For OASIS, annotating one image takes 305 seconds on average. In contrast, verifying a single image-level label takes no more than a few seconds. Thus in terms of the total amount of human time, OASIS is already comparable to millions of image-level labels.

OASIS opens up new research opportunities on a wide range of single-image 3D tasks—depth estimation, surface normal estimation, boundary detection, and instance segmentation of planes—by providing in-the-wild ground truths either for the first time, or at a much larger scale than prior work. For depth estimation and surface normals,

pixelwise ground truth is available for images in the wild for the first time—prior data in the wild provide only sparse annotations [5, 6]. For the detection of occlusion boundaries and folds, OASIS provides annotations at a scale 700 times larger than prior work—existing datasets [36, 17] have annotations for only about 200 images. For instance segmentation of planes, ground truth annotation is available for images in the wild for the first time.

To facilitate future research, we provide extensive statistics of the annotations in OASIS, and train and evaluate leading deep learning models on a variety of single-image tasks. Experiments show that there is a large room for performance improvement, pointing to ample research opportunities for designing new learning algorithms for single-image 3D. We expect OASIS to serve as a useful resource for 3D vision research.

2 Related Work

3D Ground Truth from Depth-Sensors and Computer Graphics Major 3D datatsets are either collected by sensors [34, 14, 32, 33, 9] or synthesized with Computer Graphics  [4, 26, 35, 25, 29]. But due to the limitations of depth sensors and the lack of varied 3D assets for rendering, the diversity of scenes is quite limited. For example, sensor-based ground truth is mostly for indoor or driving scenes [34, 9, 26, 35, 14].

3D Ground Truth from Multiview Reconstruction Single-image 3D training data can also be obtained by applying classical Structure-from-Motion (SfM) algorithms on Internet images or videos [21, 41, 6]. However, classical SfM algorithms have many well known failure modes including scenes with moving objects and scenes with specular or textureless surfaces. In contrast, humans can annotate all types of scenes.

3D Ground Truth from Human Annotations Our work is connected to many previous works that crowdsource 3D annotations of Internet images. For example, prior work has crowdsourced annotations of relative depth [5] and surface normals [7] at sparse locations of an image (a single pair of relative depth and a single normal per image). Prior work has also aligned pre-existing 3D models to images [42, 37]. However, this approach has a drawback that not every shape can be perfectly aligned with available 3D models, whereas our approach can handle arbitrary geometry.

Our work is related to that of Karsch et al. [17], who reconstruct pixelwise depth from human annotations of boundaries, with the aid of a shape-from-shading algorithm [2]. Our approach is different in that we annotate not only boundaries but also surface normals, planarity, and relative normals, and our reconstruction method does not rely on automatic shape from shading, which is still unsolved and has many failure modes.

One of our inspirations is LabelMe3D [31], which annotated 3D planes attached to a common ground plane. Another is OpenSurfaces [3], which also annotated 3D planes. We differ from LabelMe3D and OpenSurfaces in that our annotations recover not only planes but also curved surfaces. Our dataset is also much larger, being the size of LabelMe3D and of OpenSurfaces in terms of the number of images annotated. It is also more diverse, because LabelMe3D and OpenSurface include only city or indoor scenes.

Figure 1: (a) Our UI allows a user to annotate rich 3D properties and includes a preview window for interactive 3D visualization. (b) An illustration of the depth scaling procedure in our backend.

3 Crowdsourcing Human Annotations

We use random keywords to query and download Creative Commons Flickr images with a known focal length (extracted from the EXIF data). Each image is presented to a crowd worker for annotation through a custom UI as shown in Fig. 1 (a). The worker is asked to mask out a region that she wishes to work on with a polygon of her choice, with the requirement that the polygon covers a pair of randomly pre-selected locations. She then works on the annotations and iteratively monitors the generated mesh (detailed in Sec 4) from an interactive preview window (Fig. 1 (a)).

Occlusion Boundary and Fold An occlusion boundary denotes locations of depth discontinuity, where the surface on one side is physically disconnected from the surface on the other side. When it is drawn, the worker also specifies which side of the occlusion is closer to the viewer, i.e. depth order of the surfaces on both sides of the occlusion. Workers need to distinguish between two kinds of occlusion boundaries. Smooth occlusion (green in Fig 1 (a)) is where the the closer surface smoothly curves away from the viewer, and the surface normals should be orthogonal to the occlusion line and parallel to the image plane, and pointing toward the further side. Sharp occlusion (red in Fig 1 (a)) has none of these constraints. On the other hand, fold denotes locations of surface normal discontinuity, where the surface geometry changes abruptly, but the surfaces on the two sides of the fold are still physically attached to each other (orange in Fig 1 (a)).

Occlusion boundaries segment a region into subregions, each of which is a continuous surface whose geometry can change abruptly but remains physically connected in 3D. Folds further segment a continuous surface into smooth surfaces where the geometry vary smoothly without discontinuity of surface normals.

Surface Normal The worker first specifies if a smooth surface is planar or curved. She annotates one normal at each planar surface which indicates the orientation of the plane. For each curved surface, she annotates normals at as many locations as she sees fit. A normal is visualized as a blue arrow originating from a green grid (see the appendix), rendered in perspective projection according to the known focal length. Such visualization helps workers perceive the normal in 3D [7]. To rotate and adjust the normal, the worker only needs to drag the mouse.

Relative Normal Finally, to annotate normals with higher accuracy, the worker specifies the relative normal between each pair of planar surfaces. She chooses between Neither, Parallel and Orthogonal. Surfaces pairs that are parallel or orthogonal to each other then have their normals adjusted automatically to reflect the relation.

Interactive Previewing While annotating, the worker can click a button to see a visualization of the 3D shape constructed from the current annotations (detailed later in Sec. 4). Workers can rotate or zoom to inspect the shape from different angles in a preview window (Fig 1 (a)). She keeps working on it until she is satisfied with the shape.

Quality Control Completing our 3D annotation task requires knowledge of relevant concepts. To ensure good quality of the dataset, we require each worker to complete a training course to learn concepts such as occlusions, folds and normals, and usage of the UI. She then needs to pass a qualification quiz before being allowed to work on our annotation task. Besides explicitly selecting qualified workers, we also set up a separate quality verification task on each collected mesh. In this task, a worker inspects the mesh to judge if it reflects the image well. Only meshes deemed high quality are accepted.

To improve our annotation throughput, we collected annotations from three sources: Amazon Mechanical Turk, which accounts for 11% of all annotations, and two data annotation companies that employ full-time annotators, who supplied the rest of the annotations.

Figure 2: Statistics of OASIS. (a) The distribution of focal length (unit: relative length to the image width). (b) The distribution of surface normals. (c) Boundary: the ratio of regions containing only occlusion, only fold, and both. Curvature: the distribution of regions containing only planes, only curved surfaces, and both. (d) The frequency distribution of each surface type in a region.

4 From Human Annotations to Dense Depth

Because humans do not directly annotate the depth value of each pixel, we need to convert the human annotations to pixelwise depth in order to visualize the 3D surface.

Generating Dense Surface Normals We first describe how we generate dense surface normals from annotations. We assume the normals to be smoothly varying in the spatial domain, except across folds or occlusion boundaries where the normals change abruptly. Therefore, our system propagates the known normals to the unknown ones by requiring the final normals to be smooth overall, but stops the propagation at fold and occlusion lines.

More concretely, let denote the normal at pixel on a normal map , and , denotes the pixels belong to the folds and occlusion boundaries. We have a set of known normals at locations from (1) surface normal annotations by workers, and (2) the pre-computed normals along the smooth occlusion boundaries as mentioned in Sec 3. Each pixel has four neighbors . If is on an occlusion boundary, its neighbors on the closer side of this boundary are . If is on a fold line, only its neighbors on one fixed random side of this line are considered. We solve for the optimal normal using LU factorization and then normalize it into unit norm:

(1)
s.t. (2)

Generating Dense Depth Our depth generation pipeline consists of two stages: First, from surface normals and focal length, we recover the depth of each continuous surface through integration [28] 111We snap the z component of the surface normals to be no smaller than 0.3 so that the generated depth would not stretch into huge distance.. Next, we adjust the depth order among these surfaces by performing surface-wise depth scaling (Fig. 1 (b)), i.e. each surface has its own scale factor.

Our design is motivated by this fact: in single-view depth recovery, depth within continuous surface can be recovered only up to an ambiguous scale; thus different surfaces may end up with different scales, leading to incorrect depth ordering between surfaces. But workers already decide which side of an occlusion boundary is closer to the viewer. Based on such knowledge, we correct depth order by scaling the depth of each surface.

We now describe the details. Let denotes the set of all continuous surface. From integration, we obtain the depth of each . We then solve for a scaling factor for each , which is used in scaling depth . Let denote the set of occlusion boundaries. Along , we densely sample a set of point pairs . Each pair has lying on the closer side of one of the occlusion boundaries and the further side. The continuous surface a pixel lies on is , and its depth is . The set of optimal scaling factors is solved for as follows:

(3)
s.t. (4)
(5)

where is a minimum separation between surfaces, and is a minimum scale factor. Eq.(4) requires the surfaces to meet the depth order constraints specified by point pairs after scaling. Meanwhile, Eq.(3) constrains the value of so that they do not increase indefinitely. After correcting the depth order, the final depth for surface is . We normalize and reproject the final depth to 3D as point clouds, and generate 3D meshes for visualization.

NYU Depth [34] (depth mean: 2.471 m, depth std: 0.754 m) Tanks & Temples [18] (depth mean: 4.309m, depth std: 3.059m)
Human-Human Human-Sensor CNN-Sensor Human-Human Human-Sensor CNN-Sensor
Depth (EDist) 0.078m 0.095m 0.097m [19] 0.194m 0.213m 0.402m [19]
Normals (MAE) 13.13 17.82 14.19 [47] 14.33 20.29 29.11 [47]
Post-Rotation Depth (EDist) 0.037m 0.048m - 0.082m 0.080m -
Depth Order (WKDR) 5.68% 8.67% 11.90% 9.28% 10.80% 32.13%
Table 1: Depth and normal difference between different humans (Human-Human), between human and depth sensor (Human-Sensor), and between ConvNet and depth sensor (CNN-Sensor). The results are averaged over all human pairs.
Figure 3: Humans estimate shape correctly but the absolute orientation can be slightly off, causing large depth error after perspective back-projection into 3D. Depth error drops significantly (from 0.07m to 0.01m) after a global rotation of normals.

5 Dataset Statistics

Statistics of Surfaces Fig. 2 plots various statistics of the 3D surfaces. Fig. 2 (a) plots the distribution of focal length. We see that focal lengths in OASIS vary greatly: they range from wide angle to telezoom, and are mostly 1 to 10 of the width of the image. Fig. 2 (b) visualizes the distribution of surface normals. We see that a substantial proportion of normals point directly towards the camera, suggesting that parallel-frontal surfaces frequently occur in natural scenes. Fig. 2 (c) presents region-wise statistics. We see that most regions (90%+) contain occlusion boundaries and close to half have both occlusion boundaries and folds (top). We also see that most regions (70%+) contain at least one curve surface (bottom). Fig. 2 (d) shows the histogram of the number of different kinds of surfaces in an annotated region. We see that most regions consist of multiple disconnected pieces and have non-trivial geometry in terms of continuity and smoothness.

Annotation Quality We study how accurate and consistent the annotations are. To this end, we randomly sample 50 images from NYU Depth [34] and 70 images from Tanks and Temples [18], and have 20 workers annotate each image. Tab. I reports the depth and normal difference between human annotations, between human annotations and sensor ground truth, and between predictions from state-of-the-art ConvNets and sensor ground truth. Depth difference is measured by the mean Euclidean distance (EDist) between corresponding points in two point clouds, after aligning one to the other through a global translation and scaling (surface-wise scaling for human annotations and CNN predictions). Normal difference is measured in Mean Angular Error (MAE). We see in Tab. I that human annotations are highly consistent with each other and with sensor ground truth, and are better than ConvNet predictions, especially when the ConvNet is not trained and tested on the same dataset.

We observe that humans often estimate the shape correctly, but the overall orientation can be slightly off, causing a large depth error against sensor ground truth (Fig. 3). This error can be particularly pronounced for planes close to orthogonal to the image plane. Thus we also compute the error after a rotational alignment with the sensor ground truth—we globally rotate the human annotated normals (up to 30 degrees) before generating the shape. After accounting for this global rotation of normals, human-sensor depth difference is further reduced by 47.96% (relative) for NYU and 62.44% (relative) for Tanks and Temples; a significant drop of normal error is also observed in human-human difference.

We also measure the qualitative aspect of human annotations by evaluating the WKDR metric [5], i.e. the percentage of point pairs with inconsistent depth ordering between query and reference depth. Depth pairs are sampled in the same way as [5]. Tab. I again shows that human annotations are qualitatively accurate and highly consistent with each other.

It is worth noting that metric 3D accuracy is not required for many tasks such as navigation, object manipulation, and semantic scene understanding—humans do well without perfect metric accuracy. Therefore human perception of depth alone can be the gold standard for training and evaluating vision systems, regardless of its metric accuracy. As a result, our dataset would still be valuable even if it were less metrically accurate than it is currently.

Figure 4: Qualitative outputs of the four tasks from representative models. More details and examples are in the appendix.

6 Experiments

To facilitate future research, we use OASIS to train and evaluate leading deep learning models on a suite of single-image 3D tasks including depth estimation, normal estimation, boundary detection, plane segmentation. Qualitative results are shown in Fig. 4. A train-val-test split of 110K, 10K, 20K is used for all tasks.

For each task we estimate human performance to provide an upperbound accounting for the variance of human annotations. We randomly sample 100 images from the test set, and have each image re-annotated by 8 crowd workers. That is, each image now has “predictions” from 8 different humans. We evaluate each prediction and report the mean as the performance expected of an average human.

6.1 Depth Estimation

We first study single-view depth estimation. OASIS provides pixelwise metric depth in the wild. But as discussed in Sec 4, due to inherent single-image ambiguity, depth in OASIS is independently recovered within each continuous surface, after which the depth undergoes a surface-wise scaling to correct the depth order. The recovered depth is only accurate up to scaling within each continuous surface and ordering between continuous surfaces.

Given this, in OASIS we provide metric depth ground truths that is surface-wise accurate up to a scaling factor. This new form of depth necessitates new evaluation metrics and training losses.

Depth Metric The images in OASIS have varied focal lengths. This means that to evaluate depth estimation, we cannot simply use pixelwise difference between a predicted depth map and the ground truth map. This is because the predicted 3D shape depends greatly on the focal length—given the same depth values, decreasing the focal length will flatten the shape along the depth dimension. In practice, the focal length is often unknown for a test image. Thus, we require a depth estimator to predict a focal length along with depth. Because the predicted focal length may differ from the ground truth focal length, pixelwise depth difference is a poor indicator of how close the predicted 3D shape is to the ground truth.

A more reasonable metric is the Euclidean distance between the predicted and ground-truth 3D point cloud. Concretely, we backproject the predicted depth to a 3D point cloud using (the predicted focal length), and ground truth depth to using (the ground truth focal length). We then calculate the distance between and .

The metric also needs to be invariant to surface-wise depth scaling and translation. Therefore we introduce a surface-wise scaling factor , and a surface-wise translation , to align each predicted surface in to the ground truth point cloud in a least square manner. The final metric, which we call Locally Scale-Invariant RMSE (LSIV_RMSE), is defined as:

(6)

where denotes the surface a pixel is on. The ground truth point cloud

is normalized to a canonical scale by the standard deviation of its X coordinates

. Under this metric, as long as is accurate up to scaling and translation, it will align perfectly with , and get 0 error.

Note that LSIV_RMSE ignore the ordering between two separate surfaces; it allows objects floating in the air to be arbitrarily scaled. This is typically not an issue because in most scenes there are not many objects floating in the air. But we nonetheless also measure the correctness of depth ordering. We report WKDR [5], which is the percentage of point pairs that have incorrect depth order in the predicted depth. We evaluate on depth pairs sampled in the same way as [5], i.e. half are random pairs, half are from the same random horizontal lines.

Models We train and evaluate two leading depth estimation networks on OASIS: the Hourglass network [5], and ResNetD [41]

, a dense prediction network based on ResNet50. Each network predicts a metric depth map and a focal length, which are together used to backproject pixels to 3D points, which are compared against the ground truth to compute the LSIV_RMSE metric, which we optimize as the loss function during training. Note that we do not supervise on the predicted focal length.

We also evaluate leading pre-trained models that estimate single-image depth on OASIS, including FCRN [19] trained on ILSVRC [30] and NYU Depth [34], Hourglass [21] trained on MegaDepth [21], ResNetD [41] trained on a combination of datasets including ILSVRC [30], Depth in the Wild [5], ReDWeb [41] and YouTube3D [6]. For networks that do not produce a focal length, we use the validation set to find the best focal length that leads to the smallest LSIV_RMSE, and use this focal length for each test image. In addition, we also evaluate plane, a naive baseline that predicts a uniform depth map.

Method Training Data LSIV_RMSE WKDR
FCRN [19] ImageNet [30] + NYU [34] 0.67 (0.67) 39.95% (39.94%)
Hourglass [5, 21] MegaDepth [21] 0.67 (0.67) 38.37% (38.37%)
ResNetD [41, 6] ImageNet [30] + YouTube3D [6]+ 0.66 (0.66) 34.01% (34.03%)
ReDWeb [41] + DIW [5]
ResNetD [41] ImageNet [30] + OASIS 0.37 (0.37) 32.62% (32.04%)
ResNetD [41] OASIS 0.47 (0.47) 39.73% (38.79%)
Hourglass [5] OASIS 0.45 (0.47) 39.01% (39.64%)
Plane - 0.67 (0.67) 100.00% (100.00%)
Human (Approx) - 0.24 (0.24) 19.04% (19.33%)
Table 2: Depth estimation performance of different networks on OASIS (lower is better). For networks that do not produce a focal length, we use the best focal length leading to the smallest error. See Sec. AA7: Crossed-out Numbers in Tables about the numbers crossed out.

Tab. 2

reports the results. In terms of metric depth, we see that networks trained on OASIS perform the best. This is expected because they are trained to predict a focal length and to directly optimize the LSIV_RMSE metric. It is noteworthy that ImageNet pretraining provides a significant benefit even for this purely geometrical task. Off-the-shelf models do not perform better than the naive baseline, probably because they were not trained on diverse enough scenes or were not trained to optimize metric depth error. In terms of relative depth, it is interesting to see that ResNetD trained on ImageNet and OASIS performs the best, even though the training loss does not enforce depth ordering. We also see that there is still a significant gap between human performance and machine performance. At the same time, the gap is not hopelessly large, indicating the effectiveness of a large training set.

OASIS
Method Training Data Angle Distance % Within Relative Normal
Mean Median 11.25 22.5 30
Hourglass [7] OASIS 23.91 (23.24) 18.16 (18.08) 31.23 (31.44) 59.45 (59.79) 71.77 (72.25) 0.5913(0.5508) 0.5786 (0.5439)
Hourglass [7] SNOW [7] 31.35 (30.74) 26.97 (26.65) 13.98 (14.33) 40.20 (40.84) 56.03 (56.73) 0.5329(0.5329) 0.5016 (0.4714)
Hourglass [7] NYU [34] 35.32 (34.69) 29.21 (28.76) 14.23 (14.65) 37.72 (38.49) 51.31 (52.06) 0.5467(0.5415) 0.5132 (0.5064)
PBRS [47] NYU [34] 38.29 (38.09) 33.16 (33.00) 11.59 (11.94) 32.14 (32.58) 45.00 (45.29) 0.5669(0.5729) 0.5253 (0.5227)
Front_Facing - 31.79 (31.20) 24.80 (24.76) 27.52 (27.36) 46.61 (46.62) 56.80 (56.94) 0.5000 (0.5000) 0.5000 (0.5000)
Human (Approx) - 17.27 (17.43) 12.92 (13.08) 44.36 (43.89) 76.16 (75.94) 85.24 (84.72) 0.8826 (0.8870) 0.6514 (0.6439)

Table 3: Surface normal estimation on OASIS. See Sec. AA7: Crossed-out Numbers in Tables about the numbers crossed out.
DIODE [38] ETH3D [33]
Method Training Data Angle Distance % Within Angle Distance % Within
Mean 11.25 22.5 30 Mean 11.25 22.5 30
Hourglass [7] OASIS 34.21 (34.57) 14.45 (13.71) 36.98 (35.69) 51.36 (49.65) 33.00 (34.51) 26.25(23.52) 54.07 (52.04) 65.36 (62.73)
Hourglass [7] SNOW [7] 40.10 8.29 27.20 40.67 45.71 10.69 31.16 43.16
Hourglass [7] NYU [34] 42.23 10.97 29.76 41.35 41.84 21.94 44.05 53.81
PBRS [47] NYU [34] 42.59 9.96 29.08 40.72 39.91 18.68 44.76 56.08
Front_Facing - 47.76 5.62 18.70 28.05 58.97 11.84 23.75 30.19
Table 4: Cross-dataset generalization. See Sec. AA7: Crossed-out Numbers in Tables about the numbers crossed out.

6.2 Surface Normal Estimation

We now turn to single-view surface normal estimation. We evaluate on absolute normal, i.e. the pixel-wise predicted normal values, and relative normal, i.e. the parallel and orthogonal relation predicted between planar surfaces.

Absolute Normal Evaluation We use standard metrics proposed in prior work [40]: the mean and median of angular error measured in degrees, and the percentage of pixels whose angular error is within degrees.

We evaluate on OASIS four state-of-the-art networks that are trained to directly predict normals: (1) Hourglass [7] trained on OASIS, (2) Hourglass trained on the Surface Normal in the Wild (SNOW) dataset [7], (3) Hourglass trained on NYU Depth [34], and (4) PBRS, a normal estimation network by Zhang et al. [47] trained on NYU Depth [34]. We also include Front_Facing, a naive baseline predicting all normals to be orthogonal to the image plane.

Figure 5: Limitations of standard metrics: a deep network gets low mean angle error but important details are wrong.

Tab. 3 reports the results. As expected, the Hourglass network trained on OASIS performs the best. Although SNOW is also an in-the-wild dataset, the same network trained on it does not perform as well, but is still better than training on NYU. Notably, the human-machine gap appears fairly small numerically (17.27 versus 23.91 in mean angle error). However, we observe that the naive baseline can achieve 31.79; thus the dynamic range of this metric is small to start with, due to the natural distribution of normals in the wild. In addition, a close examination of the results suggests that these standard metrics of surface normals do not align well with perceptual quality. In natural images there can be large areas that dominate the metric but have uninteresting geometry, such as a blank wall in the background. For example, in Fig. 5

, a neural network gets the background correct, but largely misses the important details in the foreground. This opens up an interesting research question about developing new evaluation metrics.

Relative Normal Evaluation We also evaluate the predicted normals in terms of relative relations, specifically orthogonality and parallelism. Getting these relations correct is important because it can help find vanishing lines and perform self-calibration.

We first define a metric to evaluate relative normal. From the human annotations, we first sample an equal number of point pairs from surface pairs that are parallel, orthogonal, and neither. Given a predicted normal map, we look at the two normals at each point pair and measure the angle between them. We consider them orthogonal if , and parallel if , where ,

are thresholds. We then plot the Precision-and-Recall curve for orthogonal by varying

, and measure its Area Under Curve , using neither and parallel pairs as negative examples. Varying and using neither and orthogonal as negative examples, we obtain for parallel.

Tab. 3 reports results of relative normal evaluation. Notably, all methods perform similarly, and all perform very poorly compared to humans. This suggests that existing approaches to normal estimation have limitations in capturing orthogonality and parallelism, indicating the need for further research.

Cross-Dataset Generalization Next we study how networks trained on OASIS generalize to other datasets. Surface normal estimation is ideal for such evaluation because unlike depth, which is tricky to evaluate on a new dataset due to scale ambiguity and varying focal length, a normal estimation network can be directly evaluated on a new dataset without modification.

We train the same Hourglass network on OASIS, and NYU, and report their performance on two benchmarks not seen in training: DIODE [38] and ETH3D [33]. From Tab. 4 we see that training on NYU underperforms on all benchmarks, showing that networks trained on scene-specific datasets have difficulties generalizing to diverse scenes. Training on OASIS outperforms on all benchmarks, demonstrating the effectiveness of diverse annotations.

6.3 Fold and Occlusion Boundary Detection

Occlusion and fold are both important 3D cues, as they tell us about physical connectivity and curvature: Occlusion delineates the boundary at which surfaces are physically disconnected to each other, while Fold is where geometry changes abruptly but the surfaces remain connected.

Task We investigate joint boundary detection and occlusion-versus-fold classification: deciding whether a pixel is a boundary (fold or occlusion) and if so, which kind it is. Prior work has explored similar topics: Hoiem et al. [15] and Stein et al. [36] handcraft edge or motion features to perform occlusion detection, but our task involves folds, not just occlusion lines.

MetricModel Edge: All Fold Edge: All Occ HED [43] Hourglass [5] Human (Approx)
ODS 0.123 0.539 0.547 (0.533) 0.581 (0.585) 0.810
OIS 0.129 0.576 0.606 (0.584) 0.639 (0.639) 0.815
AP 0.02 0.44 0.488 (0.466) 0.530 (0.547) 0.642
Table 5: Boundary detection performance on OASIS. See Sec. AA7: Crossed-out Numbers in Tables about the numbers crossed out.

Evaluation Metric We adopt metrics similar to standard ones used in edge detection [1, 43]

: F-score by optimal threshold per image (OIS), by fixed threshold (ODS) and average precision (AP). For a boundary to be considered correct, it has to be labeled correctly as either occlusion or fold. More details on the metrics can be found in the appendix.

To perform joint detection of fold and occlusion, we adapt and train two networks on OASIS: Hourglass [5], and a state-of-the-art edge detection network HED [43]. The networks take in an image, and output two probabilities per pixel: is the probability of being an boundary pixel (occlusion or fold), and is the probability of being a fold pixel. Given a threshold , pixels whose are neither fold nor occlusion. Pixels whose are fold if and otherwise occlusion.

As baselines, we also investigate how a generic edge detector would perform on this task. We use HED network trained on BSDS dataset [1]

to detect image edges, and classify the resulting edges to be either all occlusion (

Edge: All Occ) or all fold (Edge: All Fold).

All results are reported on Tab 5. Hourglass outperforms HED when trained on OASIS, and significantly outperforms both the All-Fold and All-Occlusion baselines, but still underperforms humans by a large margin, suggesting that fold and occlusion boundary detection remains challenging in the wild.

6.4 Instance Segmentation of Planes

Our last task focuses on instance segmentation of planes in the wild. This task is important because planes often have special functional roles in a scene (e.g. supporting surfaces, walls). Prior work has explored instance segmentation of planes, but is limited to indoor or driving environments [24, 45, 23, 44]. Thanks to OASIS, we are able to present the first-ever evaluation of this task in the wild.

We follow the way prior work [24, 23, 45] performs this task: a network takes in an image, and produces instance masks of planes, along with an estimate of planar parameters that define each 3D plane. To measure performance, we report metrics used in instance segmentation literature [22]: the average precision (AP) computed and averaged across a range of overlap thresholds (ranges from 50% to 95% as in [22, 8]). A ground truth plane is considered correctly detected if it overlaps with one of the detected planes by more than the overlap threshold, and we penalize multiple detection as in [8]. We also report the AP at 50% overlap () and 75% overlap ().

PlanarReconstruction by Yu et al. [45] is a state-of-the-art method for planar instance segmentation. We train PlanarReconstruction on three combinations of data: (1) ScanNet [9] only as done in [45], (2) OASIS only, and (3) ScanNet + OASIS. Tab. 6 compares their performance.

As expected, training on ScanNet alone performs the worse, because ScanNet only has indoor images. Training on OASIS leads to better performance. Leveraging both ScanNet and OASIS is the best overall. But even the best network significantly underperforms humans, suggesting ample space for improvement.

Method Training Data AP
ScanNet [9] 0.076 (0.076) 0.161 (0.161) 0.064 (0.065)
PlanarReconstruction [45] OASIS 0.125 (0.127) 0.249 (0.250) 0.110 (0.112)
ScanNet [9] + OASIS 0.137 (0.139) 0.262 (0.264) 0.126 (0.130)
Human (Approx) - 0.461 0.542 0.476
Table 6: Planar instance segmentation performance on OASIS. See Sec. AA7: Crossed-out Numbers in Tables about the numbers crossed out.

7 Conclusion

We have presented OASIS, a dataset of rich human 3D annotations. We trained and evaluated leading models on a variety of single-image tasks. We expect OASIS to be a useful resource for 3D vision research.

Acknowledgement This work was partially supported by a National Science Foundation grant (No. 1617767), a Google gift, and a Princeton SEAS innovation grant.

References

  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: §6.3, §6.3, A6: Evaluating Fold and Occlusion Boundary Detection, A6: Evaluating Fold and Occlusion Boundary Detection.
  • [2] J. T. Barron and J. Malik (2012) Color constancy, intrinsic images, and shape estimation. In European Conference on Computer Vision, pp. 57–70. Cited by: §2.
  • [3] S. Bell, P. Upchurch, N. Snavely, and K. Bala (2013) OpenSurfaces: a richly annotated catalog of surface appearance. ACM Trans. on Graphics (SIGGRAPH) 32 (4). Cited by: §2, Table II.
  • [4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black (2012-10)

    A naturalistic open source movie for optical flow evaluation

    .
    In European Conf. on Computer Vision (ECCV), A. Fitzgibbon et al. (Eds.) (Ed.), Part IV, LNCS 7577, pp. 611–625. Cited by: §2.
  • [5] W. Chen, Z. Fu, D. Yang, and J. Deng (2016) Single-image depth perception in the wild. In Advances in Neural Information Processing Systems, pp. 730–738. Cited by: §1, §1, §1, §2, §5, §6.1, §6.1, §6.1, §6.3, Table 2, Table 5, Table II, A5: Additional Qualitative Outputs.
  • [6] W. Chen, S. Qian, and J. Deng (2019) Learning single-image depth from videos using quality assessment networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5604–5613. Cited by: §1, §2, §6.1, Table 2.
  • [7] W. Chen, D. Xiang, and J. Deng (2017) Surface normals in the wild. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, pp. 22–29. Cited by: §1, §2, §3, §6.2, Table 3, Table 4, Table II, A5: Additional Qualitative Outputs.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §6.4.
  • [9] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes.. In CVPR, Vol. 2, pp. 10. Cited by: §1, §2, §6.4, Table 6, A5: Additional Qualitative Outputs.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, A5: Additional Qualitative Outputs.
  • [11] P. Dollár and C. L. Zitnick (2013) Structured forests for fast edge detection. In Proceedings of the IEEE international conference on computer vision, pp. 1841–1848. Cited by: A6: Evaluating Fold and Occlusion Boundary Detection.
  • [12] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §1.
  • [13] A. Fernández-Baldera, J. M. Buenaposada, and L. Baumela (2018) BAdaCost: multi-class boosting with costs. Pattern Recognition 79, pp. 467–479. Cited by: A6: Evaluating Fold and Occlusion Boundary Detection.
  • [14] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: §2, Table II.
  • [15] D. Hoiem, A. A. Efros, and M. Hebert (2011) Recovering occlusion boundaries from an image. International Journal of Computer Vision 91 (3), pp. 328–346. Cited by: §6.3.
  • [16] E. Ilg, T. Saikia, M. Keuper, and T. Brox (2018) Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 614–630. Cited by: §1.
  • [17] K. Karsch, Z. Liao, J. Rock, J. T. Barron, and D. Hoiem (2013) Boundary cues for 3d object shape recovery. In CVPR, Cited by: §1, §2.
  • [18] A. Knapitsch, J. Park, Q. Zhou, and V. Koltun (2017) Tanks and temples: benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36 (4). Cited by: Table 1, §5.
  • [19] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab (2016) Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV), pp. 239–248. Cited by: §1, Table 1, §6.1, Table 2.
  • [20] K. Lasinger, R. Ranftl, K. Schindler, and V. Koltun (2019) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. arXiv preprint arXiv:1907.01341. Cited by: Table II.
  • [21] Z. Li and N. Snavely (2018) MegaDepth: learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2041–2050. Cited by: §2, §6.1, Table 2, Table II.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §6.4.
  • [23] C. Liu, K. Kim, J. Gu, Y. Furukawa, and J. Kautz (2019) Planercnn: 3d plane detection and reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4450–4459. Cited by: §6.4, §6.4.
  • [24] C. Liu, J. Yang, D. Ceylan, E. Yumer, and Y. Furukawa (2018) Planenet: piece-wise planar reconstruction from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2579–2588. Cited by: §6.4, §6.4.
  • [25] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Note: arXiv:1512.02134 External Links: Link Cited by: §2.
  • [26] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison (2016) Scenenet rgb-d: 5m photorealistic images of synthetic indoor trajectories with ground truth. arXiv preprint arXiv:1612.05079. Cited by: §2.
  • [27] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §1.
  • [28] Y. Quéau, J. Durou, and J. Aujol (2018) Normal integration: a survey. Journal of Mathematical Imaging and Vision 60 (4), pp. 576–593. Cited by: §4.
  • [29] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In European Conference on Computer Vision, pp. 102–118. Cited by: §2.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §6.1, Table 2.
  • [31] B. C. Russell and A. Torralba (2009) Building a database of 3d scenes from user annotations. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 2711–2718. Cited by: §2.
  • [32] A. Saxena, S. H. Chung, and A. Y. Ng (2008) 3-d depth reconstruction from a single still image. International journal of computer vision 76 (1), pp. 53–69. Cited by: §2.
  • [33] T. Schöps, J. L. Schönberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger (2017) A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2017. Cited by: §2, §6.2, Table 4.
  • [34] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus (2012) Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pp. 746–760. Cited by: §1, §2, Table 1, §5, §6.1, §6.2, Table 2, Table 3, Table 4, Table I, Table II.
  • [35] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser (2017) Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §2.
  • [36] A. N. Stein and M. Hebert (2009) Occlusion boundaries from motion: low-level detection and mid-level reasoning. International journal of computer vision 82 (3), pp. 325. Cited by: §1, §6.3, Table II.
  • [37] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman (2018) Pix3D: dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2974–2983. Cited by: §2.
  • [38] I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, et al. (2019) DIODE: a dense indoor and outdoor depth dataset. arXiv preprint arXiv:1908.00463. Cited by: §6.2, Table 4.
  • [39] P. Wang, X. Shen, B. Russell, S. Cohen, B. Price, and A. L. Yuille (2016) Surge: surface regularized geometry estimation from a single image. In Advances in Neural Information Processing Systems, pp. 172–180. Cited by: §1.
  • [40] X. Wang, D. Fouhey, and A. Gupta (2015) Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547. Cited by: §6.2.
  • [41] K. Xian, C. Shen, Z. Cao, H. Lu, Y. Xiao, R. Li, and Z. Luo (2018) Monocular relative depth perception with web stereo data supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 311–320. Cited by: §2, §6.1, §6.1, Table 2, Table II, A5: Additional Qualitative Outputs.
  • [42] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese (2016) Objectnet3d: a large scale database for 3d object recognition. In European Conference on Computer Vision, pp. 160–176. Cited by: §2.
  • [43] S. Xie and Z. Tu (2015) Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pp. 1395–1403. Cited by: §6.3, §6.3, Table 5, A6: Evaluating Fold and Occlusion Boundary Detection.
  • [44] F. Yang and Z. Zhou (2018)

    Recovering 3d planes from a single image via convolutional neural networks

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §6.4.
  • [45] Z. Yu, J. Zheng, D. Lian, Z. Zhou, and S. Gao (2019) Single-image piece-wise planar 3d reconstruction via associative embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1029–1037. Cited by: §6.4, §6.4, §6.4, Table 6, A5: Additional Qualitative Outputs.
  • [46] B. Zeisl, M. Pollefeys, et al. (2014) Discriminatively trained dense surface normal estimation. In European conference on computer vision, pp. 468–484. Cited by: §1.
  • [47] Y. Zhang, S. Song, E. Yumer, M. Savva, J. Lee, H. Jin, and T. Funkhouser (2017) Physically-based rendering for indoor scene understanding using convolutional neural networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Table 1, §6.2, Table 3, Table 4.

A1: Surface Normal Annotation UI

The surface normal annotation UI is shown in Fig. I.

Figure I: Surface normal annotation UI. The surface normal is visualized as a blue arrow originating from a green grid, rendered in perspective projection according to the known focal length.

A2: Planar versus Curved Regions

Tab. I measures the annotation quality separately for planar regions and curved regions.

NYU Depth [34]
Human-Human Human-Sensor
Planar Regions 0.079m 0.091m
Curved Regions 0.077m 0.102m
Table I: Depth difference between different humans (Human-Human) and between humans and depth sensors (Human-Sensor) in planar and curved regions. The results are averaged over all human pairs. The mean of depth in tested samples is 2.471 m, the standard deviation is 0.754 m.

A3: Comparison with Other Datasets

Tab. II compares OASIS and other datasets.

Dataset In the Wild Acquisition Depth Normals Occlusion & Fold Relative Normals Planar Inst Seg # Images
OASIS Human annotation Metric (up to scale) Dense 140K
NYU Depth V2 [34] - Kinect Metric Dense - - - 407K
KITTI [14] - LiDAR Metric - - - - 93K
DIW [5] Human annotation Relative - - - - 496K
SNOW [7] Human annotation - Sparse - - - 60K
MegaDepth [21] SfM Metric (up to scale) - - - - 130K
ReDWeb [41] Stereo Metric (up to scale) - - - - 3.6K
3D Movie [20] Stereo Metric (up to scale) - - - - 75K
OpenSurfaces [3] - Human annotation - Dense - - - 25K
CMU Occlusion [36] Human annotation - - Occlusion Only - - 538
Table II: Comparison between OASIS and other 3D datasets. Metric (up to scale) denotes that the depth is metrically accurate up to scale.
Figure II: Additional human annotations from OASIS. Note that each planar instance has a different color.

A4: Additional Examples from OASIS

Additional human annotations are shown in Fig. II.

A5: Additional Qualitative Outputs

Figure III: Additional qualitative outputs from four tasks: (1) depth estimation, (2) normal estimation, (3) fold and occlusion boundary detection, and (4) planar instance segmentation.

Qualitative predictions presented in both Fig. III and Fig. 4 are produced as follows: Depth predictions are produced by a ResNetD [41] network trained on OASIS + ImageNet [10]. Surface normal predictions are produced by an Hourglass [7] network trained on OASIS alone. Occlusion boundary and fold predictions are produced by an Hourglass [5] network trained on OASIS alone. Planar instance segmentations are produced by a PlanarReconstruction [45] network trained on Scannet [9] + OASIS.

A6: Evaluating Fold and Occlusion Boundary Detection

This section provides details on evaluating fold and occlusion boundary detection. As discussed in Sec 6.3 of the main paper, our metric is based on the ones used in evaluating edge detection [1, 11, 43, 13].

The input to our evaluation pipeline consists of (1) the probability of each pixel being on edge (fold or occlusion) , and (2) a label of each pixel being occlusion or fold. By thresholding on , we first obtain an edge map at threshold . We denote the occlusion pixels as and the fold pixels as . We find the intersection and use the same protocol as  [1] to compare it against the ground-truth occlusion and obtain true positive count TF, false positive count FP and false negative count FN. We follow the same protocol to compare against ground-truth fold and obtain TF, FP and FN.

We then calculate the joint counts TF, FP and FN: TP=TF+TF, FP=FP+FP and FN=FN+FN.

We iterate through different to obtain the joint counts TF, FP and FN at each threshold to obtain the final ODS/OIS F-score and AP.

A7: Crossed-out Numbers in Tables

We made minor quality improvements to the dataset after the camera ready deadline of CVPR 2020, affecting less than 10% of the images. The crossed-out numbers are those presented in the CVPR camera ready version of this paper 222https://openaccess.thecvf.com/content_CVPR_2020/papers/Chen_OASIS_A_Large-Scale_Dataset_for_Single_Image_3D_in_the_CVPR_2020_paper.pdf and are from the older, obsolete version of the dataset. The publicly released version of OASIS is the new version with the quality improvements.