## 1 Introduction

Despite their tremendous effectiveness in tasks such as object category detection, most deep neural networks do not understand the 3D nature of object categories. Reasoning about objects in 3D is necessary in many applications, for physical reasoning, or to understand the geometric relationships between different objects or scene elements.

The typical approach to learn 3D objects is to make use of large collections of high quality CAD models such as [5] or [46], which can be used to fully supervise models to recognize the objects’ viewpoint and 3D shape. Alternatively, one can start from standard image datasets such as PASCAL VOC [8], augmented with other types of supervision, such as object segmentations and keypoint annotations [4]. Whether synthetically generated or manually collected, annotations have so far been required in order to overcome the significant challenges of learning 3D object categories, where both viewpoint and geometry are variable.

In this paper, we develop an alternative approach that can learn 3D object categories in an *unsupervised manner* (fig. 1), replacing synthetic or manual supervision with *motion*. Humans learn about the visual word by experiencing it continuously, through a variable viewpoint, which provides very strong cues on its 3D structure. Our goal is to build on such cues in order to learn the 3D geometry of object categories, using videos rather than images of objects. We are motivated by the fact that videos are almost as cheap as images to capture, and do not require annotations.

We build on mature structure-from-motion (SFM) technology to extract 3D information from individual video sequences. However, these cues are specific to each object instance as contained in different videos. The challenge is to integrate this information in a global 3D model of the object category, as well as to work with noisy and incomplete reconstructions from SFM.

We propose a new deep architecture composed of three modules (fig. 2). The first module estimates the *absolute viewpoint* of objects in all video sequences (sec. 3.2). This aligns different object instances to a common reference frame where geometric relationships can be modeled more easily. The second estimates the 3D shape of an object from a given viewpoint, producing a *depth map* (sec. 3.3). The third *completes the depth map to a full 3D reconstruction* in the globally-aligned reference frame (sec. 3.4).
Combined and trained end-to-end without supervision, from videos alone, these components constitute VpDR-Net,
a network for viewpoint, depth and reconstruction,
capable of extracting viewpoint and shape of a new object instance from a single image.

One of our main contributions is thus to demonstrate the utility of using motion cues in learning 3D categories. We also introduce two significant technical innovations in the viewpoint and shape estimation modules as well as design guidelines and training strategies for 3D estimation tasks.

The first innovation (sec. 3.2) is a new approach to align video sequences of different 3D objects based on a *Siamese viewpoint factorization network*. While existing methods [40, 38] align shapes by looking at 3D features, we propose to train VpDR-Net to directly estimate the absolute viewpoint of an object. We train our network to reconstruct *relative camera motions* and we show that this implicitly aligns different objects instances together. By avoiding explicit shape comparisons in 3D space, this method is simpler and more robust than alternatives.

The second innovation (sec. 3.4

) is a new network architecture that can generate a complete point cloud for the object from a partial reconstruction obtained from monocular depth estimation. This is based on a shape representation that predicts the support of a point probability distribution in 3D space, akin to a flexible voxelization, and a corresponding space occupancy map.

As a general design guideline, we demonstrate throughout the paper the utility of allowing deep networks to *express uncertainty* in their estimate by predicting probability distributions over outputs (sec. 3), yielding more robust training and useful cues (such as separating foreground and background in a depth map). We also demonstrate the significant power of *geometry-aware data augmentation*, where a deep network is used to predict the geometry of an image and the latter is used to generate new realistic views to train other components of the system (sec. 4). Each component and design choice is thoroughly evaluated in sec. 5, with significant improvements over the state-of-the-art.

## 2 Related work

Viewpoint estimation. The vast majority of methods for learning the viewpoint of object categories use manual supervision [35, 27, 11, 29, 47, 25, 42] or synthetic [39] data. In [44], a deep architecture predicts a relative camera pose and depth for a pair of images. Only a few works have used videos [40, 38]. [38] solves the shape alignment problem using a global search strategy based on the pairwise alignment of point clouds, a step we avoid by means of our Siamese viewpoint factorization network.

3D shape prediction. A traditional approach to 3D reconstruction is to use handcrafted 3D models [32, 22], and more recently 3D CAD models [5, 47]. Often the idea is to search for the 3D model in a CAD library that best fits the image [20, 1, 13, 2]. Alternatively, CAD models can be used to train a network to directly predict the 3D shape of an object [10, 45, 41, 7]. Morphable models have sometimes been used [49, 17], particularly for modeling faces [3, 21]. All these methods require 3D models at train time.

Data-driven approaches for geometry. Structure from motion (SFM) generally assumes fixed geometry between views and is difficult to apply directly to object categories due to intra-class variations. Starting from datasets of unordered images, methods such as [48] and [30] use SFM and manual annotations, such as keypoints in [4, 17], to estimate a rough 3D geometry of objects. Here, we leverage motion cues and do not need extra annotations.

## 3 Method

We propose a single Convolutional Neural Network (CNN), VpDR-Net, that learns a *3D object category* by observing it from a *variable viewpoint* in videos and no supervision (fig. 2). Videos do not solve the problem of modeling intra-class shape variations, but they provide powerful yet noisy cues about the 3D shape of individual objects.

VpDR-Net takes as an input a set of video sequences of an object category (such as cars or chairs), where a video contains RGB or RGBD frames (where for RGB and for RGBD data)
and learns a model of the 3D category. This model has three components: i) a predictor of the *absolute viewpoint* of the object (implicitly aligning the different object instances to a common reference frame; sec. 3.2), ii) a monocular depth predictor
(sec. 3.3) and iii) and a shape predictor that extends the depth map to a point cloud capturing the complete shape of the object (sec. 3.4). Learning starts by preprocessing videos to extract instance-specific egomotion and shape information (sec. A.1).

### 3.1 Sequence-specific structure and pose

Video sequences are pre-processed to extract from each frame a tuple consisting of: (i) the camera calibration parameters , (ii) its pose , and (iii) a depth map associating a depth value to each pixel of . The camera pose consists of a rotation matrix

and a translation vector

.^{1}

^{1}1We use the convention that transforms world-relative coordinates to camera-relative coordinates . We extract this information using off-the-shelf methods: the structure-from-motion (SFM) algorithm COLMAP for RGB sequences [36, 37], and an open-source implementation [34] of KinectFusion (KF) [26] for RGBD sequences. The information extracted from RGB or RGBD data is qualitatively similar, except that the scale of SFM reconstructions is arbitrary.

### 3.2 Intra-sequence alignment

Methods such as SFM or KF can reliably estimate camera pose and depth information for single objects and individual video sequences, but are not applicable to *different instances and sequences*. In fact, their underlying assumption is that geometry is fixed, which is true for single (rigid) objects, but false when the geometry and appearance differ due to intra-class variations.

Learning 3D object categories requires to relate their variable 3D shapes by identifying and putting in correspondence analogous geometric features, such as the object front and rear. For rigid objects, such correspondences can be expressed by rigid transformations that *align* occurrences of analogous geometric features.

The most common approach for aligning 3D shapes, also adopted by [38] for video sequences, is to extract and match 3D feature descriptors. Once objects in images or videos are aligned, the data can be used to supervise other tasks, such as learning a monocular predictor of the absolute viewpoint of an object [38].

One of our main contributions, described below, is to reverse this process by learning a viewpoint predictor *without* explicitly matching 3D shapes. Empirically (sec. 5), we show that, by skipping the intermediate 3D analysis, our method is often more effective and robust than alternatives.

Siamese network for viewpoint factorization.

Geometric analogies between 3D shapes can often be detected in image space directly, based on visual similarity. Thus, we propose to train a CNN

that maps a single frame to its*absolute viewpoint*in the globally-aligned reference frame. We wish to learn this CNN from the viewpoints estimated by the algorithms of sec. A.1 for each video sequence. However, these estimated viewpoints are

*not*absolute, but valid only within each sequence; formally, there are unknown sequence-specific motions that map the sequence-specific camera poses to global poses .

^{2}

^{2}2 composes to the right: it transforms the world reference frame and then moves it to the camera reference frame.

To address this issue, we propose to supervise the network using *relative pose changes within each sequence*, which are invariant to the alignment transformation . Formally, the transformation is eliminated by computing the relative pose change of the camera from frame to frame :

(1) |

Expanding the expression with , we find equations expressing the relative rotation and translation

(2) | ||||

(3) |

Eqs. (2) and (3) are used to constrain the training of a *Siamese architecture*, which, given two frames and , evaluates the CNN twice to obtain estimates and . The estimated poses are then compared to the ground truth ones, and , in a relative manner by using losses that enforce the estimated poses to satisfy eqs. 3 and 2:

(4) | ||||

(5) |

where is the principal matrix logarithm and

While this CNN is only required to correctly predict relative viewpoint changes *within each sequence*, since the *same CNN* is used for all videos, the most plausible/regular solution for the network is to assign similar viewpoint predictions , to images viewed from the same viewpoint, leading to a globally consistent alignment of the input sequences. Furthermore, in a large family of 3D objects, different ones (e.g. SUVs and sedans) tend to be mediated by intermediate cases.
This is shown empirically in sec. 5.

Scale ambiguity in SFM. For methods such as SFM, there is an additional ambiguity: reconstructions are known only up to sequence-specific scaling factors , so that the camera pose is parametrized as This ambiguity leaves eq. 2 unchanged, but eq. 3 becomes:

During training, the ambiguity can be removed from loss (5) by dividing vectors and by their Euclidean norm. Note that for KF sequences . As the viewpoints are learned, an estimate of is computed using a moving average over training iterations for the other network modules to use (see supplementary material for details).

Probabilistic predictions. Due to intrinsic ambiguities in the images or to errors in the SFM supervision (caused for example by reflective or textureless surfaces), is occasionally unable to predict the ground truth viewpoint accurately. We found beneficial to allow the network to explicitly learn these cases and express uncertainty as an additional input-dependent prediction. For the translation component, we modify the network to predict the absolute pose as well as its confidence score

(predicted as the output of a soft ReLU units to ensure positivity). We then model the relative translation as a Gaussian distribution with standard deviation

and our model is now learned by minimizing the negative log-likelihood which replaces the loss :(6) |

The rotation component is more complex due to the non-Euclidean geometry of , but it was found sufficient to assume that the error term (4) has Laplace distribution and optimize where is a normalization term ensuring that the probability distribution integrates to one. During training, by optimizing the losses and instead of and

, the network can discount gross errors by dividing the losses by a large predicted variance.

Architecture. The architecture of is a variant of ResNet-50 [15] with some modifications to improve its performance as viewpoint predictor. The lower layers of are used to extract a multiscale intermediate representation (denoted HC for hypercolumn [14] in fig. 2). The upper layers consist of downsampling residual blocks that predict the viewpoint (see supp. material for details).

### 3.3 Depth prediction

The depth predictor module of VpDR-Net takes individual frames and outputs a corresponding depth map , performing monocular depth estimation.

Estimating depth from a single image is inherently ambiguous and requires comparing the image to internal priors of the object shape. Similar to pose, we allow the network to explicitly *learn and express uncertainty*

about depth estimates by predicting a posterior distribution over possible pixel depths. For robustness to outliers from COLMAP and KF, we assume a Laplace distribution with negative log-likelihood loss

(7) |

where is the noisy ground truth depth output by SFM or KF for a given pixel , and and are respectively the corresponding predicted depth mean and standard deviation. The relative scale is 1 for KF and is estimated as explained in sec. 3.2 for SFM.

### 3.4 Point-cloud completion

Given any image of an object instance, its aligned 3D shape can be reconstructed by estimating and aligning its depth map using the output of the viewpoint and depth predictors of sec. 3.2 and 3.3. However, since a depth map cannot represent the occluded portions of the object, such a reconstruction can only be partial. In this section, we describe the third and last component of VpDR-Net, whose goal is to generate a full reconstruction of the object, beyond what is visible in the given view.

Partial point cloud. The first step is to convert the predicted depth map into a partial point cloud where are the coordinates of a pixel in the depth map and is the camera calibration matrix. Empirically, we have found that the reconstruction problem is much easier if the data is aligned in the global reference frame established by VpDR-Net. Thus, we transform into a globally-aligned point cloud as , where is the camera pose estimated by the viewpoint-prediction network.

Point cloud completion network. Next, our goal is to learn the point cloud completion part of our network that takes the aligned but incomplete point could and produces a complete object reconstruction . We do so by predicting a 3D occupancy probability field. However, rather than using a volumetric method that may require a discrete and fixed voxelization of space, we propose a simple and efficient alternative. First, the network predicts a set of 3D points that, during training, closely fit the ground truth 3D point cloud . This step minimizes the fitting error:

(8) |

The 3D point cloud provides a good coverage of the ground truth object shape. However, this point cloud is conservative and distributed *in the vicinity* of the ground truth object. Thus, while this is not a precise representation of the object shape, it works well as a support of a probability distribution of space occupancy. In order to estimate the occupancy probability values, the network predicts additional scalar outputs

proportional to the number of ground truth surface points for which the support point is the nearest neighbor. The network is trained to compute a prediction of the occupancy masses by minimizing the squared error loss

Given the network prediction , the completed point cloud is then defined as the subset of points that have sufficiently high occupancy, defined as: where is a confidence parameter. The set can be further refined by using e.g. a 3D Laplacian filter to smooth out noise.

Architecture. The point cloud completion network is modeled after PointNet [31], originally proposed to semantically *segment* a point clouds. Here we adapt it to perform a completely different task, namely 3D shape reconstruction. This is made possible by our model where shape is represented as a cloud of 3D support points and their occupancy masses .
Differently from and , is *not* convolutional but uses a sequence of fully connected layers to process the 3D points in , after appending an appearance descriptor to each of them. A key step is to add an intermediate orderless pooling operator to remove the dependency on the order and number of input points (see the supplementary material for details). The architecture is configured to predict points .

object class | test set | level of supervision | method | ||||||
---|---|---|---|---|---|---|---|---|---|

car | Pascal3D | unsupervised | VPNet + aligned FrC [38] | 49.62 | 32.29 | 85.45 | 0.84 | 0.15 | 0.01 |

unsupervised | VpDR-Net + FrC (ours) | 29.57 | 7.29 | 62.30 | 0.65 | 0.41 | 0.91 | ||

fully supervised | VPNet + Pascal3D | 12.49 | 1.27 | 20.34 | 0.24 | 0.77 | 0.97 | ||

chair | Pascal3D | unsupervised | VPNet + aligned LDOS [38] | 64.68 | 42.46 | 89.01 | 0.95 | 0.06 | 0.00 |

unsupervised | VpDR-Net + LDOS (ours) | 42.34 | 16.72 | 71.35 | 0.93 | 0.23 | 0.22 | ||

fully supervised | VPNet + Pascal3D | 34.37 | 6.14 | 67.41 | 0.74 | 0.26 | 0.66 | ||

LDOS | unsupervised | VPNet + aligned LDOS [38] | 30.56 | 0.61 | 71.40 | 0.77 | 0.30 | 0.18 | |

unsupervised | VpDR-Net + LDOS (ours) | 33.92 | 0.54 | 60.90 | 0.70 | 0.40 | 0.22 | ||

fully supervised | VPNet + Pascal3D | 61.45 | 2.55 | 82.97 | 0.96 | 0.15 | 0.00 |

Leave out. During training the incomplete point cloud is downsampled by randomly selecting between and points based on their depth prediction confidence as estimated by . Similar to dropout, dropping points allows the network to overfit less, to become less sensitive to the size of the input point cloud, and to implicitly discard background points (as these are assigned low confidence by depth prediction). For the latter reason, leave out is maintained at test time too with .

## 4 Geometry-aware data augmentation

As viewpoint prediction with deep networks benefits significantly from large training sets [39], we increase the effective size of the training videos by *data augmentation*. This is trivial for tasks such as classification, where one can translate or scale an image without changing its identity. The same is true for viewpoint recognition if the task is to only estimate the viewpoint orientation as in [39, 42], as images can be scaled and translated without changing the equivalent viewpoint orientation. However, this assumption is not satisfied if, as in our case, the goal is to estimate all 6 DoF of the camera pose.

Inspired by the approach of [12], we propose to solve this problem by using the estimated scene geometry to *generate new realistic viewpoints* (fig. 3). Given a sample , we apply a random perturbation to the viewpoint (with a forward bias to avoid unoccluding too many pixels) and use depth-image-based rendering (DIBR) [24] to generate a new sample , warping both the image and the depth map.

Sometimes the depth map from KF contains too many holes to yield satisfactory DIBR results (fig. 3, bottom); we found preferable to use the depth estimated by the network which is less accurate but more robust, containing almost no missing pixels (fig. 3, top).

Test set: LDOS | ||||||

VpDR-Net (ours) | 33.92 | 0.54 | 60.90 | 0.70 | 0.40 | 0.22 |

VpDR-Net-NoProb | 45.33 | 0.67 | 69.33 | 0.85 | 0.12 | 0.07 |

VpDR-Net-NoDepth | 68.19 | 0.85 | 82.99 | 1.01 | 0.01 | 0.01 |

VpDR-Net-NoAug | 35.16 | 0.59 | 63.54 | 0.73 | 0.38 | 0.19 |

Test set: Pascal3D | ||||||

VpDR-Net (ours) | 42.34 | 16.72 | 71.35 | 0.93 | 0.23 | 0.22 |

VpDR-Net-NoProb | 57.23 | 17.06 | 77.72 | 1.05 | 0.08 | 0.14 |

VpDR-Net-NoDepth | 60.31 | 17.89 | 85.17 | 1.15 | 0.07 | 0.21 |

VpDR-Net-NoAug | 43.52 | 18.80 | 72.93 | 0.92 | 0.10 | 0.17 |

## 5 Experiments

We assess viewpoint estimation in sec. 5.1, depth prediction in sec. 5.2, and point cloud prediction in sec. 5.3.

Datasets. Throughout the experimental section, we consider three datasets for training and benchmarking our network: (1) FreiburgCars (FrC) [38] which consists of RGB video sequences with the camera circling around various types of cars; (2) the Large Dataset of Object Scans (LDOS) [6] containing RGBD sequences of man-made objects; and (3) Pascal3D [47], a standard benchmark for pose estimation [42, 39].

For viewpoint estimation, Pascal3D already contains viewpoint annotations. For LDOS, experiments focus on the chair class. In order to generate ground truth pose annotations for evaluation, we manually aligned 3D reconstructions of 10 randomly-selected chair videos and used 50 randomly-selected frames for each video as a test set.

For depth estimation, we evaluate on LDOS as it provides high quality depth maps one can use as ground truth.

For point cloud reconstruction, we use FrC and LDOS. Ground truth point clouds for evaluation are obtained by merging the SFM or RGBD depth maps from all frames of a given test video sequence, sampling points and post-processing those using a 3D Laplacian filter. For FrC, five videos were randomly selected and removed from the train set, picking 60 random frames per video for evaluation. For LDOS the pose estimation test frames are used.

Learning details.

VpDR-Net is trained with stochastic gradient descent with a momentum of 0.0005 and an initial learning rate of

. The weights of the losses were empirically set to achieve convergence on the training set. Better convergence was observed by training VpDR-Net in two stages. First, and were optimized jointly, lowering the learning rate tenfold when no further improvement in the training losses was observed. Then, is optimized after initializing the bias of its last layer, which corresponds to an average point cloud of the object category, by randomly sampling points from the ground truth models.### 5.1 Pose estimation

Pascal3D. First, we evaluate the VpDR-Net viewpoint predictor on the Pascal3D benchmark [47]. Unlike previous works [39, 42] that focus on estimating the object/camera viewpoint represented by a 3 DoF rotation matrix, we evaluate the full 6 DoF camera pose represented by the rotation matrix together with the translation vector .

In Pascal3D, the camera poses are expressed relatively to the whole scenes instead of the objects themselves, so we adjust the dataset annotations. We crop every object using bounding box annotations after reshaping the box to a fixed aspect ratio, and resize the crop to pixels. The camera pose is adjusted to the cropped object using the P3P algorithm to minimize the reprojection error between the camera-projected vertices of the ground truth CAD model and the original projection after cropping and resizing.

Absolute pose evaluation. We first evaluate absolute camera pose estimation using two standard measures: the angular error between the ground truth camera pose and the prediction [42, 39], as well as the camera-center distance between the predicted camera center and the ground truth . Following the common practice [42, 39] we report median and over all pose predictions on each test set.

Note that, while object viewpoints in Pascal3D and our method are internally consistent for a whole category, they may still differ between them by an arbitrary global 3D similarity transformation. Thus, as detailed in the supplementary material, the two sets of annotations are aligned by a single global similarity before assessment.

Relative pose evaluation. To assess methods with measures independent of we also evaluate: (1) the relative rotation error between pairs of ground truth relative camera motions and the corresponding predicted relative motions given by and (2) the normalized relative translation error , where both and are -normalized so the measure is invariant to the scaling component of . We report the median errors over all possible image pairs in each test set.

Pose prediction confidence evaluation. A feature of our model is to produce confidence scores with its viewpoint estimates. We evaluate the reliability of these scores by correlating them with viewpoint prediction accuracy. In order to do so, predictions are divided into “accurate” and “inaccurate” by comparing their errors and to thresholds (set to following [39, 42] and and for Pascal3D or LDOS respectively). Predictions are then ranked by decreasing confidence scores and the average precisions and of the two ranked lists are computed.

Baselines. We compare our viewpoint predictor to a strong baseline, called VPNet, trained using absolute viewpoint labels. VPNet is a ResNet50 architecture [15]

with the final softmax classifier replaced by a viewpoint estimation layer that predicts the 6 DoF pose

. Following [42], rotation matrices are decomposed in Euler angles, each discretized in 24 equal bins. This network is trained to predict a softmax distribution over the angular bins and to regress a 3D vector corresponding to the camera translation . The average softmax value across the three max-scoring Euler angles is used as a prediction confidence score.We test both an unsupervised and a fully-supervised variant of VPNet. VPNet-unsupervised is comparable to our setting and is trained on the output of the global camera poses estimated from the videos by the state-of-the-art sequence-alignment method of [38]. In the fully-supervised setting, VPNet is trained instead by using ground-truth global camera poses provided by the Pascal3D training set.

Results. Table 1 compares VpDR-Net to the VPNet baselines. First, we observe that our baseline VPNet-unsupervised is very strong, as we report error for the full rotation matrix, while the original method of [38] reports an error of 61.5 just for the azimuth component. Nevertheless, VpDR-Net outperforms VPNet in all performance metrics except for a single case ( for LDOS chairs). Furthermore, the advantage is generally substantial, and the unsupervised VpDR-Net reduces the gap with fully-supervised VPNet by 20 % or better in the vast majority of the cases. This shows the advantage of the proposed viewpoint factorization method compared to aligning 3D shapes as in [38]. Second, we observe that the confidence scores estimated by VpDR-Net are significantly more correlated with the accuracy of the predictions than the softmax scores in VPNet, providing a reliable self-assessment mechanism. The most confident viewpoint predictions of VpDR-Net are shown in fig. 4.

Ablation study. We evaluate the importance of the different components of VpDR-Net by turning them off and measuring performance on the chair class. In table 2, VpDR-Net-NoProb replaces the robust probabilistic losses and with their non-probabilistic counterparts and , and confidence predictions are replaced with random scores for AP evaluation. VpDR-Net-NoDepth removes the depth prediction and point cloud prediction branches during training, retaining only the subnetwork. VpDR-Net-NoAug does not use the data augmentation mechanism of sec. 4.

We observe a significant performance drop when each of the components is removed. This confirms the importance of all contributions in the network design. Interestingly, we observe that the depth prediction branch is crucial for pose estimation ( -34.27 on LDOS).

### 5.2 Depth prediction

The monocular depth prediction module of VpDR-Net is compared against three baselines: VpDR-Net-Rand uses VpDR-Net to estimate depth but predicts random confidence scores. BerHu-Net is a variant of the state-of-the-art depth prediction network from [19] based on the same subnetwork as VpDR-Net (but dropping and ). Following [19], for training it uses the BerHu depth loss and a dropout layer, which allows it to produce a confidence score of the depth measurements at test time using the sampling technique of [18, 9]. Finally, BerHu-Net-Rand is the same network, but predicting random confidence scores.

Results. Fig. 5 (right) shows the cumulative root-mean-squared (RMS) depth reconstruction error for LDOS after sorting pixels by their confidence as estimated by the network. By fitting better to inlier pixels and giving up on outliers, VpDR-Net produces a much better estimate than alternatives for the vast majority of pixels. Furthermore, accuracy is well predicted by the confidence scores. Fig. 5 (left) shows the cumulative RMS by depth, demonstrating that accuracy is better for pixels closer to the camera, which are more likely to be labeled with correct depth. Qualitative results are shown in fig. 6.

### 5.3 Point cloud prediction

We evaluate the point cloud completion module of VpDR-Net by comparing ground truth point clouds to the point clouds predicted by using: (1) the voxel intersection-over-union (VIoU) measure that computes the Jaccard similarity between the volumetric representations of and , and (2) the normalized point cloud distance of [33]. We average these measures over the test set leading to mVIoU and m (see supp. material for details).

VpDR-Net is compared against the approach of Aubry [1] using their code. [1] is a 3D CAD model retrieval method which first trains a large number of exemplar models which, in our case, are represented by individual video frames with their corresponding ground truth 3D point clouds. Then, given a testing image, [1] detects the object instance and retrieves the best matching model from the database. We align the retrieved point cloud to the object location in the testing image using the P3P algorithm. For VpDR-Net, we evaluate two flavors. The original VpDR-Net that predicts the point cloud and VpDR-Net-Fuse which further merges with the predicted partial depth map point cloud .

## 6 Conclusion

We have demonstrated the power of motion cues in replacing manual annotations and synthetic data in learning 3D object categories. We have done so by proposing a single neural network that simultaneously performs monocular viewpoint estimation, depth estimation, and shape reconstruction. This network is based on two innovations, a new image-based viewpoint factorization method and a new probabilistic shape representation. The contribution of each component was assessed against suitable baselines.

## Appendix A Method: additional details

### a.1 Scale ambiguity in SFM

In Sec. 3.2 in the paper, we explain that the scale ambiguity of structure from motion (SFM) causes each reconstruction of a sequence to be known only up to a global sequence specific scaling factor . Since is not required to learn , but it is important for depth prediction (as discussed in Sec. 3.3 from the paper), we estimate it as well.

To do so, we note that, given a pair of frames from sequence , one can estimate the sequence scale as This expression allows us to conveniently estimate on the fly as a moving average during the SGD iterations used to learn , as samples can be computed essentially for free during this process.

### a.2 The VpDR-Net architecture: further details

This section contains additional details about the layers that compose the VpDR-Net architecture.

The core architecture. The architecture of the VpDR-Net (introduced in Sec. 3.2 from the paper) is a variant of the ResNet-50 architecture [15] with some modifications to improve its performance as a viewpoint and depth predictor that we detail below.

In order to decrease the degree of geometrical invariance of the network, we first replace all downsampling filters with full

convolutions. We then attach bilinear upsampling layers that first resize features from 3 different layers of the architecture (res2d, res3d, res4d) into fixed-size tensors and then sum them in order to create a multiscale intermediate image representation which resembles hypercolumns (HC)

[14]. An extension of Fig. 2 from the paper that contains the diagram of this HC module can be found in Figure H.Architecture of the viewpoint factorization network . HC is followed by 3 modified downsampling residual layers that produce the final viewpoint prediction. While the standard downsampling residual layers do not contain the residual skip connection due to different sizes of the input and output tensors, here we retain the skip connection by performing average pooling over the input tensor and summing the result with the result of the second downsampling convolution branch. We further remove the ReLU after the final residual summation layer. Figure J contains an overview of the viewpoint estimation module together with a detailed illustration of the modified downsampling residual blocks.

Architecture of the depth prediction . The depth prediction network (introduced in Sec. 3.3 from the paper) shares the early HC layers with the viewpoint factorization network . The remainder of the pipeline is based on the state-of-the-art depth estimation method of [19]. More precisely, after attaching 2 standard residual blocks to the HC layers, the network also contains two 2x2 up-projection layers from [19] leading to a 64-dimensional representation of the same size as the input image. This is followed by 1x1 convolutional filters that predict the depth and confidence maps and respectively. Figure I contains an illustration of .

Architecture of the point cloud completion network . Differently from the two previous networks, the point cloud completion network

(introduced in Sec. 3.4 from the paper) is not convolutional but uses a residual multi-layer perceptron (MLP), a sequence of residual fully connected layers.

In more details, the network starts by appending to each 3D point an appearance descriptor and processes this input with an MLP with an intermediate pooling operator:

The intermediate pooling operator, which is permutation invariant, removes the dependency on the number and order of input points . In practice, the pooling operator uses both max and sum pooling, stacking the results of the two.

For the appearance descriptors, recall that each point is the back-projection of a certain pixel in image . To obtain the appearance descriptor we reuse the HC features from the core architecture and sample a column of feature channels at location using differentiable bilinear sampling. Note that, following [41], the fully connected residual blocks contain leaky-ReLUs with the leak factor set to 0.2. A diagram depicting can be found in Figure K.

## Appendix B Experimental evaluation

In this section we provide additional details about the learning procedures of the baseline networks and about the experimental evaluation.

### b.1 Learning details of BerHu-Net and VPNet

In this section we provide learning details for the BerHu-Net and VPNet baselines. The learning rates and batch sizes were in all cases adjusted empirically such that the convergence is achieved on the respective training sets.

BerHu-Net is trained with stochastic gradient descent with a momentum of 0.0005, initial learning rate and a batch size of 16.
The learning rate was lowered tenfold when no further improvement
in the training losses was observed. The BerHu loss
uses the adaptive adjustment of the loss cut-off threshold as explained in [19].
For the 2x2 up-projection layers we used the implementation of [19].
For each test image, we repeat the depth map extraction 70 times^{3}^{3}3We empirically verified that 70 repetitions are enough for convergence of the variance estimates.
with the dropout layer turned on and compute the variance of the predictions in order to obtain the per-pixel depth confidence values.
The final feed-forward pass turns off the dropout layer and produces the actual depth predictions.

VPNet is trained with stochastic gradient descent with a momentum of 0.0005, initial learning rate and a batch size of 128. The learning rate was lowered tenfold when no further improvement in the training losses was observed. For VPNet trained on aligned FrC, we adjusted the produced bounding box and viewpoint annotations in the same fashion as done for adjusting the Pascal3D annotations in sec. 5.1. in the paper, ensuring that the aligned FrC dataset is as compatible as possible with the target Pascal3D dataset. For LDOS, the produced dataset was adjusted in the same way except that we did not use the bounding boxes predicted by [38] because the input video frames already focus on full/truncated views of the object category.

### b.2 Additional results

In sec. 5.1. in the paper we compared VpDR-Net to [38] on an adjusted version of the Pascal3D dataset. In this section, we additionally report the standard AVP measure [47] on the original Pascal3D dataset in order to present a better comparison with fully supervised state-of-the-art on this dataset. Because the AVP measure requires an object detector, we extract viewpoints from the same set of RCNN detections as in [42]. Due to the fact that the AVP measure, as well as most other measures from sec. 5.1. in the paper, depends on the dataset-specific global alignment transformation , we estimate it from the ground truth annotations of the training set of [47] using the same method as described in sec. 5.1. in the paper.

Due to the additional measurement noise brought by the estimation of , we report results only for the coarsest resolution of 4 azimuth bins. Our VpDR-Net obtained 33.4 and 14.7 AVP for the car and chair classes vs. 29.4 and 14.3 AVP obtained by [38] using the same detections from [42]. Our approach performs on par with some fully supervised approaches such as 3D DPM [28], while being inferior to the fully supervised state-of-the-art by the same margin as for the other metrics reported in table 1 in the paper.

### b.3 Absolute pose evaluation protocol

As noted in the paper, the absolute pose error metrics and can be computed only after aligning the implicit global coordinate frames of the benchmarked network and of the ground truth annotations. This procedure is explained in detail below.

Given a set of ground truth camera poses and the corresponding predictions , we want to estimate a global similarity transform , parametrized by a scale , translation and rotation , such that the coordinate frames of and become aligned.

In more detail, the desired global similarity transform satisfies the following equation:

(9) |

given an arbitrary world-coordinate point , its projection into the coordinate frame of (the right part of eq. 9) should be equal to the projection of into the coordinate frame of after transforming with , and scaling the corresponding camera translation vector with (the left side of eq. 9). Note that for LDOS data corresponds to a rigid motion and . Given , the adjusted camera matrices for which are then computed with

In order to estimate , is substituted in eq. 9 with , is set to be the center of the ground truth camera which is a valid point of the world coordinate frame. After performing some additional manipulations, we end up with the following constraint:

(10) |

where is the center of the predicted camera . Given the corresponding camera pairs the constraint in eq. 10 is converted to a least squares minimization problem:

(11) |

and solved using the UMEYAMA algorithm [43].

For Pascal3D we estimate from the held-out training set and later use it for evaluation on the test set. For LDOS, due to the absence of a held-out annotated training set, we estimate on the test set.

### b.4 Point cloud prediction

The normalized point cloud distance of [33] is computed as For the VIoU measure, a voxel grid is setup around each ground truth point-cloud by uniformly subdividing ’s bounding volume into voxels.

The point clouds are compared within the local coordinate frames of each frame’s camera (whose focal length is assumed to be known). Furthermore, since the SFM reconstructions are known only up to a global scaling factor, we adjust each point cloud prediction from the FrC dataset by multiplying it with a scaling factor that aligns the means of and . Note that can be computed analytically with:

where is the centroid of the point cloud .

Ablative study. In table 2 in the paper, we have presented a comparison of VpDR-Net to the baseline approach from [1]. Here we provide an additional ablative study that evaluates the contribution of the components of . More exactly, table A extends table 2 from the paper with the following flavours of VpDR-Net: (1) VpDR-Net- which only predicts the partial point cloud , (2) VpDR-Net-Chamfer which removes the density predictions and replaces with a Chamfer distance loss and (3) VpDR-Net- that predicts the raw unfiltered and untruncated point cloud .

The drops in performance by predicting solely the raw and partial point clouds and emphasize the importance of the point cloud completion and density prediction components respectively. The Chamfer distance loss brings marginal improvements in but a significant decrease of VIoU due to the inability of the network to represent and discard outliers.

Test set | LDOS | FrC | ||
---|---|---|---|---|

Metric | mVIoU | m | mVIoU | m |

Aubry [1] | 0.06 | 1.30 | 0.21 | 0.41 |

VpDR-Net- | 0.10 | 0.37 | 0.11 | 0.56 |

VpDR-Net-Chamfer | 0.09 | 0.18 | 0.20 | 0.24 |

VpDR-Net- | 0.12 | 0.27 | 0.18 | 0.50 |

VpDR-Net (ours) | 0.13 | 0.20 | 0.24 | 0.28 |

VpDR-Net-Fuse (ours) | 0.13 | 0.19 | 0.26 | 0.26 |

## References

- [1] M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In Proc. CVPR, 2014.
- [2] A. Bansal, B. Russell, and A. Gupta. Marr Revisited: 2D-3D model alignment via surface normal prediction. In Proc. CVPR, 2016.
- [3] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. PAMI, 25(9):1063–1074, 2003.
- [4] J. Carreira, S. Vicente, L. Agapito, and J. Batista. Lifting object detection datasets into 3d. PAMI, 38(7):1342–1355, 2016.
- [5] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
- [6] S. Choi, Q. Zhou, S. Miller, and V. Koltun. A large dataset of object scans. CoRR, abs/1602.02481, 2016.
- [7] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Proc. ECCV, 2016.
- [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
- [9] Y. Gal and Z. Ghahramani. Bayesian convolutional neural networks with Bernoulli approximate variational inference. In Proc. ICLR, 2016.
- [10] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In Proc. ECCV, 2016.
- [11] D. Glasner, M. Galun, S. Alpert, R. Basri, and G. Shakhnarovich. Viewpoint-aware object detection and pose estimation. In Proc. ICCV, 2011.
- [12] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proc. CVPR, 2016.
- [13] S. Gupta, P. A. Arbeláez, R. B. Girshick, and J. Malik. Aligning 3D models to RGB-D images of cluttered scenes. In Proc. CVPR, 2015.
- [14] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proc. CVPR, 2015.
- [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.
- [16] Q. Huang, H. Wang, and V. Koltun. Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics (TOG), 34(4):87, 2015.
- [17] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-specific object reconstruction from a single image. In Proc. CVPR, 2015.
- [18] A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. CoRR, abs/1511.02680, 2015.
- [19] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3DV, 2016.
- [20] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In Proc. ICCV, 2013.
- [21] F. Liu, D. Zeng, Q. Zhao, and X. Liu. Joint face alignment and 3d face reconstruction. In Proc. ECCV, 2016.
- [22] D. G. Lowe. Three-dimensional object recognition from single two-dimensional images. Artif. Intell., 31(3):355–395, 1987.
- [23] F. Massa, B. C. Russell, and M. Aubry. Deep exemplar 2d-3d detection by adapting from real to rendered views. In Proc. CVPR, 2016.
- [24] Y. Y. Morvan. Acquisition, compression and rendering of depth and texture for multi-view video. PhD thesis, Technische Universiteit Eindhoven, 2009.
- [25] R. Mottaghi, Y. Xiang, and S. Savarese. A coarse-to-fine model for 3d pose estimation and sub-category recognition. In Proc. CVPR, 2015.
- [26] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Proc. ISMAR, 2011.
- [27] M. Ozuysal, V. Lepetit, and P.Fua. Pose estimation for category specific multiview object localization. In Proc. CVPR, 2009.
- [28] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Teaching 3d geometry to deformable part models. In Proc. CVPR, 2012.
- [29] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Multi-view priors for learning detectors from sparse viewpoint data. In Proc. ICLR, 2014.
- [30] M. Prasad, A. Fitzgibbon, A. Zisserman, and L. V. Gool. Finding nemo: Deformable object class modelling using curve matching. In Proc. CVPR, 2010.
- [31] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. CoRR, abs/1612.00593, 2016.
- [32] L. G. Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of Technology. Dept. of Electrical Engineering, 1963.
- [33] J. Rock, T. Gupta, J. Thorsen, J. Gwak, D. Shin, and D. Hoiem. Completing 3d object shape from one depth image. In Proc. CVPR, 2015.
- [34] R. B. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). In Proc. ICRA, 2011.
- [35] S. Savarese and L. Fei-Fei. 3d generic object categorization, localization and pose estimation. In Proc. ICCV, 2007.
- [36] J. L. Schönberger and J.-M. Frahm. Structure-from-motion revisited. In Proc. CVPR, 2016.
- [37] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm. Pixelwise view selection for unstructured multi-view stereo. In Proc. ECCV, 2016.
- [38] N. Sedaghat and T. Brox. Unsupervised generation of a viewpoint annotated car dataset from videos. In Proc. ICCV, 2015.
- [39] H. Su, C. R. Qi, Y. Li, and L. J. Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In Proc. ICCV, 2015.
- [40] i. Sun, H. Su, S. Savarese, and L. Fei-Fei. A multi-view probabilistic model for 3d object classes. In Proc. CVPR, 2009.
- [41] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolutional network. In Proc. ECCV, 2016.
- [42] S. Tulsiani and J. Malik. Viewpoints and keypoints. In Proc. CVPR, 2015.
- [43] S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. PAMI, 13(4):376–380, 1991.
- [44] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. CoRR, abs/1612.02401, 2016.
- [45] J. Wu, C. Zhang, T. Xue, W. T. Freeman, and J. B. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Proc. NIPS, 2016.
- [46] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In Proc. ECCV, 2016.
- [47] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In WACV, 2014.
- [48] S. Zhu, L. Zhang, and B. M. Smith. Model evolution: An incremental approach to non-rigid structure from motion. In Proc. CVPR, 2010.
- [49] Z. Zia, M. Stark, B. Schiele, and K. Schindler. Detailed 3d representations for object recognition and modeling. PAMI, 35(11):2608–2623, 2013.

Comments

There are no comments yet.