RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

08/11/2022
by   Jason Y. Zhang, et al.
Carnegie Mellon University
35

We describe a data-driven method for inferring the camera viewpoints given multiple images of an arbitrary object. This task is a core component of classic geometric pipelines such as SfM and SLAM, and also serves as a vital pre-processing requirement for contemporary neural approaches (e.g. NeRF) to object reconstruction and view synthesis. In contrast to existing correspondence-driven methods that do not perform well given sparse views, we propose a top-down prediction based approach for estimating camera viewpoints. Our key technical insight is the use of an energy-based formulation for representing distributions over relative camera rotations, thus allowing us to explicitly represent multiple camera modes arising from object symmetries or views. Leveraging these relative predictions, we jointly estimate a consistent set of camera rotations from multiple images. We show that our approach outperforms state-of-the-art SfM and SLAM methods given sparse images on both seen and unseen categories. Further, our probabilistic approach significantly outperforms directly regressing relative poses, suggesting that modeling multimodality is important for coherent joint reconstruction. We demonstrate that our system can be a stepping stone toward in-the-wild reconstruction from multi-view datasets. The project page with code and videos can be found at https://jasonyzhang.com/relpose.

READ FULL TEXT VIEW PDF

page 10

page 19

page 20

page 21

page 22

04/16/2021

Learning Implicit 3D Representations of Dressed Humans from Sparse Views

Recently, data-driven single-view reconstruction methods have shown grea...
09/08/2022

im2nerf: Image to Neural Radiance Field in the Wild

We propose im2nerf, a learning framework that predicts a continuous neur...
07/27/2020

Associative3D: Volumetric Reconstruction from Sparse Views

This paper studies the problem of 3D volumetric reconstruction from two ...
09/01/2021

Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction

Traditional approaches for learning 3D object categories have been predo...
05/10/2022

KeypointNeRF: Generalizing Image-based Volumetric Avatars using Relative Spatial Encoding of Keypoints

Image-based volumetric avatars using pixel-aligned features promise gene...
03/26/2021

Planar Surface Reconstruction from Sparse Views

The paper studies planar surface reconstruction of indoor scenes from tw...
09/08/2018

Learning Sports Camera Selection from Internet Videos

This work addresses camera selection, the task of predicting which camer...

1 Introduction

Recovering 3D from 2D images of an object has been a central task in vision across decades. Given multiple views, structure-from-motion (SfM) based methods can infer a 3D representation of the underlying instance while also associating each image with a camera viewpoint. However, these correspondence-driven methods cannot robustly handle sparsely sampled images that minimally overlap, and typically require many (20) images for a 360-degree 3D inference. Unfortunately, this requirement of densely sampled views can be prohibitive—online marketplaces often have only a few images per instance, and a user casually reconstructing a novel object would also find capturing such views tedious. Although the recently emerging neural 3D reconstruction techniques also typically leverage similarly dense views, some works have shown promise that a far smaller number of images can suffice for high-quality 3D reconstruction. These successes have however still relied on precisely [60] or approximately [27, 82, 20] known camera viewpoints for inference. To apply these methods at scale, we must therefore answer a fundamental question—given sparsely sampled images of a generic object, how can we obtain the associated camera viewpoints?

Existing methods do not provide a conclusive answer to this question. On the one hand, bottom-up correspondence-based techniques are not robustly applicable for sparse-view inference. On the other, recent neural multi-view methods can optimize already known approximate camera poses but provide no mechanism to obtain these to begin with. In this work, our goal is to fill this void and develop a method that, given a small number of unposed images of a generic object, can associate them with (approximate) camera viewpoints. Towards this goal, we focus on inferring the camera rotation matrices corresponding to each input image and propose a top-down approach to predict these. However, we note that the ‘absolute’ rotation is not well-defined given an image of a generic object—it assumes a ‘canonical’ pose which is not always known a-priori (e.g. what is an identity rotation for a pen? or a plant?). In contrast, the relative rotation between two views is well-defined even if a canonical pose for the instance is not. Thus, instead of adopting the common paradigm of single-image based pose prediction, we learn to estimate the relative pose given a pair of input images. We propose a system that leverages such pairwise predictions to then infer a consistent set of global rotations given multiple images of a generic object.

A key technical question that we consider is regarding the formulation of such pairwise pose estimation. Given two informative views of a rotationally asymmetric object, a regression-based approach may be able to accurately predict their relative transformation. The general case however, can be more challenging—given two views of a cup but with the handle only visible in one, the relative pose is ambiguous given just the two images. To allow capturing this uncertainty, we formulate an energy-based relative pose prediction network that, given two images

and

a candidate relative rotation, outputs an energy corresponding to the (unnormalized) log-probability of the hypothesis. This probabilistic estimation of relative pose not only makes the learning more stable, but more importantly, provides a mechanism to estimate a

joint distribution over viewpoints given multiple images. We show that optimizing rotations to improve this joint likelihood yields coherent poses given multiple images and leads to significant improvements over naive approaches that do not consider the joint likelihoods.

We train our system using instances from over 40 commonplace object categories, and find that not only can it infer accurate (relative) poses for novel instances of these classes, it even generalizes to instances from unseen categories. Our approach can thus be viewed as a stepping stone toward sparse-view 3D reconstruction of generic objects; just as classical techniques provide precise camera poses that (neural) multi-view reconstruction methods can leverage, our work provides a similar, albeit coarser, output that can be used to initialize inference in current (and future) sparse-view reconstruction methods. While our system only outputs camera rotations, we note that a reasonable corresponding translation can be easily initialized assuming object-facing viewpoints, and we show that this suffices in practice for bootstrapping sparse-view reconstruction.

Figure 2: Overview.

From a set of images, we aim to recover corresponding camera poses (rotations). To do this, we train a pairwise pose predictor that takes in two images and a candidate relative rotation and predicts energy. By repeatedly querying this network, we recover a probability distribution over conditional relative rotations (see Sec. 

3.1). We use these pairwise distributions to induce a joint likelihood over the camera transformations across multiple images, and iteratively improve an initial estimate by maximizing this likelihood (see Sec. 3.2).

2 Related Work

Structure-from-Motion (SfM). At a high level, structure-from-motion aims to recover 3D geometry and camera parameters from image sets. This is done classically by computing local image features [21, 30, 2, 66], finding matches across images [31], and then estimating and verifying epipolar geometry using bundle adjustment [67]. Later works have scaled up the SfM pipeline using sequential algorithms, demonstrating results on hundreds or even thousands of images [58, 18, 55, 54, 52].

The advent of deep learning has augmented various stages of the classical SfM pipeline. Better feature descriptors 

[14, 57, 72, 79, 15, 46, 49] and improved featured matching [53, 9, 29, 68, 16] have significantly outperformed their hand-crafted counterparts. BA-Net [63] and DeepSFM [75] have even replaced the bundle-adjustment process by optimizing over a cost volume. Most recently, Pixel-Perfect SfM [28] uses a featuremetric error to post-process camera poses to achieve sub-pixel accuracy.

While these methods can achieve excellent localization, all these approaches are bottom-up: beginning with local features that are matched across images. However, matching features requires sufficient overlap between images, which may not be possible given wide baseline views. While our work also aims to localize camera poses given image sets, our approach fundamentally differs because it is top-down and does not rely on low-level correspondences.

Simultaneous Localization and Mapping (SLAM). Related is the task of Monocular SLAM, which aims to localize and map the surroundings from a video stream. Indirect SLAM methods, similar to SfM, match local features across different images to localize the camera [51, 5, 38, 37]. Direct SLAM methods, on the other hand, define a geometric objective function to directly optimize over a photometric error [87, 56, 11, 17].

There have also been various attempts to introduce deep learning into SLAM pipelines. As with SfM, learned feature descriptors and matching have helped improve accuracy on SLAM subproblems and increased robustness. End-to-end deep SLAM methods [84, 40, 73, 74] have improved the robustness of SLAM compared to classical methods, but have generally not closed the gap on performance. One notable exception is the recent DROID-SLAM [64], which combines the robustness of learning-based SLAM with the accuracy of classical SLAM.

These approaches all assume sequential streams and generally rely on matching or otherwise incorporating temporal locality between neighboring frames. We do not make any assumptions about the order of the image inputs nor the amount of overlap between nearby frames.

Single-view Pose Prediction. The task of predicting a (6-DoF) pose from a single image has a long and storied history, the surface of which can barely be scratched in this section. Unlike relative pose between multiple images, the (absolute) pose given a single image is only well-defined if there exists a canonical coordinate system. Most single-view pose prediction approaches therefore deal with a fixed set of categories, each of which has a canonical coordinate system defined a priori [77, 65, 43, 7, 71, 23, 4, 59, 42, 39, 24, 26]. Other methods that are category-agnostic take in a 3D mesh or point cloud as input, which provides a local coordinate system [76, 78, 81, 44].

Perhaps most relevant to us are approaches that not only predict pose but also model inherent uncertainty in the pose prediction [3, 25, 39, 45, 10, 61, 47, 19, 12, 13, 36, 33]. Like our approach, VpDR-Net [41] uses relative poses as supervision but still predicts absolute pose (with a unimodal Gaussian uncertainty model). Implicit-PDF [39]

is the most similar approach to ours and served as an inspiration. Similar to our approach, Implicit-PDF uses a neural network to implicitly represent probability using an energy-based formulation which elegantly handles symmetries and multimodal distributions. Unlike our approach, Implicit-PDF (and all other single-view pose prediction methods) predict

absolute pose, which does not exist in general for generic or novel categories. Instead, we model probability distributions over relative pose given pairs of images.

Learning-based Relative Pose Prediction. When considering generic scenes, prior works have investigated the task of relative pose prediction given two images. However, these supervised [69] or self-supervised [85, 80, 32, 70] methods typically consider prediction of motion between consecutive frames and are not easily adapted to wide-baseline prediction. While some approaches have investigated wide baseline prediction [34, 1]

, regression-based inference does not effectively capture uncertainty unlike our energy-based model. Perhaps most similar to ours is DirectionNet 

[8] which also predicts a camera distribution for wide baseline views. While DirectionNet only uses the expected value of the distribution and thus ignores symmetry, we take advantage of multimodal distributions to improve our joint pose estimation.

3 Method

Figure 3: Predicted conditional distribution of image pairs from unseen categories. Here, we visualize the predicted conditional distribution of image pairs. Inspired by [39], we visualize the rotation distribution (Alg. 1) by plotting yaw as latitude, pitch as longitude, and roll as the color. The size of each circle is proportional to the probability of that rotation. We omit rotations with negligible probability. The center of the open circle represents the ground truth. We can see that network predicts 4 modes for the couch images, corresponding roughly to 90 degree increments, with the greatest probability assigned to the correct 90 degree rotation. The relative pose of the hot dog is unambiguous and thus only has one mode. While the relative pose for the frisbee has close to no pitch or yaw, the roll remains ambiguous, hence the variety in colors. See the supplement for a visualization of how to interpret the relative rotations.

Given a set of images depicting a generic object in-the-wild, we aim to recover a set of rotation matrices such that rotation matrix corresponds to the viewpoint of the camera used to take image . Note that while we do not model translation, it can be easily initialized using object-facing viewpoints for 3D object reconstruction [27, 82] or a pose graph for SLAM [6]. We are primarily interested in settings with only sparse views and wide baselines. While bottom-up correspondence based techniques can reliably recover camera pose given dense views, they do not adapt well to sparse views with minimal overlap. We instead propose a prediction-based top-down approach that can learn and exploit the global structure directly.

The basic building block of our prediction system (visualized in Fig. 3) is a pairwise pose predictor that infers relative camera orientations given pairs of images. However, symmetries in objects and possibly uninformative viewpoints make this an inherently uncertain prediction task. To allow capturing this uncertainty, we propose an energy-based approach that models the multi-modal distribution over relative poses given two images.

Given the predicted distributions over pairwise relative rotations, we show that these can be leveraged to induce a joint distribution over the rotations. Starting with a greedy initialization, we present a coordinate-ascent approach that jointly reasons over and improves the set of inferred rotations. We describe our approach for modeling probability distributions over relative poses between two images in Sec. 3.1, and build on this in Sec. 3.2 to recover a joint set of poses across multiple images. Finally, we discuss implementation details in Sec. 3.3.

3.1 Estimating Pair-wise Relative Rotations

procedure PairwiseDistribution(, )
      queries SampleRotationsUnif(50000)
      energies queries)
      probs SoftMax(energies)
      return queries, probs
end procedure
Algorithm 1 Pseudo-code for recovering a pairwise distribution. We describe how to recover the distribution of the relative pose given images.

Given a pair of images depicting an arbitrary object, we aim to predict a distribution over the relative rotation corresponding to the camera transformation between the two views. As there may be ambiguities when inferring the relative pose given two images, we introduce a formulation that can model uncertainty.

Energy-based Formulation. We wish to model the conditional distribution over a relative rotation matrix given input images and : . Inspired by recent work on implicitly representing the distribution over rotations using a neural network [39], we propose using an energy-based relative pose estimator. More specifically, we train a network that learns to predict the energy, or the unnormalized joint log-probability, where is the constant of integration. From the product rule, we can recover the conditional probability as a function of :

(1)

We marginalize over rotations to avoid having to compute (see Alg. 1), but note that the number of sampled rotations should be large for the approximation to be accurate. It is therefore important to use a lightweight network since it is queried once per sampled rotation in the denominator.

Training. We train our network by maximizing the log-likelihood of the conditional distribution, or equivalently minimizing the negative log-likelihood:

(2)

where and are the ground truth poses of and respectively. Note that while the ‘absolute’ poses are in an arbitrary coordinate system (depending on e.g. SLAM system outputs), the relative pose between two views is agnostic to this incidental canonical frame. Following (1), we sample multiple candidate rotation matrices to compute the conditional probability.

Inference. Recovering the optimal transformation from the pose of to amounts to optimizing over the space of rotations:

(3)

In practice, the loss landscape of is often un-smooth, so we find that sampling and scoring rotations based on to be more effective than gradient ascent.

We can also compute the conditional distribution of the relative rotation from to by sampling rotations over . The probability associated with each rotation can be computed using a softmax function, as described Alg. 1 and derived in (1). Inspired by [39], we can visualize the distribution of rotations by projecting the rotation matrices on a 2-sphere using pitch and yaw and coloring the rotation based on roll. See Fig. 3 and the supplement for sample results.

3.2 Recovering Joint Poses

procedure CoordAsc(Images )
       InitializeRotations()
      for  Num Iterations do
             RandomInteger()
             (33): replacements for
            SampleRotationsUnif(=250000)
            energs Zeros()
            for  and  do
                  Repeat(, ) 3333
                 energs energs +
                 energs energs +
            end for
            [ArgMax(energs)]
      end for
end procedure
Algorithm 2 Pseudo-code for joint inference using relative pose predictor. We describe how to recover the joint poses given images via coordinate ascent.

In the previous section, we describe an energy-based relative pose predictor conditioned on pairs of images. Using this network, we recover a coherent set of rotations when given a set of images.

Greedy Initialization.

Given predictions for relative rotations between every pair of images, we aim to associate each image with an absolute rotation. However, as the relative poses are invariant up to a global rotation, we can treat the pose of the first image as the identity matrix:

. We note that the rotations for the other images can be uniquely induced given any relative rotations that span a tree.

Sequential Chain. Perhaps the simplest way to construct such a tree is to treat the images as part of an ordered sequence. Given , all subsequent poses can be computed by using the best scoring relative pose from the previous image: , denoting as the relative rotation matrix . However, this assumes that the images are captured sequentially (e.g. in a video) and may not be applicable for settings such as online marketplaces.

Maximum Spanning Tree. We improve over the naive linear chain by recognizing that some pairs of images may produce more confident predictions. Given images, we construct a directed graph with edges, where the weight of edge . We then construct a Maximum Spanning Tree (MST) that covers all images with the most confident set of relative rotations.

Reasoning over all images jointly. Both of the previous methods, which select a subset of edges, do not perform any joint reasoning and discard all but the highest scoring mode for each pair of images. Instead, we can take advantage of our energy-based formulation to enforce global consistency.

Given our pairwise conditional probabilities, we can define a joint distribution over the set of rotations:

(4)

where is the set of pairwise permutations and is the normalizing constant. Intuitively, this corresponds to the distribution modeled by a factor graph with a potential function corresponding to each pairwise edge.

We then aim to find the most likely set of rotations under this conditional joint distribution (assuming ). While it is not feasible to analytically obtain the global maxima, we adopt an optimization-based approach and iteratively improve the current estimate. More specifically, we initialize the set of poses with the greedy MST solution and at each iteration, we randomly select a rotation to update. Assuming fixed values for , we then search for the rotation under the conditional distribution that maximizes the overall likelihood. We show in supplementary that this in fact corresponds to computing the most likely hypothesis under the distribution :

(5)

Analogous to our approach for finding the optimal solution for a single relative rotation, we sample multiple hypotheses for the rotation , and select the hypothesis that maximizes (5). We find that this search-based block coordinate ascent helps us consistently improve over the initial solution while avoiding the local optima that a continuous optimization is susceptible to. We provide pseudo-code in Alg. 2 and visualize one iteration of coordinate ascent in Fig. 4.

Figure 4: Recovering Joint Poses with Coordinate Ascent. Given a set of images , we initialize a set of corresponding poses . During each iteration of coordinate ascent, we: 1) randomly select one pose to update (the red camera in this case); 2) sample a large number (250k) of candidate poses; 3) score each pose according to the joint distribution conditioned on the other poses and images  (5); and 4) update with the highest scoring pose. See Sec. 3.2 for more detail.

3.3 Implementation Details

Network Architecture. We use a ResNet-50 [22] with anti-aliasing [83] to extract image features. We use a lightweight 3-layer MLP that takes in a concatenation of 2 sets of image features and a rotation matrix to predict energy. We use positional encoding [35, 62] directly on flattened rotation matrix, similar to [39]. See the supplement for architecture diagrams.

Number of Rotation Samples. We use the equivolumetric sampling in [39] to compute query rotations (37k total rotations) during training. For each iteration of coordinate ascent, we randomly sample 250k rotation matrices. For visualizing distributions, we randomly sample 50k rotations.

Runtime. We train the pairwise estimator with a batch size of 64 images for approximately 2 days on 4 NVIDIA 2080TI GPUs. Inference for 20 images takes around 1-2 seconds to construct an MST and around 2 minutes for 200 iterations of coordinate ascent on a single 2080TI. Note that the runtime of the coordinate ascent scales linearly with the number of images.

4 Evaluation

Figure 5: Qualitative Comparison of Recovered Camera Poses with Baselines.

We visualize the camera poses (rotations) predicted by DROID-SLAM, COLMAP with SuperPoint/SuperGlue, and our method given sparse image frames. The black cameras correspond to the ground truth. We only visualize the rotations predicted by each method, and set the translation such that the object center is a fixed distance away along the camera axis. As the poses are agnostic to a global rotation, we align the predicted cameras across all methods to the ground truth coordinate system by setting the recovered camera pose for the first image to the corresponding ground truth (visualized in green). Odd rows correspond to randomly sampled image frames, while even rows correspond to uniformly-spaced image frames.

Figure 6: Mean Accuracy on Seen Categories. We evaluate our approach against competitive SLAM (DROID-SLAM) and SfM (COLMAP with SuperPoint + SuperGlue) baselines in sparse-view settings. We also train a direct relative rotation predictor (Pose Regression) that is not probabilistic and uses the MST generated by our method to recover joint pose. We consider both randomly sampling and uniformly spacing frames from a video sequence. We report the proportion of pairwise relative poses that are within 15 and 30 degrees of the ground truth, averaged over all seen categories. We find that our approach shines with fewer views because it does not rely on correspondences and thus can handle wide baseline views. Correspondence-based approach need about 20 images to begin to work.
Figure 7: Accuracy on Subset of Seen Categories. Here we compare all approaches on a representative subset of seen categories. We find that direct regression of relative poses (purple) struggles more on categories with symmetry (Car, Hydrant) than categories without symmetry (Chair, Plant), suggesting that multimodal prediction is important for resolving ambiguity.
Figure 8: Mean Accuracy on Unseen Categories. We evaluate our approach on held out categories from CO3D. Figure 9: Novel View Registration. Here, we evaluate the task of registering a new view given previously aligned cameras. We find that adding more views improves performance, suggesting that additional views reduce ambiguity.

4.1 Experimental Setup

Dataset. We train and test on the Common Objects in 3D dataset (CO3D) [48], a large-scale dataset consisting of turntable-style videos of 51 common object categories. We train on the subset of the dataset that has camera poses, which were acquired by running COLMAP [54] over all frames of the video.

To train our network, we sample random frames and their associated camera poses from each video sequence. We train on 12,299 video sequences (from the train-known split) from 41 categories, holding out 10 categories to test generalization. We evaluate on 1,711 video sequences (from the test-known split) over all 41 trained categories (seen) as well as the 10 held out categories (unseen). The 10 held out categories are: ball, book, couch, frisbee, hotdog, kite, remote, sandwich, skateboard, and suitcase. We selected these categories randomly after excluding some of the categories with the most training images.

Task and Metrics. We consider the task of sparse-view camera pose estimation with 3, 5, 10, and 20 images, subsampled from a video sequence. This is highly challenging, especially when , because the ground truth camera poses have wide baselines.

We consider two possible ways to select frames from a video sequence. First, we can randomly sample a set of indices per video sequence (Random). Alternatively, we can use uniformly-spaced frame indices (Uniform). We note that because CO3D video sequences are commonly taken in a turntable fashion, the uniformly spaced sampling strategy may be more representative of real world distributions of sparse view image sets. We report metrics on both task setups.

Because the global transformation of the camera poses is ambiguous, we evaluate each pair of relative rotations. For each of the pairs, we compare the angular difference between the relative predicted rotation and the relative ground truth rotation using Rodrigues’ formula [50]. We report the proportion of relative rotations that are within 15 and 30 degrees of the ground truth. We note that rotation errors within this range are relatively easy to handle by downstream 3D reconstruction tasks (See Fig. 10 for an example).

Baselines. We compare against DROID-SLAM [64], a current state-of-the-art SLAM approach that incorporates learning in an optimization framework. Note that DROID-SLAM requires trajectories and camera intrinsics. Thus, we provide the DROID-SLAM baseline with sorted frame indices and intrinsics, but do not provide these to any other method.

We also compare with a state-of-the-art structure-from-motion pipeline that uses COLMAP [54]

with SuperPoint feature extraction 

[14] and SuperGlue matching [53]. We used the implementation provided by [52]. For instances for which COLMAP does not converge or is unable to localize some cameras, we treat the missing poses as identity rotation for evaluation. We note that DROID-SLAM also outputs approximate identity rotations when the optimization fails.

Ablations. In the spirit of learning-based solutions that directly regress pose, we train a network that predicts relative rotation directly given two images. Similar to our energy-based predictor, we pass the concatenated images features from a ResNet-50 into an MLP. We double the number of layers from 3 to 6 and add a skip connection to give this network increased capacity. Rotations are predicted using the 6D rotation representation [86]. See the supplement for additional architecture details. The relative pose regressor cannot directly predict poses for more than two images. To recover sets of poses from sets of images, we use the MST graph recovered by our method to link the pairs of relative rotations (we find that this performs better than linking the relative rotations sequentially).

To demonstrate the benefits of joint reasoning, we additionally report the performance of our method using the greedy Maximum Spanning Tree (MST) solution. The performance of the sequential solution is in the supplement.

4.2 Quantitative Evaluation

We evaluate all approaches on sparse-view camera pose estimation by averaging over all seen categories in Fig. 6. We find that our approach outperforms all baselines for images. Correspondence-based approaches (DROID-SLAM and COLMAP) do not work until roughly 20 images, at which point image frames have sufficient overlap for local correspondences. However, real world multi-view data (e.g. marketplace images) typically have much fewer images. We find that coordinate ascent helps our approach scale with more image frames whereas the greedy maximum spanning tree accumulates errors with more frames.

Directly predicting relative poses does not perform well, possibly because pose regression cannot model multiple modes, which is important for symmetrical objects. We visualize the performance for four categories in Fig. 7. We find that the performance gap between our approach and direct regression is larger for objects with some symmetry (car, hydrant) than for objects without symmetry (chair, plant). Moreover, unlike our energy-based approach that models a joint distribution, a regression-based method does not allow similar joint reasoning.

We also test the generalization of our approach for unseen categories in Fig. 8. We still find that our method significantly outperforms all other approaches with sparse view () even for never-before-seen object categories, indicating its ability to handle generic objects beyond training. The per-category evaluation for both seen and unseen categories are in the supplement.

Novel View Registration. In our standard SfM-inspired task setup, we aim to recover camera poses given images. Intuitively, adding images reduces ambiguity, but recovering additional cameras is also more challenging. To disambiguate between the two, we evaluate the task of registering new views given previously aligned images in Fig. 9. Given images, of which have aligned cameras, we use our energy-based regressor to recover the remaining camera (equivalent to one iteration of coordinate ascent). We find that adding images improves accuracy, suggesting that additional views can reduce ambiguity.

4.3 Qualitative Results

Figure 10: Initializing 3D NeRS Reconstruction using Predicted Cameras. NeRS [82] is a representative 3D reconstruction approach that takes noisy cameras as initialization and jointly optimizes object shape, appearance, and camera poses. We run our method with coordinate ascent on 7 input images of a fire hydrant and 4 input images of a motorbike to obtain the camera initialization (green), which we provide to NeRS. NeRS then finetunes the cameras (orange) and outputs a 3D reconstruction.

We show qualitative results on the outputs of our pairwise predictor in Fig. 3. The visualized distributions suggest that our model is learning useful information about symmetry and can model multiple modes even for unseen categories.

We visualize predicted camera poses for DROID-SLAM, COLMAP, and our method with coordinate ascent in Fig. 5. Unable to bridge the domain gap from narrow baseline video frames, DROID-SLAM often gets stuck in the trajectory. Although COLMAP sometimes fails to converge, it performs well for =20. Our approach consistently outputs plausible interpretations but is unable to achieve precise localization. See supplementary for visualizations on randomly selected sequences and more category-specific discussion.

We also validate that our camera pose estimations can be used for downstream 3D reconstruction. We use our camera poses to initialize NeRS [82], a representative sparse-view surface-based approach that requires a (noisy) camera initialization. Using our cameras, we successfully reconstruct a 3D model of a fire hydrant from 7 images and a motorbike from 4 images in Fig. 10. Note that the camera pose initialization in the original NeRS paper was manually selected.

5 Discussion

We presented a prediction-based approach for estimating camera rotations given (a sparse set of) images of a generic object. Our energy-based formulation allows capturing the underlying uncertainty in relative poses, while also enabling joint reasoning over multiple images. We believe our system’s robustness under sparse views can allow it to serve as a stepping stone for initializing (neural) reconstruction methods in the wild, but also note that there are several open challenges. First, our work reasoned about the joint distribution using only pairwise potentials and developing efficient higher-order energy models may further improve performance. Moreover, while we outperform existing techniques given sparse-views, the correspondence-driven methods are more accurate given a large number of views, and we hope future efforts can unify the two approaches. Finally, our approach may not be directly applicable to reasoning about camera transformations for arbitrary scenes as modeling camera translation would be more important compared to object-centric images.

Acknowledgements. We would like to thank Gengshan Yang, Jonathon Luiten, Brian Okorn, and Elliot Wu for helpful feedback and discussion. This work was supported in part by the NSF GFRP (Grant No. DGE1745016), Singapore DSTA, and CMU Argo AI Center for Autonomous Vehicle Research.

6 Supplementary Materials

In this section, we show that maximizing the conditional distribution of an update to a hypothesis is equivalent to maximizing the joint likelihood in Sec. 6.1. We evaluate ablations of our approach to validate the use of coordinate ascent vs gradient ascent and MST vs sequential loop in Tab. 1. To test the quality of our SLAM and SfM baselines, we also ran them with more image frames (narrower baseline) in Fig. 11. We show per-category evaluations to compare performance across seen and unseen categories of CO3D in Tab. 2. We provide a visualization of how to interpret the relative rotations in Fig. 12 and discuss the coordinate system in which we compute relative rotations in Fig. 13. We discuss the learned symmetry modes as well as some failure modes in Fig. 14. As a proof of concept, we use our energy-based predictor on a deformable object (cat) in Fig. 15. We include architecture diagrams for our energy-based pairwise pose predictor in Fig. 16 and the direct pose predictor baseline in Fig. 17. Finally, we show qualitative comparisons between our approach and the correspondence-based baselines on randomly selected sequences on both seen and unseen categories in Fig. 18 and Fig. 19 respectively.

6.1 Derivation of Conditional Distribution for Coordinate Ascent

Given our pairwise conditional probabilities, the joint distribution over a set of rotations can be computed as:

(6)

where .

We are searching for the most likely set of rotations under this joint distribution given images . For each iteration of coordinate ascent, we have our current most likely set of rotations and wish to update . If we fix all , the only terms in that can change are the ones involving , and the rest can be folded into a scalar constant. Thus, searching for the rotation that maximizes the overall likelihood is equivalent to finding the most likely hypothesis under :

(7)
(8)

This simplifies each iteration of coordinate ascent from a sum to a sum.

Acc @ 3 5 10 20
Ours (Sequential) 0.50 0.48 0.42 0.39
Ours (MST) 0.52 0.50 0.47 0.43
Ours (Grad. Asc.) 0.52 0.51 0.49 0.47
Ours (Coord. Asc.) 0.59 0.58 0.59 0.59
Table 1: Ablations on Seen Categories in CO3D (Random Sequence Subsampling). One way to convert a set of relative pose predictions to a coherent set of joint poses is by naively linking them together in a sequence (Sequential). We find that greedily linking them by constructing a maximum spanning tree (MST) performs slightly better since it incorporates that most confident relative rotation predictions. To make better use of our energy-based relative pose predictor, we tried directly running gradient ascent initialized from the MST solution and maximizing energy using ADAM (Grad. Asc.). Because the loss landscape is non-smooth, we observe that it does not deviate much from the MST solution. We found the scoring-based block coordinate ascent (Coord. Asc.) to be the most effective.
Acc. @ 30 (%)
Category 3 5 10 20

Seen Categories

Apple 59 60 62 61
Backpack 63 58 59 57
Banana 67 54 63 55
Baseballbat 100 67 70 73
Baseballglove 48 56 56 55
Bench 69 75 68 66
Bicycle 62 61 63 62
Bottle 59 57 60 60
Bowl 80 75 77 80
Broccoli 55 54 51 51
Cake 46 47 47 54
Car 67 71 70 62
Carrot 60 64 63 65
Cellphone 69 78 72 69
Chair 53 55 55 56
Cup 55 56 54 51
Donut 52 44 51 51
Hairdryer 58 56 58 54
Handbag 66 63 62 61
Hydrant 72 73 68 70
Keyboard 72 73 74 74
Laptop 88 87 89 89
Microwave 56 65 55 58
Motorcycle 59 60 62 61
Mouse 68 70 69 67
Orange 52 52 51 49
Parkingmeter 22 27 23 22
Acc. @ 30 (%)
Category 3 5 10 20

Seen Categories

Pizza 50 57 57 55
Plant 46 47 49 51
Stopsign 42 49 47 47
Teddybear 47 52 49 48
Toaster 76 75 71 73
Toilet 76 80 75 77
Toybus 63 70 72 71
Toyplane 43 57 48 51
Toytrain 81 73 75 75
Toytruck 71 69 68 68
Tv 78 83 87 86
Umbrella 58 60 54 55
Vase 58 55 55 51
Wineglass 51 46 46 47
Seen Mean 61 62 61 61

Unseen Categories

Ball 45 41 43 44
Book 51 49 49 47
Couch 42 58 39 35
Frisbee 55 49 40 38
Hotdog 58 61 50 49
Kite 28 23 27 24
Remote 64 58 65 66
Sandwich 37 41 41 42
Skateboard 56 64 64 65
Suitcase 59 61 67 63
Unseen Mean 49 51 48 48
Table 2: Per-category Evaluation on CO3D with Random Sequence Sampling. We find that rotationally symmetric objects (e.g. apple, orange, wineglass) tend to be challenging. We were surprised to find that bowls worked well, likely because the bowls in the CO3D dataset tend to have a lot of texture or even stickers. Objects with distinctive shapes (e.g. toilet, laptop) tend to be easier to orient. Note that some object categories have few instances for both training and testing (e.g. baseballbat, parkingmeter).
Figure 11: Evaluation of correspondence-based approaches on large image sets (on “Seen Categories” Split). We evaluate the DROID-SLAM [64] and COLMAP (with SuperPoint features and SuperGlue matching) baselines on much longer image sequences (N=30, 40, 50). We verify that these approaches, which rely on correspondences between images, can achieve good performance when the cameras baselines are narrow. Nonetheless, the poor performance at suggests that there is a rich space for improving camera pose estimation in the low data regime, which is the setting that we target in our work.
Figure 12: Interpreting Relative Rotations using a 2-Sphere. Given “Image 1”, we show how “Image 2” would have appeared given different relative rotations. (a), (b), and (c) show relative rotations with 60, 120, and 180 yaw respectively. (d) and (e) show relative rotations with 45 and -45 pitch respectively. (f) shows a relative rotation with just roll. (g) shows a relative rotation with all three components. We use a view-aligned coordinate system (See Fig. 13) when computing relative rotations. Inspired by [39], we visualize the by projecting rotations onto a 2-sphere, with the x-axis representing yaw, y-axis representing pitch, and color representing roll.
Figure 13: View-aligned vs Object-centric Coordinate System. We compute relative rotations in a coordinate system (red axes on left) aligned with the camera (red wireframe on left). Relative rotations aligned to the camera viewpoint can always be computed without reasoning about the object’s alignment with respect to the camera. While possibly more intuitive, relative rotations in the object coordinate system (blue axes on left) must be defined with respect to a canonical object pose and thus cannot be computed in general. On the right, we visualize a 60 yaw relative rotation from Image 1 in the view-aligned coordinate system (red) and object-centric coordinate system (blue).
Figure 14: Learned Pairwise Distributions on Seen Categories (Test Set). Here we visualize the learned pairwise distributions for various pairs of images. Top left: The images correspond to opposite sides of the apple, so the relative pose is ambiguous. Our approach predicts a rotationally symmetric band of possible rotations. Top middle: The images have sufficient overlap such that the relative rotation is unambiguous and our method predicts a single mode for the apples. Top right: For rectangular objects such as microwaves, our approach often predicts 4 modes corresponding to each of the 90 degree rotations. Bottom left: Our approach predicts 2 modes for the bicycle because the first viewpoint is challenging. Bottom-middle: Clashing foreground and background textures can be a challenge for our pairwise predictor. Even though the relative pose should be unambiguous, our method places low probability on the correct pose although it does recognize the rotational symmetry of the cup category. Bottom-right: Unusual object appearances is another failure mode of our method, which defaults to placing high probability mass on the identity matrix. Our method does recognize the rotational symmetry of the cake category.
Figure 15: Deformable Objects. Existing SfM and SLAM pipelines often make assumptions about rigidity or appearance constancy in order for bundle adjustment to converge. Our method has no such requirements and can be run even on deformable objects. While the ground truth poses for these images of a cat are unknown, the relative rotation of the camera w.r.t the cat is roughly -90 degrees yaw with negative pitch while the relative rotation of the camera w.r.t. the couch has no pitch or yaw but some roll in the clockwise direction (green). Although our training data does not include dynamic or deformable objects, our network outputs plausible modes.
Figure 16: Architecture Diagram for our Pairwise Energy Predictor. We use a ResNet-50 [22] with anti-aliasing [83] as our feature extractor. We directly apply positional encoding (8 bases) [62] to the elements of the

rotation matrix. We concatenate the image features and positionally encoded rotations into a feature vector (

), which we feed into an MLP that predicts energy (corresponding to unnormalized log probability).
Figure 17: Architecture Diagram for our Direct Pairwise Rotation Predictor. For the direct rotation regression baseline, we still input the concatenated image features (). To make the baseline more competitive, we increase the capacity of the MLP to have 6 layers and a skip connection. The network predicts the 6-D rotation representation [86].
Figure 18: Randomly selected Qualitative Results for Seen Categories.
Figure 19: Randomly selected Qualitative Results for Unseen Categories.

References

  • [1] V. Balntas, S. Li, and V. Prisacariu (2018) RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. In ECCV, Cited by: §2.
  • [2] H. Bay, T. Tuytelaars, and L. V. Gool (2006) SURF: Speeded Up Robust Features. In ECCV, Cited by: §2.
  • [3] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold, et al. (2016) Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In CVPR, Cited by: §2.
  • [4] Y. Bukschat and M. Vetter (2020) EfficientPose: An Efficient, Accurate and Scalable End-to-end 6D Multi Object Pose Estimation Approach. arXiv:2011.04307. Cited by: §2.
  • [5] C. Campos, R. Elvira, J. J. Gómez, J. M. M. Montiel, and J. D. Tardós (2021)

    ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM

    .
    T-RO 37 (6), pp. 1874–1890. Cited by: §2.
  • [6] L. Carlone, R. Tron, K. Daniilidis, and F. Dellaert (2015) Initialization Techniques for 3D SLAM: A Survey on Rotation Estimation and its Use in Pose Graph Optimization. ICRA. Cited by: §3.
  • [7] B. Chen, T. Chin, and M. Klimavicius (2022) Occlusion-Robust Object Pose Estimation with Holistic Representation. In WACV, Cited by: §2.
  • [8] K. Chen, N. Snavely, and A. Makadia (2021) Wide-Baseline Relative Camera Pose Estimation with Directional Learning. In CVPR, Cited by: §2.
  • [9] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker (2016) Universal Correspondence Network. NeurIPS. Cited by: §2.
  • [10] E. Corona, K. Kundu, and S. Fidler (2018) Pose Estimation for Objects with Rotational Symmetry. In IROS, Cited by: §2.
  • [11] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse (2007) MonoSLAM: Real-time Single Camera SLAM. TPAMI 29 (6), pp. 1052–1067. Cited by: §2.
  • [12] X. Deng, A. Mousavian, Y. Xiang, F. Xia, T. Bretl, and D. Fox (2019) PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking. In RSS, Cited by: §2.
  • [13] X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox (2020) Self-supervised 6D Object Pose Estimation for Robot Manipulation. In ICRA, Cited by: §2.
  • [14] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018) SuperPoint: Self-supervised Interest Point Detection and Description. In CVPR-W, Cited by: §2, §4.1.
  • [15] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019) D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In CVPR, Cited by: §2.
  • [16] M. Dusmanu, J. L. Schönberger, and M. Pollefeys (2020) Multi-view Optimization of Local Feature Geometry. In ECCV, Cited by: §2.
  • [17] J. Engel, V. Koltun, and D. Cremers (2018) Direct Sparse Odometry. TPAMI. Cited by: §2.
  • [18] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski (2010) Towards Internet-scale Multi-view Stereo. In CVPR, Cited by: §2.
  • [19] I. Gilitschenski, R. Sahoo, W. Schwarting, A. Amini, S. Karaman, and D. Rus (2019) Deep Orientation Uncertainty Learning Based on a Bingham Loss. In ICLR, Cited by: §2.
  • [20] S. Goel, G. Gkioxari, and J. Malik (2022) Differentiable Stereopsis: Meshes from Multiple Views Using Differentiable Rendering. In CVPR, Cited by: §1.
  • [21] C. Harris and M. Stephens (1988) A Combined Corner and Edge Detector. In Alvey Vision Conference, Cited by: §2.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §3.3, Figure 16.
  • [23] S. Iwase, X. Liu, R. Khirodkar, R. Yokota, and K. M. Kitani (2021) RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering. In ICCV, Cited by: §2.
  • [24] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) SSD-6D: Making RGB-based 3D Detection and 6D Pose Estimation Great Again. In ICCV, Cited by: §2.
  • [25] A. Kendall and R. Cipolla (2016) Modelling Uncertainty in Deep Learning for Camera Relocalization. In ICRA, Cited by: §2.
  • [26] A. Kendall, M. Grimes, and R. Cipolla (2015) PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, Cited by: §2.
  • [27] C. Lin, W. Ma, A. Torralba, and S. Lucey (2021) BARF: Bundle-Adjusting Neural Radiance Fields. In ICCV, Cited by: §1, §3.
  • [28] P. Lindenberger, P. Sarlin, V. Larsson, and M. Pollefeys (2021) Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. In ICCV, Cited by: §2.
  • [29] C. Liu, J. Yuen, and A. Torralba (2010) SIFT Flow: Dense Correspondence Across Scenes and Its Applications. TPAMI 33 (5), pp. 978–994. Cited by: §2.
  • [30] D. G. Lowe (2004) Distinctive Image Features from Scale-invariant Keypoints. IJCV 60 (2), pp. 91–110. Cited by: §2.
  • [31] B. D. Lucas and T. Kanade (1981) An Iterative Image Registration Technique with an Application to Stereo Vision. In IJCAI, Cited by: §2.
  • [32] R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised Learning of Depth and Ego-Motion From Monocular Video Using 3D Geometric Constraints. In CVPR, Cited by: §2.
  • [33] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari (2019) Explaining the Ambiguity of Object Detection and 6D Pose from Visual Data. In ICCV, Cited by: §2.
  • [34] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu (2017)

    Relative Camera Pose Estimation Using Convolutional Neural Networks

    .
    In ACIVS, Cited by: §2.
  • [35] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, Cited by: §3.3.
  • [36] D. Mohlin, J. Sullivan, and G. Bianchi (2020) Probabilistic Orientation Estimation with Matrix Fisher Distributions. NeurIPS. Cited by: §2.
  • [37] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-SLAM: A Versatile and Accurate Monocular SLAM System. T-RO 31 (5), pp. 1147–1163. Cited by: §2.
  • [38] R. Mur-Artal and J. D. Tardós (2017) ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras. T-RO 33 (5), pp. 1255–1262. Cited by: §2.
  • [39] K. A. Murphy, C. Esteves, V. Jampani, S. Ramalingam, and A. Makadia (2021) Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold. In ICML, Cited by: §2, §2, Figure 3, §3.1, §3.1, §3.3, §3.3, Figure 12.
  • [40] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison (2011) DTAM: Dense Tracking and Mapping in Real-time. In ICCV, Cited by: §2.
  • [41] D. Novotny, D. Larlus, and A. Vedaldi (2017) Learning 3D Object Categories by Looking Around Them. In ICCV, Cited by: §2.
  • [42] D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi (2019) C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion. In ICCV, Cited by: §2.
  • [43] M. Oberweger, M. Rad, and V. Lepetit (2018) Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation. In ECCV, Cited by: §2.
  • [44] B. Okorn, Q. Gu, M. Hebert, and D. Held (2021) ZePHyR: Zero-shot Pose Hypothesis Scoring. In ICRA, Cited by: §2.
  • [45] B. Okorn, M. Xu, M. Hebert, and D. Held (2020) Learning Orientation Distributions for Object Pose Estimation. In IROS, Cited by: §2.
  • [46] R. Pautrat, V. Larsson, M. R. Oswald, and M. Pollefeys (2020) Online Invariance Selection for Local Feature Descriptors. In ECCV, Cited by: §2.
  • [47] S. Prokudin, P. Gehler, and S. Nowozin (2018) Deep Directional Statistics: Pose Estimation with Uncertainty Quantification. In ECCV, Cited by: §2.
  • [48] J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny (2021) Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In ICCV, Cited by: §4.1.
  • [49] J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel (2019) R2D2: Reliable and Repeatable Detector and Descriptor. NeurIPS. Cited by: §2.
  • [50] O. Rodrigues (1840) Des lois géométriques qui régissent les déplacements d’un système solide dans l’espace, et de la variation des coordonnées provenant de ces déplacements considérés indépendamment des causes qui peuvent les produire. Journal de Mathématiques Pures et Appliquées 5. Cited by: §4.1.
  • [51] A. Rosinol, M. Abate, Y. Chang, and L. Carlone (2020) Kimera: an Open-Source Library for Real-Time Metric-Semantic Localization and Mapping. In ICRA, Cited by: §2.
  • [52] P. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk (2019) From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In CVPR, Cited by: §2, §4.1.
  • [53] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020) SuperGlue: Learning Feature Matching with Graph Neural Networks. In CVPR, Cited by: §2, §4.1.
  • [54] J. L. Schönberger and J. Frahm (2016) Structure-from-Motion Revisited. In CVPR, Cited by: §2, §4.1, §4.1.
  • [55] J. L. Schönberger, E. Zheng, M. Pollefeys, and J. Frahm (2016) Pixelwise View Selection for Unstructured Multi-View Stereo. In ECCV, Cited by: §2.
  • [56] T. Schops, T. Sattler, and M. Pollefeys (2019) BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In CVPR, Cited by: §2.
  • [57] K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Learning Local Feature Descriptors Using Convex Optimisation. TPAMI 36 (8), pp. 1573–1585. Cited by: §2.
  • [58] N. Snavely, S. M. Seitz, and R. Szeliski (2006) Photo Tourism: Exploring Photo Collections in 3D. In SIGGRAPH, Cited by: §2.
  • [59] C. Song, J. Song, and Q. Huang (2020) Hybridpose: 6D Object Pose Estimation under Hybrid Representations. In CVPR, Cited by: §2.
  • [60] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman (2018) Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling. In CVPR, Cited by: §1.
  • [61] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018) Implicit 3D Orientation Learning for 6D Object Detection from RGB Images. In ECCV, Cited by: §2.
  • [62] M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng (2020) Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS. Cited by: §3.3, Figure 16.
  • [63] C. Tang and P. Tan (2019) BA-Net: Dense Bundle Adjustment Network. In ICLR, Cited by: §2.
  • [64] Z. Teed and J. Deng (2021) DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. NeurIPS. Cited by: §2, §4.1, Figure 11.
  • [65] B. Tekin, S. N. Sinha, and P. Fua (2018) Real-Time Seamless Single Shot 6D Object Pose Prediction. In CVPR, Cited by: §2.
  • [66] E. Tola, V. Lepetit, and P. Fua (2009) Daisy: An Efficient Dense Descriptor Applied to Wide-baseline Stereo. TPAMI 32 (5), pp. 815–830. Cited by: §2.
  • [67] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon (1999) Bundle Adjustment—A Modern Synthesis. In International workshop on vision algorithms, Cited by: §2.
  • [68] P. Truong, M. Danelljan, and R. Timofte (2020) GLU-Net: Global-Local Universal Network for Dense Flow and Correspondences. In CVPR, Cited by: §2.
  • [69] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) DeMoN: Depth and Motion Network for Learning Monocular Stereo. In CVPR, Cited by: §2.
  • [70] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki (2017) SfM-Net: Learning of Structure and Motion from Video. arXiv:1704.07804. Cited by: §2.
  • [71] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese (2019) DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In CVPR, Cited by: §2.
  • [72] Q. Wang, X. Zhou, B. Hariharan, and N. Snavely (2020) Learning Feature Descriptors Using Camera Pose Supervision. In ECCV, Cited by: §2.
  • [73] S. Wang, R. Clark, H. Wen, and N. Trigoni (2017) DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In ICRA, Cited by: §2.
  • [74] W. Wang, Y. Hu, and S. Scherer (2020) TartanVO: A Generalizable Learning-based VO. In CoRL, Cited by: §2.
  • [75] X. Wei, Y. Zhang, Z. Li, Y. Fu, and X. Xue (2020) DeepSFM: Structure From Motion Via Deep Bundle Adjustment. In ECCV, Cited by: §2.
  • [76] J. M. Wong, V. Kee, T. Le, S. Wagner, G. Mariottini, A. Schneider, L. Hamilton, R. Chipalkatty, M. Hebert, D. M. Johnson, et al. (2017) SegICP: Integrated Deep Semantic Segmentation and Pose Estimation. In IROS, Cited by: §2.
  • [77] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018) PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In RSS, Cited by: §2.
  • [78] Y. Xiao, X. Qiu, P. Langlois, M. Aubry, and R. Marlet (2019) Pose from Shape: Deep Pose Estimation for Arbitrary 3D Objects. In BMVC, Cited by: §2.
  • [79] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua (2016) LIFT: Learned Invariant Feature Transform. In ECCV, Cited by: §2.
  • [80] Z. Yin and J. Shi (2018) GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In CVPR, Cited by: §2.
  • [81] J. Y. Zhang, S. Pepose, H. Joo, D. Ramanan, J. Malik, and A. Kanazawa (2020) Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild. In ECCV, Cited by: §2.
  • [82] J. Y. Zhang, G. Yang, S. Tulsiani, and D. Ramanan (2021) NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild. In NeurIPS, Cited by: §1, §3, Figure 10, §4.3.
  • [83] R. Zhang (2019) Making Convolutional Networks Shift-Invariant Again. In ICML, Cited by: §3.3, Figure 16.
  • [84] H. Zhou, B. Ummenhofer, and T. Brox (2018) DeepTAM: Deep Tracking and Mapping. In ECCV, Cited by: §2.
  • [85] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised Learning of Depth and Ego-Motion From Video. In CVPR, Cited by: §2.
  • [86] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li (2019) On the Continuity of Rotation Representations in Neural Networks. In CVPR, Cited by: §4.1, Figure 17.
  • [87] J. Zubizarreta, I. Aguinaga, and J. M. M. Montiel (2020) Direct Sparse Mapping. T-RO. Cited by: §2.