Recovering 3D from 2D images of an object has been a central task in vision across decades. Given multiple views, structure-from-motion (SfM) based methods can infer a 3D representation of the underlying instance while also associating each image with a camera viewpoint. However, these correspondence-driven methods cannot robustly handle sparsely sampled images that minimally overlap, and typically require many (20) images for a 360-degree 3D inference. Unfortunately, this requirement of densely sampled views can be prohibitive—online marketplaces often have only a few images per instance, and a user casually reconstructing a novel object would also find capturing such views tedious. Although the recently emerging neural 3D reconstruction techniques also typically leverage similarly dense views, some works have shown promise that a far smaller number of images can suffice for high-quality 3D reconstruction. These successes have however still relied on precisely  or approximately [27, 82, 20] known camera viewpoints for inference. To apply these methods at scale, we must therefore answer a fundamental question—given sparsely sampled images of a generic object, how can we obtain the associated camera viewpoints?
Existing methods do not provide a conclusive answer to this question. On the one hand, bottom-up correspondence-based techniques are not robustly applicable for sparse-view inference. On the other, recent neural multi-view methods can optimize already known approximate camera poses but provide no mechanism to obtain these to begin with. In this work, our goal is to fill this void and develop a method that, given a small number of unposed images of a generic object, can associate them with (approximate) camera viewpoints. Towards this goal, we focus on inferring the camera rotation matrices corresponding to each input image and propose a top-down approach to predict these. However, we note that the ‘absolute’ rotation is not well-defined given an image of a generic object—it assumes a ‘canonical’ pose which is not always known a-priori (e.g. what is an identity rotation for a pen? or a plant?). In contrast, the relative rotation between two views is well-defined even if a canonical pose for the instance is not. Thus, instead of adopting the common paradigm of single-image based pose prediction, we learn to estimate the relative pose given a pair of input images. We propose a system that leverages such pairwise predictions to then infer a consistent set of global rotations given multiple images of a generic object.
A key technical question that we consider is regarding the formulation of such pairwise pose estimation. Given two informative views of a rotationally asymmetric object, a regression-based approach may be able to accurately predict their relative transformation. The general case however, can be more challenging—given two views of a cup but with the handle only visible in one, the relative pose is ambiguous given just the two images. To allow capturing this uncertainty, we formulate an energy-based relative pose prediction network that, given two imagesand
a candidate relative rotation, outputs an energy corresponding to the (unnormalized) log-probability of the hypothesis. This probabilistic estimation of relative pose not only makes the learning more stable, but more importantly, provides a mechanism to estimate ajoint distribution over viewpoints given multiple images. We show that optimizing rotations to improve this joint likelihood yields coherent poses given multiple images and leads to significant improvements over naive approaches that do not consider the joint likelihoods.
We train our system using instances from over 40 commonplace object categories, and find that not only can it infer accurate (relative) poses for novel instances of these classes, it even generalizes to instances from unseen categories. Our approach can thus be viewed as a stepping stone toward sparse-view 3D reconstruction of generic objects; just as classical techniques provide precise camera poses that (neural) multi-view reconstruction methods can leverage, our work provides a similar, albeit coarser, output that can be used to initialize inference in current (and future) sparse-view reconstruction methods. While our system only outputs camera rotations, we note that a reasonable corresponding translation can be easily initialized assuming object-facing viewpoints, and we show that this suffices in practice for bootstrapping sparse-view reconstruction.
2 Related Work
Structure-from-Motion (SfM). At a high level, structure-from-motion aims to recover 3D geometry and camera parameters from image sets. This is done classically by computing local image features [21, 30, 2, 66], finding matches across images , and then estimating and verifying epipolar geometry using bundle adjustment . Later works have scaled up the SfM pipeline using sequential algorithms, demonstrating results on hundreds or even thousands of images [58, 18, 55, 54, 52].
The advent of deep learning has augmented various stages of the classical SfM pipeline. Better feature descriptors[14, 57, 72, 79, 15, 46, 49] and improved featured matching [53, 9, 29, 68, 16] have significantly outperformed their hand-crafted counterparts. BA-Net  and DeepSFM  have even replaced the bundle-adjustment process by optimizing over a cost volume. Most recently, Pixel-Perfect SfM  uses a featuremetric error to post-process camera poses to achieve sub-pixel accuracy.
While these methods can achieve excellent localization, all these approaches are bottom-up: beginning with local features that are matched across images. However, matching features requires sufficient overlap between images, which may not be possible given wide baseline views. While our work also aims to localize camera poses given image sets, our approach fundamentally differs because it is top-down and does not rely on low-level correspondences.
Simultaneous Localization and Mapping (SLAM). Related is the task of Monocular SLAM, which aims to localize and map the surroundings from a video stream. Indirect SLAM methods, similar to SfM, match local features across different images to localize the camera [51, 5, 38, 37]. Direct SLAM methods, on the other hand, define a geometric objective function to directly optimize over a photometric error [87, 56, 11, 17].
There have also been various attempts to introduce deep learning into SLAM pipelines. As with SfM, learned feature descriptors and matching have helped improve accuracy on SLAM subproblems and increased robustness. End-to-end deep SLAM methods [84, 40, 73, 74] have improved the robustness of SLAM compared to classical methods, but have generally not closed the gap on performance. One notable exception is the recent DROID-SLAM , which combines the robustness of learning-based SLAM with the accuracy of classical SLAM.
These approaches all assume sequential streams and generally rely on matching or otherwise incorporating temporal locality between neighboring frames. We do not make any assumptions about the order of the image inputs nor the amount of overlap between nearby frames.
Single-view Pose Prediction. The task of predicting a (6-DoF) pose from a single image has a long and storied history, the surface of which can barely be scratched in this section. Unlike relative pose between multiple images, the (absolute) pose given a single image is only well-defined if there exists a canonical coordinate system. Most single-view pose prediction approaches therefore deal with a fixed set of categories, each of which has a canonical coordinate system defined a priori [77, 65, 43, 7, 71, 23, 4, 59, 42, 39, 24, 26]. Other methods that are category-agnostic take in a 3D mesh or point cloud as input, which provides a local coordinate system [76, 78, 81, 44].
Perhaps most relevant to us are approaches that not only predict pose but also model inherent uncertainty in the pose prediction [3, 25, 39, 45, 10, 61, 47, 19, 12, 13, 36, 33]. Like our approach, VpDR-Net  uses relative poses as supervision but still predicts absolute pose (with a unimodal Gaussian uncertainty model). Implicit-PDF 
is the most similar approach to ours and served as an inspiration. Similar to our approach, Implicit-PDF uses a neural network to implicitly represent probability using an energy-based formulation which elegantly handles symmetries and multimodal distributions. Unlike our approach, Implicit-PDF (and all other single-view pose prediction methods) predictabsolute pose, which does not exist in general for generic or novel categories. Instead, we model probability distributions over relative pose given pairs of images.
Learning-based Relative Pose Prediction. When considering generic scenes, prior works have investigated the task of relative pose prediction given two images. However, these supervised  or self-supervised [85, 80, 32, 70] methods typically consider prediction of motion between consecutive frames and are not easily adapted to wide-baseline prediction. While some approaches have investigated wide baseline prediction [34, 1]
, regression-based inference does not effectively capture uncertainty unlike our energy-based model. Perhaps most similar to ours is DirectionNet which also predicts a camera distribution for wide baseline views. While DirectionNet only uses the expected value of the distribution and thus ignores symmetry, we take advantage of multimodal distributions to improve our joint pose estimation.
Given a set of images depicting a generic object in-the-wild, we aim to recover a set of rotation matrices such that rotation matrix corresponds to the viewpoint of the camera used to take image . Note that while we do not model translation, it can be easily initialized using object-facing viewpoints for 3D object reconstruction [27, 82] or a pose graph for SLAM . We are primarily interested in settings with only sparse views and wide baselines. While bottom-up correspondence based techniques can reliably recover camera pose given dense views, they do not adapt well to sparse views with minimal overlap. We instead propose a prediction-based top-down approach that can learn and exploit the global structure directly.
The basic building block of our prediction system (visualized in Fig. 3) is a pairwise pose predictor that infers relative camera orientations given pairs of images. However, symmetries in objects and possibly uninformative viewpoints make this an inherently uncertain prediction task. To allow capturing this uncertainty, we propose an energy-based approach that models the multi-modal distribution over relative poses given two images.
Given the predicted distributions over pairwise relative rotations, we show that these can be leveraged to induce a joint distribution over the rotations. Starting with a greedy initialization, we present a coordinate-ascent approach that jointly reasons over and improves the set of inferred rotations. We describe our approach for modeling probability distributions over relative poses between two images in Sec. 3.1, and build on this in Sec. 3.2 to recover a joint set of poses across multiple images. Finally, we discuss implementation details in Sec. 3.3.
3.1 Estimating Pair-wise Relative Rotations
Given a pair of images depicting an arbitrary object, we aim to predict a distribution over the relative rotation corresponding to the camera transformation between the two views. As there may be ambiguities when inferring the relative pose given two images, we introduce a formulation that can model uncertainty.
Energy-based Formulation. We wish to model the conditional distribution over a relative rotation matrix given input images and : . Inspired by recent work on implicitly representing the distribution over rotations using a neural network , we propose using an energy-based relative pose estimator. More specifically, we train a network that learns to predict the energy, or the unnormalized joint log-probability, where is the constant of integration. From the product rule, we can recover the conditional probability as a function of :
We marginalize over rotations to avoid having to compute (see Alg. 1), but note that the number of sampled rotations should be large for the approximation to be accurate. It is therefore important to use a lightweight network since it is queried once per sampled rotation in the denominator.
Training. We train our network by maximizing the log-likelihood of the conditional distribution, or equivalently minimizing the negative log-likelihood:
where and are the ground truth poses of and respectively. Note that while the ‘absolute’ poses are in an arbitrary coordinate system (depending on e.g. SLAM system outputs), the relative pose between two views is agnostic to this incidental canonical frame. Following (1), we sample multiple candidate rotation matrices to compute the conditional probability.
Inference. Recovering the optimal transformation from the pose of to amounts to optimizing over the space of rotations:
In practice, the loss landscape of is often un-smooth, so we find that sampling and scoring rotations based on to be more effective than gradient ascent.
We can also compute the conditional distribution of the relative rotation from to by sampling rotations over . The probability associated with each rotation can be computed using a softmax function, as described Alg. 1 and derived in (1). Inspired by , we can visualize the distribution of rotations by projecting the rotation matrices on a 2-sphere using pitch and yaw and coloring the rotation based on roll. See Fig. 3 and the supplement for sample results.
3.2 Recovering Joint Poses
In the previous section, we describe an energy-based relative pose predictor conditioned on pairs of images. Using this network, we recover a coherent set of rotations when given a set of images.
Given predictions for relative rotations between every pair of images, we aim to associate each image with an absolute rotation. However, as the relative poses are invariant up to a global rotation, we can treat the pose of the first image as the identity matrix:. We note that the rotations for the other images can be uniquely induced given any relative rotations that span a tree.
Sequential Chain. Perhaps the simplest way to construct such a tree is to treat the images as part of an ordered sequence. Given , all subsequent poses can be computed by using the best scoring relative pose from the previous image: , denoting as the relative rotation matrix . However, this assumes that the images are captured sequentially (e.g. in a video) and may not be applicable for settings such as online marketplaces.
Maximum Spanning Tree. We improve over the naive linear chain by recognizing that some pairs of images may produce more confident predictions. Given images, we construct a directed graph with edges, where the weight of edge . We then construct a Maximum Spanning Tree (MST) that covers all images with the most confident set of relative rotations.
Reasoning over all images jointly. Both of the previous methods, which select a subset of edges, do not perform any joint reasoning and discard all but the highest scoring mode for each pair of images. Instead, we can take advantage of our energy-based formulation to enforce global consistency.
Given our pairwise conditional probabilities, we can define a joint distribution over the set of rotations:
where is the set of pairwise permutations and is the normalizing constant. Intuitively, this corresponds to the distribution modeled by a factor graph with a potential function corresponding to each pairwise edge.
We then aim to find the most likely set of rotations under this conditional joint distribution (assuming ). While it is not feasible to analytically obtain the global maxima, we adopt an optimization-based approach and iteratively improve the current estimate. More specifically, we initialize the set of poses with the greedy MST solution and at each iteration, we randomly select a rotation to update. Assuming fixed values for , we then search for the rotation under the conditional distribution that maximizes the overall likelihood. We show in supplementary that this in fact corresponds to computing the most likely hypothesis under the distribution :
Analogous to our approach for finding the optimal solution for a single relative rotation, we sample multiple hypotheses for the rotation , and select the hypothesis that maximizes (5). We find that this search-based block coordinate ascent helps us consistently improve over the initial solution while avoiding the local optima that a continuous optimization is susceptible to. We provide pseudo-code in Alg. 2 and visualize one iteration of coordinate ascent in Fig. 4.
3.3 Implementation Details
Network Architecture. We use a ResNet-50  with anti-aliasing  to extract image features. We use a lightweight 3-layer MLP that takes in a concatenation of 2 sets of image features and a rotation matrix to predict energy. We use positional encoding [35, 62] directly on flattened rotation matrix, similar to . See the supplement for architecture diagrams.
Number of Rotation Samples. We use the equivolumetric sampling in  to compute query rotations (37k total rotations) during training. For each iteration of coordinate ascent, we randomly sample 250k rotation matrices. For visualizing distributions, we randomly sample 50k rotations.
Runtime. We train the pairwise estimator with a batch size of 64 images for approximately 2 days on 4 NVIDIA 2080TI GPUs. Inference for 20 images takes around 1-2 seconds to construct an MST and around 2 minutes for 200 iterations of coordinate ascent on a single 2080TI. Note that the runtime of the coordinate ascent scales linearly with the number of images.
4.1 Experimental Setup
Dataset. We train and test on the Common Objects in 3D dataset (CO3D) , a large-scale dataset consisting of turntable-style videos of 51 common object categories. We train on the subset of the dataset that has camera poses, which were acquired by running COLMAP  over all frames of the video.
To train our network, we sample random frames and their associated camera poses from each video sequence. We train on 12,299 video sequences (from the train-known split) from 41 categories, holding out 10 categories to test generalization. We evaluate on 1,711 video sequences (from the test-known split) over all 41 trained categories (seen) as well as the 10 held out categories (unseen). The 10 held out categories are: ball, book, couch, frisbee, hotdog, kite, remote, sandwich, skateboard, and suitcase. We selected these categories randomly after excluding some of the categories with the most training images.
Task and Metrics. We consider the task of sparse-view camera pose estimation with 3, 5, 10, and 20 images, subsampled from a video sequence. This is highly challenging, especially when , because the ground truth camera poses have wide baselines.
We consider two possible ways to select frames from a video sequence. First, we can randomly sample a set of indices per video sequence (Random). Alternatively, we can use uniformly-spaced frame indices (Uniform). We note that because CO3D video sequences are commonly taken in a turntable fashion, the uniformly spaced sampling strategy may be more representative of real world distributions of sparse view image sets. We report metrics on both task setups.
Because the global transformation of the camera poses is ambiguous, we evaluate each pair of relative rotations. For each of the pairs, we compare the angular difference between the relative predicted rotation and the relative ground truth rotation using Rodrigues’ formula . We report the proportion of relative rotations that are within 15 and 30 degrees of the ground truth. We note that rotation errors within this range are relatively easy to handle by downstream 3D reconstruction tasks (See Fig. 10 for an example).
Baselines. We compare against DROID-SLAM , a current state-of-the-art SLAM approach that incorporates learning in an optimization framework. Note that DROID-SLAM requires trajectories and camera intrinsics. Thus, we provide the DROID-SLAM baseline with sorted frame indices and intrinsics, but do not provide these to any other method.
We also compare with a state-of-the-art structure-from-motion pipeline that uses COLMAP 
with SuperPoint feature extraction and SuperGlue matching . We used the implementation provided by . For instances for which COLMAP does not converge or is unable to localize some cameras, we treat the missing poses as identity rotation for evaluation. We note that DROID-SLAM also outputs approximate identity rotations when the optimization fails.
Ablations. In the spirit of learning-based solutions that directly regress pose, we train a network that predicts relative rotation directly given two images. Similar to our energy-based predictor, we pass the concatenated images features from a ResNet-50 into an MLP. We double the number of layers from 3 to 6 and add a skip connection to give this network increased capacity. Rotations are predicted using the 6D rotation representation . See the supplement for additional architecture details. The relative pose regressor cannot directly predict poses for more than two images. To recover sets of poses from sets of images, we use the MST graph recovered by our method to link the pairs of relative rotations (we find that this performs better than linking the relative rotations sequentially).
To demonstrate the benefits of joint reasoning, we additionally report the performance of our method using the greedy Maximum Spanning Tree (MST) solution. The performance of the sequential solution is in the supplement.
4.2 Quantitative Evaluation
We evaluate all approaches on sparse-view camera pose estimation by averaging over all seen categories in Fig. 6. We find that our approach outperforms all baselines for images. Correspondence-based approaches (DROID-SLAM and COLMAP) do not work until roughly 20 images, at which point image frames have sufficient overlap for local correspondences. However, real world multi-view data (e.g. marketplace images) typically have much fewer images. We find that coordinate ascent helps our approach scale with more image frames whereas the greedy maximum spanning tree accumulates errors with more frames.
Directly predicting relative poses does not perform well, possibly because pose regression cannot model multiple modes, which is important for symmetrical objects. We visualize the performance for four categories in Fig. 7. We find that the performance gap between our approach and direct regression is larger for objects with some symmetry (car, hydrant) than for objects without symmetry (chair, plant). Moreover, unlike our energy-based approach that models a joint distribution, a regression-based method does not allow similar joint reasoning.
We also test the generalization of our approach for unseen categories in Fig. 8. We still find that our method significantly outperforms all other approaches with sparse view () even for never-before-seen object categories, indicating its ability to handle generic objects beyond training. The per-category evaluation for both seen and unseen categories are in the supplement.
Novel View Registration. In our standard SfM-inspired task setup, we aim to recover camera poses given images. Intuitively, adding images reduces ambiguity, but recovering additional cameras is also more challenging. To disambiguate between the two, we evaluate the task of registering new views given previously aligned images in Fig. 9. Given images, of which have aligned cameras, we use our energy-based regressor to recover the remaining camera (equivalent to one iteration of coordinate ascent). We find that adding images improves accuracy, suggesting that additional views can reduce ambiguity.
4.3 Qualitative Results
We show qualitative results on the outputs of our pairwise predictor in Fig. 3. The visualized distributions suggest that our model is learning useful information about symmetry and can model multiple modes even for unseen categories.
We visualize predicted camera poses for DROID-SLAM, COLMAP, and our method with coordinate ascent in Fig. 5. Unable to bridge the domain gap from narrow baseline video frames, DROID-SLAM often gets stuck in the trajectory. Although COLMAP sometimes fails to converge, it performs well for =20. Our approach consistently outputs plausible interpretations but is unable to achieve precise localization. See supplementary for visualizations on randomly selected sequences and more category-specific discussion.
We also validate that our camera pose estimations can be used for downstream 3D reconstruction. We use our camera poses to initialize NeRS , a representative sparse-view surface-based approach that requires a (noisy) camera initialization. Using our cameras, we successfully reconstruct a 3D model of a fire hydrant from 7 images and a motorbike from 4 images in Fig. 10. Note that the camera pose initialization in the original NeRS paper was manually selected.
We presented a prediction-based approach for estimating camera rotations given (a sparse set of) images of a generic object. Our energy-based formulation allows capturing the underlying uncertainty in relative poses, while also enabling joint reasoning over multiple images. We believe our system’s robustness under sparse views can allow it to serve as a stepping stone for initializing (neural) reconstruction methods in the wild, but also note that there are several open challenges. First, our work reasoned about the joint distribution using only pairwise potentials and developing efficient higher-order energy models may further improve performance. Moreover, while we outperform existing techniques given sparse-views, the correspondence-driven methods are more accurate given a large number of views, and we hope future efforts can unify the two approaches. Finally, our approach may not be directly applicable to reasoning about camera transformations for arbitrary scenes as modeling camera translation would be more important compared to object-centric images.
Acknowledgements. We would like to thank Gengshan Yang, Jonathon Luiten, Brian Okorn, and Elliot Wu for helpful feedback and discussion. This work was supported in part by the NSF GFRP (Grant No. DGE1745016), Singapore DSTA, and CMU Argo AI Center for Autonomous Vehicle Research.
6 Supplementary Materials
In this section, we show that maximizing the conditional distribution of an update to a hypothesis is equivalent to maximizing the joint likelihood in Sec. 6.1. We evaluate ablations of our approach to validate the use of coordinate ascent vs gradient ascent and MST vs sequential loop in Tab. 1. To test the quality of our SLAM and SfM baselines, we also ran them with more image frames (narrower baseline) in Fig. 11. We show per-category evaluations to compare performance across seen and unseen categories of CO3D in Tab. 2. We provide a visualization of how to interpret the relative rotations in Fig. 12 and discuss the coordinate system in which we compute relative rotations in Fig. 13. We discuss the learned symmetry modes as well as some failure modes in Fig. 14. As a proof of concept, we use our energy-based predictor on a deformable object (cat) in Fig. 15. We include architecture diagrams for our energy-based pairwise pose predictor in Fig. 16 and the direct pose predictor baseline in Fig. 17. Finally, we show qualitative comparisons between our approach and the correspondence-based baselines on randomly selected sequences on both seen and unseen categories in Fig. 18 and Fig. 19 respectively.
6.1 Derivation of Conditional Distribution for Coordinate Ascent
Given our pairwise conditional probabilities, the joint distribution over a set of rotations can be computed as:
We are searching for the most likely set of rotations under this joint distribution given images . For each iteration of coordinate ascent, we have our current most likely set of rotations and wish to update . If we fix all , the only terms in that can change are the ones involving , and the rest can be folded into a scalar constant. Thus, searching for the rotation that maximizes the overall likelihood is equivalent to finding the most likely hypothesis under :
This simplifies each iteration of coordinate ascent from a sum to a sum.
|Ours (Grad. Asc.)||0.52||0.51||0.49||0.47|
|Ours (Coord. Asc.)||0.59||0.58||0.59||0.59|
|Acc. @ 30 (%)|
|Acc. @ 30 (%)|
-  (2018) RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. In ECCV, Cited by: §2.
-  (2006) SURF: Speeded Up Robust Features. In ECCV, Cited by: §2.
-  (2016) Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In CVPR, Cited by: §2.
-  (2020) EfficientPose: An Efficient, Accurate and Scalable End-to-end 6D Multi Object Pose Estimation Approach. arXiv:2011.04307. Cited by: §2.
ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM. T-RO 37 (6), pp. 1874–1890. Cited by: §2.
-  (2015) Initialization Techniques for 3D SLAM: A Survey on Rotation Estimation and its Use in Pose Graph Optimization. ICRA. Cited by: §3.
-  (2022) Occlusion-Robust Object Pose Estimation with Holistic Representation. In WACV, Cited by: §2.
-  (2021) Wide-Baseline Relative Camera Pose Estimation with Directional Learning. In CVPR, Cited by: §2.
-  (2016) Universal Correspondence Network. NeurIPS. Cited by: §2.
-  (2018) Pose Estimation for Objects with Rotational Symmetry. In IROS, Cited by: §2.
-  (2007) MonoSLAM: Real-time Single Camera SLAM. TPAMI 29 (6), pp. 1052–1067. Cited by: §2.
-  (2019) PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking. In RSS, Cited by: §2.
-  (2020) Self-supervised 6D Object Pose Estimation for Robot Manipulation. In ICRA, Cited by: §2.
-  (2018) SuperPoint: Self-supervised Interest Point Detection and Description. In CVPR-W, Cited by: §2, §4.1.
-  (2019) D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In CVPR, Cited by: §2.
-  (2020) Multi-view Optimization of Local Feature Geometry. In ECCV, Cited by: §2.
-  (2018) Direct Sparse Odometry. TPAMI. Cited by: §2.
-  (2010) Towards Internet-scale Multi-view Stereo. In CVPR, Cited by: §2.
-  (2019) Deep Orientation Uncertainty Learning Based on a Bingham Loss. In ICLR, Cited by: §2.
-  (2022) Differentiable Stereopsis: Meshes from Multiple Views Using Differentiable Rendering. In CVPR, Cited by: §1.
-  (1988) A Combined Corner and Edge Detector. In Alvey Vision Conference, Cited by: §2.
-  (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §3.3, Figure 16.
-  (2021) RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering. In ICCV, Cited by: §2.
-  (2017) SSD-6D: Making RGB-based 3D Detection and 6D Pose Estimation Great Again. In ICCV, Cited by: §2.
-  (2016) Modelling Uncertainty in Deep Learning for Camera Relocalization. In ICRA, Cited by: §2.
-  (2015) PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, Cited by: §2.
-  (2021) BARF: Bundle-Adjusting Neural Radiance Fields. In ICCV, Cited by: §1, §3.
-  (2021) Pixel-Perfect Structure-from-Motion with Featuremetric Refinement. In ICCV, Cited by: §2.
-  (2010) SIFT Flow: Dense Correspondence Across Scenes and Its Applications. TPAMI 33 (5), pp. 978–994. Cited by: §2.
-  (2004) Distinctive Image Features from Scale-invariant Keypoints. IJCV 60 (2), pp. 91–110. Cited by: §2.
-  (1981) An Iterative Image Registration Technique with an Application to Stereo Vision. In IJCAI, Cited by: §2.
-  (2018) Unsupervised Learning of Depth and Ego-Motion From Monocular Video Using 3D Geometric Constraints. In CVPR, Cited by: §2.
-  (2019) Explaining the Ambiguity of Object Detection and 6D Pose from Visual Data. In ICCV, Cited by: §2.
Relative Camera Pose Estimation Using Convolutional Neural Networks. In ACIVS, Cited by: §2.
-  (2020) NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, Cited by: §3.3.
-  (2020) Probabilistic Orientation Estimation with Matrix Fisher Distributions. NeurIPS. Cited by: §2.
-  (2015) ORB-SLAM: A Versatile and Accurate Monocular SLAM System. T-RO 31 (5), pp. 1147–1163. Cited by: §2.
-  (2017) ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras. T-RO 33 (5), pp. 1255–1262. Cited by: §2.
-  (2021) Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold. In ICML, Cited by: §2, §2, Figure 3, §3.1, §3.1, §3.3, §3.3, Figure 12.
-  (2011) DTAM: Dense Tracking and Mapping in Real-time. In ICCV, Cited by: §2.
-  (2017) Learning 3D Object Categories by Looking Around Them. In ICCV, Cited by: §2.
-  (2019) C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion. In ICCV, Cited by: §2.
-  (2018) Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation. In ECCV, Cited by: §2.
-  (2021) ZePHyR: Zero-shot Pose Hypothesis Scoring. In ICRA, Cited by: §2.
-  (2020) Learning Orientation Distributions for Object Pose Estimation. In IROS, Cited by: §2.
-  (2020) Online Invariance Selection for Local Feature Descriptors. In ECCV, Cited by: §2.
-  (2018) Deep Directional Statistics: Pose Estimation with Uncertainty Quantification. In ECCV, Cited by: §2.
-  (2021) Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In ICCV, Cited by: §4.1.
-  (2019) R2D2: Reliable and Repeatable Detector and Descriptor. NeurIPS. Cited by: §2.
-  (1840) Des lois géométriques qui régissent les déplacements d’un système solide dans l’espace, et de la variation des coordonnées provenant de ces déplacements considérés indépendamment des causes qui peuvent les produire. Journal de Mathématiques Pures et Appliquées 5. Cited by: §4.1.
-  (2020) Kimera: an Open-Source Library for Real-Time Metric-Semantic Localization and Mapping. In ICRA, Cited by: §2.
-  (2019) From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In CVPR, Cited by: §2, §4.1.
-  (2020) SuperGlue: Learning Feature Matching with Graph Neural Networks. In CVPR, Cited by: §2, §4.1.
-  (2016) Structure-from-Motion Revisited. In CVPR, Cited by: §2, §4.1, §4.1.
-  (2016) Pixelwise View Selection for Unstructured Multi-View Stereo. In ECCV, Cited by: §2.
-  (2019) BAD SLAM: Bundle Adjusted Direct RGB-D SLAM. In CVPR, Cited by: §2.
-  (2014) Learning Local Feature Descriptors Using Convex Optimisation. TPAMI 36 (8), pp. 1573–1585. Cited by: §2.
-  (2006) Photo Tourism: Exploring Photo Collections in 3D. In SIGGRAPH, Cited by: §2.
-  (2020) Hybridpose: 6D Object Pose Estimation under Hybrid Representations. In CVPR, Cited by: §2.
-  (2018) Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling. In CVPR, Cited by: §1.
-  (2018) Implicit 3D Orientation Learning for 6D Object Detection from RGB Images. In ECCV, Cited by: §2.
-  (2020) Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS. Cited by: §3.3, Figure 16.
-  (2019) BA-Net: Dense Bundle Adjustment Network. In ICLR, Cited by: §2.
-  (2021) DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. NeurIPS. Cited by: §2, §4.1, Figure 11.
-  (2018) Real-Time Seamless Single Shot 6D Object Pose Prediction. In CVPR, Cited by: §2.
-  (2009) Daisy: An Efficient Dense Descriptor Applied to Wide-baseline Stereo. TPAMI 32 (5), pp. 815–830. Cited by: §2.
-  (1999) Bundle Adjustment—A Modern Synthesis. In International workshop on vision algorithms, Cited by: §2.
-  (2020) GLU-Net: Global-Local Universal Network for Dense Flow and Correspondences. In CVPR, Cited by: §2.
-  (2017) DeMoN: Depth and Motion Network for Learning Monocular Stereo. In CVPR, Cited by: §2.
-  (2017) SfM-Net: Learning of Structure and Motion from Video. arXiv:1704.07804. Cited by: §2.
-  (2019) DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In CVPR, Cited by: §2.
-  (2020) Learning Feature Descriptors Using Camera Pose Supervision. In ECCV, Cited by: §2.
-  (2017) DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In ICRA, Cited by: §2.
-  (2020) TartanVO: A Generalizable Learning-based VO. In CoRL, Cited by: §2.
-  (2020) DeepSFM: Structure From Motion Via Deep Bundle Adjustment. In ECCV, Cited by: §2.
-  (2017) SegICP: Integrated Deep Semantic Segmentation and Pose Estimation. In IROS, Cited by: §2.
-  (2018) PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In RSS, Cited by: §2.
-  (2019) Pose from Shape: Deep Pose Estimation for Arbitrary 3D Objects. In BMVC, Cited by: §2.
-  (2016) LIFT: Learned Invariant Feature Transform. In ECCV, Cited by: §2.
-  (2018) GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In CVPR, Cited by: §2.
-  (2020) Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild. In ECCV, Cited by: §2.
-  (2021) NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild. In NeurIPS, Cited by: §1, §3, Figure 10, §4.3.
-  (2019) Making Convolutional Networks Shift-Invariant Again. In ICML, Cited by: §3.3, Figure 16.
-  (2018) DeepTAM: Deep Tracking and Mapping. In ECCV, Cited by: §2.
-  (2017) Unsupervised Learning of Depth and Ego-Motion From Video. In CVPR, Cited by: §2.
-  (2019) On the Continuity of Rotation Representations in Neural Networks. In CVPR, Cited by: §4.1, Figure 17.
-  (2020) Direct Sparse Mapping. T-RO. Cited by: §2.