1 Introduction
Recovering 3D from 2D images of an object has been a central task in vision across decades. Given multiple views, structurefrommotion (SfM) based methods can infer a 3D representation of the underlying instance while also associating each image with a camera viewpoint. However, these correspondencedriven methods cannot robustly handle sparsely sampled images that minimally overlap, and typically require many (20) images for a 360degree 3D inference. Unfortunately, this requirement of densely sampled views can be prohibitive—online marketplaces often have only a few images per instance, and a user casually reconstructing a novel object would also find capturing such views tedious. Although the recently emerging neural 3D reconstruction techniques also typically leverage similarly dense views, some works have shown promise that a far smaller number of images can suffice for highquality 3D reconstruction. These successes have however still relied on precisely [60] or approximately [27, 82, 20] known camera viewpoints for inference. To apply these methods at scale, we must therefore answer a fundamental question—given sparsely sampled images of a generic object, how can we obtain the associated camera viewpoints?
Existing methods do not provide a conclusive answer to this question. On the one hand, bottomup correspondencebased techniques are not robustly applicable for sparseview inference. On the other, recent neural multiview methods can optimize already known approximate camera poses but provide no mechanism to obtain these to begin with. In this work, our goal is to fill this void and develop a method that, given a small number of unposed images of a generic object, can associate them with (approximate) camera viewpoints. Towards this goal, we focus on inferring the camera rotation matrices corresponding to each input image and propose a topdown approach to predict these. However, we note that the ‘absolute’ rotation is not welldefined given an image of a generic object—it assumes a ‘canonical’ pose which is not always known apriori (e.g. what is an identity rotation for a pen? or a plant?). In contrast, the relative rotation between two views is welldefined even if a canonical pose for the instance is not. Thus, instead of adopting the common paradigm of singleimage based pose prediction, we learn to estimate the relative pose given a pair of input images. We propose a system that leverages such pairwise predictions to then infer a consistent set of global rotations given multiple images of a generic object.
A key technical question that we consider is regarding the formulation of such pairwise pose estimation. Given two informative views of a rotationally asymmetric object, a regressionbased approach may be able to accurately predict their relative transformation. The general case however, can be more challenging—given two views of a cup but with the handle only visible in one, the relative pose is ambiguous given just the two images. To allow capturing this uncertainty, we formulate an energybased relative pose prediction network that, given two images
anda candidate relative rotation, outputs an energy corresponding to the (unnormalized) logprobability of the hypothesis. This probabilistic estimation of relative pose not only makes the learning more stable, but more importantly, provides a mechanism to estimate a
joint distribution over viewpoints given multiple images. We show that optimizing rotations to improve this joint likelihood yields coherent poses given multiple images and leads to significant improvements over naive approaches that do not consider the joint likelihoods.We train our system using instances from over 40 commonplace object categories, and find that not only can it infer accurate (relative) poses for novel instances of these classes, it even generalizes to instances from unseen categories. Our approach can thus be viewed as a stepping stone toward sparseview 3D reconstruction of generic objects; just as classical techniques provide precise camera poses that (neural) multiview reconstruction methods can leverage, our work provides a similar, albeit coarser, output that can be used to initialize inference in current (and future) sparseview reconstruction methods. While our system only outputs camera rotations, we note that a reasonable corresponding translation can be easily initialized assuming objectfacing viewpoints, and we show that this suffices in practice for bootstrapping sparseview reconstruction.
2 Related Work
StructurefromMotion (SfM). At a high level, structurefrommotion aims to recover 3D geometry and camera parameters from image sets. This is done classically by computing local image features [21, 30, 2, 66], finding matches across images [31], and then estimating and verifying epipolar geometry using bundle adjustment [67]. Later works have scaled up the SfM pipeline using sequential algorithms, demonstrating results on hundreds or even thousands of images [58, 18, 55, 54, 52].
The advent of deep learning has augmented various stages of the classical SfM pipeline. Better feature descriptors
[14, 57, 72, 79, 15, 46, 49] and improved featured matching [53, 9, 29, 68, 16] have significantly outperformed their handcrafted counterparts. BANet [63] and DeepSFM [75] have even replaced the bundleadjustment process by optimizing over a cost volume. Most recently, PixelPerfect SfM [28] uses a featuremetric error to postprocess camera poses to achieve subpixel accuracy.While these methods can achieve excellent localization, all these approaches are bottomup: beginning with local features that are matched across images. However, matching features requires sufficient overlap between images, which may not be possible given wide baseline views. While our work also aims to localize camera poses given image sets, our approach fundamentally differs because it is topdown and does not rely on lowlevel correspondences.
Simultaneous Localization and Mapping (SLAM). Related is the task of Monocular SLAM, which aims to localize and map the surroundings from a video stream. Indirect SLAM methods, similar to SfM, match local features across different images to localize the camera [51, 5, 38, 37]. Direct SLAM methods, on the other hand, define a geometric objective function to directly optimize over a photometric error [87, 56, 11, 17].
There have also been various attempts to introduce deep learning into SLAM pipelines. As with SfM, learned feature descriptors and matching have helped improve accuracy on SLAM subproblems and increased robustness. Endtoend deep SLAM methods [84, 40, 73, 74] have improved the robustness of SLAM compared to classical methods, but have generally not closed the gap on performance. One notable exception is the recent DROIDSLAM [64], which combines the robustness of learningbased SLAM with the accuracy of classical SLAM.
These approaches all assume sequential streams and generally rely on matching or otherwise incorporating temporal locality between neighboring frames. We do not make any assumptions about the order of the image inputs nor the amount of overlap between nearby frames.
Singleview Pose Prediction. The task of predicting a (6DoF) pose from a single image has a long and storied history, the surface of which can barely be scratched in this section. Unlike relative pose between multiple images, the (absolute) pose given a single image is only welldefined if there exists a canonical coordinate system. Most singleview pose prediction approaches therefore deal with a fixed set of categories, each of which has a canonical coordinate system defined a priori [77, 65, 43, 7, 71, 23, 4, 59, 42, 39, 24, 26]. Other methods that are categoryagnostic take in a 3D mesh or point cloud as input, which provides a local coordinate system [76, 78, 81, 44].
Perhaps most relevant to us are approaches that not only predict pose but also model inherent uncertainty in the pose prediction [3, 25, 39, 45, 10, 61, 47, 19, 12, 13, 36, 33]. Like our approach, VpDRNet [41] uses relative poses as supervision but still predicts absolute pose (with a unimodal Gaussian uncertainty model). ImplicitPDF [39]
is the most similar approach to ours and served as an inspiration. Similar to our approach, ImplicitPDF uses a neural network to implicitly represent probability using an energybased formulation which elegantly handles symmetries and multimodal distributions. Unlike our approach, ImplicitPDF (and all other singleview pose prediction methods) predict
absolute pose, which does not exist in general for generic or novel categories. Instead, we model probability distributions over relative pose given pairs of images.Learningbased Relative Pose Prediction. When considering generic scenes, prior works have investigated the task of relative pose prediction given two images. However, these supervised [69] or selfsupervised [85, 80, 32, 70] methods typically consider prediction of motion between consecutive frames and are not easily adapted to widebaseline prediction. While some approaches have investigated wide baseline prediction [34, 1]
, regressionbased inference does not effectively capture uncertainty unlike our energybased model. Perhaps most similar to ours is DirectionNet
[8] which also predicts a camera distribution for wide baseline views. While DirectionNet only uses the expected value of the distribution and thus ignores symmetry, we take advantage of multimodal distributions to improve our joint pose estimation.3 Method
Given a set of images depicting a generic object inthewild, we aim to recover a set of rotation matrices such that rotation matrix corresponds to the viewpoint of the camera used to take image . Note that while we do not model translation, it can be easily initialized using objectfacing viewpoints for 3D object reconstruction [27, 82] or a pose graph for SLAM [6]. We are primarily interested in settings with only sparse views and wide baselines. While bottomup correspondence based techniques can reliably recover camera pose given dense views, they do not adapt well to sparse views with minimal overlap. We instead propose a predictionbased topdown approach that can learn and exploit the global structure directly.
The basic building block of our prediction system (visualized in Fig. 3) is a pairwise pose predictor that infers relative camera orientations given pairs of images. However, symmetries in objects and possibly uninformative viewpoints make this an inherently uncertain prediction task. To allow capturing this uncertainty, we propose an energybased approach that models the multimodal distribution over relative poses given two images.
Given the predicted distributions over pairwise relative rotations, we show that these can be leveraged to induce a joint distribution over the rotations. Starting with a greedy initialization, we present a coordinateascent approach that jointly reasons over and improves the set of inferred rotations. We describe our approach for modeling probability distributions over relative poses between two images in Sec. 3.1, and build on this in Sec. 3.2 to recover a joint set of poses across multiple images. Finally, we discuss implementation details in Sec. 3.3.
3.1 Estimating Pairwise Relative Rotations
Given a pair of images depicting an arbitrary object, we aim to predict a distribution over the relative rotation corresponding to the camera transformation between the two views. As there may be ambiguities when inferring the relative pose given two images, we introduce a formulation that can model uncertainty.
Energybased Formulation. We wish to model the conditional distribution over a relative rotation matrix given input images and : . Inspired by recent work on implicitly representing the distribution over rotations using a neural network [39], we propose using an energybased relative pose estimator. More specifically, we train a network that learns to predict the energy, or the unnormalized joint logprobability, where is the constant of integration. From the product rule, we can recover the conditional probability as a function of :
(1) 
We marginalize over rotations to avoid having to compute (see Alg. 1), but note that the number of sampled rotations should be large for the approximation to be accurate. It is therefore important to use a lightweight network since it is queried once per sampled rotation in the denominator.
Training. We train our network by maximizing the loglikelihood of the conditional distribution, or equivalently minimizing the negative loglikelihood:
(2) 
where and are the ground truth poses of and respectively. Note that while the ‘absolute’ poses are in an arbitrary coordinate system (depending on e.g. SLAM system outputs), the relative pose between two views is agnostic to this incidental canonical frame. Following (1), we sample multiple candidate rotation matrices to compute the conditional probability.
Inference. Recovering the optimal transformation from the pose of to amounts to optimizing over the space of rotations:
(3) 
In practice, the loss landscape of is often unsmooth, so we find that sampling and scoring rotations based on to be more effective than gradient ascent.
We can also compute the conditional distribution of the relative rotation from to by sampling rotations over . The probability associated with each rotation can be computed using a softmax function, as described Alg. 1 and derived in (1). Inspired by [39], we can visualize the distribution of rotations by projecting the rotation matrices on a 2sphere using pitch and yaw and coloring the rotation based on roll. See Fig. 3 and the supplement for sample results.
3.2 Recovering Joint Poses
In the previous section, we describe an energybased relative pose predictor conditioned on pairs of images. Using this network, we recover a coherent set of rotations when given a set of images.
Greedy Initialization.
Given predictions for relative rotations between every pair of images, we aim to associate each image with an absolute rotation. However, as the relative poses are invariant up to a global rotation, we can treat the pose of the first image as the identity matrix:
. We note that the rotations for the other images can be uniquely induced given any relative rotations that span a tree.Sequential Chain. Perhaps the simplest way to construct such a tree is to treat the images as part of an ordered sequence. Given , all subsequent poses can be computed by using the best scoring relative pose from the previous image: , denoting as the relative rotation matrix . However, this assumes that the images are captured sequentially (e.g. in a video) and may not be applicable for settings such as online marketplaces.
Maximum Spanning Tree. We improve over the naive linear chain by recognizing that some pairs of images may produce more confident predictions. Given images, we construct a directed graph with edges, where the weight of edge . We then construct a Maximum Spanning Tree (MST) that covers all images with the most confident set of relative rotations.
Reasoning over all images jointly. Both of the previous methods, which select a subset of edges, do not perform any joint reasoning and discard all but the highest scoring mode for each pair of images. Instead, we can take advantage of our energybased formulation to enforce global consistency.
Given our pairwise conditional probabilities, we can define a joint distribution over the set of rotations:
(4) 
where is the set of pairwise permutations and is the normalizing constant. Intuitively, this corresponds to the distribution modeled by a factor graph with a potential function corresponding to each pairwise edge.
We then aim to find the most likely set of rotations under this conditional joint distribution (assuming ). While it is not feasible to analytically obtain the global maxima, we adopt an optimizationbased approach and iteratively improve the current estimate. More specifically, we initialize the set of poses with the greedy MST solution and at each iteration, we randomly select a rotation to update. Assuming fixed values for , we then search for the rotation under the conditional distribution that maximizes the overall likelihood. We show in supplementary that this in fact corresponds to computing the most likely hypothesis under the distribution :
(5) 
Analogous to our approach for finding the optimal solution for a single relative rotation, we sample multiple hypotheses for the rotation , and select the hypothesis that maximizes (5). We find that this searchbased block coordinate ascent helps us consistently improve over the initial solution while avoiding the local optima that a continuous optimization is susceptible to. We provide pseudocode in Alg. 2 and visualize one iteration of coordinate ascent in Fig. 4.
3.3 Implementation Details
Network Architecture. We use a ResNet50 [22] with antialiasing [83] to extract image features. We use a lightweight 3layer MLP that takes in a concatenation of 2 sets of image features and a rotation matrix to predict energy. We use positional encoding [35, 62] directly on flattened rotation matrix, similar to [39]. See the supplement for architecture diagrams.
Number of Rotation Samples. We use the equivolumetric sampling in [39] to compute query rotations (37k total rotations) during training. For each iteration of coordinate ascent, we randomly sample 250k rotation matrices. For visualizing distributions, we randomly sample 50k rotations.
Runtime. We train the pairwise estimator with a batch size of 64 images for approximately 2 days on 4 NVIDIA 2080TI GPUs. Inference for 20 images takes around 12 seconds to construct an MST and around 2 minutes for 200 iterations of coordinate ascent on a single 2080TI. Note that the runtime of the coordinate ascent scales linearly with the number of images.
4 Evaluation
We visualize the camera poses (rotations) predicted by DROIDSLAM, COLMAP with SuperPoint/SuperGlue, and our method given sparse image frames. The black cameras correspond to the ground truth. We only visualize the rotations predicted by each method, and set the translation such that the object center is a fixed distance away along the camera axis. As the poses are agnostic to a global rotation, we align the predicted cameras across all methods to the ground truth coordinate system by setting the recovered camera pose for the first image to the corresponding ground truth (visualized in green). Odd rows correspond to randomly sampled image frames, while even rows correspond to uniformlyspaced image frames.
4.1 Experimental Setup
Dataset. We train and test on the Common Objects in 3D dataset (CO3D) [48], a largescale dataset consisting of turntablestyle videos of 51 common object categories. We train on the subset of the dataset that has camera poses, which were acquired by running COLMAP [54] over all frames of the video.
To train our network, we sample random frames and their associated camera poses from each video sequence. We train on 12,299 video sequences (from the trainknown split) from 41 categories, holding out 10 categories to test generalization. We evaluate on 1,711 video sequences (from the testknown split) over all 41 trained categories (seen) as well as the 10 held out categories (unseen). The 10 held out categories are: ball, book, couch, frisbee, hotdog, kite, remote, sandwich, skateboard, and suitcase. We selected these categories randomly after excluding some of the categories with the most training images.
Task and Metrics. We consider the task of sparseview camera pose estimation with 3, 5, 10, and 20 images, subsampled from a video sequence. This is highly challenging, especially when , because the ground truth camera poses have wide baselines.
We consider two possible ways to select frames from a video sequence. First, we can randomly sample a set of indices per video sequence (Random). Alternatively, we can use uniformlyspaced frame indices (Uniform). We note that because CO3D video sequences are commonly taken in a turntable fashion, the uniformly spaced sampling strategy may be more representative of real world distributions of sparse view image sets. We report metrics on both task setups.
Because the global transformation of the camera poses is ambiguous, we evaluate each pair of relative rotations. For each of the pairs, we compare the angular difference between the relative predicted rotation and the relative ground truth rotation using Rodrigues’ formula [50]. We report the proportion of relative rotations that are within 15 and 30 degrees of the ground truth. We note that rotation errors within this range are relatively easy to handle by downstream 3D reconstruction tasks (See Fig. 10 for an example).
Baselines. We compare against DROIDSLAM [64], a current stateoftheart SLAM approach that incorporates learning in an optimization framework. Note that DROIDSLAM requires trajectories and camera intrinsics. Thus, we provide the DROIDSLAM baseline with sorted frame indices and intrinsics, but do not provide these to any other method.
We also compare with a stateoftheart structurefrommotion pipeline that uses COLMAP [54]
with SuperPoint feature extraction
[14] and SuperGlue matching [53]. We used the implementation provided by [52]. For instances for which COLMAP does not converge or is unable to localize some cameras, we treat the missing poses as identity rotation for evaluation. We note that DROIDSLAM also outputs approximate identity rotations when the optimization fails.Ablations. In the spirit of learningbased solutions that directly regress pose, we train a network that predicts relative rotation directly given two images. Similar to our energybased predictor, we pass the concatenated images features from a ResNet50 into an MLP. We double the number of layers from 3 to 6 and add a skip connection to give this network increased capacity. Rotations are predicted using the 6D rotation representation [86]. See the supplement for additional architecture details. The relative pose regressor cannot directly predict poses for more than two images. To recover sets of poses from sets of images, we use the MST graph recovered by our method to link the pairs of relative rotations (we find that this performs better than linking the relative rotations sequentially).
To demonstrate the benefits of joint reasoning, we additionally report the performance of our method using the greedy Maximum Spanning Tree (MST) solution. The performance of the sequential solution is in the supplement.
4.2 Quantitative Evaluation
We evaluate all approaches on sparseview camera pose estimation by averaging over all seen categories in Fig. 6. We find that our approach outperforms all baselines for images. Correspondencebased approaches (DROIDSLAM and COLMAP) do not work until roughly 20 images, at which point image frames have sufficient overlap for local correspondences. However, real world multiview data (e.g. marketplace images) typically have much fewer images. We find that coordinate ascent helps our approach scale with more image frames whereas the greedy maximum spanning tree accumulates errors with more frames.
Directly predicting relative poses does not perform well, possibly because pose regression cannot model multiple modes, which is important for symmetrical objects. We visualize the performance for four categories in Fig. 7. We find that the performance gap between our approach and direct regression is larger for objects with some symmetry (car, hydrant) than for objects without symmetry (chair, plant). Moreover, unlike our energybased approach that models a joint distribution, a regressionbased method does not allow similar joint reasoning.
We also test the generalization of our approach for unseen categories in Fig. 8. We still find that our method significantly outperforms all other approaches with sparse view () even for neverbeforeseen object categories, indicating its ability to handle generic objects beyond training. The percategory evaluation for both seen and unseen categories are in the supplement.
Novel View Registration. In our standard SfMinspired task setup, we aim to recover camera poses given images. Intuitively, adding images reduces ambiguity, but recovering additional cameras is also more challenging. To disambiguate between the two, we evaluate the task of registering new views given previously aligned images in Fig. 9. Given images, of which have aligned cameras, we use our energybased regressor to recover the remaining camera (equivalent to one iteration of coordinate ascent). We find that adding images improves accuracy, suggesting that additional views can reduce ambiguity.
4.3 Qualitative Results
We show qualitative results on the outputs of our pairwise predictor in Fig. 3. The visualized distributions suggest that our model is learning useful information about symmetry and can model multiple modes even for unseen categories.
We visualize predicted camera poses for DROIDSLAM, COLMAP, and our method with coordinate ascent in Fig. 5. Unable to bridge the domain gap from narrow baseline video frames, DROIDSLAM often gets stuck in the trajectory. Although COLMAP sometimes fails to converge, it performs well for =20. Our approach consistently outputs plausible interpretations but is unable to achieve precise localization. See supplementary for visualizations on randomly selected sequences and more categoryspecific discussion.
We also validate that our camera pose estimations can be used for downstream 3D reconstruction. We use our camera poses to initialize NeRS [82], a representative sparseview surfacebased approach that requires a (noisy) camera initialization. Using our cameras, we successfully reconstruct a 3D model of a fire hydrant from 7 images and a motorbike from 4 images in Fig. 10. Note that the camera pose initialization in the original NeRS paper was manually selected.
5 Discussion
We presented a predictionbased approach for estimating camera rotations given (a sparse set of) images of a generic object. Our energybased formulation allows capturing the underlying uncertainty in relative poses, while also enabling joint reasoning over multiple images. We believe our system’s robustness under sparse views can allow it to serve as a stepping stone for initializing (neural) reconstruction methods in the wild, but also note that there are several open challenges. First, our work reasoned about the joint distribution using only pairwise potentials and developing efficient higherorder energy models may further improve performance. Moreover, while we outperform existing techniques given sparseviews, the correspondencedriven methods are more accurate given a large number of views, and we hope future efforts can unify the two approaches. Finally, our approach may not be directly applicable to reasoning about camera transformations for arbitrary scenes as modeling camera translation would be more important compared to objectcentric images.
Acknowledgements. We would like to thank Gengshan Yang, Jonathon Luiten, Brian Okorn, and Elliot Wu for helpful feedback and discussion. This work was supported in part by the NSF GFRP (Grant No. DGE1745016), Singapore DSTA, and CMU Argo AI Center for Autonomous Vehicle Research.
6 Supplementary Materials
In this section, we show that maximizing the conditional distribution of an update to a hypothesis is equivalent to maximizing the joint likelihood in Sec. 6.1. We evaluate ablations of our approach to validate the use of coordinate ascent vs gradient ascent and MST vs sequential loop in Tab. 1. To test the quality of our SLAM and SfM baselines, we also ran them with more image frames (narrower baseline) in Fig. 11. We show percategory evaluations to compare performance across seen and unseen categories of CO3D in Tab. 2. We provide a visualization of how to interpret the relative rotations in Fig. 12 and discuss the coordinate system in which we compute relative rotations in Fig. 13. We discuss the learned symmetry modes as well as some failure modes in Fig. 14. As a proof of concept, we use our energybased predictor on a deformable object (cat) in Fig. 15. We include architecture diagrams for our energybased pairwise pose predictor in Fig. 16 and the direct pose predictor baseline in Fig. 17. Finally, we show qualitative comparisons between our approach and the correspondencebased baselines on randomly selected sequences on both seen and unseen categories in Fig. 18 and Fig. 19 respectively.
6.1 Derivation of Conditional Distribution for Coordinate Ascent
Given our pairwise conditional probabilities, the joint distribution over a set of rotations can be computed as:
(6) 
where .
We are searching for the most likely set of rotations under this joint distribution given images . For each iteration of coordinate ascent, we have our current most likely set of rotations and wish to update . If we fix all , the only terms in that can change are the ones involving , and the rest can be folded into a scalar constant. Thus, searching for the rotation that maximizes the overall likelihood is equivalent to finding the most likely hypothesis under :
(7)  
(8) 
This simplifies each iteration of coordinate ascent from a sum to a sum.
Acc @  3  5  10  20 

Ours (Sequential)  0.50  0.48  0.42  0.39 
Ours (MST)  0.52  0.50  0.47  0.43 
Ours (Grad. Asc.)  0.52  0.51  0.49  0.47 
Ours (Coord. Asc.)  0.59  0.58  0.59  0.59 
Acc. @ 30 (%)  

Category  3  5  10  20  
Seen Categories 
Apple  59  60  62  61 
Backpack  63  58  59  57  
Banana  67  54  63  55  
Baseballbat  100  67  70  73  
Baseballglove  48  56  56  55  
Bench  69  75  68  66  
Bicycle  62  61  63  62  
Bottle  59  57  60  60  
Bowl  80  75  77  80  
Broccoli  55  54  51  51  
Cake  46  47  47  54  
Car  67  71  70  62  
Carrot  60  64  63  65  
Cellphone  69  78  72  69  
Chair  53  55  55  56  
Cup  55  56  54  51  
Donut  52  44  51  51  
Hairdryer  58  56  58  54  
Handbag  66  63  62  61  
Hydrant  72  73  68  70  
Keyboard  72  73  74  74  
Laptop  88  87  89  89  
Microwave  56  65  55  58  
Motorcycle  59  60  62  61  
Mouse  68  70  69  67  
Orange  52  52  51  49  
Parkingmeter  22  27  23  22 
Acc. @ 30 (%)  
Category  3  5  10  20  
Seen Categories 
Pizza  50  57  57  55 
Plant  46  47  49  51  
Stopsign  42  49  47  47  
Teddybear  47  52  49  48  
Toaster  76  75  71  73  
Toilet  76  80  75  77  
Toybus  63  70  72  71  
Toyplane  43  57  48  51  
Toytrain  81  73  75  75  
Toytruck  71  69  68  68  
Tv  78  83  87  86  
Umbrella  58  60  54  55  
Vase  58  55  55  51  
Wineglass  51  46  46  47  
Seen Mean  61  62  61  61  
Unseen Categories 
Ball  45  41  43  44 
Book  51  49  49  47  
Couch  42  58  39  35  
Frisbee  55  49  40  38  
Hotdog  58  61  50  49  
Kite  28  23  27  24  
Remote  64  58  65  66  
Sandwich  37  41  41  42  
Skateboard  56  64  64  65  
Suitcase  59  61  67  63  
Unseen Mean  49  51  48  48 
rotation matrix. We concatenate the image features and positionally encoded rotations into a feature vector (
), which we feed into an MLP that predicts energy (corresponding to unnormalized log probability).References
 [1] (2018) RelocNet: Continuous Metric Learning Relocalisation using Neural Nets. In ECCV, Cited by: §2.
 [2] (2006) SURF: Speeded Up Robust Features. In ECCV, Cited by: §2.
 [3] (2016) UncertaintyDriven 6D Pose Estimation of Objects and Scenes from a Single RGB Image. In CVPR, Cited by: §2.
 [4] (2020) EfficientPose: An Efficient, Accurate and Scalable Endtoend 6D Multi Object Pose Estimation Approach. arXiv:2011.04307. Cited by: §2.

[5]
(2021)
ORBSLAM3: An Accurate OpenSource Library for Visual, VisualInertial and MultiMap SLAM
. TRO 37 (6), pp. 1874–1890. Cited by: §2.  [6] (2015) Initialization Techniques for 3D SLAM: A Survey on Rotation Estimation and its Use in Pose Graph Optimization. ICRA. Cited by: §3.
 [7] (2022) OcclusionRobust Object Pose Estimation with Holistic Representation. In WACV, Cited by: §2.
 [8] (2021) WideBaseline Relative Camera Pose Estimation with Directional Learning. In CVPR, Cited by: §2.
 [9] (2016) Universal Correspondence Network. NeurIPS. Cited by: §2.
 [10] (2018) Pose Estimation for Objects with Rotational Symmetry. In IROS, Cited by: §2.
 [11] (2007) MonoSLAM: Realtime Single Camera SLAM. TPAMI 29 (6), pp. 1052–1067. Cited by: §2.
 [12] (2019) PoseRBPF: A RaoBlackwellized Particle Filter for 6D Object Pose Tracking. In RSS, Cited by: §2.
 [13] (2020) Selfsupervised 6D Object Pose Estimation for Robot Manipulation. In ICRA, Cited by: §2.
 [14] (2018) SuperPoint: Selfsupervised Interest Point Detection and Description. In CVPRW, Cited by: §2, §4.1.
 [15] (2019) D2Net: A Trainable CNN for Joint Detection and Description of Local Features. In CVPR, Cited by: §2.
 [16] (2020) Multiview Optimization of Local Feature Geometry. In ECCV, Cited by: §2.
 [17] (2018) Direct Sparse Odometry. TPAMI. Cited by: §2.
 [18] (2010) Towards Internetscale Multiview Stereo. In CVPR, Cited by: §2.
 [19] (2019) Deep Orientation Uncertainty Learning Based on a Bingham Loss. In ICLR, Cited by: §2.
 [20] (2022) Differentiable Stereopsis: Meshes from Multiple Views Using Differentiable Rendering. In CVPR, Cited by: §1.
 [21] (1988) A Combined Corner and Edge Detector. In Alvey Vision Conference, Cited by: §2.
 [22] (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §3.3, Figure 16.
 [23] (2021) RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering. In ICCV, Cited by: §2.
 [24] (2017) SSD6D: Making RGBbased 3D Detection and 6D Pose Estimation Great Again. In ICCV, Cited by: §2.
 [25] (2016) Modelling Uncertainty in Deep Learning for Camera Relocalization. In ICRA, Cited by: §2.
 [26] (2015) PoseNet: A Convolutional Network for RealTime 6DOF Camera Relocalization. In ICCV, Cited by: §2.
 [27] (2021) BARF: BundleAdjusting Neural Radiance Fields. In ICCV, Cited by: §1, §3.
 [28] (2021) PixelPerfect StructurefromMotion with Featuremetric Refinement. In ICCV, Cited by: §2.
 [29] (2010) SIFT Flow: Dense Correspondence Across Scenes and Its Applications. TPAMI 33 (5), pp. 978–994. Cited by: §2.
 [30] (2004) Distinctive Image Features from Scaleinvariant Keypoints. IJCV 60 (2), pp. 91–110. Cited by: §2.
 [31] (1981) An Iterative Image Registration Technique with an Application to Stereo Vision. In IJCAI, Cited by: §2.
 [32] (2018) Unsupervised Learning of Depth and EgoMotion From Monocular Video Using 3D Geometric Constraints. In CVPR, Cited by: §2.
 [33] (2019) Explaining the Ambiguity of Object Detection and 6D Pose from Visual Data. In ICCV, Cited by: §2.

[34]
(2017)
Relative Camera Pose Estimation Using Convolutional Neural Networks
. In ACIVS, Cited by: §2.  [35] (2020) NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, Cited by: §3.3.
 [36] (2020) Probabilistic Orientation Estimation with Matrix Fisher Distributions. NeurIPS. Cited by: §2.
 [37] (2015) ORBSLAM: A Versatile and Accurate Monocular SLAM System. TRO 31 (5), pp. 1147–1163. Cited by: §2.
 [38] (2017) ORBSLAM2: An OpenSource SLAM System for Monocular, Stereo and RGBD Cameras. TRO 33 (5), pp. 1255–1262. Cited by: §2.
 [39] (2021) ImplicitPDF: NonParametric Representation of Probability Distributions on the Rotation Manifold. In ICML, Cited by: §2, §2, Figure 3, §3.1, §3.1, §3.3, §3.3, Figure 12.
 [40] (2011) DTAM: Dense Tracking and Mapping in Realtime. In ICCV, Cited by: §2.
 [41] (2017) Learning 3D Object Categories by Looking Around Them. In ICCV, Cited by: §2.
 [42] (2019) C3DPO: Canonical 3D Pose Networks for NonRigid Structure From Motion. In ICCV, Cited by: §2.
 [43] (2018) Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation. In ECCV, Cited by: §2.
 [44] (2021) ZePHyR: Zeroshot Pose Hypothesis Scoring. In ICRA, Cited by: §2.
 [45] (2020) Learning Orientation Distributions for Object Pose Estimation. In IROS, Cited by: §2.
 [46] (2020) Online Invariance Selection for Local Feature Descriptors. In ECCV, Cited by: §2.
 [47] (2018) Deep Directional Statistics: Pose Estimation with Uncertainty Quantification. In ECCV, Cited by: §2.
 [48] (2021) Common Objects in 3D: LargeScale Learning and Evaluation of Reallife 3D Category Reconstruction. In ICCV, Cited by: §4.1.
 [49] (2019) R2D2: Reliable and Repeatable Detector and Descriptor. NeurIPS. Cited by: §2.
 [50] (1840) Des lois géométriques qui régissent les déplacements d’un système solide dans l’espace, et de la variation des coordonnées provenant de ces déplacements considérés indépendamment des causes qui peuvent les produire. Journal de Mathématiques Pures et Appliquées 5. Cited by: §4.1.
 [51] (2020) Kimera: an OpenSource Library for RealTime MetricSemantic Localization and Mapping. In ICRA, Cited by: §2.
 [52] (2019) From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In CVPR, Cited by: §2, §4.1.
 [53] (2020) SuperGlue: Learning Feature Matching with Graph Neural Networks. In CVPR, Cited by: §2, §4.1.
 [54] (2016) StructurefromMotion Revisited. In CVPR, Cited by: §2, §4.1, §4.1.
 [55] (2016) Pixelwise View Selection for Unstructured MultiView Stereo. In ECCV, Cited by: §2.
 [56] (2019) BAD SLAM: Bundle Adjusted Direct RGBD SLAM. In CVPR, Cited by: §2.
 [57] (2014) Learning Local Feature Descriptors Using Convex Optimisation. TPAMI 36 (8), pp. 1573–1585. Cited by: §2.
 [58] (2006) Photo Tourism: Exploring Photo Collections in 3D. In SIGGRAPH, Cited by: §2.
 [59] (2020) Hybridpose: 6D Object Pose Estimation under Hybrid Representations. In CVPR, Cited by: §2.
 [60] (2018) Pix3D: Dataset and Methods for SingleImage 3D Shape Modeling. In CVPR, Cited by: §1.
 [61] (2018) Implicit 3D Orientation Learning for 6D Object Detection from RGB Images. In ECCV, Cited by: §2.
 [62] (2020) Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS. Cited by: §3.3, Figure 16.
 [63] (2019) BANet: Dense Bundle Adjustment Network. In ICLR, Cited by: §2.
 [64] (2021) DROIDSLAM: Deep Visual SLAM for Monocular, Stereo, and RGBD Cameras. NeurIPS. Cited by: §2, §4.1, Figure 11.
 [65] (2018) RealTime Seamless Single Shot 6D Object Pose Prediction. In CVPR, Cited by: §2.
 [66] (2009) Daisy: An Efficient Dense Descriptor Applied to Widebaseline Stereo. TPAMI 32 (5), pp. 815–830. Cited by: §2.
 [67] (1999) Bundle Adjustment—A Modern Synthesis. In International workshop on vision algorithms, Cited by: §2.
 [68] (2020) GLUNet: GlobalLocal Universal Network for Dense Flow and Correspondences. In CVPR, Cited by: §2.
 [69] (2017) DeMoN: Depth and Motion Network for Learning Monocular Stereo. In CVPR, Cited by: §2.
 [70] (2017) SfMNet: Learning of Structure and Motion from Video. arXiv:1704.07804. Cited by: §2.
 [71] (2019) DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In CVPR, Cited by: §2.
 [72] (2020) Learning Feature Descriptors Using Camera Pose Supervision. In ECCV, Cited by: §2.
 [73] (2017) DeepVO: Towards EndtoEnd Visual Odometry with Deep Recurrent Convolutional Neural Networks. In ICRA, Cited by: §2.
 [74] (2020) TartanVO: A Generalizable Learningbased VO. In CoRL, Cited by: §2.
 [75] (2020) DeepSFM: Structure From Motion Via Deep Bundle Adjustment. In ECCV, Cited by: §2.
 [76] (2017) SegICP: Integrated Deep Semantic Segmentation and Pose Estimation. In IROS, Cited by: §2.
 [77] (2018) PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In RSS, Cited by: §2.
 [78] (2019) Pose from Shape: Deep Pose Estimation for Arbitrary 3D Objects. In BMVC, Cited by: §2.
 [79] (2016) LIFT: Learned Invariant Feature Transform. In ECCV, Cited by: §2.
 [80] (2018) GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In CVPR, Cited by: §2.
 [81] (2020) Perceiving 3D HumanObject Spatial Arrangements from a Single Image in the Wild. In ECCV, Cited by: §2.
 [82] (2021) NeRS: Neural Reflectance Surfaces for Sparseview 3D Reconstruction in the Wild. In NeurIPS, Cited by: §1, §3, Figure 10, §4.3.
 [83] (2019) Making Convolutional Networks ShiftInvariant Again. In ICML, Cited by: §3.3, Figure 16.
 [84] (2018) DeepTAM: Deep Tracking and Mapping. In ECCV, Cited by: §2.
 [85] (2017) Unsupervised Learning of Depth and EgoMotion From Video. In CVPR, Cited by: §2.
 [86] (2019) On the Continuity of Rotation Representations in Neural Networks. In CVPR, Cited by: §4.1, Figure 17.
 [87] (2020) Direct Sparse Mapping. TRO. Cited by: §2.