In many robotic vision pipelines, fiducial markers are often employed to simplify feature extraction. In particular, planar markers[25, 7, 8, 9, 17, 11], which are designed to be easily detected and associated across images, find extensive use in laboratory and commercial settings (factories, warehouses, mines, etc.). In applications that perform planar marker-based SfM or SLAM [19, 14, 5, 13], there is a basic need to estimate the 6DOF pose of an observed marker relative to the camera coordinate frame. This is often solved as a special case of planar pose estimation (PPE), which functions by determining the relative pose between a plane of known dimensions and its projection onto the image [16, 18, 3].
While in theory 6DOF pose can be determined uniquely from four non-colinear but co-planar points, the situation is less clear in non-ideal conditions where perspective effects are not apparent, e.g., when the imaged marker is small or the marker is at a distance which is significantly larger than the focal length. In such conditions there is a two-fold rotational ambiguity that corresponds to an unknown reflection of the plane about the z-axis of the camera [16, 18, 3]. For one observed planar marker (specifically its four corners), state-of-the-art PPE methods [18, 3] may return two physically plausible pose solutions, with one of them being the correct one (i.e., the one closer to the ground truth pose).
Fig. 1 shows an example from the dataset of . Note that the two solutions returned by PPE can be very different, thus it is unwise to arbitrarily choose one of the two poses, or take the midpoint of the two solutions as the pose estimate.
A common way to disambiguate the two returned poses and is to compute the reprojection error of each pose
where and are the reference 3D position and 2D observation of the 4 corners of the detected marker, is the camera intrinsic parameter and projects onto the image with camera pose . The PPE pose with the lower reprojection error is then selected.
However, comparing reprojection errors is not foolproof [26, 13], for if the corner localisation is noisy, and can be very close. In fact, the correct solution can have the higher reprojection error; see Fig. 1.
While current theory and algorithms for PPE [18, 3] have characterised the ambiguity issue and are able to compute all physically plausible solutions stably, using the PPE outputs under ambiguity, particularly in marker-based SfM or SLAM pipelines, remains a fundamental challenge. In the following, we further survey efforts to deal with marker pose ambiguity, before outlining the proposed solution.
I-a Related work
Tanaka et al. [22, 21] modified the conventional planar marker design to directly incorporate orientation information. They attach two one-dimensional moire patterns onto the marker to obtain appearance variation for pose disambiguation, as well as lenticular lenses that introduce 3D deviations to the marker surface. Though this largely alleviates the ambiguity problem, the marker fabrication is non-trivial.
For planar target camera tracking, a filtering method with a well-tuned camera motion model [26, 24] can be exploited to disambiguate the marker poses. However, this assumes temporal continuity in the images, which may not be valid in SfM with wide baseline images; moreover, there are no mature filtering methods for marker SLAM. Jin et al.  showed improved marker pose estimation accuracy by fusing depth information. However, this requires an RGBD camera.
Marker-based SfM/SLAM is an active research area [19, 15, 14, 5, 13]. Marker ambiguity is not dealt with explicitly in [15, 19, 5], though  combined feature-based SfM with marker-based SfM. Munoz-Salinas et al. applied the ratio test of  in their marker-based SfM  and SLAM pipeline . Basically, if the ratio (2) is below a threshold (default is 0.6 ), the PPE solution with the lower reprojection error is used in subsequent SfM/SLAM processing; else, the marker detection is discarded. A weakness of this approach is the sensitivity to the threshold. If it is too low, many marker detections will be excluded, leading to data wastage or even SfM/SLAM failure. On the other hand, a high threshold risks using bad marker poses (recall that the pose with the lower reprojection error may not be the correct one) for SfM/SLAM. Sec. VI will demonstrate this shortcoming.
I-B Our contributions
Unlike previous works that have used a per-marker approach to resolve marker ambiguity, we exploit multi-view constraints for disambiguation. From the input marker detections, we first construct a multigraph of relative rotation measurements, which incorporates all PPE pose ambiguities. Then, we formulate a novel rotation averaging problem with clique constraints that respects consistency (details later) between subsets of relative pose measurements. We examine the combinatorial complexity of the new problem, and develop a lifted optimisation method to efficiently solve it. Then, a series of small maximal weighted clique problems are solved to make the final pose selections. Our method allows all valid PPE pose combinations to be examined, and leads to more accurate and/or complete marker-based SfM.
Ii Problem formulation
Consider input images that observed a set of markers of known sizes in a static scene. We assume calibrated cameras. A standard marker detection and id algorithm  is applied to each image. Denote by
Without loss of generality, we assume that each marker observation has exactly two relative pose solutions. Note that the pose ambiguity is due to orientation ambiguity, thus the translation component is the same, i.e.,
Given the set of all M2C relative pose measurements
our overall aim is SfM, i.e., find the absolute poses of the markers and cameras . To do so, pose ambiguity must be resolved, i.e., for each such that , choose either or for SfM computations.
where each is either or , and . The reduced measurement set is then subjected to the rest of the SfM/SLAM pipeline. Our new method exploits multi-view consistency to disambiguate the PPE marker poses in a way that avoids premature decisions; details as follows.
Iii Multigraph with rotational ambiguity
Since the ambiguity lies in the orientations, it is natural to model the ambiguity using only the M2C relative rotations
To this end, we construct a multigraph , where the vertices is the set of markers , and the edges indicate covisibility between the markers. More specifically, if and are detected in , four edges
connect vertices and in ; assuming , the edges correspond to the marker-to-marker (M2M) relative rotations
where is a bit string composed of two binary indicators . The edges in are undirected; if , the edge has the associated M2M relative rotation
Thus, in our notation
The set of all edges (without repetitions) is thus
Similarly, the set of unique M2M relative rotations is
The existence of four M2M relative rotations per pair is a direct consequence of ambiguity in marker pose estimation, and the bit string selects a particular combination of M2C relative rotations to derive the M2M relative rotation.
Note that our multigraph construction method is a significant extension of that in , in that our multigraph incorporates all ambiguous marker poses, whereas  generates from the preprocessed data (7) with no ambiguities.
Iii-a Consistent cliques
We assume that the multigraph is connected, i.e., there is a path that connects every pair of vertices (markers) in .
(Consistent clique) Given multigraph as defined above, a consistent clique for image is a fully connected subgraph such that
Every two vertices are connected by exactly one edge , where is one of .
For every two vertices that are connected to vertex , the associated edges and satisfy the condition .
Fig. 3 provides examples. Intuitively, a consistent clique for image corresponds to a set of M2M relative rotations that are composed using a constant selection of one of the two M2C relative poses for each marker detected in .
Since there are multiple valid combinations of constant M2C relative pose selections, there are multiple consistent cliques for an image. Assuming that markers are detected in each image, there are number of consistent cliques per image. For images, there are thus unique combinations of consistent cliques across the images.
Iv Disambiguation with rotation averaging
Based on the multigraph, our technique resolves the ambiguities by first solving a novel rotation averaging formulation, then - based on the averaging results - building and solving a maximum weighted clique problem. The key outcome of this step is marker pose disambiguation; Sec. V will incorporate this step into a marker-based SfM pipeline.
Iv-a Rotation averaging with clique constraints
While standard rotation averaging is defined over a graph of relative rotations [10, 2], extending the formulation to a multigraph of relative rotations is straightforward, and existing algorithms (we used ) can be applied with minor adjustments. Let be the absolute rotations of the markers. A rotation averaging problem over multigraph is
where is a robust norm. The motivation behind (16) is to attempt to identify the incorrect poses from PPE as the contributors to outlying measurements in the averaging task.
However, our tests (Sec. VI
) suggest that this approach is ineffective for disambiguation, most probably because (16) does not enforce clique consistency (Def. 1). Thus, error terms that are regarded as inliers could correspond to choosing both PPE poses for the same marker detection.
To enforce clique consistency into rotation averaging, we introduce a set of binary indicator variables
where the setting implies selecting M2C relative rotation the detection of in , while implies selecting . We then formulate the clique-constrained rotation averaging problem
Intuitively, selects the M2C relative rotations to compose the M2M relative rotations in a consistent way. Searching over thus allows different consistent cliques in all images to be examined. Finally, since are shared across images, multi-view consistency is exploited to choose the best combinations of the PPE relative rotations.
Iv-B Efficient algorithm using lifting approach
A naive method to solve (18) is to enumerate , and for each instantiation, collect the non-zero terms in (18) and solve the resulting rotation averaging problem. Then, return the with the lowest optimised error as the disambiguation decision. Since there are possible instantiations of (assuming markers seen per image), this is infeasible.
) with a sigmoid function
which yields the “smoothed” version of (18)
Intuitively, the contribution of an error term in (20) is now weighted according to correctness of the corresponding M2C relative poses that define the error term.
Problem (20) can be solved using an iterative non-linear optimiser (e.g., fmincon in MATLAB). We initialise via a minimum spanning tree on , choosing the M2M relative rotations with the lower combined reprojection errors for chaining, and is set to reflect these choices. As we will show in Sec. VI, our method is not biased by such an initialisation, since it is capable of providing more accurate disambiguation than comparing reprojection errors alone.
Iv-C Selecting the marker poses
for all . Similar to (2), the ratio (21) indicates how “disambiguable” the PPE poses are for each marker detection (smaller ratios are better), but now based on the value of . Although is not discrete, the percentage of marker poses that are still ambiguous is now significantly reduced.
To conclusively select one PPE pose per detected marker, a simple solution would be to threshold each with ; however, we would like to avoid such a per-marker decision. To this end, for each image we construct the multigraph , where , and
Note that is a submultigraph of , and there exist consistent cliques in (see Sec. III-A). Further, each edge in has the weight
Given , define edge indicator variables
and the maximum weighted clique (MWC) problem
Basically, the aim of is to find a consistent clique in with the largest edge weights. Though MWC is intractable in general , each instance is small, since the number of detected markers in is small (usually ).
We use the efficient clique solver of  on each . The optimised provides a consistent selection of the PPE poses for all markers detected in . Specifically, for each detected in , find a that is nonzero, and set if , or otherwise.
Algorithm 1 summarises the proposed method for marker pose disambiguation.
V Marker-based SfM pipeline
To carry out marker-based SfM using our marker pose disambiguation method, we largely follow the pipeline of the state-of-the-art MarkerMapper . Briefly, a robust pose graph optimisation is first invoked on the resolved M2C relative poses (7) from Algorithm 1 to yield absolute marker poses - in our case, the absolute rotation component is initialised using the output from solving (20). Then, each camera pose is initialised using single pose averaging from the M2C poses, before all marker and camera poses are refined simultaneously by bundle adjustment on the observed corners of all detected markers. We refer to  for details of the SfM pipeline.
To assess the efficacy of the proposed marker pose disambiguation technique, we compared the following methods:
Reprojection error (M1): For each marker detection, select the PPE solution with the lower reprojection error.
Default ratio test (M3): The threshold of is applied on the reprojection error ratio (the default setting in ).
Proposed method (Ours): As described in Sec. IV.
When applying the above disambiguation methods to perform marker-based SfM, we simply used them to preprocess the input marker detections, then execute the rest of the pipeline of MarkerMapper  (see Sec. V). All the experiments were conducted on a 3.5GHz CPU and 8GB of RAM.
Vi-a Experiments on hybrid data
Vi-A1 Data generation
We used the ScanNet Dataset  that contained a number of sequences with ground truth 6DOF camera poses and depth. A test sequence was created from an original sequence by warping a number of ArUco markers [9, 17] based on known/ground truth M2C relative poses onto parts of the images that correspond to planar surfaces; see supplementary video 111https://www.youtube.com/watch?v=LtwavEeCkQ4&t= for a sample sequence. Using the ground truth camera absolute pose , the ground truth marker absolute pose is .
|Seq||Precision(%)||# markers mapped||# cameras localised|
|Seq||Average marker pose error (, cm)||Average camera pose error (, cm)|
Vi-A2 Marker detection
Using the steps above, we generated five testing sequences from Bedroom(B), Hotel1(H1), Hotel2(H2), Office1(O1) and Office2(O2). We used  to detect, identify and localise the corners of each marker in each frame; see Table I for the number of frames and unique detected markers in each sequence. Though the markers were synthetically warped into the images, our analysis suggests that corner localisation suffered from errors of 1–7 pixels.
|Dataset||Mean err. (m)||Median err. (m)|
|ece floor4 wall||5.28||2.72||20.95||2.56||5.35||2.03||18.09||2.12|
|ece floor5 stairs||1.58||3.18||4.07||1.14||0.96||2.64||3.72||0.82|
|cee night cw||30.21||34.79||75.57||19.06||19.25||24.21||76.42||10.12|
Vi-A3 Ground truth M2C pose selection
On the noisy corner localisations, PPE  is invoked, which yields two M2C relative poses for each detected marker. To decide the ground truth selection, we compute the angular difference between and as
The ground truth selection of the PPE poses is taken as the one with the lower angular difference .
For the hybrid data experiment, we evaluated all the approaches on two main aspects; see supplementary video for demonstration of our pose disambiguation method.
Precision in pose disambiguation
For each testing sequence, precision in pose disambiguation is defined as
Table I shows that Ours generally has higher precision than the others. The fact that M4 (the control method) is much poorer than Ours proves that enforcing the proposed clique-consistency is crucial for disambiguating the PPE poses. Amongst the per-marker disambiguation methods (M1–M3), M1 has the lowest precision, validating observations in previous works that comparing reprojection errors alone is not foolproof. Adding a ratio test to avoid decisions on cases that are too ambiguous helps to improve precision in M2 and M3. In particular, the precision of M2 is on par with Ours. However, as we show next, this gain by M2 comes at a cost.
Completeness and accuracy of SfM
To assess the effects of marker pose disambiguation on SfM, we evaluate
the number of markers mapped and cameras localised; and
the error (in deg and cm) of the marker and camera poses
estimated by marker-based SfM from the disambiguated PPE poses in Table I,II respectively. Although M2 is precise, it yields a much sparser map than the others; moreover, as it has pruned away many useful detections, there are insufficient data to allow accurate SfM. Using our pose disambiguation technique leads to more complete and accurate maps.
Vi-B Real world dataset experiment
Testing was performed on sequences from . We selected 3 indoor scenes with different difficulty levels: ece floor 4 wall, ece floor5 stairs and cee night cw. There are unique markers placed the scene in each sequence. To enable comparisons, we invoked  (denoted as FM) which conducts both feature- and marker-based SfM on the sequences. Since SfM with M2 failed in all 3 sequences due to insufficient data for optimisation, comparison is not made.
Qualitative results in Table III show that Ours is more accurate than M1 and M3 in marker-based SfM - of course, Ours is visibly not as complete as FM, but the latter uses features on top of markers, which entails heavier computations. Using the estimated camera positions by FM as reference, we obtain the position errors (in m) computed by the marker-based SfM methods - normalised and plotted as a cumulative density in Fig. 4. It is apparent that Ours is much more accurate in camera localisation, especially in the most challenging sequence cee night cw. Table IV lists the mean and median position error, relative to FM.
-  (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §II.
-  (2013) Efficient and robust large-scale rotation averaging. In Proceedings of the IEEE International Conference on Computer Vision, pp. 521–528. Cited by: §IV-A, 4th item.
-  (2014) Infinitesimal plane-based pose estimation. International Journal of Computer Vision 109 (3), pp. 252–286. Cited by: Fig. 1, §I-A, §I, §I, §I, §II, §VI-A3.
Scannet: richly-annotated 3d reconstructions of indoor scenes.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839. Cited by: Fig. 2, §I, §VI-A1.
-  (2018) Improved structure from motion using fiducial marker matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 273–288. Cited by: Fig. 1, §I-A, §I, §I, §VI-B, TABLE III.
-  (2011) Listing all maximal cliques in large sparse real-world graphs. In International Symposium on Experimental Algorithms, pp. 364–375. Cited by: §IV-C.
-  (2004) ARTag, an improved marker system based on artoolkit. National Research Council Canada, Publication Number: NRC 47419, pp. 2004. Cited by: §I.
-  (2005) ARTag, a fiducial marker system using digital techniques. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 590–596. Cited by: §I.
-  (2014) Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition 47 (6), pp. 2280–2292. Cited by: §I, §VI-A1, §VI-A2.
-  (2013) Rotation averaging. International journal of computer vision 103 (3), pp. 267–305. Cited by: §IV-A.
-  (2019) Deep ChArUco: Dark ChArUco Marker Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8436–8444. Cited by: §I.
-  (2017) Sensor fusion for fiducial tags: highly robust pose estimation from single frame rgbd. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5770–5776. Cited by: §I-A.
-  (2019) SPM-SLAM: simultaneous localization and mapping with squared planar markers. Pattern Recognition 86, pp. 156–171. Cited by: §I-A, §I, §I, §II.
-  (2018) Mapping and localization from planar markers. Pattern Recognition 73, pp. 158–171. Cited by: §I-A, §I, §I, §II, §III, §V, 3rd item, §VI.
-  (2016) An open source, fiducial based, visual-inertial motion capture system. In 2016 19th International Conference on Information Fusion (FUSION), pp. 1523–1530. Cited by: §I-A.
-  (1993) Iterative pose estimation using coplanar points. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 626–627. Cited by: §I, §I.
-  (2018) Speeded up detection of squared fiducial markers. Image and vision Computing 76, pp. 38–47. Cited by: §I, §VI-A1.
-  (2006) Robust pose estimation from a planar target. IEEE transactions on pattern analysis and machine intelligence 28 (12), pp. 2024–2030. Cited by: §I, §I, §I, §II.
-  (2012) A self-localization system with global error reduction and online map-building capabilities. In International Conference on Intelligent Robotics and Applications, pp. 13–22. Cited by: §I-A, §I.
-  (2012) Towards a robust back-end for pose graph slam. In 2012 IEEE International Conference on Robotics and Automation, pp. 1254–1261. Cited by: §IV-B.
-  (2017) Solving pose ambiguity of planar visual marker by wavelike two-tone patterns. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 568–573. Cited by: §I-A.
-  (2014) A solution to pose ambiguity of visual markers using moire patterns. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3129–3134. Cited by: §I-A.
-  (2003) An efficient branch-and-bound algorithm for finding a maximum clique. In International Conference on Discrete Mathematics and Theoretical Computer Science, pp. 278–289. Cited by: §IV-C.
-  (2007) Improvement of accuracy for 2d marker-based tracking using particle filter. In 17th International Conference on Artificial Reality and Telexistence (ICAT 2007), pp. 183–189. Cited by: §I-A.
-  (2016) AprilTag 2: Efficient and robust fiducial detection. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4193–4198. Cited by: §I.
-  (2012) Stable pose estimation with a motion model in real-time application. In 2012 IEEE International Conference on Multimedia and Expo, pp. 314–319. Cited by: §I-A, §I.