How can we compute the orientation of a robot or a rigid body in space?
This is a fundamental question in SLAM (Simultaneous Localization And Mapping) and computer vision[21, 13, 18]. For the case of static ground cameras outside the robot, the standard approach is to put a few markers on the robot or object in known places and distances between them, and then track them in real-time using few cameras by triangulation . Instead of markers in dimensional space, visual features, e.g SIFT or SURF [19, 4] in high -dimensional space can be used.
The same problem arises when the cameras are on an autonomous moving robot and few points (markers/features) are observed. If the position of these points in the real world is known from a map or a 3D-model, the robot can compute its position by matching them with the observed set of points.
By letting denote the initial/known set of points in , and letting denote the observed points (after translation and rotation of the object), the localization problem can be divided into three sub-problems that in some sense are independent:
Detection of the set of markers in the observed image.
Matching which point in the observed set corresponds to each point in the initial set . In general, there are such such permutations (matchings) , where .
Pose Estimation of the set , i.e., compute the rotation matrix and translation of that yields the observed set . The matrix and the vector then corresponds to the position and orientation of the robot, compared to its original position. In practice, there is no exact matching due to noise in the observed set such as latency of camera, resolution, frames per seconds, blurring, communication speed and quality, etc. Thus Gaussian noise is usually assumed, which implies that the sum of squared distances (RMS error) between each point in to its corresponding shifted point in should be minimized,
see Definition 1.
2 Related Work
The pose estimation problem is also called the alignment problem, since given two 3D point sets, and , the task is to find the Euclidean motion that brings into the best possible alignment with . The optimal translation is simply the mean of minus the means of , each can be computed in time. Computing the optimal rotation requires computing the -by- matrix in time where each marker is a row in or
, and then computing the Singular Value Decomposition (SVD) of this matrix intime .
The standard and popular solution for solving the matching and pose-estimation problems is called Iterative Closest Point (ICP) proposed by Besl and McKay ; see  and references therein. This algorithm starts with a random matching (mapping) between and , then (1) runs the Kabsch algorithm on this pair, (2) re-matches each marker in to its nearest point in , then returns to step (1). Indeed, we run these steps on our coreset below which allows us faster tracking using low-cost hardware. Various data structures, like k-D tree  or spatial bins , are used to facilitate search of the closest point. To speed up the convergence, normal vectors are considered, which is mainly helpful in the beginning of the iteration process .
References to more provable approaches that try to compute the minimum Hausdorf distance under Euclidean motionfrom can be found e.g. in . These techniques from computational geometry are relatively slow, roughly . Running such algorithms on our suggested coreset below might turn them into practical heuristics that are based on provable solutions.
3 Novel Approach: Core-set for Pose Estimation
In order to boost the performance of each of the three steps above (detection, matching and pose-estimation), our main theoretical result is an algorithm that computes in time a “smart selection” (core-set) of a small subset of the markers in and their corresponding markers in , with appropriate multiplicative weights. Maybe surprisingly, we prove that for every such a pair of sets we can compute a pair of core-sets that have the following guarantees (see Theorem 7.2 for details):
The size of each coreset is constant, i.e., independent of .
The rotation and translation that minimize the sum of squared distances from to , would also exactly minimize this sum between the original sets. In particular, .
The same coreset and the above guarantees remain the same and thus can be used also after the observed set is rotated by or shifted by ,
The coresets can be computed in time using one pass over the input points and -memory.
The constant behind the notation is bounded by where is the rank of and is its dimension.
For example, in our experiments where IR markers are put on a quadcopter such that as in Fig. 2(left)), the coreset is of size . This enabled us to obtain significant improvements even for very small values , partially from the reasons that are explained in the next section.
Unlike uniform random sample of markers, that are used e.g. in RANSAC-type algorithms , our core-set is guaranteed to produce an optimal rotation matrix for the original pair using only the small coreset pair . This theoretical guarantee turns into practical results in our experiments and videos when the error scale is changed from meters to centimeters.
4 Why core-set for Pose Estimation?
Pose Estimation. Given the above coresets , we can solve the pose-estimation problem in time by solving it on the coreset. Property (iv) allows us to compute the coreset in time, which is the same time that is needed to solve the pose-estimation problem. However, in the next observed frames we can use Property (iii) of the coresets above, and reuse them also for the following frames, while the the set of markers on the observed rigid body keeps moving. This reduces the computation time for the pose-estimation problem in each frame from to .
While our coreset is designed for the pose-estimation problem, the fact that the number of tracked markers reduces from to , allows us to handle efficiently also the solution to the following pre-processing problems for the pose estimation that are mentioned in Section 1:
Detection of the set of markers in the observed image is reduced to detecting of only the required markers in the coreset.
Matching the points in to the points in from the permutations now reduces to choosing one among constant number of permutations.
The improvement in the running time and performance of these first two steps prior to solving the pose-estimation problem, depends on their exact implementation. For example, if we apply the ICP heuristic for solving this problem on the coreset instead of the original set, then we can stop detect and insert points to after we detected for each point in the coreset a matching (nearest neighbour) point in that is below the required threshold.
Memory. In tracking algorithms that are based on computer vision, such as PTAM , we need to track hundreds of visual features. Tracking only small subset of them will save us not only time, but also memory.
Streaming and distributed Computation. Our coreset construction is applied in a streaming fashion (see Algorithm 1) using one pass over the points and memory that is independent of . In particular, the compression (reduction) of each subset of the input markers can be computed independently and in parallel on different computers or GPUs, and then merged and maybe re-compressed again. This technique is common in coresets constructions and thus we refer the reader to e.g.  for details.
5 Real Time Tracking System
While our coresets are small and optimal, they come with a price: unlike random sampling which takes sub-linear time to compute (without going over the markers), computing our coreset takes the same time as solving the pose estimation problem on the same frame. To this end, we use their Property (iii) in Section 3 to suggest a real-time tracking and localization system that is based on two following threads that run in parallel.
The first thread, which we run at 1-3 FPS, gets a snapshot (frame) of markers and computes the coreset for this frame. This includes marker identification, matching problem, and then computing the actual coreset for the original set of markers and the observed set .
The second thread, which calculates the object’s pose, runs every frame. In our low-cost tracking system (see 8) it handles FPS. This is by using the last computed coreset on the new frames, until the first thread computes a new coreset for a later frame. The assumption of this model is that, for frames that are close to each other in time, the translation and rotation of the observed set of markers will be similar to the translation and rotation of the set in the previous frame, up to a small error. Property (iii) guarantees that the coreset for the first frame will still be the same for the new frame.
Our experimental results in Section 8 demonstrate the tracking system using a toy quadcopter with visual and IR features.
Table 1 shows the time complexity comparison between solving the problem on our coreset, and solving it on the entire data for different steps of the localization problem, as is done e.g. in ICP. The first row of the table represents step (1), where the matching has already been computed, and what is left to compute is the optimal rotation between the two sets of points. Test (B) in section 8, demonstrates the practical improvement of the coreset in this case.
The second row represents step (2) of the localization problem, where the matching needs to be computed given the rotation. In this case, a perfect matching between a set of size to a set of size , can be achieved, according to , in time. Without using a coreset, the size of both sets is . When using a coreset, the size of is reduced to , although the size of remains .
The last row represents a case where we need to compute the matching between two sets of points and the correct alignment is not given. In this case there are possible permutations of the original set, each with its own optimal rotation. Using the coreset, the number of permutations reduces to roughly since it suffices to match correctly only the coreset points. Test (A) in section 8 demonstrates an actual test in which we have a noisy matching.
Suppose that we have a matrix whose rows represent a set of markers on a rigid body. Let be a corresponding matrix whose rows represent the location of the markers, after rotating the body in with some additive random noise. Recall that a rotation matrix
is an orthogonal matrix () whose determinant is . We can now define the following pose estimation problem.
Definition 1 (Wahba’s Problem )
Find a rotation matrix between the two paired sets of points and in which minimizes the following root mean squared (RMS) deviation between them,
over every rotation matrix . We denote this minimum by
The Kabsch algorithm  suggests a simple solution for Wahba’s problem. Let be a Singular Value Decomposition (SVD) of the matrix . That is, , , and is a diagonal matrix whose entries are non-increasing. In addition, assume that , otherwise invert the signs of one of the columns of . Note that is unique but there might be more than one such factorization.
Theorem 6.1 (Kabsch algorithm )
The matrix minimizes over every rotation matrix , i.e.,
7 Coreset for Pose Estimation
A distribution is a vector in whose entries are non-negative and sum to one, i.e., a vector in the set
The sparsity or support of a distribution is the number of non-zero entries in .
Our goal is to compute a sparse distribution which defines a small weighted subset of markers in , and its corresponding markers in , such that solving the pose estimation problem on the pair of small sets would yield an optimal solution to the original pair . More generally, the coreset should satisfy the properties in Section 2 that are formalized in the following definition.
Definition 2 (Coreset for pose estimation)
Let be a distribution that defines the matrices and above. Then is a pose coreset for the pair if for every pair of rotation matrices and every pair of vectors the following holds: A rotation matrix that minimizes over every rotation matrix , is also optimal for , i.e.,
This implies that we can use the same coreset even if the set is translated or rotated over time. A coreset is efficient if it is also small (i.e. the distribution vector is sparse).
|Input:||A pair of matrices .|
|Output:||A sparse pose-coreset for ; see Definition 2.|
|Input:||Vectors in .|
|Output:||Distribution of sparsity such that|
We prove that such a coreset of constant sparsity (independent of ) always exists. Moreover, we provide an algorithm that given any pair and of points, returns such a coreset in time. To this end we need to understand better the set of possible optimal solutions to the pose estimation problem, using the following lemma, and to introduce the Caratheodory theorem later.
The solution in Theorem 6.1 is in general not unique and depends on the specific chosen SVD. In particular, suppose that is an optimal solution for our coreset pair and also optimal for the pair as desired. Still, using the Kabsch algorithm with a different SVD for might yield a different matrix which is optimal for the coreset pair , but not for the original pair . We thus need to prove the following lemma which further characterizes the set of solutions.
Recall that is the SVD of , and let denote the rank of , i.e., number of non-zero entries in the matrix of . Let denote the diagonal matrix whose diagonal is in its first entries, and otherwise.
Let be a rotation matrix, such that and are orthogonal matrices, and . then is an optimal rotation, i.e.,
Moreover, the matrix is unique and independent of the chosen Singular Value Decomposition of .
It is easy to prove that is optimal, if
see  for details. Indeed, the trace of the matrix is
Term (3) equals
where the last equality holds since the trace is invariant under cyclic permutations. Term (4) equals
For the uniqueness of the matrix , observe that for we have
Here, a squared root for a matrix is a matrix such that , and denote the pseudo inverse of . Let be an SVD of . Similarly to (6), .
Since is a positive-semidefinite matrix, it has a unique square root. Since the pseudo inverse of a matrix is also unique, we conclude that is unique, and thus .
If the point lies in the convex hull of a set , then there is a subset of consisting of or fewer points such that lies in the convex hull of .
Overview of Algorithm 1. To obtain an object’s pose, we need to apply the Kabsch algorithm on the matrix , calculating this sum takes time. We want to replace the summation of those vectors by a sum of weighted vectors , where is a pose-coreset of sparsity at most computed in Algorithm 1. This is done by by choosing such that
is a diagonal matrix. In this case, the rotation matrix of the pairs and will be the same by Theorem 6.1. By letting we need to approximate the sum by a weighted subset of the same sum.
Algorithm 1 reads the first points and removes one of those points using Algorithm 2 that assign new weights to the points such that the new weighted sum is the same, but one of the weights is zero ( for some between and ). Then the next point is inserted and another one is removed, untill all the points were inserted. The output is the final weighted set of points.
Overview of Algorithm 2. This algorithm gets a weighted set of independent vectors and returns a weighted set of vectors whose weighted sum is the same. The idea is based on the proof of Theorem 7.1. The mean of the input vectors equals to a convex combination , where . We compute (possibly negative) coefficients which have the same mean, and whose sum is zero. Note that adding the weights to the original weight vector yields a set of weights whose sum is one and whose weighted mean is the same. This also holds for any scaling of by . There are values of such that one of the weights in the sum will turn into zero. Taking the smallest of those values for will make sure that the other entries are still positive; see Fig 1. The result is a desired non-negative weighted set of vectors whose weighted mean is the same.
We are now ready to prove the main theorem of this paper.
Let be a pair of matrices. Let denote the rank of the matrix . Then a call to the procedure returns a pose-coreset of sparsity at most for in time; see Definition 2.
Let be a distribution of sparsity at most , and suppose that the matrix
is diagonal and consists of at most non-zero entries. Here and are columns vectors which represent the th row of and respectively. Let and be the rows of and respectively. Let be an SVD of such that , and let be an optimal rotation of this pair; see Theorem 6.1. We need to prove that
We assume without loss of generality that , since translating the pair of matrices does not change the optimal rotation between them .
By (8), is an SVD of , and thus is an SVD of . Replacing and with and respectively in Lemma 1 we have that . Note that since is an SVD of , we have that is an SVD of . Using this in Lemma 1 with and instead of and respectively yields that is an optimal rotation for the pair as desired, i.e.,
It is left to compute a sparse coreset as defined in (8). Since is of rank , its rows are contained in an -dimensional subspace. Without loss of generality, we thus assume that the last entries of every row in are zeros, otherwise we rotate the coordinate system. For every let be a vector that consists of the entries of the matrix , excluding its last rows and diagonal.
Let be the translation of each , and . Each vector is a unit vector on the unit ball, and thus the convex combination
lies in the convex hull of the set . Applying Theorem 7.1 with yields that there is a distribution of sparsity such that . By defining
we obtain such that
Hence, the non-diagonal entries of are the same as the non-diagonal entries of , which are all zeros. That is, there is a diagonal matrix such that
which satisfied (8) as desired.
time implementation. To compute we first need to compute SVD for , to get the matrices and , which takes time. Computing the set of vectors takes additional time. Next we need to apply the Caratheodory Theorem on this set. The proof in [6, 9] is constructive and removes a vector from the set until only the small set of vectors is left. Since each iteration takes time, this will take time. To obtain a running time of , we apply the algorithm only on vectors and run a single iteration to get vectors again; see Algorithm 1. Then, we continue to add the next vector and re-compress, until we are left with a distribution over vectors that approximates all the set, as implemented in Algorithm 2.
8 Experimental Results
In this section, we conducted experiments on data sets collected from real hovering tests using a sensorless, light-weight toy quadcopter, and compared the rotation error in degrees by computing the orientation of the quadcopter using our coreset as compared to uniform random sampling of markers on the quadcopter. The ground truth in the first test is an OptiTrack system, whilst in the second test the ground truth is the orientation when computed from all the markers on the quadcopter. Both the coreset and the random subset are of equal size. All our experiments show that the coreset is consistently and significantly better than a corresponding uniform sample of the same size. The error in all the experiments is measured by differences in degrees from the estimated angles (Roll, Pitch, Yaw) to the ground truth.
In both tests, the coreset was computed every frames, the random points were also sampled every frames, where is called the calculation cycle.
The chosen weighted points were used for the next frames, and then a new coreset of size was computed by Algorithm 1, where and as the LEDs on the quadcopter are roughly placed in a planar configuration.
See attached video for demonstrations and results: https://vimeo.com/154337055.
8.1 Few Infra-Red Markers
In our experiments we develop a low-cost home-made tracking system () that has the ability to replace commercial systems in the mission of tracking and controlling toy micro quadcopters, without any on-board sensors, assuming no high-speed maneuvers are performed (that require a professional and expensive hardware such as Vicon or OptiTrack). Our suggested tracking system requires only 2 components: (A) a mini-computer such as Raspberry Pi ( ), or a laptop/desktop , and (B) a pair of standard web-cams ( ).
The workspace covered by our system, when using two webcams, is roughly the same as the OptiTrack’s workspace using two OptiTrack Flex3 cameras. Regarding the cameras’ specifications, our cameras have a FOV of while the OptiTack cameras have a FOV of . However, OptiTrack system obviously outperforms our low-cost system in applications that require fast maneuvers.
In order for the system to control a toy micro quadcopter, it needs to compute the of the quadcopter fast enough. The sensorless quadcopter requires a stream of at least control commands per second in order to hover and not crash. Using a SIFT feature detector in such hardware conditions was not practical. For faster detection in each frame, we placed Infrared LEDs on the quadcopter and modified the web-cams’ lenses to let only infrared spectrum rays pass. We could not place more than LEDs on such a micro quadcopter since solving the matching problem on markers is impractical, and due to over-weight problem.
We computed the 3D location of each LED using triangulation. Afterwards, we computed a coreset for those 3D locations, and sampled a random subset of the same size. The ground truth in this test was obtained from the OptiTrack system. The control of the quadcopter based on its positioning was done using a simple PID controller.
For different calculation cycles, we computed the average error through out the whole test, which consisted of roughly frames. The results are shown in 2(b).
8.2 Many Visual Features
In this test, we placed a simple planar pattern on a quadcopter, as shown in Fig. 1(b). An OptiTrack motion capture system was used to perform an autonomous hover with the quadcopter, whilst two other 2D grayscale cameras mounted above the quadcopter collected and tracked features from the pattern using SIFT feature detector; See submitted video. The matching between the SIFT features in both images is considered a noisy matching. This is discussed in Section 4. Given 2D points from two calibrated cameras, we were able to compute the 3D location of each detected feature using triangulation. A coreset, as described in Section 7, was computed from those 3D locations, along side a uniform random sample of the same size. The quadcopter’s orientation was then estimated by computing the optimal rotation matrix, using Kabsch algorithm, on both the coreset and the random sampled points. The ground truth in this test was obtained using Kabsch algorithm on all the points in the current frame.
For different calculation cycles, we computed the average error through out the whole test, which consisted of roughly frames, as shown in Fig. 2(a). The number of detected SIFT features in each frame ranged from around to features, though most of the features did not last for more than consequent frames, therefore we tested the coreset with calculation cycles in the range to .
We can see that the average errors in this test were smaller than the average errors in the previous test, this is due to the low-cost hardware in the previous test, e.g. web-cams as compared to OptiTrack’s cameras, and due to the difference between the ground truth measurements in the two tests.
8.3 Time Complexity
In this section, time comparison tests have been conducted to demonstrate this paper’s main practical results. The data in this test was randomly and uniformly sampled using MATLAB. The tests were conducted on a laptop with an Intel Core i7-4710HQ CPU @ 2.50GHz processor running Ubuntu 14.04 OS.
The first test compares the time needed to calculate the pose estimation given a coreset VS. using the full set of points. For this test we assume the matching between the points is given. The test has two different cases: a) Using an increasing number of points while maintaining a constant dimension , b) Using a constant number of points in many different dimensions. The results are shown in figures 2(c) and 2(d) respectively. This test is described by the first row of Table 1. Figure 2(c) indeed shows that when the coreset size is bigger than the number of points , the computation time is roughly identical, and as reaches beyond we see that the computation time using the full set of points continues to grow linearly with (), while the computation time using the coreset ceases to increase since it is independent of ( = ). Figure 2(d) shows that the coreset indeed yields smaller computation times compared to the full set of points when the dimension , and both yield roughly the same computation time as reaches and beyond.
Iotracker, internet of things based tracking system.
on_publication, 2015. To be published with a final version of this paper.
- Amazon [2015a] Amazon. Raspberry Pi Model A+ (256MB). http://www.amazon.com/Raspberry-Pi-Model-A-256MB/dp/B00PEX05TO/ , 2015a. [Online; accessed 15-September-2015].
- Amazon [2015b] Amazon. VAlinks® USB 2.0 Webcam Web Cam Camera. http://www.amazon.com/VAlinks%C2%AE-Microphone-Adjustable-Computer-Supports/dp/B0135KJWTQ/ , 2015b. [Online; accessed 15-September-2015].
- Bay et al.  Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer vision–ECCV 2006, pages 404–417. Springer, 2006.
- Besl and McKay  Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Robotics-DL tentative, pages 586–606. International Society for Optics and Photonics, 1992.
- Carathéodory  Constantin Carathéodory. Über den variabilitätsbereich der fourier’schen konstanten von positiven harmonischen funktionen. Rendiconti del Circolo Matematico di Palermo (1884-1940), 32(1):193–217, 1911.
Chew et al. 
L Paul Chew, Michael T Goodrich, Daniel P Huttenlocher, Klara Kedem, Jon M
Kleinberg, and Dina Kravets.
Geometric pattern matching under euclidean motion.Computational Geometry, 7(1):113–124, 1997.
- Choi et al.  Sunglok Choi, Taemin Kim, and Wonpil Yu. Performance evaluation of ransac family. Journal of Computer Vision, 24(3):271–300, 1997.
- Eckhoff et al.  Jürgen Eckhoff et al. Helly, radon, and carathéodory type theorems. Handbook of convex geometry, pages 389–448, 1993.
- Feldman et al.  Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2142–2150. 2011.
- Friedman et al.  Jerome H Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3(3):209–226, 1977.
- Golub and Reinsch  Gene H Golub and Christian Reinsch. Singular value decomposition and least squares solutions. Numerische mathematik, 14(5):403–420, 1970.
- Guerrero-Castellanos et al.  JF Guerrero-Castellanos, H Madrigal-Sastre, S Durand, N Marchand, WF Guerrero-Sánchez, and BB Salmerón. Design and implementation of an attitude and heading reference system (ahrs). In Electrical Engineering Computing Science and Automatic Control (CCE), 2011 8th International Conference on, pages 1–5. IEEE, 2011.
- Hartley and Sturm  Richard I Hartley and Peter Sturm. Triangulation. Computer vision and image understanding, 68(2):146–157, 1997.
- Kabsch  Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 32(5):922–923, 1976.
- Kjer and Wilm  Hans Martin Kjer and Jakob Wilm. Evaluation of surface registration algorithms for PET motion correction. PhD thesis, Technical University of Denmark, DTU, DK-2800 Kgs. Lyngby, Denmark, 2010.
- Klein and Murray  Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, November 2007.
Lee and Park 
Jung Keun Lee and Edward J Park.
A minimum-order kalman filter for ambulatory real-time human body orientation tracking.In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3565–3570. IEEE, 2009.
- Lowe  David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004.
- Pulli and Shapiro  Kari Pulli and Linda G Shapiro. Surface reconstruction and display from range and color data. Graphical Models, 62(3):165–201, 2000.
- Stanway and Kinsey  M Jordan Stanway and James C Kinsey. Rotation identification in geometric algebra: Theory and application to the navigation of underwater robots in the field. Journal of Field Robotics, 2015.
- Valencia and Vargas  Carlos E. Valencia and Marcos C. Vargas. Optimum matchings in weighted bipartite graphs. Boletín de la Sociedad Matemática Mexicana, 2015.
- Wahba  Grace Wahba. A least squares estimate of satellite attitude. SIAM review, 7(3):409–409, 1965.
- Wang and Sun  Lu Wang and Xiaopeng Sun. Comparisons of iterative closest point algorithms. In Ubiquitous Computing Application and Wireless Sensor, pages 649–655. Springer, 2015.
- Zhang  Zhengyou Zhang. Iterative point matching for registration of free-form curves and surfaces. International journal of computer vision, 13(2):119–152, 1994.