Minimal Solutions for Panoramic Stitching Given Gravity Prior

12/01/2020 ∙ by Yaqing Ding, et al. ∙ Nanjing University 0

When capturing panoramas, people tend to align their cameras with the vertical axis, i.e., the direction of gravity. Moreover, modern devices, such as smartphones and tablets, are equipped with an IMU (Inertial Measurement Unit) that can measure the gravity vector accurately. Using this prior, the y-axes of the cameras can be aligned or assumed to be already aligned, reducing their relative orientation to 1-DOF (degree of freedom). Exploiting this assumption, we propose new minimal solutions to panoramic image stitching of images taken by cameras with coinciding optical centers, i.e., undergoing pure rotation. We consider four practical camera configurations, assuming unknown fixed or varying focal length with or without radial distortion. The solvers are tested both on synthetic scenes and on more than 500k real image pairs from the Sun360 dataset and from scenes captured by us using two smartphones equipped with IMUs. It is shown, that they outperform the state-of-the-art both in terms of accuracy and processing time.



There are no comments yet.


page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Panoramic image stitching is a fundamental problem in computer vision. When solving this problem, we are given a sequence of images taken from a single point in space with a camera rotating around some 3D axis. The objective is to map the images into a common reference frame and to create a larger image composed of the captured ones, thus, covering a much wider field-of-view than each individual image. In other words, the goal is to estimate the unknown relative rotation and inner calibration parameters of cameras with coinciding optical centers, , cameras undergoing a pure rotational motion. In the computer vision community, this problem is often considered to be solved with a number of existing solutions 

[hartley2003multiple, brown2007minimal, fitzgibbon2001simultaneous, jin2008three, byrod2009minimal, kukelova2015radial]. However, in this paper, we will show that the existing solutions do not exploit all available information which can be easily obtained from recent devices. This information can be used to simplify both the problem formulation and solution.


Figure 2: Panorama stitching with known gravity vector .

A typical panorama stitching pipeline consists of the following two major steps.

  1. Pair-wise computation: In the first step, image features are obtained and tentative correspondences are matched between all image pairs. From these correspondences, inner and outer calibration parameters of pairs of cameras are estimated robustly and the correspondences consistent with the calibration are selected. The robust estimation is usually based on solving the stitching problem from a minimal number of input correspondences, , solving the problem using a minimal solver, in a RANSAC framework [brown2007minimal].

  2. Bundle adjustment (BA): The pair-wise estimates of inner and outer calibration parameters are refined jointly over all images using a non-linear optimization.

(a) Full panoramic image
(b) Image pair
(c) Stitching
Figure 6: Example scenes and stitching results from the SUN360 (top) and the created Smartphone (bottom) datasets.

Several minimal solvers exists which can be used in the pair-wise step of the panoramic stitching pipeline. One of the basic well-known solvers for panoramic stitching assumes fully calibrated pinhole cameras and estimates the unknown rotation from two point correspondences [horn1987closed]. However, the camera intrinsic parameters may not be known in practice, which is often the case in real-world. Solutions to different camera configurations can be divided into two main categories: the unknown focal length case for narrow field-of-view cameras, and the unknown radial distortion case for wide-angle cameras. For the unknown focal length case, one of the most commonly used solvers is the normalized 4 point linear solution for homography estimation [hartley2003multiple]. In [brown2007minimal], the authors demonstrated that 2 and 3 points are sufficient for obtaining the homography induced by a rotation with 1 and 2 unknown focal length parameters.

In practice, almost all cameras exhibit some amount of lens distortion. Moreover, wide-angle lenses with significant radial lens distortion are now commonly used in smart devices. These wide-angle cameras introduce large distortions that can not be ignored. In [fitzgibbon2001simultaneous, jin2008three], it was shown that modeling lens distortion inside the minimal solver is critical for obtaining high-quality homography estimates. Without modeling lens distortion in the pair-wise step, not only the initial pose and calibration estimates are less accurate but the set of inliers used for the non-linear optimization does not cover the whole image and misses especially highly distorted points from image borders. Without such points, the non-linear optimization in the BA stage may not be able to correctly estimate the lens distortion and to produce good stitching results. To address the case where there is lens distortion in the images, Fitzgibbon [fitzgibbon2001simultaneous] used the single parameter division model to simultaneously estimate the homography and radial distortion using 5 point correspondences. With this division model, the minimal 3 point solver was proposed for panoramic stitching with equal and unknown radial distortion [jin2008three, byrod2009minimal]. In [kukelova2015radial], two solvers were proposed for estimating homography between two cameras with different radial distortions.

All the mentioned minimal solvers estimate the full 3-DOF rotation. This, especially for cameras with radial distortion, makes the resulting system of polynomial equations complicated. However, panoramas are usually captured with cameras or mobile phones aligned with the direction of gravity. Moreover, recent devices usually are equipped with an IMU sensor that can measure the gravity vector accurately. Using the gravity prior, the -axes of the cameras can be aligned, reducing their relative orientation to 1-DOF. This prior not only simplifies the geometry and polynomial systems that have to be solved but, also, reduces that number of correspondences needed for the estimation. This is extremely important since the processing time of RANSAC-like robust estimation depends exponentially on the sample size. The gravity prior was used to simplify different minimal relative [fraundorfer2010minimal, naroditsky2012two, sweeney2014solving, 2014Relative, saurer2017homography, ding2019efficient, Ding_2020_CVPR] and absolute pose solvers [kukelova2010closed, 2015Efficient, albl2016rolling]. Surprisingly, it was not considered in the panoramic stitching solvers.

In this paper, we present the first minimal solutions to panoramic image stitching of images taken by two cameras with coinciding optical centers, , undergoing pure rotation, under the assumption of known gravity direction. We consider four practical camera configurations:

  1. [label=()]

  2. - Equal focal length: The images are captured by a camera with fixed unknown focal length and known or negligible radial distortion (narrow field-of-view).

  3. - Varying focal lengths: The images are captured by cameras with different focal lengths, , by a zoom camera, with negligible or known distortion.

  4. - Equal focal length and radial distortion: The images are captured by a camera with fixed unknown focal length and radial distortion.

  5. - Varying focal lengths and radial distortion: The most general case, where the images are captured by cameras with different focal lengths and distortions, , using a wide-angle camera that was zooming during the image capture.

Experiments both on synthetic data and more than k real image pairs demonstrate that our new solvers are superior to the state-of-the-art methods both in terms of accuracy and computational efficiency, , the run-time.

2 Problem Statement

Let us assume that we are given a set of 3D points observed by two cameras undergoing a pure rotational motion. Let and be the corresponding measured distorted image projections of in these two cameras in their homogeneous form. Undistorted image points and , undistorted with some undistoriton function , are related by


where are the projective depths, are the intrinsic camera matrices, and is the unknown relative rotation between the cameras.

We use the one-parameter division model to parameterize radial undistortion function . This model suits especially for minimal solvers since it is able to express a wide range of distortions with a single parameter and usually results in simpler equations compared with other distortion models. This model was used, , by [fitzgibbon2001simultaneous] for the simultaneous estimation of two-view geometry and lens distortion, for the estimation of radial distortion homography [kukelova2015radial], and, also, for panorama stitching with radial distortion [byrod2009minimal, jin2008three].

In the one parameter division model, the undistortion function , that is undistorting distorted image point with homogeneous coordinates using the undistortion parameter has the following form


In this paper, we assume that the gravity direction is known. This is a reasonable assumption since panoramas are usually captured with cameras or mobile phones aligned with the direction of gravity. Moreover, smart devices are equipped with IMU sensors that can measure the gravity vector accurately. Using the gravity direction, we can compute the roll and pitch angles and use them to align the -axes of the cameras. Let us denote the known rotation matrices used for the alignment of the two cameras as and . In this case, (1) can be written as


where is the unknown rotation from the yaw angle (the unknown rotation around the -axis).

For most of the modern CCD or CMOS cameras, it is reasonable to assume that the camera has square-shaped pixels, and the principal point coincides with the image center [hartley2012efficient]. In such case, the calibration matrices have the form and . Relationship (3) between the undistorted image points and can be expressed using a homography matrix as


where , and indicates the equality up to a scale factor. The scale, which is equal to can be eliminated by multiplying (4

) with the skew symmetric matrix

resulting in


Moreover, equation (5) can be multiplied by resulting in




with , and . Since contains unknowns only in its last row and column, one equation from (6) does not contain and .

Rotation matrix can be parameterized using the Cayley parameterization, which results in a degree-2 polynomial matrix with only one parameter as follows:


In this case, corresponds to , and is the yaw angle. Hence, and . Since (5) is homogeneous, the scale factor in (8) can be ignored. Note that this formulation introduces a degeneracy for a rotation. An alternative formulation is to use the cosines and sines formulation, which needs two parameters and produces an extra trigonometric identity constraint. In our experiments, we found that the cosines representation leads to a larger template size for the G-J (Gauss-Jordan) elimination and, also, more solutions. Therefore, we only focus on the Cayley parameterization in this paper. Our goal is to estimate the rotation, focal lengths and, potentially, radial distortion parameters of two cameras using the minimal number of point correspondences.

3 Minimal Solutions

H4 H5 H2 H3 H3 H5 H6 H1 H2 H2 H3
Reference [hartley2003multiple] [fitzgibbon2001simultaneous] [brown2007minimal] [brown2007minimal] [jin2008three][byrod2009minimal] [kukelova2015radial] [kukelova2015radial]
Radial distortion
No. of points 4 5 2 3 3 5 6 1 2 2 3
No. of solutions 1 18 3 7 18 5 2 4(2) 4(2) 6(2) 6(2)
Gravity prior
Table 1: The properties of the proposed gravity-based (gray) and state-of-the-art solvers.

In this section, we present new minimal solutions, all considering that the images are aligned by the gravity direction, to four practical camera configurations.

3.1 Common Focal Length Solver -

A simple case arises when the two cameras have the same focal length, , , and the cameras have known or zero radial distortion, , . This happens when the images are captured by a fixed camera with negligible distortion, , narrow field-of-view. Also, it can be used for images captured by some wide-angle smartphone cameras which often are undistorted automatically.

This is a 2-DOF problem with respect to . Since each correspondence gives two linearly independent constraints, we only need a single one to solve this problem. Given one point correspondence, constraint (6) leads to three equations from which only two are linearly independent. These equations have the following form


where are coefficient vectors that can be computed from the point correspondence and the gravity direction and . Note, that equations (9) and (10) can be multiplied with to obtain polynomial equations in two unknowns and 9 monomials . Using (11), can be expressed as a rational function in . Substituting this expression into (9) or (10) and multiplying it with the denominator gives us an univariate polynomial in of degree 6. This polynomial has, however, always the factor that can be eliminated. In this case, we can obtain a quartic equation in , which can be solved in closed-form. Finally, there are up to 4 possible solutions.

3.2 Varying Focal Length Solver -

The next case that we consider assumes unknown different focal lengths. This case occurs, , when the images are taken with a zooming camera. It is a 3-DOF problem with respect to unknowns , and we need at least 1.5 point correspondences to solve it. In practice, we still need to sample 2 points, but we only need 3 out of 4 linearly independent equations. The 4 equation can be used to eliminate geometrically infeasible solutions. In this case, the constraint (6) leads to three polynomial equations of the same form as (9)-(11), with and .

Note that equation (11) is of degree 3 and does not contain the unknown . In this case, two correspondences give us two equations of form (11) in two unknowns . These equations can be used to solve for . In this case can be easily eliminated, leading to a quartic equation in , which can be solved in closed-form. This is done by rewriting two equations (11) as


where is a coefficient matrix, and the entries of are polynomials of degree 2 in the hidden variable . Since (12) has non-trivial solutions, matrix must be rank-deficient. Thus, , which is a univariate polynomial equation in of degree 4. Once is computed, can be extracted from the null space of matrix . Then substituting solutions for into (9) yields up to 4 possible solutions to .

3.3 Common Focal Length, Distortion -

Another practical case occurs when the images are taken by a camera with fixed unknown focal length and radial distortion, and in (6). In this case, there are three unknowns . Thus we need 1.5 point correspondence , or, in practice, 2 point correspondences to solve this problem. Actually, from the second point correspondence we will again use only one constraint.

Two out of the three equations from (6) are of degree 6 and one is of degree 4, , the equation of degree four corresponds to the last row of matrix that does not contain and multiplies only the first two rows of , which does not contain . Therefore, to solve the problem we use equations of degree 6 and 4 from the first correspondence and the equation of degree 4 from the second correspondence. The system of these three equations in three unknowns can be solved using the Gröbner basis method. Using the automatic generator [larsson2017efficient], we obtained a solver that performs G-J elimination of a template matrix of size

, and the eigenvalue decomposition of an

action matrix, , the system has 8 solutions. In practice, we found that there are always two imaginary solutions. So there are up to 6 possible real solutions. An alternative and more efficient way how to solve such system is to eliminate from the original equations using similar hidden variable trick as in H2 solver. In this way, we obtain a univariate polynomial in of degree 8, which can be efficiently solved using Sturm sequences [gellert2012vnr]. More details on how to eliminate are in the supplementary material.

3.4 Varying Focal Length, Distortion -

The final case that we consider, is the most general one, when the focal lengths and distortion parameters of the two cameras are different. This is the situation, , when the images are taken using a zooming wide-angle camera. For this problem, there are a total of five unknowns . The minimal case is 3 point correspondences, and similar to the 2-point algorithms we only need one equation from the constraints implied by the last one.

In this case, similarly to the solver, two out of the three equations from (6) are of degree 6 and one is of degree 4, however, now in five unknowns. Again only two from these equations are linearly independent. Moreover, the equation of degree four, , the equation corresponding to the last row of the matrix , has a simpler form and contains only three unknowns as follows:


where is a coefficient vector. To simplify the solver, we use only this simpler equation (13) from the third point correspondence of degree four in three unknowns. In this way, we obtain 5 polynomial equations in 5 unknowns. However, three of them contain only three unknowns .

Therefore, we can first use the three equations of form (13) to solve for . We solve these equations using the hidden variable technique [cox2006using]. The basic idea of this technique is to consider one of variables as a hidden variable and then compute the resultant. We treat as the hidden variable, , we hide it into the coefficient matrix. The equations from the three correspondences can be expressed in terms of the monomials as


where is of size with entries that are polynomials in the hidden variable . In this case, has form


where the upper index denotes the degree of the respective polynomial . Since (14) has a non-trivial solution, matrix must be rank-deficient. Thus, , which is an univariate sextic equation in that can be solved using Sturm sequences. Once is known, can be extracted from the null space of the matrix . Finally, and can be extracted from the remaining two equations.

3.5 Special case

If the -axis of the camera is considered to be physically aligned with the gravity direction, , are identity matrices, there are many zero coefficients in systems that appear in our solvers. In such a scenario, all four solvers are very simple and result in one quadratic equation. The properties of all the stitching solvers are shown in Table 1.

4 Experiments

In this section, we study the performance of the proposed H1, H2, H2, H3 solvers on both synthetic and real images. For comparison, the state-of-the-art minimal H2, H3 solvers for zero distortion [brown2007minimal], the H5 [fitzgibbon2001simultaneous] and H3 solver [jin2008three, byrod2009minimal] for equal and unknown distortion, and the H5 and H6 solvers [kukelova2015radial] for different and unknown distortions are used.

4.1 Synthetic Evaluation

(a) No radial distortion.
(b) Radial distortion.
Figure 9: (a) Boxplot of the focal length error and rotation error for the zero distortion case. Left column: Focal length error. Right column: Rotation error. (b) Boxplot of the focal length error and distortion error for the distortion solvers. Left column: Focal length error. Right column

: Distortion error. From top to bottom: increased image noise, increased roll noise and 2 pixel standard deviation image noise, increased pitch noise with constant 2 pixel standard deviation image noise.

We choose the following setup to generate synthetic data. 200 points are randomly distributed in the box in the first camera’s local coordinate system. A random but feasible rotation was applied to the points. The focal length was set to 1,000 pixels. We generated 1,000 different pairs of images with different rotations. The focal length error was defined as , where

is the estimated focal length. For varying focal lengths solvers that return multiple estimates, we compute the focal length error using the geometric mean


Fig. (a)a reports the focal length (left column) and rotation error (right column) for the solvers assuming zero distortion. The top row shows the performance under increasing image noise with different standard deviations. The proposed solvers perform significantly better than the other solvers under varying image noise. Since, in real applications, the alignment of the camera or the gravity vector measured by the IMU would not be perfect, we also added noise to the roll and pitch angles to simulate this noisy case. The middle and bottom rows show the performance with increasing roll and pitch noise under constant image noise of 2 pixel standard deviation. Our solvers are comparable to the state-of-the-art methods even with roll and pitch noise up to 0.5. The upper bound of the noise is chosen to follow the noise from the lower grade MEMS IMUs [2014Relative]. Nowadays, accelerometers used in modern smartphones and camera-IMU systems have noise levels around (and expensive “good” accelerometers have [fraundorfer2010minimal].

We also study the performance of the radial distortion solvers. The distortion parameter was set to , and the error was defined as , where is the estimated distortion. For solvers returning multiple distortions we compute the distortion error using the geometric mean. Fig. (b)b reports the focal length (left) and radial distortion error (right). The H3 solver is very sensitive to noise.

4.2 Real-world Experiments

To test the proposed techniques on real-world data, we chose the SUN360 [xiao2012recognizing] panorama The purpose of the SUN360 database is to provide academic researchers a comprehensive collection of annotated panoramas covering -degree full view for a large variety of environmental scenes, places and the objects within. To build the core of the dataset, high-resolution panorama images were downloaded and grouped into different place categories.

To obtain radially distorted image pairs from each 360 panoramic scene, we cut out images simulating a 80 FOV camera with a step size of 10. Thus, the rotation around the vertical axis between two consecutive images is always 10. Finally, image pairs were formed by pairing all images with a common field-of-view in each scene. In total, image pairs were generated. Moreover, to test the methods also in cases when there is no distortion, we undistorted the images using the ground truth distortion parameters. In each image, SIFT keypoints are detected in order to have a reasonably dense point cloud and a stable result [IMC2020]. We combined mutual nearest neighbor check with standard distance ratio test [lowe1999object] to establish tentative correspondences [IMC2020]

. To get ground truth correspondences which can be used to calculate the re-projection error, we composed the ground truth homography and selected its inliers given the used inlier-outlier threshold.

To test the solvers on real-world data, we chose a locally optimized RANSAC, , GC-RANSAC [barath2017graph]. In GC-RANSAC (and other locally optimized RANSACs), two different solvers are used: (a) one for estimating the homography from a minimal sample and (b) one for fitting to a larger-than-minimal sample when doing final homography polishing on all inliers or in the local optimization step. For (a), the objective is to solve the problem using as few correspondences as possible since the processing time depends exponentially on the number of correspondences required for the estimation. The proposed solvers are included in this step. For (b), when testing the methods on the distorted images, we applied the H6 solver [kukelova2015radial] to estimate the homography and distortion parameters from a larger-than-minimal sample. In case when the undistorted images are used, we applied the normalized H4 [hartley2003multiple] homography estimator. The inlier threshold was set to pixels.

The cumulative distribution functions (CDF) of the robust estimation’s processing times (in seconds), re-projection and focal length errors are shown in Fig. 

12. Being accurate is interpreted as a curve close to the top-left corner. The top row reports the results on the distorted and the bottom row shows the results on the undistorted images.

Distorted images. The run-times of the robust estimation using the proposed H1(G), H2(G) and H2(G) solvers are amongst the best ones with H1(G) being the fastest solver. All these solvers are far real-time, being faster than seconds in of the cases. In terms of re-projection error, solvers not considering the radial distortion have more very accurate results – their curves go up early. However, the curves of the solvers estimating the radial distortion goes higher – they are on average more accurate –, as it is expected. The proposed solvers which estimate the focal length and do not consider radial distortion lead to the most accurate focal lengths. The best method, both in terms of processing time and focal length accuracy, is the proposed H1(G) solver.

Undistorted images. After the distortion parameters are used to undistort the images, the most accurate results are clearly obtained by the solvers not considering the radial distortion. While, again, the fastest methods are the proposed ones, they are now the most accurate ones as well both in terms of re-projection and focal length errors.

Smartphone images. To further show the benefits of the proposed solvers, we tested them on the data recorded from two devices (iPhone 6s, iPhone 11 pro) with three cameras: the wide-angle cameras of the iPhone 6s and iPhone 11 with focal lengths of 29mm and 26mm, respectively, and the ultra wide-angle camera of the iPhone 11 with a focal length of 13mm. The sequences were captured at 1280x720@30Hz with the IMU readings captured at 100Hz. The images and IMU data are synchronized based on their timestamps. image pairs for the two wide-angle cameras and image pairs for the ultra wide-angle camera with synchronized gravity direction were obtained. Since we do not know the ground truth homography and, thus, ground truth inliers, we do not measure the re-projection error on these datasets. The CDFs of the focal length errors and processing times are shown in Fig. (a)a. The same trend can be seen as before, , the proposed solvers lead to the fastest and most accurate robust estimation.

Distorted images
(a) Distorted images
(b) Undistorted images
Figure 12: The cumulative distribution functions of the processing times (in seconds), average re-projection errors (in pixels) and focal length errors of Graph-Cut RANSAC when combined with different minimal solvers. The values are calculated from a total of image pairs from the SUN360 dataset. The confidence was set to and the inlier threshold to px. Being accurate is interpreted as a curve close to the top-left corner. (Top) distorted images. (Bottom) undistorted images.
The captured smartphone dataset.
(a) The captured smartphone dataset.
(b) Theoretical RANSAC iterations.
Figure 15: (a) The cumulative distribution functions of the processing time (in seconds) and focal length errors of GC-RANSAC when combined with different minimal solvers. The values are calculated from a total of image pairs from the captured phone dataset. Being accurate is interpreted as a curve close to the top-left corner. (b) The theoretical number of RANSAC iterations for each solver plotted as the function of the outlier ratio. The confidence was set to .

4.3 Computational Complexity

The computational complexity and run-time of a single estimation of the compared solvers are reported in the following table. We only show the major steps performed by each solver. The number in the cells, denotes the matrix size to which the G-J elimination or Eigen-decomposition is applied. The number in the fourth column denotes the degree of the univariate polynomial which needs to be solved.

Solver G-J Eigen Poly Time (s)
H1(G) - - 4 5
H2(G) - - 4 5
H2(G) - - 8 5
H3(G) - - 6 5
H2 - - 6
H3 - - 7
H3 - 80
H5 - 40
H5 - 12
H6 - 2 14

The theoretical number of RANSAC iterations is shown in (b)b plotted as the function of outlier ratio in the data. It can be seen that the proposed solver, due to requiring the fewest correspondences, lead to reasonable number of RANSAC iterations even in the most challenging cases.

5 Conclusions

In this paper, we propose four new minimal solvers for image stitching, considering different camera configurations and cameras aligned with the gravity direction. These configurations include solvers for fixed and varying focal length with or without radial distortion. The proposed methods are tested on synthetic scenes and on more than k image pairs from publicly available datasets. Since we have not found datasets for image stiching with available gravity vector, we captured a new one consisting of 7770 image pairs in total with a smartphone equipped with an IMU sensor.