Image rectification is a crucial component for fast stereo matching, especially for smartphones or other platforms with limited computational resources. Taking rectified images as inputs, the correspondence matching becomes restricted in the same scan-line, which largely reduces the computational demand.
Traditionally, image rectification is the process of applying a pair of projective transformations (i.e., homographies) to a pair of images (i.e., the master image and the slave image) whose epipolar geometry is known, so that the epipolar lines in the original images map to horizontally aligned lines in the transformed images. Examples of co-located image fragments before and after applying rectification are shown in Fig. 1(b). However, existing image rectification paradigm has two shortcomings. First, to get the epipolar geometry, offline calibration or online calibration is required. Though capturing the stereo images using dual-lens cameras is straight-forward, calibrating them offline is cumbersome, in terms of: 1) setting up the calibration environment; 2) individually calibration of each dual-lens cameras; and 3) fix the dual-lens module after calibration, otherwise the calibrated parameters would have degraded usage or even not useful at all. Second, applying a pair of projective transformations brings two side-effects which are not desirable for stereo vision tasks. One is that there are undefined regions in the transformed image which may cause matching ambiguities during stereo matching. The other is that there is geometric distortion in the transformed image, which are not preferred for high-quality depth-based applications, such as depth-of-field rendering (Fig. 1(d)).
Notice that it is a commonly adopted set-up that dual-lens cameras are laterally displaced, such as those equipped on smartphones or on robots. Without limiting to the physical dual-lens cameras, if a hand-held camera moves horizontally and capture the scenes at two time instants, we also interpret this setting as dual-lens cameras in a general case. Based on this configuration, we propose a novel self-rectification algorithm for uncalibrated stereo images called DSR. Our proposal keeps the master image unchanged and applies homography only on the slave image. Moreover, no additional information is required except for the stereo images. To achieve these features, we carefully examine the feasibility in the laterally displaced stereo and formulate the self-rectification as a regression problem without explicitly knowing the epipolar geometry. Particularly, the vertical displacement error of detected correspondence pairs between the master image and the slave image is first minimized. Then a shearing transformation is computed to minimize the amount of geometric distortion of the transformed slave image. Lastly, the transformed slave image is shifted horizontally to make the largest disparity being to facilitate the follow-up stereo matching algorithm.
To demonstrate the superiority of the proposed self-rectification algorithm, we evaluate the proposed DualRec on synthetic stereo images with similar settings of smart-phones, while each synthesized pair has slightly different configurations in terms of camera parameters. We also evaluate on realistic stereo images acquired from dual-lens smartphones captured in various scenarios. Our method is applicable to vertically displaced cameras as well by switching - and -axis during problem formulation. Experimentation on either synthetic stereo images or on realistic stereo images shows that, our approach provides promising results, out-performing prior state-of-the-art solutions both quantitatively and qualitatively.
Differing with previous approaches that targeted for rectification of uncalibrated stereo or calibrated stereo, our contributions are:
We find that for laterally displaced stereo cameras, the homographies can be computed in a novel way without requiring the epipolar geometry, thus reducing the calibration cost for every pair of the dual-lens cameras.
We present a self-rectification approach introducing zero geometric distortion to the master image, which brings stable results for stereo matching and for depth-based image applications.
We have carefully examined the usage and limitation of the proposed method, showing that it is applicable for a wide range of dual-lens cameras which are laterally displaced.
Our paper is organized as follows. We review related works in Section 2. In Section 3, we elaborate our proposed self-rectification. The experimental results and conclusions are presented in Section 4 and Section 5, respectively.
2 Related Works
Rectification is a classical problem in stereo vision. If the intrinsic and extrinsic camera parameters are pre-calibrated, one may adopt a compact algorithm, proposed by Fusiello et. al , to find the homographies in a few lines of code. However, such camera parameters are often not available and are volatile due to any mechanical misalignment of the stereo rig.
To resolve this headache, finding the epipolar geometry (i.e., fundamental matrix or essential matrix) which reflects all geometric information contained in the two images is an alternative to a full calibration. An overview of the relevant techniques can be found in . Given such epipolar geometry, Loop et al.  estimated homographies by decomposing them into a specialized projective transform, a similarity transform and a shearing transform, so as to reduce the geometric distortion of the rectified image pairs. Gluckman and Nayar  proposed a rectification method to minimize the re-sampling effect, namely the loss of pixels due to under-sampling and the creation of new pixels due to over-sampling. Isgro and Trucco  proposed an approach without explicit computation of the fundamental matrix by rectifying homographies directly from image correspondences. Fusiello and Irsara  proposed a new Quasi-Euclidean rectification algorithm by adopting Sampson error  to better capture geometric reprojection error. More recently, Zilly et al. proposed a technique to jointly estimate the epipolar geometry and the rectification parameters for almost parallel stereo rigs. Although quite a few methods are proposed to reduce the unwanted geometric distortions, the homographies can be easily affected by the uncertainties in the epipolar geometry, which may result in unstable or unreasonable disparities after stereo matching.
As opposed to previous approaches that require calibrated stereo rig or estimated epipolar geometry, we propose a practical solution called DSR for the laterally displaced stereo where the stereo cameras have barley horizontal displacement. By estimating a homography only for the slave image with minimized geometric distortion, we achieve state-of-the-art rectification results with superior accuracy. It is also demonstrated very effective for the follow-up stereo matching algorithms.
In this section, we elaborate the proposed self-rectification method in detail. As a convention, a matrix and a vector will be denoted respectively by boldface uppercase letter (e.g., ) and lowercase letter (e.g., ), and a scalar will be denoted by an italic upper or lowercase letter (e.g., or
). For ease of representation, we adopt homogeneous coordinate system as commonly used in 3D computer vision, where image points in 2D are represented by 3D column vectors,e.g., . As in homogeneous coordinate, points are scale-invariant, hence and denote the same point. We will firstly describe the dual-lens cameras and propose our small-drift assumption in Section 3.1. Based on this assumption, the self-rectification approach called DSR is then presented in Section 3.2. Lastly, in Section 3.3, we analyze from simulated experiments to further quantify the small-drift assumption.
3.1 Motivations and Assumptions
Dual-lens cameras exist widely, for example, those mounted on robots or mobile phones. Without limiting to the physical stereo cameras, if a hand-held camera moves horizontally and capture the scenes at two time instants, we also interpret this setting as stereo cameras in a general sense. In this system, the line connecting two camera centers is almost parallel to the image planes and the scanlines (i.e., -axis) in the image plane. In Fig. ((a))(a) and Fig. ((b))(b), we plot the epipolar geometry for a general stereo vision system and for a laterally displaced stereo vision system, respectively, where is the master image plane, is the slave image plane, and are camera centers, is a point in the 3D space with and being its projections on the image planes. The line intersects the image planes at locations and which are termed as epipoles. If varies its location along the line , its projection on the slave image shall lies on the epipolar line , and vice versa. Thus, and forms a pair of corresponding lines. The objective of rectification is to push the epipoles, and , to infinity, such that the corresponding lines in the image planes would become the same scanline.
In laterally displaced stereo vision system, Fig. ((b))(b), is almost parallel to the master image plane , making the epipole far from the image center. In the perfect case, where is strictly parallel to the master image plane, the master image does not need any projective transformations as its epipolar lines are already parallel to the scanline. However, in practice, the stereo rig may suffer from small perturbations, leading to relative rotation and/or relative translation with regard to the perfect case. We analyze these perturbations in the following. For simplification, we use the perfect case as a reference, and denote the perfect slave image center and its image plane as and , respectively. The imperfect slave image center and its image plane are denoted as and , respectively.
A. Relative Rotation. In this case, the camera plane has only relative rotation with respect to its perfect case , as shown in Fig. ((a))(a), where and are co-located. It can be proved that there always exist a homography between and . For an arbitrary point in 3D space, the projected points and are
where and are the intrinsic camera matrices,
is an identity matrix,is a vector of zeros indicating no translation, is the relative rotation matrix. Substituting (1) into (2), we can derive
Therefore in this case, the homography can be computed without approximation.
B. Relative Translation. In this case, the image planes and are parallel, however, the camera center is off from its perfect counterpart by a tiny shift , shown in Fig. ((b))(b). We ignore the shift in -axis, as we can always find a perfect reference with the same -coordinate. Then the projection becomes
When and are satisfied, can be ignored. We call this assumption as small-drift assumption. In this case, an approximated homography can be found between the two image planes. Note that as and has no translation in the - and -axis in 3D, and are also the amount of relative translation between the master image and the slave image.
C. Relative Rotation and Relative Translation. If both relative rotation and translation exists, one can first rotate one image plane to make them parallel without loss of generality. Then the problem reduces to the case of relative translation with the same approximation made.
3.2 The DSR Algorithm
As discussed in Section 3.1, if the small-drift assumption satisfies, we can find an approximated homography to align the master image and the slave image. This assumption (i.e., and ) can be easily satisfied for dual-lens smartphones, since the cameras are fixed in phones with very small relative shifts. As opposed to prior works that use epipolar geometry between a pair of images to find homographies for both images, in this paper, we only need to find one homography for the slave image, without altering the master image.
Given a set of corresponding points that are identified by a feature matching approach or by manual labor, we aim at finding the appropriate transformation matrix. Let be the set of corresponding points, where and . Let be the transformation matrix for the slave image, where are its 9 entries:
As in homogeneous coordinates, multiplying a non-zero scalar does not change the homography, we simply let . To find the remaining 8 elements in , we decompose it into three matrices:
where aligns the -coordinate of corresponding pixels, serves as a shearing matrix to reduce the geometric distortion of the transformed slave image, is used to shift image horizontally to guarantee negative disparities for the intention of stereo matching. We present the computation for each of them as follows.
Computation of . We compute by minimizing the vertical alignment error between the master image and the transformed slave image. As the -coordinate of the transformed points are determined by the last two rows of , we define it as
Let and be the second row and the third row of the matrix . Then we minimize the vertical alignment error by solving the following problem,
By changing (6) as
it becomes a multi-variable regression problem. It follows that the regression function is:
Hence the optimal solution is given by
where denotes the pseudo inverse of .
Additionally, we apply RANSAC to make the proposed algorithm robust to possible outliers in the matched keypoints. In particular, as described in Algorithm1, we compute a temporary alignment matrix on randomly selected pairs of matched points. Let be the y-coordinate of , and be the y-coordinate of the warped points . Such points pair is counted as inlier if the vertical displacement error is less than a threshold . We iterate this process for at most times, and select the one with largest percentage of inliers among the computed as .
Computation of . After the alignment of vertical axis, the image may suffer from geometric distortions. As suggested by Loop et al. , we reduce the geometric distortion by applying a shearing matrix, which is defined as
Since the shearing transformation only relates with the -coordinate of a point, it will not affect the rectification accuracy of an image. We denote and as the width and height of the images, respectively. Then on the slave image, , , , are the midpoints on its four edges, respectively. Let , , and be the mapped points after applying the transformation , and we introduce
to ease the presentation. Then parameters and are estimated by preserving the perpendicularity and aspect ratio of lines and . Particularly, we solve for with:
According to , the solution is given by
Computation of . Finally, a shifting matrix is introduced to facilitate the follow-up stereo matching, which shifts the slave image horizontally to make the maximum disparity being . It has the following form,
Let be the -coordinate of the mapped points of with the transformation . Then is simply computed by
3.3 The Small-drift: How small is small?
To verify the validity of this assumption experimentally presented in Section 3.1, we generate synthetic stereo image pairs with mm baseline at a fixed scenario (to reduce the variations in keypoint detection and matching) by varying one of the five variables at a time. We run DSR to calculate the proportion of aligned points by setting , , and . The metric will befigure described in Section 4.2. The corresponding accuracy curves are shown in solid lines in Fig. 4, where the left three sub-figures plot the accuracy with varying rotation, and the right two sub-figures plot the accuracy with varying translation. Since there exist alignment errors in corresponding points themselves, we plot as a reference the accuracy curves by running calibrated rectification with ground-truth camera parameters, shown in dotted lines. From these figures, we could observe: 1) the alignment accuracy does not drop much with increased rotation angles; and 2) the alignment accuracy is more sensitive to the translation in axis, compared to the translation in axis. These well correspond to our theoretical analysis in Section 3.1. Based on these experiments, we empirically regard and as small-drift in our following experiments.
In this section, we introduce our evaluation dataset and experiment settings in Section 4.1. The proposed method DSR is then evaluated in Section 4.2. We further employ DSR as a pre-processing stage to other applications and demonstrate its effectiveness, shown in Section 4.3.
4.1 Experiment Settings
We use two datasets to evaluate our method, i) the synthetic dataset with simulated dual-lens camera settings and ii) the realistic dataset collected at various real-world scenarios. Firstly, to verify its stability across different module of mobile phones, we generated 1000 stereo pairs using Unity software , under similar settings with a real dual-lens camera on smart phones. The baseline is set to be mm. To simulate the fabrication randomness, we vary the rotation angles and translations within a reasonable range. In particular, we set , , , , , and . Secondly, to validate the effectiveness on realistic image sets, we use a dual-lens smartphone with two rear-facing dual-lens cameras to collect another 1000 pairs of images at various scenarios. Some image samples are shown in Fig. 5.
4.2 Performance of DSR
Evaluation Metrics. We evaluate the rectification algorithms by two metrics. The first metric is the vertical alignment accuracy, which is defined as the proportion of well aligned points (short for PAP):
where and are the -coordinate of the pair of corresponding points in two images , is the total number of corresponding pairs, is the indicator function and has the same value as the one in DSR algorithm. is evaluated in our experiments. Conceptually similar to the commonly used reprojection error in camera calibration, the PAP metric also evaluates the alignment accuracy. However, as outliers inevitably exist during evaluation, though a very small portion, we use the proportion of aligned points rather than the distance of re-projected points to reduce the influence introduced by the inevitable outliers. The second metric is defined to measure the geometric distortion on an image by a normalized vertex distance. Let , , and be the four vertices of an image, and be the width and height of the image, respectively. Then
where is the Euclidean distance between and its transformed point.
Comparison and Evaluation. We compare with two other methods. Given a set of offline calibrated camera parameters, the first rectification method we compared is the widely used OpenCV  implementation of calibrated rectification, which will be referred to as CalRec in the rest of this paper. The second one is a classic rectification algorithm proposed by Loop et al. . We also tried to evaluate the newer algorithms proposed by Fusiello et al. . However, when running over the whole dataset, we observe that the implementation provided by the authors is unstable due to the use of Levenberg-Marquardt with all the unknown variables set to zero at the beginning. Thus we do not report the results here. For a fair comparison, both methods (i.e., Loop’s  and ours) share the same strategy and the same parameters for corresponding points extraction. In addition, the same RANSAC is applied to make both methods robust to possible outliers. The evaluation results for the synthetic dataset and realistic dataset are shown in Table 1. As observed from the table, our method greatly reduces the vertical alignment error as well as geometric distortion compared to the method of Loop . Idealy, CalRec for both synthetic and realistic dataset should have the best PAP, given that the camera parameters are calibrated. However, it is common that a smartphone may have different camera parameters with its initially fabricated ones, due to change of focal length, movement of camera modules, etc., which degrades the usage of calibrated data.
|Evaluation on synthetic dataset|
|Evaluation on realistic dataset|
On a desktop equipped with Intel I5 CPU, the running time of two methods, Loop  and our DSR, is reported in Table 2 with the input image resolution being 720960. The homography estimation in DSR is extremely fast with feature matching being the bottleneck. We believe it can be further accelerated by parallel processing languages or optimized for customized platforms.
4.3 Applications on Depth-of-field Rendering
Applied to Stereo Matching. As rectification is usually served as a pre-processing to stereo matching algorithms, it is worthy to know whether the rectified stereo images can bring satisfactory stereo matching results. To do so, we choose the commonly adopted semi-global matching (SGM)  from a list of stereo matching algorithms [17, 16, 18, 10], and compare the above mentioned three rectification algorithms. For ease of convenience, we denote the combined algorithm flow as CalRec+SGM, Loop +SGM, and DSR+SGM. The estimated disparity maps are shown in Fig. 6. Notice that to analyze qualitatively the quality of rectification algorithms, we did not implement any refinement techniques on the results of SGM. As can be observed, DSR+SGM has less mis-calculated pixels compared to Loop +SGM and CalRec+SGM. Notice that as there are geometic distortions after image warping of CalRec and Loop , the pixels at image boundaries are erroneously estimated. To test the ability to compensate the calibration errors after fabrication (also mentioned in Section 4.2), we also show the disparity maps of CalRec + DSR + SGM in Fig. 6. Clearly, the noises and inaccurate regions on the disparity map greatly reduces.
In Figure 7, we compare DSR+SGM with the methods proposed by Ha  for uncalibrated small motion clip on their collected dataset. To run DSR, we select only two frames with relatively higher alignment accuracy. Compared with Ha , DSR+SGM generates more stable and accurate disparity maps. The quantitative assessment of the estimated disparities are not performed, since the ground-truth disparities are not available.
5 Discussion and Conclusion
We present a self-rectification approach called DSR for uncalibrated dual-lens smartphone cameras. The proposed DSR achieves superb accuracy in terms of percentage of aligned points (PAP) and zero geometric distortion for master image in terms of normalized vertex distance (NVD). The effectiveness is further validated by applying stereo matching and depth-of-field rendering on the rectified image pairs. As DSR is designed for dual-lens cameras with small-drift properties, the method is not suggested to rectify stereo image pairs with large translation. Fortunately, almost all types of dual-lens smartphones can benefit from the high effectiveness of the proposed algorithm. Though some dual-lens cameras have wide-and-tele or color-and-gray cameras, as long as sufficient keypoints can be matched, DSR can be employed to rectify the stereo images. DSR may fail when insufficient correct keypoints are detected, for example a stereo image pair of an entire textureless white wall. The proposed DSR can be applied as a pre-processing step to stereo matching for a wide range of applications, such as depth-of-field rendering, 3D segmentation, portrait relighting. It can also be applied to generate training samples for unsupervised / semi-supervised stereo matching networks.
-  Unity. https://unity3d.com. Accessed: 2017-11-14.
-  H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features, pages 404–417. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.
-  G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
-  M. Calonder, V. Lepetit, C. Strecha, and P. Fua. Brief: Binary robust independent elementary features. Computer Vision–ECCV 2010, pages 778–792, 2010.
-  A. Fusiello and L. Irsara. Quasi-euclidean epipolar rectification of uncalibrated images. Machine Vision and Applications, 22(4):663–670, 2011.
-  A. Fusiello, E. Trucco, and A. Verri. A compact algorithm for rectification of stereo pairs. Machine Vision and Applications, 12(1):16–22, 2000.
J. Gluckman and S. K. Nayar.
Rectifying transformations that minimize resampling effects.
Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.
-  H. Ha, S. Im, J. Park, H.-G. Jeon, and I. So Kweon. High-quality depth from uncalibrated small motion clip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5413–5421, 2016.
-  R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
-  H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2008.
-  F. Isgro and E. Trucco. Projective rectification without epipolar geometry. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 1, pages 94–99. IEEE, 1999.
-  M. Kraus and M. Strengert. Depth-of-field rendering by pyramidal image processing. In Computer Graphics Forum, volume 26, pages 645–654, 2007.
-  C. Loop and Z. Zhang. Computing rectifying homographies for stereo vision. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 1, pages 125–131. IEEE, 1999.
-  D. Min, S. Choi, J. Lu, B. Ham, K. Sohn, and M. N. Do. Fast global image smoothing based on weighted least squares. IEEE Transactions on Image Processing, 23(12):5638–5653, 2014.
-  P. C. Ng and S. Henikoff. Sift: predicting amino acid changes that affect protein function. Nucleic Acids Research, 31(13):3812–3814, 2003.
J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan.
Cascade residual learning: A two-stage convolutional neural network for stereo matching.In
ICCV Workshop on Geometry Meets Deep Learning, Oct 2017.
-  D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1-3):7–42, 2002.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4353–4361, 2015.
-  Z. Zhang. Determining the epipolar geometry and its uncertainty: A review. International journal of computer vision, 27(2):161–195, 1998.
-  F. Zilly, M. Müller, P. Eisert, and P. Kauff. Joint estimation of epipolar geometry and rectification parameters using point correspondences for stereoscopic tv sequences. In Proceedings of 3DPVT, 2010.