SeFM: A Sequential Feature Point Matching Algorithm for Object 3D Reconstruction

12/07/2018 ∙ by Zhihao Fang, et al. ∙ 0

3D reconstruction is a fundamental issue in many applications and the feature point matching problem is a key step while reconstructing target objects. Conventional algorithms can only find a small number of feature points from two images which is quite insufficient for reconstruction. To overcome this problem, we propose SeFM a sequential feature point matching algorithm. We first utilize the epipolar geometry to find the epipole of each image. Rotating along the epipole, we generate a set of the epipolar lines and reserve those intersecting with the input image. Next, a rough matching phase, followed by a dense matching phase, is applied to find the matching dot-pairs using dynamic programming. Furthermore, we also remove wrong matching dot-pairs by calculating the validity. Experimental results illustrate that SeFM can achieve around 1,000 to 10,000 times matching dot-pairs, depending on individual image, compared to conventional algorithms and the object reconstruction with only two images is semantically visible. Moreover, it outperforms conventional algorithms, such as SIFT and SURF, regarding precision and recall.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During the recent decades, 3D reconstruction is one of the key technologies used in many promising fields such as robot vision navigation, computer graphics based 3D games, video animation, Internet virtual roaming, e-commerce, digital library, visual communication, virtual reality, and etc

. In general, the 3D model is built based on a group of images captured from different angles and scales. One challenging task hence lies in estimating the spatial relationship between different images by matching feature points and achieving seamless reconstruction results. Many existing researches have made outstanding contributions to this field,

e.g.,SIFT [1], SURF [2] and PCA-SIFT [3]. However, they can only utilize a small percentage of the information from the images which is not efficient. Moreover in conventional methods, even though two images can be mathematically matched by dot-pairs, this pairing relationship is rough and error-prone.

In order to overcome the above problem, this paper addresses a novel algorithm for feature point matching between different images named as SeFM. The feature points are firstly obtained from images using conventional SIFT and SURF algorithm. Instead of searching feature points randomly and blindly, we search for the feature points by calculating the intersection between epipolar line of the camera and the edge of the targeted object. Thus, a pair of sequence of feature points are derived from each image, which we define as the rough matching phase

. Due to the different scale of the paired images, a linear interpolation algorithm is applied to help to find more matching feature points and this is defined as the

dense matching phase. Since there exist quantity wrong matchings due to noise and interpolating, it is possible to compute the maximum and minimum intensity of each point to retain high levels of reliability.

The main contribution of SeFM is its capability of generating the 3D model of a special target with a fairly small dataset of images, while conventional algorithms require using a large number of images from various angles. SeFM increases the amount of information replenishment and increases the number of pixels available from each image by three orders of magnitude.

The rest of this paper is organized as follows: Section 2 investigates the related literature. Section 3 briefly introduces concepts and theorems of epipolar geometry in stereo vision as preliminary. Section 4 presents the principle of point matching in detail. Section 5 demonstrates the experimental results. Section 6 concludes the whole paper.

2 Related Research

The 3D reconstruction problem based on Marr’s visual theory framework forms a variety of theoretical methods. For example, according to the number of cameras, it can be classified into different categories such as monocular vision, binocular vision, trinocular vision, or multi-view vision 

[4, 5, 6]; Based on the principle, it can be divided into regional-based visual methods and feature-based visual methods, model-based methods and rule-based visual methods, and etc. [7]; According to the way on obtaining data, it can be divided into active visual methods and passive visual methods [6, 7, 8].

The reconstruction problem can mainly be divided into two folders as volumetric approaches and surface-based approaches. Volumetric approaches are usually used in reconstruction for medical data, while surface-based approaches are widely used for object reconstruction. One classic application is to build 3D city models from millions of images [9]. Another typical method is the structure-from-motion (SFM for short) algorithm [10] and its variations. They all generated the 3D point cloud by matching feature points, calculating camera’s fundamental matrix and essential matrix, and processing bundle adjustment.

One key step in 3D reconstruction is to find sufficient matching feature points and there exists a large amount of research on this topic [11, 12]. At present, two conventional methods are designed based on the characteristics of corner and region respectively. The earliest corner detection algorithm was Moravecs corner descriptor [13]. Another well known approach was Harris corner detector [14], and it is simple and accurate, but sensitive to scale variation. This problem was overcame by Schmid and Mohr [15] using the Harris corner detector to identify interest points, and then creating a local image descriptor of them. The other category is the region detection algorithm. The most typical and widely used algorithms such as SIFT [1, 16] and SURF [2, 17] fell into this category. The SIFT detectors have good robustness to image scale, illumination, rotation and noise. SIFT has plenty number of optimized variations on improving the efficiency including PCA-SIFT [3], Affine-SIFT (ASIFT) [18] and etc. Other methods aiming at improving the effectiveness and efficiency employ semantic scene segmentation [19], local context [20].

All the above mentioned methods can be used to reconstruct the 3D model of a target object. However, to the best of our knowledge, the percentage of information utilized is low and this drives the reconstruction procedure requiring plenty number of images. This paper proposes a novel feature extraction algorithm with sequence matching and it can can achieve a satisfactory reconstruction result with a small database.

3 Review of Epipolar Geometry

The preliminary of SeFM is to use epipolar geometry to identify the spatial relationship when we have two photos in different angles and positions. Epipolar geometry is the geometry of stereo vision between two views. It depends on the cameras’ fundamental matrix, essential matrix and relative pose, and has no relation with the scene structure [21].

As shown in Figure 1, the point p taken by two different cameras from various locations, may appear to different positions in each image, in which X and x denotes the homogeneous and inhomogeneous coordinate in the i camera’s view respectively. Note that the epipole is the point that is the intersection of line between camera locations and each image plane, denoted as and respectively. The location of can be calculated from that of as


where is a function which converts homogeneous coordinate into inhomogeneous coordinate. t is the translation matrix, and R is the rotation matrix from camera i to camera j. Note that f is relative to the internal reference of the camera.

Figure 1: Illustration of views by two cameras: (a) implies the location relationship between Camera i and j. (b) shows that possible points p in world coordinate determined by x correspond to a series of x in Line j.

In order to describe the location relationship between cameras in the world coordinate, the fundamental matrix (F) should be used (i.e.,a 33 matrix), where contains the information of translation, rotation and the intrinsics of both cameras. To present calibrated views in epipolar geometry, Formula (2)


can be extended to Formula (3) from points to lines:


As Figure 1 shows, Formula (2) and (3) reveal that a certain location x of point p can derive x must be in a certain line l in the corresponding coordinate and vice versa. However, even x is known, the precise coordinate x is difficult to be ascertained by geometric computation, while arbitrary line l and its corresponding line l in the other view can be confirmed. Since

has seven degrees of freedom, the sufficient conditions to get unique solution is obtaining at least eight matching dot-pairs of images 

[22]. Combining some information of the scene, the conditions even can be reduced to five matching dot-pairs [23].

4 SeFM Algorithm

For simplicity, the input images are all taken by calibrated cameras in the algorithm. The basic idea of SeFM is to find a maximum number of matching dot-pairs between different images and the most challenging part is to solve scale-variant problem. Figure 2 illustrates the whole procedure of SeFM. The SURF algorithm followed by random sample consensus algorithm ( for short) is first applied to find a few matching feature points. We then calculate the fundamental matrix and generate the sequence of points. Afterwards, the rough matching phase and dense matching phase are then processed using dynamic programming. We hence obtain the matching points between two images. Finally, all the matching points are achieved after removing invalid matches. Each step is detailed as follows.

Figure 2: The whole procedure of using SeFM to search for matching dot-pairs.

4.1 Sequence Generation

SIFT, as a point descriptor, possesses the scale-invariant feature [1]. SURF, based on SIFT but faster than it, is another approach to search dot-pairs [2]. Using K-d tree, the processing time of SURF on feature points matching can be reduced to one second. As stated in Section 3, it is possible to calculate of and since sufficient dot-pairs can be easily obtained using SURF. Although SURF has higher accuracy than most existing matching algorithms, it is still necessary to adopt to get rid of the wrong matchings.

As shown in Formula (3), the corresponding relationship between the epipolar lines ( for short) from two images can be achieved. Note that the line is bound to pass through the epipole. In order to achieve the maximum number of epipolar lines, a traversal scan with epipole as center is applied to obtain every pair of sequences (as shown in Figure 4).

Figure 3: The traversal scan of epipolar lines. For example, the epipole is in the left bottom corner area. Epipolar line sequences are those lines intersect with the captured image.
Figure 4: The sequence of line l is above the graph and the sequence of line l is on the left side, where numbers are referred to intensity. Arrows based on scores imply the process of tracing back.

Suppose there exist two images , and , the epipolar lines can be represented with a set of points as


where l is the set of epipolar lines of , and l is the corresponding line satisfying Formula (3). p () are consistent sequential points in the line l.

4.2 Dynamic Programming

As epipolar lines are sequential, searching the optimal solution of the point matches between the two epipolar lines from different cameras can be deemed as longest common sequence ( for short) problem. Dynamic programming ( for short), a basic method in algorithm field, is suitable for solving these problems by establishing recurrence [24]. Once the recursive expressions is obtained, optimal solution will be available by tracing back (shown in Figure 4). For push-broom multiple views, semi-global matching [25] has a good performance. However, in other circumstances, the relationship between pixels can only be conformed through epipolar line. In this way, SeFM employs the naive dynamic programming algorithm to find the edge of the target object.

4.3 The Rough Matching Phase

It is tricky to match points in directly when camera locations are distant. Though the scales of images vary from view to view, the edge of the target object is distinct and easy to find due to the intensity changing significantly. Therefore, we first try to find the edge of target object using method stated in Section 4.2 along a certain epipolar line and treat the two intersection as the ending points of a section. We define this process as the rough matching phase. Define the change of intensity of a certain point as


where a is the intensity of the point p. Combining with Formula (4), it can be expressed as


Figure 5 is a part of line graph describing the intensity of points on (in blue) and (in green). Positions marked with red lines are points whose are larger than the threshold, which are considered as key points of the target object.

Figure 5: Line graph of the epipolar line-pair and rough matching key points. The x-axis presents the point along an epipolar line and the y-axis presents the intensity. Points in the two lines are referred to intensity from points in sequence. The intensity jumps are locations of key points in rough matching.

Afterwards, the rough-matching point sets


are obtained, where P, P are key points of l and l in rough matching. Note that RM and RM are subsets of l and l.

To match the points in RM, appropriate descriptors describing them take on extreme importance. In this process, not only the intensity of key point itself, but that of the points around it in should be considered. We recommend using Fourier expansion of the adjacent points to describe P and P. The Fourier expansion is


where a and b

provide a good description of the overall adjacent points. Considering the discreteness of points and computational efficiency of the algorithm, a fast Fourier transformation (FFT) is applied.

4.4 Dense Matching

Once the rough matching is completed, adjusting the scale of original sequence became practical. In order to establish the corresponding relationship between and , the original sequence can be tailored into several subsequences. For example in Figure 6, Sequence 1 and 2 are from l and l. matches and matches , where , , , and . To normalize the scale, a linear interpolating algorithm is used to equalize the length of Sequence 1 and 2.

Figure 6: Illustration of sequence interpolating. Since the length of Sequence 1 is shorter, a linear interpolating is applied to Sequence 1 to make its length equal to Sequence 2.

In this part, since the matching procedure is straightforward, it leads to possible wrong matchings due to noise and the interpolating process. Defining and as


the maximum and minimum intensity of the points can be defined as


where I is the intensity of a point, and x is the point location of the subsequence after interpolating. In this way, the cost function can be written as


which can effectively reduce the amount of computation and the influence of noise.

4.5 Inconsistency Problem

In reality, the occlusion problem must be taken into consideration. SeFM handles this problem from imagewise to pixelwise matching level. As is shown in Figure 7, due to the occlusion in front of the object, two redundant key points will be searched in View 2 during rough matching process. Another case shown in Figure 7, because of the object’s shape, one key point will vanish in View 2.

Figure 7: Occlusion situation. (a) shows one situation when object is shaded by a shelter, in which View 2 can search two more key points 3⃝4⃝ than View 1. (b) shows another situation when a part of object is shaded due to its shape, where only View 1 could search position 2⃝ of two views.

To solve this problem, the validity of the tailored sequences must be considered. If there exists no other from to , the dense sequence between them is valid, which can be concluded as

From the perspective of pixelwise matching, the depth information, which can be calculated from the dot-pairs, matrix and , may contribute to the validation of matches. Given the depth of each points D, the validity of each point S can be determined from

5 Experimental Results

Index a b c d e
Input image 1
Input image 2
Matching by SIFT + SURF
Matching by SIFT + SURF (amplification)
 Matching by SeFM
Figure 8: Exhibition of results from part of datasets. Row 1 and 2 are input images. The rest lines are visualization of point matching in 3D reconstruction. Row 3 and 4 are respective results of direct and amplified images using SIFT+SURF. Row 5 are results using SeFM.

The image dataset of our experiment came from various objects, each of which was taken in two directions by a Mi 5 smartphone with Sony IMX298 camera sensor. All images are originally with a resolution of 46083456, and resized to 20001500 considering of computational efficiency. Our experiments were conducted on one PC with 2.7 GHz Intel Core i7-7500U CPU, 16 GB memory, and the implementation with Python on points matching costs around 11 minutes.

Number of matches

In Figure 8, we exhibit some typical 3D reconstruction results without mesh using only two input images from the dataset. The selected images cover multiple cases, e.g., objects with different scales, objects with different backgrounds, objects with different textures. It is very difficult to reconstruct the object from two images by simply using SIFT and SURF, since the number of matching dot-pair is too small. Observed from Figure 8, the number of point clouds matched by SIFT and SURF is too small to be distinguished, even if we enlarge the figures in Row 3 to Row 4. On the contrary, SeFM can reconstruct the target object to a fairly good shape. SeFM outperforms compared with SIFT and SURF matching algorithm in the number of matching points, also listed in Table 1. The reason lies in that SeFM can get such dense points with calculating the epipolar relationship between images in the process of matching.

input images index number of matches from SeFM number of matches from SIFT+SURF precision recall
a 2,023,399 186 0.965063 0.8567428
b 249,782 13 0.997638 0.5983016
c 1,058,315 171 0.968988 0.7091526
d 725,602 311 0.826024 0.4216738
e 140,264 114 0.999537 0.8300612
Table 1: The statistics of matching on SeFM and SIFT+SURF.

Precision and Recall

Precision and Recall are widely used in algorithm evaluation. In order to evaluate the accuracy of SeFM  we use and defined as follows: # of positives stands for the number of key points in the input image. Given the inconformity of numbers and scales between two input images, we use the average number of both. # of matches records the total number of matching points between two images. # of correct-positives is the number of right matches where two points are exactly the same. Thus, precision and recall can be calculated as below:

As presented in Table 1, the input images (a) and (e) are simple, in which input (a) are two photos of two cuboids on the table and input (e) are two images of a wall. Thus, they achieve higher recall. Conversely, the input images (b) and (d) are more complexed on the object surface, which lead to the lower recall. In input image (d), there exist trees at the upper right of that image which are not main focus of the target, so the value of precision and recall is lower compared with other inputs. Consequentially, SeFM can achieve a high value of precision and a good value of recall. This proves that most of matching found by SeFM are correct. Moreover, it can find an average of of matching points between two input images, which are around to times compared to conventional SIFT+SURF algorithm, while the time consuming is around 500 times than that of SIFT+SURF due to sequential point matching. As Figure 10 shows, even scales differ in two input images, the point matching result is still dense. There is no doubt that SeFM algorithm makes outstanding performances in matching as few as only two images.

Figure 9: Point matching result where only one of epipolar line pairs is applied.
Figure 10: Examples of two wrong matching. The left one is situation of missing key points and the right one is dislocation-match.

Wrong matching Case Analysis

In SeFM, the most probable wrong matches are due to key-point missing in the rough matching phase and the dislocation-match in the dense matching phase. As illustrated in Figure 

10, the situation of missing key points tend to occur in objects with large sum of similar features, e.g., windows and doors. These objects are sensitive to the lighting when photographing. In fact, this kind of wrong matching hardly arises when images are taken virtually the same time in a day. The dislocation-match is a critical fault in the process, triggering the chain reaction of continual deterioration in dynamic programming. Somehow, reduces the probability of such phenomenon happening.

6 Conclusion

This paper introduced a novel algorithm of points matching, which can utilize a maximum number of valid information from images. Based on the spatial relation of the images, it can reconstruct the object well with two or more random images. Compared to previous points matching, our approach can find a significant number of matching dot-pairs. It is proven that SeFM will be effective on 3D reconstruction, panoramic images fields and etc.


  • [1] David G. Lowe. Distinctive image features from scale-invariant keypoints.

    International Journal of Computer Vision

    , 60(2):91–110, 2004.
  • [2] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Proceedings of the ninth European Conference on Computer Vision, 2006.
  • [3] Yan Ke and R. Sukthankar. Pca-sift: a more distinctive representation for local image descriptors. In

    Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    , 2004.
  • [4] Anthony Lobay and D. A. Forsyth. Shape from texture without boundaries. International Journal of Computer Vision, 67(1), 2006.
  • [5] Wei Fang and Bingwei He. Automatic view planning for 3d reconstruction and occlusion handling based on the integration of active and passive vision. In International Symposium on Industrial Electronics, 2012.
  • [6] D. J. Crandall, A. Owens, N. Snavely, and D. P. Huttenlocher. Sfm with mrfs: Discrete-continuous optimization for large-scale structure from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2841–2853, 2013.
  • [7] L. Ladicky, O. Saurer, S. Jeong, F. Maninchedda, and M. Pollefeys. From point clouds to mesh using regression. In International Conference on Computer Vision, 2017.
  • [8] H. H. Vu, P. Labatut, J. P. Pons, and R. Keriven. High accuracy and visibility-consistent dense multiview stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):889–901, 2012.
  • [9] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski. Building rome in a day. Communications of the Acm, 54(10):105–112, 2011.
  • [10] J. J. Koenderink and A. J. van Doorn. Affine structure from motion. J.opt.soc.america, 8(2):377–385, 1991.
  • [11] M. Donoser and H. Bischof. 3d segmentation by maximally stable volumes (msvs). In International Conference on Pattern Recognition, 2006.
  • [12] Pierre Moreels and Pietro Perona. Evaluation of features detectors and descriptors based on 3d objects. In International Conference on Computer Vision, 2007.
  • [13] Hans P. Moravec. Rover visual obstacle avoidance. In

    Proceedings of the 7th International Joint Conference on Artificial Intelligence

    , 1981.
  • [14] C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, 1988.
  • [15] C. Schmid and R. Mohr.

    Local grayvalue invariants for image retrieval.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):530–535, 1997.
  • [16] David G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999.
  • [17] H Bay, A Ess, T Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. Computer Vision and Image Understanding, 110:346–359, 01 2008.
  • [18] Jean-Michel Morel and Guoshen Yu. Asift: A new framework for fully affine invariant image comparison. SIAM Journal on Imaging Sciences, 2(2):438–469, 2009.
  • [19] N. Kobyshev, H. Riemenschneider, and L. V. Gool. Matching features correctly through semantic understanding. In International Conference on 3D Vision, 2014.
  • [20] Kyle Wilson and Noah Snavely. Network principles for sfm: Disambiguating repeated structures with local context. In IEEE International Conference on Computer Vision, pages 513–520, 2014.
  • [21] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049, 2000.
  • [22] H. C. Longuet-Higgins”. A computer algorithm for reconstructing a scene from two projections. Nature, 293:133–135, 1981.
  • [23] D. Nister. An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):756–770, 2004.
  • [24] Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline search using dynamic programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-7(2):139–154, 1985.
  • [25] Hirschmuller Heiko. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):328–341, 2008.