1 Introduction
During the recent decades, 3D reconstruction is one of the key technologies used in many promising fields such as robot vision navigation, computer graphics based 3D games, video animation, Internet virtual roaming, ecommerce, digital library, visual communication, virtual reality, and etc
. In general, the 3D model is built based on a group of images captured from different angles and scales. One challenging task hence lies in estimating the spatial relationship between different images by matching feature points and achieving seamless reconstruction results. Many existing researches have made outstanding contributions to this field,
e.g.,SIFT [1], SURF [2] and PCASIFT [3]. However, they can only utilize a small percentage of the information from the images which is not efficient. Moreover in conventional methods, even though two images can be mathematically matched by dotpairs, this pairing relationship is rough and errorprone.In order to overcome the above problem, this paper addresses a novel algorithm for feature point matching between different images named as SeFM. The feature points are firstly obtained from images using conventional SIFT and SURF algorithm. Instead of searching feature points randomly and blindly, we search for the feature points by calculating the intersection between epipolar line of the camera and the edge of the targeted object. Thus, a pair of sequence of feature points are derived from each image, which we define as the rough matching phase
. Due to the different scale of the paired images, a linear interpolation algorithm is applied to help to find more matching feature points and this is defined as the
dense matching phase. Since there exist quantity wrong matchings due to noise and interpolating, it is possible to compute the maximum and minimum intensity of each point to retain high levels of reliability.The main contribution of SeFM is its capability of generating the 3D model of a special target with a fairly small dataset of images, while conventional algorithms require using a large number of images from various angles. SeFM increases the amount of information replenishment and increases the number of pixels available from each image by three orders of magnitude.
The rest of this paper is organized as follows: Section 2 investigates the related literature. Section 3 briefly introduces concepts and theorems of epipolar geometry in stereo vision as preliminary. Section 4 presents the principle of point matching in detail. Section 5 demonstrates the experimental results. Section 6 concludes the whole paper.
2 Related Research
The 3D reconstruction problem based on Marr’s visual theory framework forms a variety of theoretical methods. For example, according to the number of cameras, it can be classified into different categories such as monocular vision, binocular vision, trinocular vision, or multiview vision
[4, 5, 6]; Based on the principle, it can be divided into regionalbased visual methods and featurebased visual methods, modelbased methods and rulebased visual methods, and etc. [7]; According to the way on obtaining data, it can be divided into active visual methods and passive visual methods [6, 7, 8].The reconstruction problem can mainly be divided into two folders as volumetric approaches and surfacebased approaches. Volumetric approaches are usually used in reconstruction for medical data, while surfacebased approaches are widely used for object reconstruction. One classic application is to build 3D city models from millions of images [9]. Another typical method is the structurefrommotion (SFM for short) algorithm [10] and its variations. They all generated the 3D point cloud by matching feature points, calculating camera’s fundamental matrix and essential matrix, and processing bundle adjustment.
One key step in 3D reconstruction is to find sufficient matching feature points and there exists a large amount of research on this topic [11, 12]. At present, two conventional methods are designed based on the characteristics of corner and region respectively. The earliest corner detection algorithm was Moravecs corner descriptor [13]. Another well known approach was Harris corner detector [14], and it is simple and accurate, but sensitive to scale variation. This problem was overcame by Schmid and Mohr [15] using the Harris corner detector to identify interest points, and then creating a local image descriptor of them. The other category is the region detection algorithm. The most typical and widely used algorithms such as SIFT [1, 16] and SURF [2, 17] fell into this category. The SIFT detectors have good robustness to image scale, illumination, rotation and noise. SIFT has plenty number of optimized variations on improving the efficiency including PCASIFT [3], AffineSIFT (ASIFT) [18] and etc. Other methods aiming at improving the effectiveness and efficiency employ semantic scene segmentation [19], local context [20].
All the above mentioned methods can be used to reconstruct the 3D model of a target object. However, to the best of our knowledge, the percentage of information utilized is low and this drives the reconstruction procedure requiring plenty number of images. This paper proposes a novel feature extraction algorithm with sequence matching and it can can achieve a satisfactory reconstruction result with a small database.
3 Review of Epipolar Geometry
The preliminary of SeFM is to use epipolar geometry to identify the spatial relationship when we have two photos in different angles and positions. Epipolar geometry is the geometry of stereo vision between two views. It depends on the cameras’ fundamental matrix, essential matrix and relative pose, and has no relation with the scene structure [21].
As shown in Figure 1, the point p taken by two different cameras from various locations, may appear to different positions in each image, in which X and x denotes the homogeneous and inhomogeneous coordinate in the i camera’s view respectively. Note that the epipole is the point that is the intersection of line between camera locations and each image plane, denoted as and respectively. The location of can be calculated from that of as
(1) 
where is a function which converts homogeneous coordinate into inhomogeneous coordinate. t is the translation matrix, and R is the rotation matrix from camera i to camera j. Note that f is relative to the internal reference of the camera.
In order to describe the location relationship between cameras in the world coordinate, the fundamental matrix (F) should be used (i.e.,a 33 matrix), where contains the information of translation, rotation and the intrinsics of both cameras. To present calibrated views in epipolar geometry, Formula (2)
(2) 
can be extended to Formula (3) from points to lines:
(3) 
As Figure 1 shows, Formula (2) and (3) reveal that a certain location x of point p can derive x must be in a certain line l in the corresponding coordinate and vice versa. However, even x is known, the precise coordinate x is difficult to be ascertained by geometric computation, while arbitrary line l and its corresponding line l in the other view can be confirmed. Since
has seven degrees of freedom, the sufficient conditions to get unique solution is obtaining at least eight matching dotpairs of images
[22]. Combining some information of the scene, the conditions even can be reduced to five matching dotpairs [23].4 SeFM Algorithm
For simplicity, the input images are all taken by calibrated cameras in the algorithm. The basic idea of SeFM is to find a maximum number of matching dotpairs between different images and the most challenging part is to solve scalevariant problem. Figure 2 illustrates the whole procedure of SeFM. The SURF algorithm followed by random sample consensus algorithm ( for short) is first applied to find a few matching feature points. We then calculate the fundamental matrix and generate the sequence of points. Afterwards, the rough matching phase and dense matching phase are then processed using dynamic programming. We hence obtain the matching points between two images. Finally, all the matching points are achieved after removing invalid matches. Each step is detailed as follows.
4.1 Sequence Generation
SIFT, as a point descriptor, possesses the scaleinvariant feature [1]. SURF, based on SIFT but faster than it, is another approach to search dotpairs [2]. Using Kd tree, the processing time of SURF on feature points matching can be reduced to one second. As stated in Section 3, it is possible to calculate of and since sufficient dotpairs can be easily obtained using SURF. Although SURF has higher accuracy than most existing matching algorithms, it is still necessary to adopt to get rid of the wrong matchings.
As shown in Formula (3), the corresponding relationship between the epipolar lines ( for short) from two images can be achieved. Note that the line is bound to pass through the epipole. In order to achieve the maximum number of epipolar lines, a traversal scan with epipole as center is applied to obtain every pair of sequences (as shown in Figure 4).
Suppose there exist two images , and , the epipolar lines can be represented with a set of points as
(4) 
where l is the set of epipolar lines of , and l is the corresponding line satisfying Formula (3). p () are consistent sequential points in the line l.
4.2 Dynamic Programming
As epipolar lines are sequential, searching the optimal solution of the point matches between the two epipolar lines from different cameras can be deemed as longest common sequence ( for short) problem. Dynamic programming ( for short), a basic method in algorithm field, is suitable for solving these problems by establishing recurrence [24]. Once the recursive expressions is obtained, optimal solution will be available by tracing back (shown in Figure 4). For pushbroom multiple views, semiglobal matching [25] has a good performance. However, in other circumstances, the relationship between pixels can only be conformed through epipolar line. In this way, SeFM employs the naive dynamic programming algorithm to find the edge of the target object.
4.3 The Rough Matching Phase
It is tricky to match points in directly when camera locations are distant. Though the scales of images vary from view to view, the edge of the target object is distinct and easy to find due to the intensity changing significantly. Therefore, we first try to find the edge of target object using method stated in Section 4.2 along a certain epipolar line and treat the two intersection as the ending points of a section. We define this process as the rough matching phase. Define the change of intensity of a certain point as
(5) 
where a is the intensity of the point p. Combining with Formula (4), it can be expressed as
(6) 
Figure 5 is a part of line graph describing the intensity of points on (in blue) and (in green). Positions marked with red lines are points whose are larger than the threshold, which are considered as key points of the target object.
Afterwards, the roughmatching point sets
(7) 
are obtained, where P, P are key points of l and l in rough matching. Note that RM and RM are subsets of l and l.
To match the points in RM, appropriate descriptors describing them take on extreme importance. In this process, not only the intensity of key point itself, but that of the points around it in should be considered. We recommend using Fourier expansion of the adjacent points to describe P and P. The Fourier expansion is
(8) 
where a and b
provide a good description of the overall adjacent points. Considering the discreteness of points and computational efficiency of the algorithm, a fast Fourier transformation (FFT) is applied.
4.4 Dense Matching
Once the rough matching is completed, adjusting the scale of original sequence became practical. In order to establish the corresponding relationship between and , the original sequence can be tailored into several subsequences. For example in Figure 6, Sequence 1 and 2 are from l and l. matches and matches , where , , , and . To normalize the scale, a linear interpolating algorithm is used to equalize the length of Sequence 1 and 2.
In this part, since the matching procedure is straightforward, it leads to possible wrong matchings due to noise and the interpolating process. Defining and as
(9) 
the maximum and minimum intensity of the points can be defined as
(10) 
where I is the intensity of a point, and x is the point location of the subsequence after interpolating. In this way, the cost function can be written as
(11) 
which can effectively reduce the amount of computation and the influence of noise.
4.5 Inconsistency Problem
In reality, the occlusion problem must be taken into consideration. SeFM handles this problem from imagewise to pixelwise matching level. As is shown in Figure 7, due to the occlusion in front of the object, two redundant key points will be searched in View 2 during rough matching process. Another case shown in Figure 7, because of the object’s shape, one key point will vanish in View 2.
To solve this problem, the validity of the tailored sequences must be considered. If there exists no other from to , the dense sequence between them is valid, which can be concluded as
From the perspective of pixelwise matching, the depth information, which can be calculated from the dotpairs, matrix and , may contribute to the validation of matches. Given the depth of each points D, the validity of each point S can be determined from
5 Experimental Results
Index  a  b  c  d  e 
Input image 1  
Input image 2  
Matching by SIFT + SURF  
Matching by SIFT + SURF (amplification)  
Matching by SeFM 
The image dataset of our experiment came from various objects, each of which was taken in two directions by a Mi 5 smartphone with Sony IMX298 camera sensor. All images are originally with a resolution of 46083456, and resized to 20001500 considering of computational efficiency. Our experiments were conducted on one PC with 2.7 GHz Intel Core i77500U CPU, 16 GB memory, and the implementation with Python on points matching costs around 11 minutes.
Number of matches
In Figure 8, we exhibit some typical 3D reconstruction results without mesh using only two input images from the dataset. The selected images cover multiple cases, e.g., objects with different scales, objects with different backgrounds, objects with different textures. It is very difficult to reconstruct the object from two images by simply using SIFT and SURF, since the number of matching dotpair is too small. Observed from Figure 8, the number of point clouds matched by SIFT and SURF is too small to be distinguished, even if we enlarge the figures in Row 3 to Row 4. On the contrary, SeFM can reconstruct the target object to a fairly good shape. SeFM outperforms compared with SIFT and SURF matching algorithm in the number of matching points, also listed in Table 1. The reason lies in that SeFM can get such dense points with calculating the epipolar relationship between images in the process of matching.
input images index  number of matches from SeFM  number of matches from SIFT+SURF  precision  recall 

a  2,023,399  186  0.965063  0.8567428 
b  249,782  13  0.997638  0.5983016 
c  1,058,315  171  0.968988  0.7091526 
d  725,602  311  0.826024  0.4216738 
e  140,264  114  0.999537  0.8300612 
Precision and Recall
Precision and Recall are widely used in algorithm evaluation. In order to evaluate the accuracy of SeFM we use and defined as follows: # of positives stands for the number of key points in the input image. Given the inconformity of numbers and scales between two input images, we use the average number of both. # of matches records the total number of matching points between two images. # of correctpositives is the number of right matches where two points are exactly the same. Thus, precision and recall can be calculated as below:
As presented in Table 1, the input images (a) and (e) are simple, in which input (a) are two photos of two cuboids on the table and input (e) are two images of a wall. Thus, they achieve higher recall. Conversely, the input images (b) and (d) are more complexed on the object surface, which lead to the lower recall. In input image (d), there exist trees at the upper right of that image which are not main focus of the target, so the value of precision and recall is lower compared with other inputs. Consequentially, SeFM can achieve a high value of precision and a good value of recall. This proves that most of matching found by SeFM are correct. Moreover, it can find an average of of matching points between two input images, which are around to times compared to conventional SIFT+SURF algorithm, while the time consuming is around 500 times than that of SIFT+SURF due to sequential point matching. As Figure 10 shows, even scales differ in two input images, the point matching result is still dense. There is no doubt that SeFM algorithm makes outstanding performances in matching as few as only two images.
Wrong matching Case Analysis
In SeFM, the most probable wrong matches are due to keypoint missing in the rough matching phase and the dislocationmatch in the dense matching phase. As illustrated in Figure
10, the situation of missing key points tend to occur in objects with large sum of similar features, e.g., windows and doors. These objects are sensitive to the lighting when photographing. In fact, this kind of wrong matching hardly arises when images are taken virtually the same time in a day. The dislocationmatch is a critical fault in the process, triggering the chain reaction of continual deterioration in dynamic programming. Somehow, reduces the probability of such phenomenon happening.6 Conclusion
This paper introduced a novel algorithm of points matching, which can utilize a maximum number of valid information from images. Based on the spatial relation of the images, it can reconstruct the object well with two or more random images. Compared to previous points matching, our approach can find a significant number of matching dotpairs. It is proven that SeFM will be effective on 3D reconstruction, panoramic images fields and etc.
References

[1]
David G. Lowe.
Distinctive image features from scaleinvariant keypoints.
International Journal of Computer Vision
, 60(2):91–110, 2004.  [2] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Proceedings of the ninth European Conference on Computer Vision, 2006.

[3]
Yan Ke and R. Sukthankar.
Pcasift: a more distinctive representation for local image
descriptors.
In
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, 2004.  [4] Anthony Lobay and D. A. Forsyth. Shape from texture without boundaries. International Journal of Computer Vision, 67(1), 2006.
 [5] Wei Fang and Bingwei He. Automatic view planning for 3d reconstruction and occlusion handling based on the integration of active and passive vision. In International Symposium on Industrial Electronics, 2012.
 [6] D. J. Crandall, A. Owens, N. Snavely, and D. P. Huttenlocher. Sfm with mrfs: Discretecontinuous optimization for largescale structure from motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2841–2853, 2013.
 [7] L. Ladicky, O. Saurer, S. Jeong, F. Maninchedda, and M. Pollefeys. From point clouds to mesh using regression. In International Conference on Computer Vision, 2017.
 [8] H. H. Vu, P. Labatut, J. P. Pons, and R. Keriven. High accuracy and visibilityconsistent dense multiview stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5):889–901, 2012.
 [9] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski. Building rome in a day. Communications of the Acm, 54(10):105–112, 2011.
 [10] J. J. Koenderink and A. J. van Doorn. Affine structure from motion. J.opt.soc.america, 8(2):377–385, 1991.
 [11] M. Donoser and H. Bischof. 3d segmentation by maximally stable volumes (msvs). In International Conference on Pattern Recognition, 2006.
 [12] Pierre Moreels and Pietro Perona. Evaluation of features detectors and descriptors based on 3d objects. In International Conference on Computer Vision, 2007.

[13]
Hans P. Moravec.
Rover visual obstacle avoidance.
In
Proceedings of the 7th International Joint Conference on Artificial Intelligence
, 1981.  [14] C. Harris and M. Stephens. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, 1988.

[15]
C. Schmid and R. Mohr.
Local grayvalue invariants for image retrieval.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):530–535, 1997.  [16] David G. Lowe. Object recognition from local scaleinvariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999.
 [17] H Bay, A Ess, T Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. Computer Vision and Image Understanding, 110:346–359, 01 2008.
 [18] JeanMichel Morel and Guoshen Yu. Asift: A new framework for fully affine invariant image comparison. SIAM Journal on Imaging Sciences, 2(2):438–469, 2009.
 [19] N. Kobyshev, H. Riemenschneider, and L. V. Gool. Matching features correctly through semantic understanding. In International Conference on 3D Vision, 2014.
 [20] Kyle Wilson and Noah Snavely. Network principles for sfm: Disambiguating repeated structures with local context. In IEEE International Conference on Computer Vision, pages 513–520, 2014.
 [21] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521623049, 2000.
 [22] H. C. LonguetHiggins”. A computer algorithm for reconstructing a scene from two projections. Nature, 293:133–135, 1981.
 [23] D. Nister. An efficient solution to the fivepoint relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):756–770, 2004.
 [24] Y. Ohta and T. Kanade. Stereo by intra and interscanline search using dynamic programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI7(2):139–154, 1985.
 [25] Hirschmuller Heiko. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):328–341, 2008.
Comments
There are no comments yet.