1 Introduction
This paper addresses the problem of estimating the scale parameter of monocular Structure from Motion (SfM) for a multimodal stereo camera system (Fig. 1). There has been growing interest in scene modeling with the development of mobile digital devices. In particular, researchers in the field of computer vision and robotics have exhaustively investigated scale estimation methods for monocular SfM to benefit from the simplicity of the camera system [5, 14]. There are several ways to estimate the scale parameter — for example, integration with other sensors such as inertial measurement units (IMUs) [19] or navigation satellite systems (NSSs), such as the Global Positioning System (GPS). Also, some methods utilize the prior knowledge of the sensor setups [13, 23]. In this paper, the scale parameter of monocular SfM is estimated by integrating the information of different spectral images, such as those taken by RGB and farinfrared (FIR) cameras in a stereo camera setup, whose feature points are difficult to directly match by using descriptors (e.g., SIFT [15], SURF [2], and ORB [22]).
With the development of the production techniques of FIR cameras, they have been widely utilized for deriving the benefits of thermal information in the form of infrared radiation emitted by objects, such as infrastructure inspection [8, 11, 16, 29, 30], pedestrian detection in the dark [3], and monitoring volcanic activity [27]. Especially for unmanned aerial vehicles (UAVs), a stereo pair of RGB and FIR cameras, which we call a multimodal stereo camera, is often mounted on the UAV for such inspection and monitoring. Although the multimodal stereo camera can capture different spectral images simultaneously, for example, in the case of structural inspection, it is laborintensive to compare a large number of image pairs. To improve the efficiency of the inspection, SfM [1, 24] and MultiView Stereo (MVS) [7, 12, 25] can be used for thermal 3D reconstruction (Fig. 1). The estimation of the absolute scale of the monocular SfM is needed in order to project FIR image information to the 3D model (Fig. 2
). However, it is difficult to match feature points between RGB and FIR images directly. Moreover, the number of matching points between FIR images is comparatively small due to the low resolution and the lack of thermal texture in a scene. Although machine learning methods, such as deep neural networks (DNNs)
[6, 9, 31], can be used to match feature points between different types of images, the cost of dataset creation for every camera and scene is quite expensive.To estimate the scale parameter from only the information of the multimodal camera system, we leverage the stereo setup with a constant extrinsic parameter and a small number of feature correspondences between the same modal images other than the visible ones (Fig. 1). More concretely, the proposed method is based on a leastsquares method of residuals by the epipolar constraint between the same modal images. The main contribution of this paper is threefold: first, the formulation of the scale estimation for a multimodal stereo camera system; second, the verification of the effectiveness of the formulation through synthetic and real image experiments; and third, experimental thermal 3D mappings as one of the applications of the proposed method.
2 Related work
2.1 Thermal 3D reconstruction
The FIR camera is utilized with other types of sensors for thermal 3D reconstruction because the texture of FIR images is poorer than that of visible ones, especially for indoor scenes. Oreifej et al. [20] developed a fully automatic 3D thermal mapping system for building interiors using light detection and ranging (LiDAR) sensors to directly measure the depth of a scene. Additionally, depth image sensors are utilized to estimate the dense 3D model of a scene based on the Kinect Fusion algorithm [17] in the works of [16, 29].
A combination of SfM and MVS is an alternative method for the 3D scene reconstruction. Ham et al. [8] developed a method to directly match feature points between RGB and FIR images, which works only in rich thermaltexture environments. Under similar conditions, the method proposed by Truong et al. [21] performs SfM using each of RGB and FIR images independently, aligning the two sparse point clouds.
Whereas the measurement range of the LiDAR sensor is longer than that of the depth image sensor, it has disadvantages in sensor size and weight, and is more expensive compared to RGB and depth cameras. Additionally, the depth image sensor can directly obtain dense 3D point clouds of a scene; however, it is unsuitable for widearea measurement tasks because the measurement range is comparatively short. As mentioned, this study assumes thermal 3D reconstruction of wide areas for structural inspection by UAVs as an application. Thus, this paper proposes a scale estimation method of monocular SfM for a multimodal stereo camera with the aim of thermal 3D reconstruction using an RGB–FIR camera system.
2.2 Scale estimation for monocular SfM
There are several types of scale estimation methods for monocular SfM based on other sensors and prior knowledge.
To estimate the absolute scale parameter of monocular SfM, an IMU is utilized as an internal sensor to integrate the information of the accelerations and angular velocities with visionbased estimation using the extended Kalman filter (EKF)
[19]. As an external sensor, location information from NSSs (e.g., GPS) can be used to estimate the similarity transformation between the trajectories of monocular SfM and the GPS information based on a leastsquares method.Otherwise, prior knowledge of the sensor setups is utilized for scale estimation. Scaramuzza et al. [23] exploit the nonholonomic constraints of a vehicle on which a camera is mounted. The work by Kitt et al. [13] utilizes ground planar detection and the height from the ground of a camera.
The objective of this study is to estimate the scale parameter of monocular SfM from only multimodal stereo camera images without other sensor information, for versatility. For example, in the case of structural inspection using UAVs, IMUs mounted on the drones suffer from vibration noise, and the GPS signal cannot be received owing to the structure. Additionally, assumptions of sensor setups restrain the application of scale estimation. Therefore, the proposed method utilizes only input image information and precalibration parameters.
As one of the scale estimation methods for a multimodal stereo camera, which uses the information only from such a camera system, Truong et al. [21] proposed a method based on an alignment of RGB and FIR point clouds. This method requires the point cloud created only from FIR images. Thus, it is not applicable to scenes with nonrich thermal texture, such as indoor scenes. Otherwise, considering a multimodal stereo camera as a multicamera cluster with nonoverlapping fields of view, we can theoretically apply scale estimation methods of monocular SfM for such a multicamera cluster to a multimodal stereo camera. The work by Clipp et al. [4] estimates the absolute scale of monocular SfM for a multicamera cluster with nonoverlapping fields of view by minimizing the residual based on the epipolar constraint between two viewpoints. This method does not perform the batch optimization, which utilizes multiple image pairs, and does not take the scale parameter into account when performing the bundle adjustment (BA) [28].
3 Scale estimation
3.1 Problem formulation
In this section, we describe a novel method of estimating a scale parameter of reconstruction results from monocular SfM. Here we use a stereo system of RGB and FIR cameras (i.e., RGB–FIR) as an example of a multimodal stereo camera system. Fig. 2 expresses the global and relative transformation matrices of a system composed of two viewpoints with an RGB–FIR camera system.
We start with a given set of RGB images , and FIR images , whose images, and , are taken simultaneously using an RGB–FIR camera system whose constant extrinsic parameter is
(1) 
and
represent the rotation matrix and the translation vector between the two cameras of the camera system, respectively. Those matrix and vector are estimated via calibration in advance. Additionally, we assume that the
images, and , are taken by the cameras, (RGB) and (FIR), with the global extrinsic parameters, and , respectively. Note that and comprise the pair of cameras in the RGB–FIR camera system. can be estimated except for its absolute scale by monocular SfM of the RGB images.Using and , the relative transformation between and is computed by . To solve the scale ambiguity, a scale parameter is introduced. Then, the relative transformation between and including the scale parameter is expressed by
(2) 
where and are the rotation matrix block and the translation vector block of , respectively. The goal is to estimate the correct .
3.2 Derivation of scale parameter
With and , the relative transformation between the two FIR cameras, and , can be computed as
(3)  
(4) 
where , and . An essential matrix between and can be derived from and expressed as
(5)  
(6) 
The epipolar constraint between the two FIR images, and , corresponding to the FIR cameras, and , is formulated as
(7) 
where and are the corresponding feature points between and , in the form of normalized image coordinates [10]. A normalized image point is defined as
(8) 
where is the intrinsic parameter matrix of the FIR camera. is the feature point in pixels in and is the corresponding feature point with in . Additionally, the normalized image point is also defined as
(9) 
where is the 3D point in the coordinate system of the FIR camera . Here, corresponds to the feature point on .
The epipolar constraint of Equation (7) can be expanded to
(10) 
with
(11)  
(12)  
(13) 
If the coordinates of the feature points have no error, Equation (10) is completely satisfied. However, in reality, the equation is not completely satisfied because coordinates of feature points usually have some error and the scale is unknown. In such a case, the scalar residual is defined as
(14) 
Likewise, the residual vector can be defined by
(15) 
where is the number of corresponding feature points between and . Using a leastsquares method, the scale parameter can be estimated by
(16) 
Collectively, the scale estimation problem comes down to determining , such that the error function,
(17) 
is minimized. Thus, the scale parameter is determined by solving the equation in terms of . Therefore, the scale is computed by
(18) 
3.3 Alternative derivation
In Equation (2), the scale parameter and the relative translation vector between the two RGB cameras, and , are multiplied. The scale parameter can be alternatively applied to the translation vector in , in contrast to Equations (1) and (2). This introduction of is reasonable because multiplying by is geometrically equivalent to multiplying by . Therefore, we can also estimate the scale parameter of monocular SfM, which has scale ambiguity, from
(19) 
When using Equation (19) for scale estimation, the , and in Equation (4) are
(20) 
The rest of the derivation procedure remains the same.
3.4 Scaleoriented bundle adjustment
After an initial estimation of the scale parameter by Equation (18) of Algorithm (1) or (2), we perform the bundle adjustment (BA) [28]. Before the scale estimation, the camera poses of the RGB cameras are precisely estimated via monocular SfM, except for its absolute scale. Thus, our BA optimizes the scale parameter rather than the translation vectors of the RGB cameras.
Using the scale parameter , the reprojection error of the FIR 3D point (in the world coordinate system) in the FIR image is defined as
(21) 
where represents the feature point in the FIR image and corresponds to . The projection function for the FIR camera is
(22) 
where is computed by
(23)  
(24) 
The cost function composed of the reprojection errors is defined by
(25) 
where
is the Huber loss function and
is the standard deviation of the reprojection errors. The optimized scale parameter
is estimated as follows:(26) 
Equation (26) is a nonconvex optimization problem. Thus, it should be solved using iterative methods such as the Levenberg–Marquardt algorithm, for which an initial value is acquired by Equation (18) of Algorithm (1) or (2). See the details of the derivation above in Section 1 of the supplementary material paper.
4 Synthetic image experiments
In Section 3, we described the two approaches of resolving scale ambiguity, with differences in the placement of the scale parameter . In this section, we investigate, via simulation, the effect of noise given to feature points on scale estimation accuracy when varying the baseline length between the two cameras of the multimodal stereo camera system.
The scale parameter is estimated in the synthetic environment with noise in both Algorithms (1) and (2). Preliminary experiments in the synthetic environment show that scale parameters can be estimated correctly using the proposed method when no noise is added to the feature points. See the details under the noisefree settings in Section 2 of the supplementary material paper.
4.1 Experimental settings
The procedure for the synthetic image experiments is as follows:

Scatter 3D points randomly in a cubic space with a side length of .

Arrange RGB–FIR camera systems in the 3D space randomly. More concretely, a constant relative transformation of an RGB–FIR camera system is given, and the absolute camera poses of the RGB cameras are set randomly. Then, the absolute camera poses of the FIR cameras are computed by .

For all , reproject the 3D points , , , to the FIR camera using . Then, determine the normalized image points using Equation (9). Gaussian noise with a standard deviation can be added to all of the reprojected points.
In this paper, we define , and . In addition, the relative pose between the two cameras of the camera system is set as
(27) 
where is the distance between the two cameras of the RGB–FIR camera system. and are set depending on the simulation.
4.2 Effects of feature point detection error
We consider the effect of noise given to feature points on scale estimation accuracy when varying a baseline length of the stereo camera system. Setting , we estimate scale parameters 100 times and compute a mean and a standard deviation (SD) of (in Algorithm (1)) or (in Algorithm (2)), with respect to each of the various baseline lengths between the two cameras of the camera system. Fig. 3 shows the relationship between , the means and the SDs of the estimated scales for both Algorithms (1) and (2).
In Fig. 3, the scale parameters are stably estimated in the region where is relatively large () because the means are and the SDs converge to . On the contrary, in the region where is relatively small (), the SD increases as decreases but the means maintain the correct value of . Meanwhile, in Fig. 3 the means of the scale parameters are less accurate than the ones in Fig. 3 in the region where is relatively small (). In addition, the SDs in Fig. 3 are larger than the ones in Fig. 3.
Hence, it is concluded that the estimated scales obtained by Algorithm (2) are more accurate and stable than the ones obtained by Algorithm (1). Additionally, the baseline length between the two cameras of a multimodal stereo camera system should be as long as possible for scale estimation.
5 Real image experiments
5.1 Evaluation method
We apply the proposed method to the experimental environment to verify that the method is capable of estimating the absolute scales of outputs from monocular SfM which uses a multimodal stereo camera. For this verification, we need to prepare results of monocular SfM in which the actual distances between the cameras are already known. Therefore in this experiment, the multimodal stereo camera system is fixed to the stage on the camera mount as shown in Fig. 4, and we capture RGB and FIR images while moving the camera system on a grid of intervals. The stage of the camera mount, where the camera system is fixed, can be moved in both vertical and horizontal directions. Fig. 4 shows an example of gridaligned camera poses estimated by SfM, whose images are captured using the camera mount shown in Fig. 4.
Let be the actual distance between the two RGB cameras and be the distance between in the result of the monocular SfM, which has scale ambiguity. The estimated actual distance is computed by
(28) 
where is the scale parameter in Equation (2) and Equation (19), respectively. Additionally, the relative error of can be defined as
(29) 
The RGB–FIR camera system used in our experiment is shown in Fig. 4. The RGB camera in the camera system is a LUMIX DMC–G8 (Panasonic Corp.) or the RGB camera part of a FLIR Duo R (FLIR Systems, Inc.), depending on the experimental setting of the baseline length. The FIR camera is the FIR camera part of the FLIR Duo R.
The procedure for the experiment is as follows:

Capture the RGB and FIR image pairs using the camera system and its mount shown in Fig. 4. Additionally, some supplementary RGB and FIR images are added to stabilize the process of monocular SfM and scale estimation.

Perform a process of monocular SfM using the captured RGB images.

Estimate the scale parameter by Algorithms (1) and (2).

Compute a mean of with all the combinations, which is defined as
(30) where is the number of RGB images taken in a grid.
When detecting and describing feature points, FIR images are converted to grayscaled images. FLIR Duo R outputs FIR images whose pixels contain values of radiation temperature. To convert them to grayscaled images, a mean and a standard deviation of pixels for each image are computed, and then pixel values with a range of are mapped to .
To confirm the effect of the difference in baseline lengths between the RGB and FIR cameras, datasets of RGB and FIR images are taken with each of the four baseline lengths of the camera system: , , and . The systems with the first, second, and third baseline lengths use the LUMIX DMC–G8 as the RGB camera. The system with uses the RGB camera equipped on the FLIR Duo R. Considering the randomness of RANSAC, for each of the four baseline lengths, the scale estimation and computation of are performed 100 times. Then, a mean and a standard deviation of are calculated.
Also, precalibration of an RGB–FIR stereo camera system is needed to perform the proposed scale estimation procedure. Thus, we adopt the stereo calibration method in which a planar pattern such as a chessboard is used [32]. See the details in Section 3 of the supplementary material paper.
5.2 Evaluation with a real scene
The experimental environment used in the evaluation is shown in Fig. 5. The grid pattern along which the camera system is moved has 8 vertical 10 horizontal grids. Thus, there are 80 RGB camera poses used for the evaluation. Additionally, 50 supplementary pairs of RGB and FIR images are included to stabilize the process of monocular SfM and scale estimation. Considering the randomness of RANSAC, we show the means and standard deviations of with 100 trials of scale estimation. Figs. 5 and 5 show the results when using Algorithms (1) and (2), respectively.
In both Figs. 5 and 5, the means of before BA decrease as the baseline length becomes larger. Additionally, the mean values in Fig. 5 are larger than the ones in Fig. 5 across the whole range of baseline length. Those results denote the same pattern as the experiments in the synthetic environment in Section 4. Consequently, without BA, it is evident that the smaller error of scale estimation occurs when using the camera system with the longer baseline as indicated by the simulation in Section 4. In addition, the difference in numerical stability of the proposed method occurs in experiments with both synthetic and real images.
On the contrary, after BA, the means of approach nearly zero in both Figs. 5 and 5, even though large error occurred before BA. Especially, at the baseline length in Fig. 5, the mean of after BA is whereas it is before BA. Additionally, at the baseline length after BA, high accuracy of the scale estimation is achieved as the means of are under Algorithm (1) and
under Algorithm (2). The SDs also decrease after BA compared to the ones before BA. Summarizing the above, we conclude that scale parameters estimated by both Algorithms (1) and (2) are suitable for an initial value of BA as well as that our BA effectively refines the scale parameters with respect to the accuracy and variance.
5.3 Comparison with the existing method
As mentioned in Section 2.2, we compare the proposed scale estimation method with the ones by Truong et al. [21] and by Clipp et al. [4]. We apply the two methods of [21] and [4] to the RGB–FIR image datasets used in Section 5.2, then evaluate the estimated scale parameter by calculating accordingly. Fig. 7 shows the comparison of the accuracies of the scale parameters estimated by Algorithm (2) of the proposed method, [21] and [4]. The results of the proposed method and [4] present the means of with 100 trials both before and after BA. In result by [21], we adopt computed by Equation (6) in the paper of [21] as the scale parameter .
As shown in Fig. 7, the by [21] and [4] are much larger than the means of by the proposed method throughout the whole range of baseline length. As for [21], the low accuracy mainly results from the erroneous 3D points reconstructed via SfM which uses only the FIR images. On the other hand, unlike our method, the method by [4] cannot deal with the epipolar residuals of multiple FIR image pairs. Thus, before BA, the means of by [21] and [4] are much larger than the ones by the proposed method. Additionally, the BA in [4] does not optimize a scale parameter but rather rotations and translations. Thus, after BA, coupled with the poor initial estimation by [4], the BA is unstable as shown in Fig. 7.
5.4 Practical examples
Fig. 8 presents temporal thermal 3D mappings as a practical example of thermal 3D reconstruction. RGB and FIR images are captured by a smartphonebased RGB–FIR camera system, composed of a FLIR One (FLIR Systems, Inc.) and a smartphone. The baseline length of the camera system is .
A 3D mesh model shown in Fig. 8 is reconstructed from the RGB images using monocular SfM and MVS, and is then resized to the absolute scale estimated by the proposed method. The thermal 3D models shown in Figs. 8 and 8 are built by reprojecting FIR images to the 3D mesh model on a sunny day and on a rainy day, respectively. The thermal information is reprojected well as shown in Figs. 8 and 8. In addition, as shown in Fig. 7, we measure the size of road surface markings in the 3D model (estimated) and in the real world (actual), as an evaluation of the estimated scales in practical scenes. The relative errors of the estimated size are approximately in the scene in Fig. 8. See the additional results in Section 4 of the supplementary material paper.
6 Conclusion
In this paper, we have shown a novel method of estimating the scale parameter of monocular SfM for a multimodal stereo camera system, which is composed of different spectral cameras (e.g., RGB and FIR) in a stereo camera setup. Owing to the difficulty of matching feature points directly between RGB and FIR images, we have leveraged a constant extrinsic parameter of the stereo setup and a small number of feature correspondences between the same modal images. Two types of formulae for scale parameter estimation, both of which are based on the epipolar constraint, were proposed in this paper. We have also verified the difference in scale estimation accuracy and stability between the two formulae in the synthetic and real image experiments. The cause for the difference in scale estimation stability requires further investigation.
Additionally, we have demonstrated a scale estimation of monocular SfM under the experimental environment using an RGB–FIR stereo camera, and we have verified its accuracy both before and after BA. The consequence shows that the proposed method can estimate an appropriate scale parameter and its accuracy depends on the baseline length between RGB and FIR cameras of a stereo camera system. Moreover, we have presented the thermal 3D modeling as an application of the proposed scale estimation method.
These results suggest that the proposed method is applicable to the construction of thermal 3D mappings using payloadlimited vehicles, such as UAVs, on which an RGB–FIR camera system is mounted. Therefore, we conclude that the proposed method is suitable for scale estimation of monocular SfM.
Acknowledgements. This research is supported by the Hori Sciences & Arts Foundation, the New Energy and Industrial Technology Development Organization (NEDO) and JSPS KAKENHI Grant Number 18K18071.
References
 [1] Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building Rome in a day. In: International Conference on Computer Vision (ICCV). pp. 72–79 (2009)
 [2] Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: European Conference on Computer Vision (ECCV). pp. 404–417 (2006)
 [3] Bertozzi, M., Broggi, A., Caraffi, C., Rose, M.D., Felisa, M., Vezzoni, G.: Pedestrian detection by means of farinfrared stereo vision. Computer Vision and Image Understanding 106(2), 194–204 (2007)
 [4] Clipp, B., Kim, J.H., Frahm, J.M., Pollefeys, M., Hartley, R.: Robust 6dof motion estimation for nonoverlapping, multicamera systems. In: IEEE Workshop on Applications of Computer Vision (WACV) (2008)
 [5] Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: Monoslam: Realtime single camera slam. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29(6), 1052–1067 (2007)
 [6] DeTone, D., Malisiewicz, T., Rabinovich, A.: Toward geometric deep slam. arXiv preprint arXiv:1707.07410 (2017)
 [7] Furukawa, Y., Ponce, J.: Accurate, Dense, and Robust MultiView Stereopsis. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 32(8), 1362–1376 (2010)
 [8] Ham, Y., GolparvarFard, M.: An automated visionbased method for rapid 3d energy performance modeling of existing buildings using thermal and digital imagery. Advanced Engineering Informatics 27(3), 395–409 (2013)

[9]
Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: Matchnet: Unifying feature and metric learning for patchbased matching. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3279–3286 (2015)
 [10] Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edn. (2004)
 [11] Iwaszczuk, D., Stilla, U.: Camera pose refinement by matching uncertain 3d building models with thermal infrared image sequences for high quality texture extraction. ISPRS Journal of Photogrammetry and Remote Sensing 132, 33–47 (2017)
 [12] Jancosek, M., Pajdla, T.: Multiview reconstruction preserving weaklysupported surfaces. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3121–3128 (2011)
 [13] Kitt, B.M., Rehder, J., Chambers, A.D., Schonbein, M., Lategahn, H., Singh, S.: Monocular visual odometry using a planar road model to solve scale ambiguity. In: European Conference on Mobile Robots (2011)
 [14] Klein, G., Murray, D.: Parallel tracking and mapping for small ar workspaces. In: International Symposium on Mixed and Augmented Reality (ISMAR). pp. 225–234 (2007)
 [15] Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision (IJCV) 60(2), 91–110 (2004)
 [16] MÃ¼ller, A.O., Kroll, A.: Generating high fidelity 3d thermograms with a handheld realtime thermal imaging system. IEEE Sensors Journal 17(3), 774–783 (2017)
 [17] Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: Realtime dense surface mapping and tracking. In: International symposium on Mixed and augmented reality (ISMAR). pp. 127–136 (2011)
 [18] Nistér, D.: An efficient solution to the fivepoint relative pose problem. IEEE transactions on pattern analysis and machine intelligence 26(6), 756–770 (2004)
 [19] Nützi, G., Weiss, S., Scaramuzza, D., Siegwart, R.: Fusion of imu and vision for absolute scale estimation in monocular slam. Journal of Intelligent & Robotic Systems 61(1), 287–299 (2011)
 [20] Oreifej, O., Cramer, J., Zakhor, A.: Automatic generation of 3d thermal maps of building interiors. ASHRAE transactions 120, C1 (2014)
 [21] Phuc Truong, T., Yamaguchi, M., Mori, S., Nozick, V., Saito, H.: Registration of rgb and thermal point clouds generated by structure from motion. In: International Conference on Computer Vision Workshop (ICCVW) (2017)
 [22] Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: International Conference on Computer Vision (ICCV). pp. 2564–2571 (2011)
 [23] Scaramuzza, D., Fraundorfer, F., Pollefeys, M., Siegwart, R.: Absolute scale in structure from motion from a single vehicle mounted camera by exploiting nonholonomic constraints. In: International Conference on Computer Vision (ICCV). pp. 1413–1419 (2009)
 [24] Schönberger, J.L., Frahm, J.M.: Structurefrommotion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4104–4113 (2016)
 [25] Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multiview stereo. In: European Conference on Computer Vision (ECCV). pp. 501–518 (2016)
 [26] Stewénius, H., Engels, C., Nistér, D.: Recent developments on direct relative orientation. ISPRS Journal of Photogrammetry and Remote Sensing 60, 284–294 (2006)
 [27] Thiele, S.T., Varley, N., James, M.R.: Thermal photogrammetric imaging: A new technique for monitoring dome eruptions. Journal of Volcanology and Geothermal Research 337(Supplement C), 140–145 (2017)
 [28] Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment — a modern synthesis. In: Vision Algorithms: Theory and Practice. pp. 298–372 (1999)
 [29] Vidas, S., Moghadam, P., Bosse, M.: 3d thermal mapping of building interiors using an rgbd and thermal camera. In: International Conference on Robotics and Automation (ICRA). pp. 2311–2318 (2013)
 [30] Weinmann, M., Leitloff, J., Hoegner, L., Jutzi, B., Stilla, U., Hinz, S.: Thermal 3d mapping for object detection in dynamic scenes. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences 2(1), 53 (2014)

[31]
Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolutional neural networks. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4353–4361 (2015)
 [32] Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22, 1330â–1334 (2000)