I Introduction
Structure from motion (SfM) is the process of estimating 3D structure and camera motion from a series of 2D images. The scale ambiguity is a wellknown limitation of this process. The reconstruction is only possible up to an unknown scale factor when using a monocular camera. However, the scale information would often be useful, for example, when making body size measurements for online shopping. Similarly, the scale information could be utilized in 3D printing. The user could first scan the object with a smart device and then print the object in exact dimensions with a 3D printer.
The scale ambiguity can be solved by using at least two calibrated cameras or a depth camera. Besides the fact that the stereo and depth cameras have a limited operational range, they are also more expensive and rarely included in mobile devices. The global positioning system (GPS) can also be used for obtaining the metric scale of the reconstruction. However, the GPS is typically relatively inaccurate and only works outdoors.
Some scale estimation methods avoid the need for extra hardware by making assumptions about the scene content. For instance, the smart device application [1] allows the user to make metric measurements from the scene by using the known dimensions of a credit card, which is embedded to the scene. Similarly, the application [2] detects the device itself from the mirror in order to make body size measurements of the user. The method [3] solves the problem by assuming that the ground is flat and that the approximate height of the camera from the ground is known.
In this paper, we propose a method for recovering the metric scale of a visual SfM reconstruction by using inertial measurements recorded with an IMU that is rigidly attached to the camera. An example of the scaled reconstruction is shown in Figure 1. The proposed approach outperforms the stateoftheart, especially when dealing with noisy measurements, e.g. due to motion blur, rolling shutter artifacts or lowquality IMU. The accuracy of scale is typically around from the ground truth depending on the recording. We also present a calibration method, which aligns the inertial and visual measurements both temporally and spatially. Therefore, the algorithm can be easily bundled with any structure from motion software that outputs the camera poses. Our method and data will be made available as open source upon publication of the paper.
Ii Related Work
The fusion of visual and inertial measurements has been a popular research topic in the robotics community. Most previous systems are focused on realtime tracking and navigation, e.g. [4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. These approaches require tightly integrated sensor fusion, which places requirements for the hardware. For example, the synchronization of individual video frames and IMU sensor timestamps must be relatively accurate, and often the used IMUs are of notably better quality than standard smartphone IMUs, which are not aimed for inertial navigation purposes. In fact, many of the previous approaches utilize special hardware setups. For instance, both [5] and [11] use a similar synchronized IMU and stereo camera hardware prototype. Further, also [7, 9, 12, 13] use custommade cameraIMU hardware. Finally, perhaps the most wellknown example of a specialized hardware platform for visualinertial odometry is the Google Tango tablet device which utilizes a fisheye lens camera [14]. Regarding Google Tango device, it should be noticed that the implementation is proprietary and not openly documented, and hence it is difficult to analyze whether similar performance could be realized with more conventional smartphone hardware.
Nevertheless, there are some approaches which utilize standard smartphone sensors for motion tracking and metric reconstruction [6, 15]. In [15] the authors report that the recovered scale was estimated to have an error of up to 1015% which is a significantly larger error than what we report in this paper. On the other hand, [6] focus on realtime visualinertial odometry in the navigation context, and not on precise metric reconstruction of objects like [15]. Thus, [6] does not present a quantitative evaluation of the obtained scale accuracy and, as their implementation is not publicly available, detailed comparisons are not possible. Interestingly, although [6] aims at realtime tracking, the authors report that in their experiments the data was recorded and stored onboard a phone and then processed offline on the same device (but the average processing time per frame was small enough for realtime operation). Thus, it seems that there are not many publicly documented visualinertial odometry systems that would be both accurate and capable of truly realtime operation on a smartphone.
Besides placing specific requirements for the hardware, tightly integrated fusion of visual and inertial measurements is a challenging task and leads to relatively complex designs as one needs to solve two difficult problems, visual odometry and inertial navigation, simultaneously. We believe that this complexity partially explains why many of the aforementioned stateoftheart visualinertial odometry methods are not available as opensource implementation.
In contrast to the approaches that utilize inertial sensors, purely visual reconstruction approaches are more mature and many solutions are available, both as opensource [16, 17, 18, 19, 20, 21] and commercial software [22, 23, 24]. In particular, given the good visual quality and high accuracy and completeness of the reconstructed models that some of the commercial software packages provide, it appears that the only feature missing from these solutions is the absolute metric scale, which can not be obtained using images alone but which would be needed for certain measurement and modeling applications (e.g. body size measurements for clothing or reproduction of objects via 3D printing).
Therefore, instead of aiming at tight integration of visual and inertial measurements for realtime odometry, we argue in favour of a decoupled approach, where one may use any visual structurefrommotion tool for capturing a 3D reconstruction from a smartphone video, and thereafter determine the metric scale of the reconstruction by applying the proposed approach for the inertial measurements, which were recorded simultaneously with the video capture. In fact, our experiments show that the proposed approach provides accurate results even when the precise temporal and spatial alignment between the video and IMU signals is not known a priori, or the camera poses provided by structurefrommotion are noisy and inaccurate (e.g. due to motion blur or rolling shutter effects). Further, besides having a wider application potential due to milder hardware requirements than most of the tightly integrated visualinertial solutions, we believe that our batchbased approach has also potential for better accuracy thanks to the fact that it is able to utilize all the noisy IMU measurements for estimating the scale factor correction. Thus, if online or realtime scale estimation is not necessary, one may get a reasonably accurate scale even from low quality IMUs of commodity devices.
The closest previous works to ours are the papers by Ham et al. [25, 26] which address the same problem in a similar context. That is, they also apply offtheshelf visual tracking software to recover the camera poses up to scale and thereafter fix the metric scale based on inertial measurements. However, in contrast to Ham’s approach we do not assume that the relative orientation of the camera and IMU must be known a priori. Further, we propose scale estimation by matching accelarations from visual and inertial sensors in frequency domain instead of time domain. We compare our frequencybased approach both to Ham’s original implementation and to our own implementation of Ham’s time domain approach. The results show that our approach has better accuracy and faster convergence of the scale estimate.
Iii Background
Iiia Measurements
Structure from motion software such as VisualSFM [16] outputs the camera poses for each image in the sequence. The pose defines the camera’s position and orientation in the world coordinate frame. Subscripts and denote the world and camera frames, respectively. The IMU measures the acceleration and angular rate of the device in the sensor coordinate frame, which is denoted with subscript . Note that we use different superscripts for the visual and inertial measurements to avoid confusion.
IiiB Transformations
The orientation of the device can be represented by a rotation matrix . To transform a vector in the world coordinates, e.g. the gravity vector to the camera coordinate frame, we apply the rotation
(1) 
Assuming that the camera and the IMU are part of the same rigid structure and close together, there exists an orthogonal transformation that rotates the inertial measurements to the camera coordinate frame
(2)  
In Section IVB, we propose a method for finding this rotation when the Euclidean transformation between the camera and IMU is unknown. It can be noted that the centripetal accelerations caused by rotations are assumed to be negligible small due to the close distance between the camera and the IMU.
IiiC Overview of the algorithm
Processing steps of the proposed algorithm are visualized in Figure 2. Before the scale estimation, we align the camera and IMU measurements both temporarily and spatially. This is achieved by comparing gyroscope readings and visual angular velocities, which are computed from the camera orientations. The scale estimation itself is performed by matching visual and inertial accelerations. More specifically, we differentiate the camera positions and match the accelerations in the frequency domain. Structure from motion algorithms typically assume that the input is a collection of unordered images. Thus, the continuity of motion is ignored when reconstructing from a video. This may lead to noisy position estimates reducing the accuracy of the scale estimate. Since we can expect that the device follows physical laws of motion, we employ the RauchTungStriebel (RTS) smoother to refine the position estimates.
Iv VisualImu Alignment
The scale estimation relies on the assumption that the visual and inertial measurements are temporally aligned. Furthermore, we want to ensure that the camera coordinate frame is spatially aligned with the coordinate frame of the inertial measurement unit.
Iva Temporal alignment
The timestamps of the IMU and visualdata are a natural starting point for temporal alignment. The camera timestamps are not always available, and even when they are, the camera and IMU may be using different clocks. This prevents the direct comparison of the measurements since there is an unknown temporal offset between the timestamps. Both the literature and our experiments confirm that the mapping between the IMU and visualdata is not just a constant shift operation. The time offset can slightly change over time, due to inaccuracies in the sensors’ clocks, or clock jitters from CPU overloading [27].
Our solution to temporal alignment is based on the idea of comparing gyroscope readings with the visual angular velocities, which are computed from the camera orientations. The offset value, which minimizes the leastsquares error is chosen as the best estimate. We perform the temporal alignment concurrently with the spatial alignment since the angular velocities cannot be compared without knowing the spatial alignment between the coordinate frames. This topic is discussed in Section IVB. Figure 3 shows the angular velocities before and after the temporal alignment. Note that the angular velocities have been spatially aligned for visualization.
Previous methods, such as [26] assume a constant offset but as mentioned, the offset can slightly vary over time. In fact, we can see from Figure 3 that signals are not perfectly aligned after compensating for a constant delay. Even though the difference is barely noticeable, we can take this account by performing the scale estimation in frequency domain as will be shown in Section VD.
IvB Spatial alignment
The rotation matrix , which aligns the sensor coordinate frame with the camera coordinate frame may often be deduced from the documentations of the development platform and the reconstruction software. There is, however, no guarantee that such transformation is precisely correct. Our approach does not require that the transformation is known in advance. Instead, we estimate the transformation by utilizing gyroscope measurements. The aim is to find the optimal rotation between two sets of angular velocities. First we compute the visual angular velocities from the camera orientations . The relation between the angular velocity and the derivative of the rotation matrix is given by the equation
(3) 
The components of the angular velocity can be extracted from the 3
3 skewsymmetric matrix
.We can find the transformation between the angular velocities and by minimizing
(4) 
where is the translation vector, which equals to gyroscope bias in the camera coordinate frame. It has been shown [28, 29] that this problem can be solved optimally in closedform. Indeed, we may first translate the angular velocities so that their centroids are at the origin of the coordinate frame and then solve the optimal rotation , and thereafter , as in [28].
Thus far, we have assumed that the visual and inertial angular velocities are temporally aligned. Since this may not be true, we include the offset term to the objective function
(5) 
which is minimized using alternating optimization. That is, we find the optimal iteratively using golden section search, where we solve and in closedform at each iteration.
Before the scale estimation, we align the inertial accelerations to the camera frame both temporally and spatially using the estimated values for and (cf. Figure 2).
V Scale Estimation
The aim of the scale estimation is to find a scale factor that fixes the metric scale of the reconstruction. Our work is similar to [26] in the sense that we compare accelerations instead of positions. What makes the problem more complicated is the fact that the accelerometer also measures the earth’s gravity, which is not observed by the camera. Furthermore, the measurements are corrupted by the noise and the accelerometer readings may be biased. In practice, we not only estimate the scale but the gravity and accelerometer bias as well.
Va Visual accelerations and RTS smoothing
Assuming that we know the time interval between each frame in the video, we can compute the visual accelerations by taking the second derivative of the position. Figure 4 shows the original camera positions and the corresponding accelerations. As can be seen, the differentiation amplifies the noise in the original signal. Using these noisy accelerations in the scale estimation would cause the scale to be severely underestimated.
The method in [26] addresses this problem by applying a lowpass filter to the visual and inertial accelerations. Based on our experiments, the choice of the cutoff frequency has significant impact to the accuracy of the scale estimation. The difficulty is that some sequences may require more smoothing than others. The amount of noise may also vary within different parts of the sequence.
Kalman filter [30] can also be used to filter position data. Its advantage is that it takes into account that the device is expected to follow physical laws of motion. The algorithm weights the measurements and predicted states based on their uncertainties. However, it does not, as such, solve the problem of varying amounts of smoothing required by the sequences.
In order to cope with the requirement for varying amount of smoothing, we adapt the process noise parameters of the Kalman filter state space model by using the marginal maximum likelihood method (e.g., [31]). We compute likelihood curve on a onedimensional grid and automatically select the best process noise value based on the marginal likelihood obtained by the prediction error decomposition.
To further improve the estimates, we employ the RauchTungStriebel (RTS) smoother [32]. Unlike the Kalman filter, the RTS smoother also utilizes the future samples to determine the optimal smoothing. It is a twopass algorithm for fixedinterval smoothing, where the forward pass corresponds to the Kalman filter. The state estimates and covariances are stored for the backward pass. Figure 4 shows the smoothed positions and accelerations, which are computed in the backward pass.
VB Scale and bias estimation
When comparing visual and inertial accelerations we have to take into account that the accelerometer not only measures the acceleration caused by the motion but also the acceleration due to earth’s gravity. Most smart device APIs provide estimates of the linear acceleration in addition to the raw accelerometer readings. In case of linear acceleration, the scale estimation is simpler because the accelerometer readings do not contain the gravity component. Later on, we show that we do not have to rely on the builtin black box gravity estimation but let us first consider a case of linear acceleration. Given the visual accelerations and the linear accelerometer readings , the scale factor can be estimated by minimizing
(6) 
Note that visual accelerations have been rotated to the camera coordinate frame using (1). The inertial accelerations have also been aligned with the camera coordinate frame using (2).
The cost function (6) disregards the fact that accelerometer readings are typically biased. Let us denote the accelerometer bias in the camera coordinate frame with . It is known that the bias may depend on the temperature of the sensor [33]. According to our experiments, we may assume that the bias stays constant during the recording. We also experimented with the bias term that varies linearly with the time but did not observe any noticeable improvement. The objective function with the constant bias can be written as
(7) 
VC Gravity estimation
As mentioned, the gravity vector is included in the raw accelerometer readings. In order to compare these accelerations, we need to subtract the gravity component from the IMU measurements. Since the gravity vector is constant in the world coordinate frame, we can transform this vector to the camera coordinate frame using (1). Therefore, we only need to estimate the three parameters of the gravity vector. The objective function with the gravity term can be written as
(8) 
Note that the system is linear but there is a nonlinear gravity constraint that needs to be satisfied. In practice, the accuracy of the scale seems to be almost as good even if the gravity constraint is ignored.
VD Frequency domain representation
In this section, we show that the scale estimation can also be performed in the frequency domain. This representation has clear advantages over the time domain as will be demonstrated in the experiments. It was discussed earlier, that the visual accelerations need to be smoothed for accurate results. The frequency domain representation allows us to simply disregard the high frequency components, i.e. the noise in the visual and inertial accelerations. This way we can avoid the difficulty of choosing the right cutoff frequency for the lowpass filter. Another advantage of the frequency domain representation is its robustness against the phase difference between the inertial and visual data. As discussed in Section IVA, the temporal offset of the inertial and visual measurements may slightly vary over time. This may cause problems when the scale estimation is performed in the time domain.
Let us denote the Fourier transform with operator
. From (8), we get the visual and inertial accelerations(9)  
(10) 
where and are matrices that contain the Fourier transforms of the visual and inertial accelerations, respectively. Note that the transform is taken separately for each of the three axis. We wish to minimize the difference between the amplitude spectrums and
(11) 
The upper limit for the frequencies is denoted by . In our experiments, this value was set to 1.2 Hz. We did not find it necessary to tune the parameter. To ensure a good initialization for the minimization, we use the closedform solution of (8) as an initial guess (i.e. we dropped the nonlinear constraint in (8) but not in (11)). The above problem is then solved with the fmincon function provided by the optimization toolbox of Matlab. Figure 5 shows the time and frequency domain representations of the inertial and visual accelerations.
Vi Evaluation
We evaluate our method on several datasets captured with the NVIDIA Shield tablet [34] and Project Tango Development Kit [14]. The main difference between these devices is that the Tango has a buildin motion tracking system, which provides pose estimates in realtime. In case of the NVIDIA Shield, the camera poses were obtained using the VisualSFM software [16]. We compare our solution against the stateoftheart method proposed by Ham et al. [26] and the Tango’s buildin motion tracking.
The method [26] has a parameter that determines the amount of smoothing applied to the accelerations. This parameter was here tuned so that the average error of the scale with respect to the ground truth was lowest possible. It can be noted that this may give overoptimistic results since the same datasets were used in training and evaluation.
It should be also noted that our method does not have parameters that would need to be manually adjusted per sequence. That is, all parameters are either set automatically or kept at a fixed value for all test sequences.
Further, in contrast to our method, the method [26] needs the IMUtocamera transformation as an input. The transformation was set as defined in Android documentation.
Finally, although our implementation is not optimized for speed, it is clear that the scale estimation is computationally light compared to the visual reconstruction. For example, processing the long test sequences took less than on a laptop with our Matlab implementation.
Via Static scene experiments
In these experiments, we captured 15 datasets with the NVIDIA Shield tablet. The IMU was logging at 100 Hz, which is the fastest possible rate for this device. The camera was recording at 30 fps and with resolution of 1080x1920 pixels. The camera pose trajectory was upsampled to match the sample rate of the IMU. The timestamps were available for both measurements, although the camera and IMU were using different clocks. The camera poses were obtained using the VisualSFM software. The ground truth scale was determined using the software’s buildin feature for ground control point registration. For this purpose, we embedded a checkerboard pattern (or other known 3D points) to the scene.
Figure 6 shows individual frames from the datasets. The objects in the scenes are viewed from different distances and angles. In general, the camera motion is not particularly fast but many of the video frames clearly suffer from motion blur. Each sequence is 60 seconds long and the total distance the camera travels varies between 14 and 20 meters. In the experiments, we wanted to make sure that there is no visual drift. Therefore, at least some parts of the scene remain visible the entire length of the recording. In case of visual drift, the scale of the reconstruction might change over time.
In practice, we only need to run the scale estimation once using all the samples available. However, we are also interested how quickly the scale converges towards the true value. The green line in Figure 7 shows the average error of the scale after the camera has traveled a certain distance. We can clearly see that the accuracy improves the farther the camera travels. The average error of the scale is already below 3 after the camera has traveled 2 meters.
A comparison of the RTS smoother and a lowpass filter is show in Figure 7. Here we also compare the frequency domain method against the time domain representation. We can see that the RTS smoother outperforms the lowpass filter, regardless of the method used and distance traveled. The best results are obtained when the RTS smoother is used together with the frequency domain method. As a reference, the graph also shows the results when smoothing is not applied. In such case, the error of time domain method is near 90 percent while the frequency domain method is still able to estimate the scale around 5 accuracy.
We compare our method against the algorithm proposed by Ham et al. [26]. Table I summarizes the results for individual datasets. The average error is visualized in Figure 8. With our method, the average error is around 1 after the camera has traveled 14 meters. The corresponding error for the Ham et al. is around 2 . We can also see that our algorithm converges more quickly towards the true scale.
In the previous experiments, the spatial alignment of the camera and IMU was estimated from the input data. For comparison, Figure 8 shows the average errors in case the spatial alignment is defined in advance. Here we have used the same IMUtocamera calibration matrix as with the algorithm [26]. The results show that the accuracy of the scale is equally good when the transformation is estimated from the input data. This is true even when the sequence is short.
Ham [26]  Proposed  

Dataset  1m  2m  6m  14m  1m  2m  6m  14m 
#1  
#2  
#3  
#4  
#5  
#6  
#7  
#8  
#9  
#10  
#11  
#12  
#13  
#14  
#15 
ViB Height measuring experiments
Here we intend to find out how accurately can we estimate the height of a person using our method. Unlike in the previous experiments, the scene is not completely static. The person was, however, advised to remain stationary during the recording. Datasets were captured with the NVIDIA Shield table and they consist of one indoor sequence (50 seconds) and one outdoor sequence (40 seconds).
Figure 1 shows the scaled point clouds. Most of the points in the background were removed for clarity, especially the points above the head and below the ground plane. The point clouds were also rotated so that the gravity vector is aligned with the yaxis.
For these experiments, we did not place any artificial objects to the scene (e.g. the checkerboard). Instead, the true height of the person was measured to be approximately 174 cm. The estimated heights were 176 cm (indoor sequence) and 181 cm (outdoor sequence). These values correspond to errors 1.1 % and 4%, respectively. It should be noted that factors such as the person’s posture, clothing and accuracy of the physical height measurement will affect the results.
ViC Project Tango experiments
The sequences in these experiments were recorded with the Project Tango Development Kit [14]. In contrast to previous experiments, there is no need to use any structure from motion software since the Tango provides pose estimates in real time at the rate of 100 Hz. Even though these estimates are in metric units, we intend to find out if our method can further improve the accuracy of the scale.
In order to evaluate the accuracy of the Tango’s scale, we set up a test room with known ground point locations. While recording, the device was briefly stopped at each of the four locations and the corresponding points were extracted from the Tango’s trajectory. The points were then rotated and translated to fit the ground points. The optimal transformation was estimated using the algorithm [28]. After the fitting, the rootmeansquare error (RMSE) was computed to evaluate the goodness of the fit. To find out if the results can be improved by our method, we scaled the estimated points with the estimated scale factor and fit the points again.
Table II shows the RMSerrors between the ground points and the estimated points, with and without the scale adjustment. Even though the Tango’s scale is already quite accurate, we can see that our method improves the results in almost every case. A common observation for all the Tango experiments is that the estimated scale factor is always slightly above one. This indicates that in these experiments, the Tango tends to underestimate the scale. In fact, the reason why the error sometimes increases is not caused by the inaccurate scale estimate but rather the fact that Tango’s motion tracking sometimes clearly fails. In such a case, the error cannot be reduced by simply adjusting the scale, because the pose tracking suffers also from other errors.
Dataset  RMSE (buildin)  RMSE (proposed) 

#1  0.0087  0.0074 
#2  0.0201  0.0176 
#3  0.0057  0.0055 
#4  0.0364  0.0392 
#5  0.0272  0.0249 
#6  0.0145  0.0146 
#7  0.0176  0.0141 
#8  0.0126  0.0074 
#9  0.0223  0.0178 
#10  0.0075  0.0057 
Vii Conclusion
We have proposed a method which recovers the metric scale of a visual structurefrommotion reconstruction by utilizing inertial measurements. The algorithm can be easily bundled with existing reconstruction software since the spatial and temporal alignment of the camera and IMU does not have to be known in advance. The evaluation shows that our method outperforms the current stateoftheart in both accuracy and convergence rate of the scale estimate. The accuracy of the scale estimate is typically around 1 % depending on the distance traveled. We have also demonstrated that our method can improve the scale estimate of Project Tango’s buildin motion tracking.
References
 [1] “B. Kamens, RulerPhone  Photo Measuring. Apple App Store,” https://itunes.apple.com/au/app/rulerphonecamerameasuring/id288774794?mt=8, accessed: 20170228.
 [2] “ThirdLove  Lingerie and Virtual Sizing Technology,” https://www.thirdlove.com/pages/iosapp.
 [3] “Smart Tools co., Smart Measure Pro. Google Play Store,” https://play.google.com/store/apps/details?id=kr.aboy.measure, accessed: 20170228.
 [4] A. I. Mourikis and S. I. Roumeliotis, “A multistate constraint Kalman filter for visionaided inertial navigation,” in IEEE International Conference on Robotics and Automation (ICRA), 2007.
 [5] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframebased visual–inertial odometry using nonlinear optimization,” International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
 [6] M. Li, B. H. Kim, and A. I. Mourikis, “Realtime motion tracking on a cellphone using inertial sensing and a rollingshutter camera,” in IEEE International Conference on Robotics and Automation (ICRA), 2013.
 [7] J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis, “Consistency analysis and improvement of visionaided inertial navigation,” IEEE Trans. Robotics, vol. 30, no. 1, pp. 158–176, 2014.
 [8] T. Liu and S. Shen, “High altitude monocular visualinertial state estimation: Initialization and sensor fusion,” 2017.
 [9] J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis, “CameraIMUbased localization: Observability analysis and consistency improvement,” International Journal of Robotics Research, vol. 33, no. 1, pp. 182–201, 2014.
 [10] P. Tanskanen, T. Nägeli, M. Pollefeys, and O. Hilliges, “Semidirect EKFbased monocular visualinertial odometry,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015.
 [11] V. Usenko, J. Engel, J. Stückler, and D. Cremers, “Direct visualinertial odometry with stereo cameras,” in IEEE International Conference on Robotics and Automation (ICRA), 2016.
 [12] A. Concha, G. Loianno, V. Kumar, and J. Civera, “Visualinertial direct SLAM,” in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 1331–1338.
 [13] E. Jones, A. Vedaldi, and S. Soatto, “Inertial structure from motion with autocalibration,” in Workshop on Dynamical Vision, vol. 25, 2007.
 [14] “Project Tango Development Kit,” https://developers.google.com/tango/, accessed: 20170228.

[15]
P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer, and M. Pollefeys,
“Live metric 3D reconstruction on mobile phones,” in
International Conference on Computer Vision (ICCV)
, 2013.  [16] C. Wu, “VisualSfM, a visual structure from motion system,” http://ccwu.me/vsfm/, accessed: 20170228.
 [17] P. Moulon, P. Monasse, R. Marlet, and Others, “OpenMVG. An Open Multiple View Geometry library.” https://github.com/openMVG/openMVG.
 [18] S. Snavely, “Bundler: structure from motion for unordered image collections,” https://www.cs.cornell.edu/~snavely/bundler/, accessed: 20170228.
 [19] J. E. Solem and Others, “OpenSfM: open source structure from motion pipeline,” https://github.com/mapillary/OpenSfM.
 [20] J. Engel, T. Schöps, and D. Cremers, “LSDSLAM: Largescale direc monocular SLAM,” in European Conference on Computer Vision (ECCV), 2014.
 [21] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” in IEEE International Conference on Robotics and Automation (ICRA), 2016.
 [22] “Pix4D,” https://pix4d.com/.
 [23] “Acute3D,” https://www.acute3d.com/.
 [24] “CapturingReality,” https://www.capturingreality.com/.
 [25] C. Ham, S. Lucey, and S. Singh, “Handwaving away scale,” in European Conference on Computer Vision (ECCV), 2014.
 [26] ——, “Absolute scale estimation of 3D monocular vision on smart devices,” in Mobile Cloud Visual Media Computing. Springer, 2015, pp. 329–353.
 [27] E. Mair, M. Fleps, M. Suppa, and D. Burschka, “Spatiotemporal initialization for IMU to camera registration,” in Robotics and Biomimetics (ROBIO), 2011 IEEE International Conference on. IEEE, 2011, pp. 557–564.
 [28] K. S. Arun, T. S. Huang, and S. D. Blostein, “Leastsquares fitting of two 3D point sets,” IEEE Trans. Pattern Anal. Machine Intell., no. 5, pp. 698–700, 1987.
 [29] K. Kanatani, “Analysis of 3D rotation fitting,” IEEE Trans. Pattern Anal. Machine Intell., vol. 16, no. 5, pp. 543–549, 1994.
 [30] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Transactions of the ASME, Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
 [31] S. Särkkä, Bayesian filtering and smoothing. Cambridge University Press, 2013.
 [32] H. E. Rauch, C. Striebel, and F. Tung, “Maximum likelihood estimates of linear dynamic systems,” AIAA journal, vol. 3, no. 8, pp. 1445–1450, 1965.
 [33] P. Aggarwal, Z. Syed, X. Niu, and N. ElSheimy, “A standard testing and calibration procedure for low cost MEMS inertial sensors and units,” Journal of navigation, vol. 61, no. 02, pp. 323–336, 2008.
 [34] “NVIDIA Shield tablet,” https://shield.nvidia.com/tablet/k1, accessed: 20161020.
Comments
There are no comments yet.