Estimating accurate the body position and orientation in indoor scenes is a challenging problem that is basic requirement for many applications such as autonomous parking, AGVs, UAVs, and Augmented/Virtual reality. Usually SLAM (Simultaneous Localization and Mapping) techniques are applied to solve such problem. Among all SLAM techniques, visual SLAM are the most favorable one to be employed in those systems where the cost, energy, or weight are limited.
A large number of methods have been developed in the past decade to solve the SLAM problem using video cameras. Some of them exhibit impressive results in both small scale  and large scale scenes , even in dynamic environments . With extra measurement data from inertial measurements units (IMUs), so-called visual-inertial odometry (VIO) systems  achieve remarkably better accuracy and robustness than pure vision systems.
Most existing approaches focus on handling general scenes, with less attention to particular scenes, like man-made environments. Those environments exhibit strong structural regularities, where most of them can be abstracted as a box world, which is known as Manhattan worlds . In such kind of worlds, planes or lines in perpendicular directions are predominant. Such characteristic have been applied to indoor modeling 9] and heading estimation . With the help of Manhattan world assumption, the robustness and accuracy of visual SLAM can been improved as shown in .
However, the Manhattan world assumption is restrictive to be applied to general man-made scenes, which may include oblique or curvy structures. Most common scenes are those that contain multiple box worlds, each of which have different headings. If the detected line features are forced to be aligned with a single Manhattan world in such cases, the performance may become worse. In this work, we try to address this issue and extend the idea of exploration of structural regularity of man-made scenes  to visual-inertial odometry (VIO) systems.
The key idea is to model the scene as an Atlanta world  rather than a single Manhattan world. An Atlanta world is a world that contains multiple local Manhattan worlds, or a set of box worlds with different orientations. Each local Manhattan world is detected on-the-fly, and their headings are gradually refined by the state estimator when new observations are coming. The detected local Manhattan world is not necessary a real box world. As we will see, we allow a local Manhattan world being detected even if the lines are found to be aligned with only a single horizontal direction. It enables our algorithm flexible to irregular scenes with triangular or polygonal shapes.
The benefit of using structural regularity in VIO system is apparent : Even though no Manhattan world has been detected, a vertical line indicates the gravity direction and immediately renders the roll and pitch of the sensor pose observable; The horizontal lines aligned with one of the local Manhattan worlds give rise a global constraint in heading, making the orientation error do not grow during the period of moving in this local area.
Based on the above mentioned ideas, we present a novel visual-inertial odometry method, which is built upon the state-of-the-art VIO framework and made several extensions to incorporate the structural information, including structural line features and local Manhattan worlds. We describe the extensions in detail, including the Atlanta world representation, structural line parameterization, filter design, line initialization and tracking, triangulation of marginalization of line tracks, and detection of Manhattan worlds.
We have conducted extensive experiments on both public benchmark datasets and challenging datasets collected at different buildings. The results show that our method, incorporating the structural regularities of the man-made environments through exploring both structural features and multiple Manhattan world assumption, achieves better performance than the state-of-the-art visual-inertial methods in all tests. We highlight major technical contributions as follows.
1) A novel line representation with minimum number of parameters seamlessly integrates the Atlanta world assumption and structural features of the man-made buildings. Each local Manhattan world is represented by a heading direction and refined by the estimator as a state variable.
2) Structural lines (lines with dominant directions) and the local Manhattan world are automatically recognized on-the-fly. If no structural line has been detected, our approach works just like a point-based system. Note that when no Manhattan world has been detected, vertical lines still help in estimation if they can be found. This makes our method flexible to different kinds of scenes besides indoor scenes.
3) We also made several improvements on the estimator and line tracking. A novel marginalization method for long feature tracks enables better feature estimates; a line tracking method by sampling multiple points and delayed EKF update makes the tracker more reliable.
Ii Related work
The structural regularity of man-made environments is well known as Manhattan world model . The first indication from this regularity is that line features are predominant in man-made environments. Researchers have attempted to use straight lines as the landmarks since the early days of visual SLAM research  . Recent works that try to use straight line features in visual SLAM  and visual-inertial odometry  can also be found. However, most visual SLAM  or visual-inertial  
systems prefer to use only point features. There are several reasons. First, points are ubiquitous features that can be found nearly in any scenes. Second, compared with line features, point features are well studied to be detected easily and tracked reliably. Another reason is that a straight line has more degree of freedom (5 DoF) than a single point (3 DoF), which makes a line more difficult to be initialized (especially the orientation) and be estimated (usually 6 parameters are required like Plucker coordinates) than a point. It has been shown that adopting line features in a SLAM system may sometimes lead to a worse performance than that of using only points . Therefore the above mentioned issues need to be carefully addressed, e.g. using stereo camera settings  and delayed initialization by waiting multiple key frames . Nevertheless lines are still a good complement to points, which allow adding extra measurements in the estimator to get more accurate results. This is particularly helpful when not enough point features can be extracted in some texture-less indoor scenes.
Another indication from structural regularity is that structural lines are aligned with three axes of the coordinate frame of a Manhattan world. The directional information encoded in the lines offers a clue about the camera orientation, which appears as vanishing points in the image. The vanishing points from parallel lines on the images relates to the camera orientation directly. It has shown that using vanishing points can improve visual SLAM  and visual-inertial odometry . However, in those methods the line features are used as only intermediate results for extracting vanishing points. After that, lines are simply discarded. It should also be helpful by integrating them in the estimator in the same way for points.
Most existing methods explore only the partial information of structural regularity - they either use straight lines without considering their orientation prior, or use the orientation prior without putting lines as extra measurements for better estimation. A few of existing methods consider both aspects . In , the lines with orientation prior are named as structural lines and treated them as landmarks the same as point features for visual SLAM. The method  has a similar spirit but puts focus on visual-inertial odometry. The assumption of only three dominant directions of those methods limit their application to simple scenes that contain no oblique structure. Both methods rely on rigid initializations to detect three directions, requiring at least one vanishing point (in horizontal direction) to be captured in the image for visual-inertial systems and two vanishing points for pure vision systems before the algorithm can start.
In this work, we take a step further to present a more powerful visual-inertial odometry method by solving mentioned issues. We propose to use Atlanta world  to allow multiple Manhattan worlds with different directions. We detect each Manhattan world on-the-fly and estimate their headings in the state. Therefore, our method does not need to capture any vanishing points at the initialization stage. Our novel line parameterization anchors each line to a local Manhattan world, which reduces the degree of freedom of the line parameter and enables line directions being refined along with the heading of the Manhattan world as more evidences are collected.
Iii structural lines and Atlanta worlds
To better model general man-made scenes, we adopt the Atlanta world  assumption in our approach. It is an extension of the Manhattan world assumption - the world is considered as a superposition of multiple Manhattan worlds with different horizontal orientations as shown in Figure 1. Note that each local world is not necessary a real box world containing three perpendicular axes. One horizontal direction can determine one local world as shown in Figure 1(c).
Structural lines could lie in different local worlds. We establish the global world coordinate system with -axis pointing up (reverse direction of gravity) on the location where odometry starts. The IMU coordinate system and the camera coordinate system are denoted by and . Their poses are described by and . Here are rotation transformations represented in unit quaternions and their matrix forms are and respectively. and are the origins of the IMU frame and the camera frame expressed in the world coordinate system.
Each line is anchored to the local coordinate system where the line is firstly detected on the image. We call the anchored coordinate system starting frame, denoted by , whose axes are aligned with the world frame or one of the local Manhattan world , and origin is the camera position at the time a line being detected.
For a given line anchored to the starting frame , we can find a rotation , that transforms this line from a parameter space into , where the line is aligned with the Z axis of as shown in Figure 2. In the parameter space , the line can be simply represented by the intersection point on the XY plane, namely . Here we use the inverse depth representation of the intersection point to describe a line, namely, , where and . The inverse depth representation is known as a good parameterization that can describe infinite features and minimize nonlinearity in feature initialization .
The line in the starting frame is computed from a rotation transformation and the intersection point ,
For structural lines that are aligned with any axis of the three axes of the local Manhattan world, the rotation is one of the following constants:
which correspond to lines aligned with axes.
The transformation from the starting frame to the world frame is determined by the rotation and the camera position , where is a rotation about the gravity direction, namely ,
For the starting frame whose axes are aligned with those of the world frame , we have . We simply use to represent this kind of starting frames.
To obtain the projection of a structural line on the image, it requires to project both the intersection point and the direction in the parameter space onto the image plane. The coordinates of the intersection point in the world frame is computed as
which can be further transformed into the camera frame by
where represents the transformation from the world frame to the camera frame. From (1) (3) (4) (5), by replacing with the inverse depth representation , we get the homogeneous coordinates of the 2D projection of the intersection point as
where . The vanishing point projected by the direction of the parameter space is computed as:
Here is the third column of . Taking the camera intrinsics () into account, we get the line equation on the image by:
From above definitions, we are able to establish the relationship between the 3D line and its 2D observations given the two parameters of the inverse depth representation , the Manhattan world which the line lies in, and the direction (,,) to which the line belongs.
The line projection can be written as a function
where denotes the camera pose. If we use the IMU pose, , instead of the camera pose, we also have
where represents the relative pose between the IMU and camera frames.
Iv System overview
As shown in Figure 3
, we adopt the EKF framework in our visual-inertial system. The state vector of our filter is defined as the following111We switch the state into a row or column vector accordingly for clarity in the following text.:
where indicates the IMU state at time step , including its pose, velocity, and the biases of the gyroscope and the accelerometer,
where is the IMU pose at the -th time step. We also put the relative pose between the IMU and camera frame into the filter to allow it to be updated if sometimes inaccurate calibration is presented.
By adopting Atlanta world model, we detect each box world (or local Manhattan world) on the fly and include the heading of each local world in the state . Those headings will be refined gradually when more observations are gathered.
The historical IMU poses are . Those historical IMU poses are cloned from the current IMU state at different time steps. Some of them will be removed from the state if the number of historical poses exceeds the maximum number allowed.
The covariance matrix at the -th time step is denoted by . Our VIO estimator is to recursively estimate the state and the covariance starting from the initial state and covariance, and .
We follow the approach  to design our filter. All the features, both points and structural lines, are not included in state. They are estimated separately outside of the filter and only used for deriving geometric constraints among IMU poses in the state. The pipeline of our filter is shown in Figure 3. we’ll present the details of our filter design in following sections.
The dynamical model involves state prediction and covariance propagation. The state is predicted by IMU integration, namely
Here, represents IMU integration, where we apply Runge-Kuatta method. To compute the slope values more accurately, we use measurements in both and time steps, for gyroscope and for accelerometer, to linearly approximate the angular velocity and acceleration inside of the time interval from to .
Let be the error state  corresponding to the state vector . The predicted error state is given by
where is the state transition matrix and is the noise matrix. The variable represents the random noise of IMU, including the measurements noise and the random walk noise of the biases. The covariance of the error state is then computed as
where is a covariance matrix corresponding to the IMU noise vector . From (13), we know that except the error state of IMU, the error states of other variables remain unchanged. So the transition matrix has the following form , where is computed as
is the rotation matrix corresponding to the quaternion representing the IMU’s orientation, , and is the time interval between the and steps.
The measurement model involves both point and line features. Here we only describe the measurement model of structural lines, as the measurement model of points is the same as the one described in . To derive the measurement model of structural lines, we need to compute the projection of a given structural line on the image first.
The measurement model of structural lines can be derived from (8). Let the structural line projected on the image be and the two end points of the associated line segment be (homogeneous coordinates) in the -th view. We adopt a signed distance to describe the closeness between the observation (line segment detected in the image) and the predicted line from perspective projection ,
By linearizion about the last estimation of line parameters, the residual in the -th view is approximated as:
where are the Jacobian matrices with respect to . By stacking the measurements from all visible views, we get the following measurement equation for a single structural line:
We then project the residual to the left null space of to yield a new residual defined as
We write it in a brief form:
By doing this, structural lines are decoupled from state estimation, significantly reducing the number of variables in the state.
Note that unlike points, the measurement model of structural lines has a novel part related to the horizontal direction () of a given Manhattan world. That means the horizontal direction of a Manhattan world can be estimated by the filter, allowing us to use multiple box worlds to model the complicate environments.
In our implementation, we adopt numerical differentiation to compute all those Jacobian matrices, since analytic forms are too complicated to be computed. By taking the measurement noise into account, we have , where describes the noise level of the line measurement.
There are two events to trigger EKF updates. The first one is a structural line being tracked is no longer detected in the image. All the measurements of this structural line are used for EKF update and the historical poses where the line is visible are involved in computation of the measurement equation (21). To account for occasional tracker failure, we do not trigger EKF immediately, but wait until the tracker is unable to recover for a number of frames. This delayed update strategy significantly improves the performance of line tracking as we observed in tests.
The second event that triggers EKF update is that the number of poses in the state exceeds the maximum number allowed. In such case, we select one-third poses evenly distributed in the state starting from the second oldest pose, and use all the features including both points and lines visible in those poses to construct the measurement equation similar to the approach described in .
The EKF update process follows the standard procedure, where Joseph form is used to update the covariance matrix to ensure numerical stability.
State management involves adding new variables to the state and removing old variables from the state. Adding new variables to the state, or state augmentation, is caused by two events. The first one is a new image has been received. In this case, the current pose of IMU is appended to the state and the covariance matrix is also augmented,
where represents the operation of cloning and insertion of IMU variables. The second event is about a new Manhattan world has been detected in the environment. Let the heading of the newly detected Manhattan word be . Similarly, we have
Note that the uncertainty of the heading depends on many factors in the process of Manhattan world detection, which involves pose estimation, line detection, and calibration error. Though we can compute Jacobian matrices with respect to all related error factors to get an accurate and , we found that it works well by simply neglecting the correlation between and , , and treating as a preset constant. In our implementation, we let , where .
V Implementation of Key components
V-a Line detection & tracking
We use the LSD line detector  to detect line features in the image and use 3D-2D matching to track the line measurements for each structural line. The advantage of 3D-2D tracking is that it utilizes the camera motion to predict possible position of the structural line on the image to reduce the range for searching correspondences. Another advantage is that it is more convenient for handling occasional tracker lost.
For each structural line, apart from those geometric parameters introduced in Sec. III, we also introduce a variable to represent the range of the structural line in 3D space. The two end points of the structural line correspond to and in the parameter space . When a new image arrives, the structural line is projected onto the image using the predicted camera pose by IMU integration. The next step involves searching line segments detected in the new image that are close to the line projection.
This can be done by checking the positional and directional difference between the line projection and the detected line segment. We then get a set of candidate line segments. After that, we attempt to figure out the real line correspondence among those candidates based on image appearance. Instead of considering only the image appearance around the middle point of a line segment as described in , we adopt a strategy of considering multiple points on the line to improve the tracking robustness.
We sample those points on the structural line by dividing the range into equally distributed values: . For each sample point, we keep its image patch around its projection on the last video frame, and search its corresponding points on the candidate line segments by ZNCC image matching with a preset threshold (ZNCC ), as shown in Figure 4. Finally, we choose the line segment that has the largest number of corresponding points as the associated one. The proposed two-phase line tracking method is proofed to be very effective through extensive tests where lines are stably tracked.
V-B Recognization of structural lines
We attempt to recognize the structural lines among all the line segments that newly extracted in the image and classify them into different directions. For the horizontal lines, we also try to figure out in which local Manhattan world they lie. In the first step, we compute all the vanishing points related to those directions. From (7), the vanishing point of direction is
where is the camera intrinsic matrix and represents the rotation matrix of current camera pose. Similarly, for directions, we have
Note that, only the vanishing points of horizontal directions depend on the heading of the local Manhattan world. We can therefore recognize the vertical lines even if there is no local Manhattan world being detected.
To recognize the structural lines, we draw a ray from each vanishing point to the middle point of line segment . We then check the consistency between the ray and the line segment , including the closeness and the directional difference. We set thresholds for the closeness and the directional difference for evaluating the consistency, and evaluate all vanishing points for each line segment. The line segment is recognized as a structural line if it is consistent with one of those vanishing points.
Sometimes a line segment may be consistent with multiple vanishing points. In this case, the vanishing point with the best consistency is chosen. The remaining line segments that are not consistent with any vanishing points are treated as general lines that are simply excluded in our state estimator. Note that if no Manhattan world has been already detected, vertical structural lines can still be recognized using the vanishing point related to the vertical direction as described above.
V-C Initialization of structural lines
After recognizing structural lines in the current image, instead of initializing all the line segments that are recognized as structural lines, we choose to initialize some of them to avoid redundant initialization (multiple line segments from a single 3D line are initialized) and let the initialized lines be well distributed in the image.
We found the following two rules work well for selecting informative line segments for initialization: 1) the selected line segments are among the longest ones; 2) the selected line segments are not close to those segments already initialized.
Following the above rules, we firstly remove line segments close to those being already initialized, sort the remaining line segments by their length in decreasing order and put them in a queue, and then use the following iterations to select line segments for initialization:
1) pop the line segment at the head of queue and remove it from the queue;
2) initialize a new structural line from ;
3) remove the line segments in the queue that are close to and goto Step 1 until the queue is empty or the number of structural lines has reached the maximum number allowed.
The remaining issue is to initialize new structural lines from the chosen segments. The key to initialize a new structural line is to find the angular parameter , while the inverse depth value can be set to a preset value. The first step of initialization is to establish a starting frame for the structural line. For all structural lines in vertical directions, we choose the starting frame as the one whose axes are aligned with the world frame , or a virtual Manhattan world that . This choice makes it convenient to represent vertical lines if no Manhattan world has been detected. For structural lines in horizontal direction, the starting frame is selected as the one whose axes are aligned with the local Manhattan world .
The angular parameter is determined by the direction from the camera center to the line on the plane of the local parameter space . The direction can be approximated by the ray from the camera center to some point on the line segment .
Let (in homogeneous coordinates) be the middle point of . The back projection ray of in the camera frame is , which is transformed into the local parameter space by
Since and for structural lines in vertical direction, we have a brief computation
The angular parameter is therefore determined by the horizontal heading in the local parameter space. We let . The angular parameter is computed as , and the inverse depth is initialized as a preset for all newly detected structural lines.
V-D Triangulation of structural lines with prior
Triangulation of a structural line is done by minimizing the sum of squared re-projection errors (18) among all views where the line is visible. As we’ll describe later, the time interval of a line track usually exceed that of the historical views stored in the state. If we use only the observations in the visible views within the state, it usually leads to small motion parallaxes and produces inaccurate triangulation result. If we use all visible views for triangulation, the computational cost may increase significantly and the obsolete pose estimates of the views outside of the state may also cause a large error.
To address the mentioned problem, we maintain a prior distribution for each structural line, , where the mean value and the covariance matrix , to store the marginalization information from the historical measurements as described later. The overall objective function is:
where is the signed distances between the line segments and the projected lines defined in (18) and denotes the visible views in the state.
is a standard deviation describes the measurement noise.
This nonlinear least square problem can be solved by Gauss-Newton algorithm in iterations. After triangulation, in order to track lines more reliably as described in Section V-A, we also update the range of structural lines by intersection of this structural line with the back-projection rays from two end points of the line segment in the last visible view.
V-E Marginalization of long line tracks
Similar to sliding window estimators, one problem of our estimator is that features can be tracked in a period of time longer than that of views stored in the state. In existing sliding window estimators, those measurements outside of the sliding window are simply discarded in both key-frame based  and filter-based  frameworks. This could lead to inaccurate estimates of line parameters as measurements outside of the sliding window still carries rich information about the line’s geometry. In , authors put those features that are tracked longer than the sliding window into the state of filter. This is similar to classic EKF-SLAM framework  - the disadvantage is that the number of points put into the state need be strictly controlled so that the state dimension will not become too high. We propose here a novel marginalization method to gather the measurements outside of the sliding window to form a priori about the line geometry. This marginalization approach can also be applied to point features. We describe here the details for lines only.
Let be the set of poses that will be deleted from the sliding window, and the old priori and covariance be and . After frames being removed from the sliding window, the new prior is updated by minimizing the objective function:
which is also solved by Gauss-Newton algorithm. Let be the normal equation being solved in the last Gauss-Newton iteration. The covariance of new prior is computed as .
Note that each structural line is anchored to the one of the camera poses (starting frame) in the state vector. If the starting frame is about to be removed from the state vector, we need to firstly change the starting frame, , to one of the remaining frames in the state, . Let be the transformation of the line parameter from the old starting frame into the new starting frame and be the Jacobian matrix of the coordinate transformation function . We have . The covariance matrices are updated . The process of marginalization could be better understood in Figure 5.
As shown in the experiments (Section VI-C), the RMSE error will reduce to of the original one if we adopt the proposed marginalization method in our VIO implementation.
V-F Detecting and Merging Manhattan worlds
Detection of a Manhattan world in the image involves identifying vanishing points by clustering parallel lines into different groups . The vanishing points from those parallel groups are then extracted to determine the orientation of three axes in 3D space. The process however becomes much simpler if IMU is available, since the accelerometer renders the vertical direction observable because of gravity. We adopt a similar approach  to detect new Manhattan world in the image. We start Manhattan world detection whenever vertical lines are identified as described in Section V-B. The vanishing line of the ground plane ( plane in the world frame) is computed as
After that, we run an -line RANSAC algorithm to detect the possible Manhattan worlds as the following steps:
1) randomly select a line segment that has not been identified as a structural line and extend it to intersect the vanishing line with a vanishing point , about which we make the assumption that it is the projection of direction of the possible Manhattan world. Since the vertical direction is already known, we are able to get direction of the Manhattan world and the vanishing point of direction .
2) get the number of consistent line segments with the two vanishing points and in a similar approach as described in Section V-B.
3) repeat the above steps until the maximum number of iterations arrives.
Finally, the cluster with the largest number of consistent line segments is considered as a possible Manhattan world. We further check if the number of consistent line segments is larger than a threshold ( in our system) and larger than the number of existing structural lines (horizontal lines only) in the image.
Let be the orientation of the Manhattan world under detection. It also requires not to be close to any orientations of existing Manhattan worlds, namely , where is set to be in our implementation. Once all the conditions are satisfied, the detected Manhattan world is added into the state, and covariance is updated as described in Section III.
Sometimes the orientation difference between two Manhattan world may be smaller than after a serials of EKF updates. In that case, we merge the two Manhattan worlds by removing the newer one from the state and adjust the covariance accordingly. The identities of structural lines are also adjusted to be the older Manhattan world.
V-G Outlier rejection
where corresponds to the confidence level of . Those structural lines without passing the gating test are excluded from EKF update. After EKF update, we re-triangulate all structural lines and further check the reprojection errors (18) at all visible views. The structural line with reprojection error larger than a threshold ( pixels in our system) is discarded. Our two-phase outlier rejection make our system much robust again outliers comparing with using only chi-squared gating tests.
Vi Experimental results
Vi-a Benchmark tests
We first evaluate the proposed method on the Euroc dataset . This dataset is collected by a visual-inertial sensor mounted on a quadrotor flying in three different environments, which are classified into machine hall and VICON rooms. In the machine hall, the ground truth trajectories were obtained by a laser tracker, while in the VICON rooms, the motion capture system is used to get the ground truth of the trajectories.
We name our method as StructVIO and compare it with two state-of-the-art VIO methods: OKVIS  and VINS . Both OKVIS and VINS use only point features and adopt the optimization framework to estimate the ego motion. We use the default parameters provided by their authors throughout the tests. We disable the loop closing in VINS to test only the odometry performance. For StructVIO, we set the maximum number of features points as and the maximum number lines as . All parameters were kept fixed during all benchmark tests. We use the RMSE and maximum error to measure the performance of VIO and do not take the first three seconds for evaluation to exclude the initialization stage.
The benchmark scenes are relatively small and full of textures. In such small scenes that are cluttered and contain rich textures, the point-only approaches should work well. Nevertheless, we observe that exploring structural regularity still helps.
As shown in Table I, StructVIO performs better than the state-of-the-art VIO methods on all the sequences except V01_02_Medium, where StructVIO’s RMSE is slightly larger than VINS’s. StructVIO correctly find the Manhattan regularity in the machine hall as shown in Figure 6(a). In the VICON room, we observe that our system reports multiple Manhattan worlds as shown in Figure 6(b) due to cluttered line features on the ground, but they still help as those horizontal lines still encode heading constraints to reduce the overall drift. Even if sometimes no Manhattan world has been detected, vertical lines still help since they immediately reveal the gravity direction to improve the pose estimates for a moving IMU.
We also did quantitative analysis on the performance of using different number of features with different combinations of features, denoted by point-only, point+line, and StructVIO, within the same filter framework adopted in this work. Here point+line represents the VIO method that uses the combination of point and line features, where the line features are the general lines without the orientation priors.
We repeatedly perform VIO on the same sequence by changing the maximum number of features from to , and obtain the average RMSEs (and standard deviations) for different numbers. During the tests, we kept the maximum number of line features as and change the number of points for the methods that involve lines.
To better understand how the structural information help, we plot the average RMSEs separately for the machine hall and the VICON room as shown in Figure 7(a) and 7(b). As we can see, since the machine hall exhibits stronger regularity in structures and contains less textures, StructVIO leads to less RMSE than the point-only and point+line methods if the same number of features are used. In contrast, the VICON rooms are highly cluttered and full of textures, where all the methods have very close performances as shown in Figure 7(b).
From these results, we may roughly conclude that structural information helps more in environments with strong regularities and few textures. Nevertheless, as the scenes in the Euroc dataset are relatively small, this conclusion requires to be further tested. In the next section, we will conduct experiments in large indoor scenes.
Line features extracted in different scenes. Blue lines are classified as vertical. Red and green lines are classified as horizontal while not necessary in the same Manhattan world. In the machine hall, the majority of lines are aligned with three orthogonal directions, which are well described by a Manhattan world. In the Vicon room, the line features are more cluttered, so that multiple Manhattan worlds are used for parameterization of lines.
Vi-B Large Indoor Scenes
In this section, we conduct experiments to test our method in large indoor scenes. We use the Project Tango tablet to collect image and IMU data for evaluation. Gray images are recorded at about Hz and IMU data at Hz. Data collection starts and ends with the same location, while traveling in different routes. We use Kalibr  to calibrate the IMU and camera parameters. To run our algorithm, we remove the distortion and extract the line features from the distortion-free images, while we extract the point features from the original images.
Data collection were conducted within three different buildings, where the camera experiences rapid camera rotation, significant change of lighting conditions, distant features, and lack of textures222The StructVIO executable and datasets can be downloaded from http://drone.sjtu.edu.cn/dpzou/project/structvio/structvio.html. The buildings are referred to as Micro, Soft, and Mech respectively.The Micro building well fits the Manhattan world assumption since it has only three orthogonal directions, while the Mech and Soft buildings have oblique corridors and curvy structures that can not be modeled by a single box world as shown in Figure 8.
During data collection, the collector also went out of the building and walked around from time to time. Each data collection lasts about minutes, and the traveling distances are usually several hundreds meters. Some of the captured images are shown in Figure 9. Our datasets exhibit many challenging cases, including over/under exposure of images, distance features, quick camera rotations and texture-less walls.
Unlike in a small indoor environment, we cannot use motion capture systems to get the ground truth of the whole trajectory in such large indoor scenes to exactly evaluate the performance. Instead, we managed to obtain the ground truth trajectory in the beginning and in the end. The first way is that we start the data collection from a vicon room and return to it when finished. The second way is to use a printed QR pattern (ArUco marker ) to get the camera trajectory when there is no motion capture system available. Though using the ArUco marker leads to less accurate result, it still produces trajectories with about cm accuracy (validated by the vicon system as shown in Figure 10), which is sufficient to treated as the ground truth in our tests.
We aligned the estimated trajectory with the ground truth trajectory acquired in the beginning and compared the difference between the estimated trajectory and the ground truth trajectories( acquired both at the beginning and at the end of data collection). The different is described by RMSE and Max. of absolute pose error after alignment.
Let be the camera trajectory estimated by the VIO algorithm and , be the ground truth trajectories estimated at the beginning and at the end respectively, from either VICON or ArUco. First we obtain the transformation that
The RMSE and Max. are computed as
Note that the two errors in fact does not fully describe the performance of a visual-inertial system. It happens occasionally that a visually bad trajectory (intermediate pose estimates are not good) ends up with a nearly closed loop result. We therefore test repeatedly in one scene in order to reduce biases.
We conduct experiments to test the performance of approaches using different combinations of features with/without structural constraints. The first approach (Point-only) uses only points. The second approach (Point+Line) uses both points and lines but without using the structural constraints. The last approach (StructVIO) uses both points and lines with the Atlanta world assumption. We keep the maximum number of points as and the maximum number of lines as during all tests.
We also present the results from the Tango system and the other two state-of-the-art VIO algorithms OKVIS and VINS  for comparison. We use the default parameters and implementation by their authors and disable loop closing in VINS to test only the performance of odometry. We also add the FOV distortion model  to OKVIS and VINS softwares to enable them to process the raw image data from Tango, as Tango uses the FOV model to describe the camera distortion. Parameters are kept constant for all algorithms during tests.
The results are presented in Table II. We listed the traveling distances of each sequence and the positional errors of all algorithms. In the bottom of the table, we also compute the mean/median drift error as the average/median RMSE position error divided by the average traveling distance.
We can see that from the results our approach (StructVIO) using structural lines achieves the best performance among all methods in extensive tests in large indoor scenes. This is largely due to two reasons: 1) structural lines are a good complement to point features in low-textured scenes within in man-made environments; 2) structural lines encode the global orientations of the local environments, and render the horizontal heading observable. The limited heading error therefore reduces the overall position error. We can see that in the results, the mean drift error reduce from to with the help of structural lines under Atlanta world assumption.
However, if we use line features without considering if they are structural lines, the results (Point+Line) show that it does not improve the accuracy much in our tests. Note the average/median drift errors of the point-only approach and the point+line approach are almost the same ( versus ) . Similar phenomenon has been observed in . The reason could be that the general lines have more degree of freedom and are less distinguishable than points. Both facts can sometimes have negative impact on the VIO system as discussed in .
|Seq. Name||Atlata world||Manhattan world|
Another interesting observation is that both optimization-based methods (OKVIS and VINS ) perform worse than the our point-only approach and sometimes fail in our tests, though they are theoretically more accurate than the filter-based approach. The first reason may be lack of feature points in the low-textured scenes, so that many feature points last only a few video frames which are easily neglected in key frame selection. The filter-based approach instead takes every feature track into EKF update. Another reason is that we adopt novel feature marginalization to get better triangulation for long tracks as long as historical measurements outside the sliding window are discarded. The effectiveness of feature marginalization is demonstrated in the next section.
Vi-C Feature Marginalization
We evaluate the feature marginalization proposed in Section V-E. In our implementation, we apply marginalization to both point and line features. We check the performance difference of our StructVIO system using or without using feature marginalization. As shown in Figure 14, if we disable feature marginalization, the drift error of StructVIO increase about . It suggests that marginalization be adopted for long feature tracks to keep the historical information encoded in old video frames that are out of the sliding windows used for estimation.
Vi-D Atlanta World v.s. Manhattan World
We also conduct experiments to evaluate the advantage of using Atlanta world assumption instead of Manhattan world assumption. The tests were conducted on the ’Soft’ and ’Mech’ sequences, since both buildings contain oblique corridors or curvy structures as shown in Figure 8. As we can see, both scenes consists of two Manhattan worlds that are better described as a Atlanta world. To test the performance of using Manhattan world assumption, we keep all the parameters the same disable detection of multiple Manhattan worlds. The benefit of using Atlanta world assumption instead of Manhattan world assumption as did in  in such scenes is clearly shown in Table III. If we use Manhattan world assumption instead of Atlanta world assumption in the two scenes with irregular structures, the RMSE errors increase about in average.
In this paper, we propose a novel visual-inertial navigation approach that integrates structural regularity of man-made environments by using line features with orientation priors. The structural regularity is modeled as an Atlanta world, which consists of multiple Manhattan worlds. The orientation prior is encoded in each local Manhattan world that is detected on-the-fly and is updated in the state variable over time. To realize such a visual-inertial navigation system , we made several technical contributions, including a novel parameterization for lines that integrate lines and Manhattan world together, a flexible strategy for detection and management of Manhattan worlds, a reliable line tracking method, and a marginalization method for long line tracks.
We compared our method with existing algorithms in both benchmark datasets and in real-world tests with a Project Tango tablet. The results show that our approach outperforms existing visual-inertial methods that are considered as the state of the arts, though the test data are challenging because of lack of textures, bad lighting conditions and fast camera motions. That indicates incorporating structural regularity is helpful to implement a better visual-inertial system. Our system is implemented in C++ without any optimization or parallel processing, and runs on an i7 laptop about frames per second. The bottle neck is about line extraction and tracking. However, feature extraction (both points and lines) can be significantly sped up by parallel processing.
-  G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in IEEE & ACM Proc. of Int’l Sym. on Mixed and Augmented Reality. IEEE, 2007, pp. 225–234.
-  R. Mur-Artal, J. Montiel, and J. D. Tardós, “Orb-slam: a versatile and accurate monocular slam system,” IEEE Trans. on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
-  D. Zou and P. Tan, “Coslam: Collaborative visual slam in dynamic environments,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 2, pp. 354–366, 2013.
A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman filter for vision-aided inertial navigation,” inProc. IEEE Int. Conf. Robotics and Automation. IEEE, 2007, pp. 3565–3572.
-  S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” Int. J. Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
-  T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” arXiv preprint arXiv:1708.03852, 2017.
J. M. Coughlan and A. L. Yuille, “Manhattan world: Compass direction from a single image by bayesian inference,” in
Proc. IEEE Int. Conf. Computer Vision, vol. 2. IEEE, 1999, pp. 941–947.
Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski, “Manhattan-world
Proc. IEEE Conf. Computer Vision & Pattern Recognition. IEEE, 2009, pp. 1422–1429.
-  A. Gupta, A. A. Efros, and M. Hebert, “Blocks world revisited: Image understanding using qualitative geometry and mechanics,” in Euro. Conf. Computer Vision. Springer, 2010, pp. 482–496.
-  L. Ruotsalainen, J. Bancroft, and G. Lachapelle, “Mitigation of attitude and gyro errors through vision aiding,” in IEEE Proc. of Indoor Positioning and Indoor Navigation. IEEE, 2012, pp. 1–9.
-  H. Zhou, D. Zou, L. Pei, R. Ying, P. Liu, and W. Yu, “Structslam: Visual slam with building structure lines,” IEEE Trans. on Vehicular Tech., vol. 64, no. 4, pp. 1364–1375, 2015.
G. Schindler and F. Dellaert, “Atlanta world: An expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments,” inProc. IEEE Conf. Computer Vision & Pattern Recognition, vol. 1. IEEE, 2004, pp. I–I.
-  P. Smith, I. D. Reid, and A. J. Davison, “Real-time monocular slam with straight lines,” in Proc. British Machine Vision Conf., 2006, pp. 17–26.
-  J. Sola, T. Vidal-Calleja, and M. Devy, “Undelayed initialization of line segments in monocular slam,” in IEEE/RSJ Proc. of Intelligent Robots and Systems. IEEE, 2009, pp. 1553–1558.
-  E. Perdices, L. M. López, and J. M. Cañas, “Lineslam: Visual real time localization using lines and ukf,” in ROBOT2013: First Iberian Robotics Conference. Springer, 2014, pp. 663–678.
-  a. pumarola, a. vakhitov, a. agudo, a. sanfeliu, and f. moreno noguer, “pl-slam: real-time monocular visual slam with points and lines,” in robotics and automation (icra), 2017 ieee international conference on. ieee, 2017, pp. 4503–4508.
-  Y. He, J. Zhao, Y. Guo, W. He, and K. Yuan, “Pl-vio: Tightly-coupled monocular visual–inertial odometry using point and line features,” Sensors, vol. 18, no. 4, p. 1159, 2018.
-  J. Sola, T. Vidal-Calleja, J. Civera, and J. M. M. Montiel, “Impact of landmark parametrization on monocular ekf-slam with points and lines,” Int. J. Computer Vision, vol. 97, no. 3, pp. 339–368, 2012.
-  R. Gomez-Ojeda and J. Gonzalez-Jimenez, “Robust stereo visual odometry through a probabilistic combination of points and line segments,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 2521–2526.
-  Y. H. Lee, C. Nam, K. Y. Lee, Y. S. Li, S. Y. Yeon, and N. L. Doh, “Vpass: Algorithmic compass using vanishing points in indoor environments,” in IEEE/RSJ Proc. of Intelligent Robots and Systems. IEEE, 2009, pp. 936–941.
-  G. Zhang, D. H. Kang, and I. H. Suh, “Loop closure through vanishing points in a line-based monocular slam,” in Proc. IEEE Int. Conf. Robotics and Automation. IEEE, 2012, pp. 4565–4570.
-  F. Camposeco and M. Pollefeys, “Using vanishing points to improve visual-inertial odometry,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015, pp. 5219–5225.
-  D. G. Kottas and S. I. Roumeliotis, “Exploiting urban scenes for vision-aided inertial navigation.” in Robotics: Science and Systems, 2013.
-  J. Civera, A. J. Davison, and J. M. Montiel, “Inverse depth parametrization for monocular slam,” IEEE Trans. on Robotics, vol. 24, no. 5, pp. 932–945, 2008.
-  J. Sola, “Quaternion kinematics for the error-state kalman filter,” arXiv preprint arXiv:1711.02508, 2017.
-  R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “Lsd: A fast line segment detector with a false detection control,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 4, pp. 722–732, 2010.
-  M. Li and A. I. Mourikis, “Optimization-based estimator design for vision-aided inertial navigation,” in Robotics: Science and Systems, 2013, pp. 241–248.
-  A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.
-  R. Toldo and A. Fusiello, “Robust multiple structures estimation with j-linkage,” in Euro. Conf. Computer Vision. Springer, 2008, pp. 537–547.
-  M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
-  P. Furgale, J. Rehder, and R. Siegwart, “Unified temporal and spatial calibration for multi-sensor systems,” in IEEE/RSJ Proc. of Intelligent Robots and Systems. IEEE, 2013, pp. 1280–1286.
-  A. de la Visión Artificial, “Aruco. a minimal library for augmented reality applications based on opencv,” Dosegljivo: http://www. uco. es/investiga/grupos/ava/node/26.[Dostopano: 16. 4. 2016], 2015.
-  F. Devernay and O. Faugeras, “Straight lines have to be straight,” Machine vision and applications, vol. 13, no. 1, pp. 14–24, 2001.