Estimation of ego-motion and scene geometry is one of the key challenges in many engineering fields such as robotics and autonomous driving. In the last few decades, visual odometry (VO) systems have attracted a substantial amount of attention due to low-cost hardware setups and rich visual representation . However, monocular VO is confronted with numerous challenges such as scale ambiguity, the need for hand-crafted mathematical features, strict parameter tuning and image blur caused by abrupt camera motion, which might corrupt VO algorithms if deployed in low-textured areas and variable ambient lighting conditions [2, 3]. For such cases, visual inertial odometry (VIO) systems increase the robustness of VO systems, incorporating information from an inertial measurement unit (IMU) to improve motion tracking performance [4, 5].
Supervised deep learning methods have achieved state-of-the-art results on various computer vision problems using large amounts of labeled data[6, 7, 8]
. Moreover, supervised deep VIO and depth recovery techniques have shown promising performance in challenging environments and successfully alleviate issues such as scale drift, need for feature extraction and parameter fine-tuning[9, 10, 11, 12]
. Most existing deep learning approaches in the literature treat VIO and depth recovery as a supervised learning problem, where they have color input images, corresponding target depth values and relative transformation of images at training time. VIO as a regression problem in supervised deep learning exploits the capability of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to estimate camera motion, calculate optical flow, and extract efficient feature representations from raw RGB and IMU input[9, 10, 13, 11]. However, for many vision-aided localization and navigation problems requiring dense, continuous-valued outputs (e.g. visual-inertial odometry (VIO) and depth map reconstruction), it is either impractical or expensive to acquire ground truth data for a large variety of scenes . Even when ground truth depth data is available, it can be imperfect and cause distinct prediction artifacts. For example, systems employing rotating LIDAR scanners suffer from the need for tight temporal alignment between laser scans and corresponding camera images even if the camera and LIDAR are carefully synchronized . In addition, structured light depth sensors — and to a lesser extent, LIDAR and time-of-flight sensors — suffer from noise and structural artifacts, especially in the presence of reflective, transparent, or dark surfaces. Last, there is usually an offset between the depth sensor and the camera, which causes shifts in the point cloud projection onto the camera viewpoint. These problems may lead to degraded performance and even failure for learning-based models trained on such data [16, 17].
In recent years, unsupervised deep learning approaches have emerged to address the problem of limited training data [18, 19, 20]. As an alternative, these approaches instead treat depth estimation as an image reconstruction problem during training. The intuition here is that, given a sequence of monocular images, we can learn a function that is able to reconstruct a target image from source images, exploiting the 3D geometry of the scene. To learn a mapping from pixels to depth and camera motion without the ground truth is challenging because each of these problems is highly ambiguous. To address this issue, recent studies imposed additional constraints and exploited the geometric relations between ego-motion and the depth map [21, 16]. Recently, optical flow has been widely studied and used as a self-supervisory signal for learning an unsupervised ego-motion system, but it has an aperture problem due to the missing structure in local parts of the single camera . However, most unsupervised methods learn only from photometric and temporal consistency between consecutive frames in monocular videos, which are prone to overly smoothed depth map estimations.
To overcome these limitations, we propose a self-supervised VIO and depth map reconstruction system based on adversarial training and attentive sensor fusion (see Fig. 1), extending our GANVO work . We introduce a novel sensor fusion technique to incorporate loosely synchronized motion information from interoceptive and mostly environment-agnostic raw inertial data. Furthermore, we conduct experiments on the publicly available EuRoC MAV dataset  to measure the robustness of the fusion system against miscalibration. Additionally, we separate the effects of the VO module from the pose estimates extracted from IMU measurements to test the effectiveness of each module. Moreover, we perform ablation studies to compare the performance of convolutional and recurrent networks. In addition to the results presented in , here we thoroughly evaluate the benefit of the adversarial generative approach. In summary, the main contributions of the approach are as follows:
To the best of our knowledge, this is the first self-supervised deep joint monocular VIO and depth reconstruction method in the literature;
We propose a novel unsupervised sensor fusion technique for the camera and the IMU, which extracts and fuses motion features from raw IMU measurements and RGB camera images using convolutional and recurrent modules based on an attention mechanism;
No strict temporal or spatial calibration between camera and IMU is necessary for pose and depth estimation, contrary to traditional VO approaches.
Evaluations made on the KITTI , EuRoC  and Cityscapes  datasets prove the effectiveness of SelfVIO. The organization of this paper is as follows. Previous work in this domain is discussed in Section II. Section III describes the proposed unsupervised deep learning architecture and its mathematical background in detail. Section IV describes the experimental setup and evaluation methods. Section V shows and discusses detailed quantitative and qualitative results with comprehensive comparisons to existing methods in the literature. Finally, Section VI concludes the study with some interesting future directions.
Ii Related Work
In this section, we briefly outline the related works focused on VIO including traditional and learning-based methods.
Ii-a Traditional Methods
Traditional VIO solutions combine visual and inertial data in a single pose estimator and lead to more robust and higher accuracy compared to VO even in complex and dynamic environments. The fusion of camera images and IMU measurements is typically accomplished by filter-based or optimization-based approaches. The multi-state constraint Kalman filter (MSCKF) is a standard for filtering-based VIO approaches. It has low computational complexity that is linear in the number of features used for ego-motion estimation. While MSCKF-based approaches are generally more robust compared to optimization-based approaches especially in large-scale real environments, they suffer from lower accuracy in comparison (as has been recently reported in ). ROVIO  is another filtering-based VIO algorithm for monocular cameras that utilizes the intensity errors in the update step of an extended Kalman filter (EKF) to fuse visual and inertial data. ROVIO uses a robocentric approach that estimates 3D landmark positions relative to the current camera pose. OKVIS  is a widely used, optimization-based visual-inertial SLAM approach for monocular and stereo cameras. OKVIS uses a nonlinear batch optimization on saved keyframes consisting of an image and estimated camera pose. It updates a local map of landmarks to estimate camera motion without any loop closure constraint. VINS-Mono  is a tightly coupled, nonlinear optimization-based method for monocular cameras. It uses a pose graph optimization to enforce global consistency, which is constrained by a loop detection module. VINS-Mono requires IMU-camera extrinsic calibration to translate pose values from one sensor frame to the other.
Ii-B Learning-Based Methods
VINet  was the first end-to-end trainable visual-inertial deep network. However, VINet was trained in a supervised manner and thus required the ground truth pose differences for each exemplar in the training set. Recently, there have been several successful unsupervised approaches to depth estimation that are trained using reconstruction loss from image warping similar to our own network. Garg et al. , Godard et al.  and Zhan et al.  used such methods with stereo image pairs with known camera baselines and reconstruction loss for training. Thus, while technically unsupervised, stereo baseline effectively provides a known transform between two images.
have approached odometry and depth estimation problems by coupling two or more problems together in an unsupervised learning framework. Zhou et al. introduced joint unsupervised learning of ego-motion and depth from multiple unlabeled RGB frames. They input a consecutive sequence of images and output a change in pose between the middle image of the sequence and every other image in the sequence, and the estimated depth of the middle image. A recent work  used a more explicit geometric loss to jointly learn depth and camera motion for rigid scenes with a semidifferentiable iterative closest point (ICP) module. These VO approaches estimate ego-motion only by the spatial information existing in several frames, which means temporal information within the frames is not fully utilized. As a result, the estimates are inaccurate and discontinuous.
UnDeepVO  is another unsupervised depth and ego-motion estimation work. It differs from  in that it can generate properly scaled trajectory estimates. However, unlike  and similar to [31, 32], it uses stereo image pairs for training where the baseline between images is available and thus, UnDeepVO can only be trained on datasets where stereo image pairs are existent. Additionally, the network architecture of UnDeepVO cannot be extended to include motion estimates derived from inertial measurements because the spatial transformation between paired images from stereo cameras are unobservable by an IMU (stereo images are recorded simultaneously). Finally, VIOLearner  is a recent unsupervised learning-based approach to VIO using multiview RGB-depth (RGB-D) images, which extends the work of . It uses a learned optimizer to minimize photometric loss for ego-motion estimation, which leverages the Jacobians of scaled image projection errors with respect to a spatial grid of pixel coordinates similar to . Although no ground truth odometry data are needed, the depth input to the system provides external supervision to the network, which may not always be available.
One critical issue of these unsupervised works is the fact that they use auto encoder-decoder-based traditional depth estimators with a tendency to generate overly smooth images . Whereas GANVO  is the first unsupervised adversarial generative approach to jointly estimate multiview pose and monocular depth map estimation to solve this, we apply GANs to provide sharper and more accurate depth maps, extending the work of 
. The second issue of the aforementioned unsupervised techniques is the fact that they solely employ CNNs that only analyze just-in-moment information to estimate camera pose[9, 11, 42]. We address this issue by employing a CNN-RNN architecture to capture temporal relations across frames. Furthermore, these existing VIO works use a direct fusion approach that concatenates all features extracted from different modalities, resulting in sub-optimal performance, as not all features are useful and necessary . We introduce an attention mechanism to self-adaptively fuse the different modalities conditioned on the input data. We discuss our reason behind these design choices in the related sections.
Iii Self-supervised Monocular VIO and Depth Estimation Architecture
Given unlabeled monocular RGB image sequences and raw IMU measurements, the proposed approach learns a function that regresses 6-DoF camera motion and predicts the per-pixel scene depth. An overview of our SelfVIO architecture is depicted in Fig. 1. We stack the monocular RGB sequences consisting of a target view () and source views () to form an input batch to the multiview visual odometry module. The VO module consisting of convolutional layers regresses the relative 6-DoF pose values of the source views with respect to the target view. We form an IMU input tensor using raw linear acceleration and angular velocity values measured by an IMU between and , which is processed in the inertial odometry module to estimate the relative motion of the source views. We fuse the 6-DoF pose values estimated by visual and inertial odometry modules in a self-adaptive fusion module, attentively selecting certain features that are significant for pose regression. In parallel, the depth generator module estimates a depth map of the target view by inferring the disparities that warp the source views to the target. The spatial transformer module synthesizes the target image using the generated depth map and nearby color pixels in a source image sampled at locations determined by a fused 3D affine transformation. The geometric constraints that provide a supervision signal cause the neural network to synthesize a target image from multiple source images acquired from different camera poses. The view discriminator module learns to distinguish between a fake (synthesized by the spatial transformer) and a real target image. In this way, each subnetwork targets a specific subtask and the complex scene geometry understanding goal is decomposed into smaller subgoals.
In the overall adversarial paradigm, a generator network is trained to produce output that cannot be distinguished from the original image by an adversarially optimized discriminator network. The objective of the generator is to trick the discriminator, i.e. to generate a depth map of the target view such that the discriminator cannot distinguish the reconstructed view from the original view. Unlike the typical use of GANs, the spatial transformer module maps the output image of the generator to the color space of the target view and the discriminator classifies this reconstructed colored view rather than the direct output of the generator. The proposed scheme enables us to predict the relative motion and depth map in an unsupervised manner, which is explained in the following sections in detail.
Iii-a Depth Estimation
The first part of the architecture is the depth generator network that synthesizes a single-view depth map by translation the target RGB frame. A defining feature of image-to-depth translation problems is that they map a high-resolution input tensor to a high resolution output tensor, which differs in surface appearance. However, both images are renderings of the same underlying structure. Therefore, the structure in the RGB frame is roughly aligned with the structure in the depth map.
The depth generator network is based on a GAN design that learns the underlying generative model of the input image . Three subnetworks are involved in the adversarial depth generation process: an encoder network , a generator network, , and a discriminator network . The encoder extracts a feature vector from the input target image , i.e. . maps the vector to the depth image space which is used in spatial transformer module to reconstruct the original target view. classifies the reconstructed view as synthesized or real.
Many previous solutions [21, 37, 44] to the single-view depth estimation are based on an encoder-decoder network . Such a network passes the input through a series of layers that progressively downsample until a bottleneck layer and, then, the process is reversed by upsampling. All information flow passes through all the layers, including the bottleneck. For the image-to-depth translation problem, there is a great deal of low-level information shared between the input and output, and the network should make use of this information by directly sharing it across the layers. As an example, RGB image input and the depth map output share the location of prominent edges. To enable the generator to circumvent the bottleneck for such shared low-level information, we add skip connections similar to the general shape of a U-Net . Specifically, these connections are placed between each layer and layer , where is the total number of layers, which concatenate all channels at layer with those at layer .
Iii-B Visual Odometry
The VO module (see Fig. 2) is designed to take two concatenated source views and a target view along the color channels as input and to output a 6D vector
introduced by motion and temporal dynamics across frames. The network is composed of 6 stride-2 convolutions followed by three parallel fully connected (FC) layers with, , units. We decouple the FC layers for translation and rotation as it has been shown to work better in separate branches as in . We also use a dropout  between FC layers at the rate of to help regularization. The last fully connected layers give a 6-DoF pose vector, which defines the 3D affine transformation between target image and source images and .
Iii-C Inertial Odometry
SelfVIO takes raw IMU measurements in the following form:
where is linear acceleration, is angular velocity, and is the number of IMU samples obtained between time and . The IMU processing module of SelfVIO uses two parallel branches consisting of convolutional layers for the IMU angular velocity and linear acceleration (see Fig. 2 for more detail). Each branch on the IMU measurements has the following convolutional layers:
two layers: single-stride filters with kernel size ,
one layer: filters of stride with kernel size ,
one layer: filters of stride with kernel size , and
one layer: filters of stride with kernel size .
The outputs of the final angular velocity and linear acceleration branches were flattened into tensors using a convolutional layer with three filters of kernel size and stride before they are concatenated into a tensor . Thus, it learns to estimate 3D affine transformation between times and .
Iii-D Self-Adaptive Visual-Inertial Fusion
In learning-based VIO, a standard method for fusion is concatenation of feature vectors coming from different modalities, which may result in suboptimal performance, as not all features are equally reliable . For example, the fusion is plagued by the intrinsic noise distribution of each modality such as white random noise and sensor bias in IMU data. Moreover, many real-world applications suffer from poor calibration and synchronization between different modalities. To eliminate the effects of these factors, we employ an attention mechanism , which allow the network to automatically learn the best suitable feature combination given visual-inertial feature inputs.
The convolutional layers of the VO and IMU processing modules extract features from the input sequences and estimate ego-motion, which is propagated to the self-adaptive fusion module. In our attention mechanism, we use a deterministic soft fusion approach to attentively fuse features. The adaptive fusion module learns visual () and inertial () filters to reweight each feature by conditioning on all channels:
is the sigmoid function,is the concatenation of all channel features, and and are the weights for each modality. We multiply the visual and inertial features with these masks to weight the relative importance of the features:
where is the elementwise multiplication. The resulting weight matrix is fed into the RNN part (a two-layer bi-directional LSTM). The LSTM takes the combined feature representation and its previous hidden states as input, and models the dynamics and connections between a sequence of features. After the recurrent network, a fully connected layer regresses the fused pose, which maps the features to a 6-DoF pose vector. It outputs ( is the number of source views ) channels for 6-DoF pose values for translation and rotation parameters, representing the motion over a time window and . The LSTM improves the sequential learning capacity of the network, resulting in a more accurate pose estimation.
Iii-E Spatial Transformer
A sequence of consecutive frames is given to the pose network as input. An input sequence is denoted by where is the time index, is the target view, and the other frames are source views that are used to render the target image according to the objective function:
where is the pixel coordinate index, and is the projected image of the source view onto the target coordinate frame using a depth image-based rendering module. For the rendering, we define the static scene geometry by a collection of depth maps for frame and the relative camera motion from the target to the source frame. The relative 2D rigid flow from target image to source image can be represented by111Similar to , we omit the necessary conversion to homogeneous coordinates for notation brevity.:
where denotes the camera transformation matrix and denotes homogeneous coordinates of pixels in target frame .
We interpolate the nondiscretevalues to find the expected intensity value at that position, using bilinear interpolation with the discrete neighbors of . The mean intensity value for projected pixel is estimated as follows:
where is the proximity value between the projected and neighboring pixels, which sums up to . Guided by these positional constraints, we can apply differentiable inverse warping  between nearby frames, which later becomes the foundation of our self-supervised learning scheme.
Iii-F View Discriminator
The L2 and L1 losses produce blurry results on image generation problems . Although these losses fail to encourage high-frequency crispness, in many cases they nonetheless accurately capture the low frequencies. This motivates restricting the GAN discriminator to model high-frequency structure, relying on an L1 term to force low-frequency correctness. To model high frequencies, it is sufficient to restrict our attention to the structure in local image patches. Therefore, we employ the PatchGAN  discriminator architecture that only penalizes the structure at the scale of patches. This discriminator tries to classify each patch in an image as real or fake. We run this discriminator convolutionally across the image, averaging all responses to provide the ultimate output of . Such a discriminator effectively models the image as a Markov random field, assuming independence between pixels separated by more than a patch diameter. This connection was previously explored in , and is also the common assumption in models of texture , which can be interpreted as a form of texture loss.
The spatial transformer module synthesizes a realistic image by the view reconstruction algorithm using the depth image generated by and estimated pose value. classifies the input images sampled from the target data distribution
into the fake and real categories, playing an adversarial role. These networks are trained by optimizing the objective loss function:
where is the sample image from the distribution and is a feature encoding of on the latent space.
Iii-G The Adversarial Training
In contrast to the original GAN , we remove fully connected hidden layers for deeper architectures and use batchnorms in the and
networks. We replace pooling layers with strided convolutions and fractional-strided convolutions inand
networks, respectively. For all layer activations, we use LeakyReLU and ReLU in theand networks, respectively, except for the output layer that uses tanh nonlinearity. The GAN with these modifications and loss functions generates nonblurry depth maps and resolves the convergence problem during the training . The final objective for the optimization during the training is:
where is the balance factor that is experimentally found to be optimal by the ratio between the expected values and at the end of the training.
Iv Experimental Setup
In this section, the datasets used in the experiments, network training protocol, evaluation methods are introduced including ablation studies, and performance evaluation in cases of poor intersensor calibration.
|Eigen et al.  coarse||k||0.214||1.605||6.563||0.292||0.673||0.884||0.957|
|Eigen et al.  fine||k||0.203||1.548||6.307||0.282||0.702||0.890||0.958|
|Liu et al. ||k||0.202||1.614||6.523||0.275||0.678||0.895||0.965|
|Mahjourian et al. ||cs+k||0.159||1.231||5.912||0.243||0.784||0.923||0.970|
|SelfVIO (ours, no-IMU)||cs+k||0.138||1.013||4.317||0.231||0.849||0.958||0.979|
|Mahjourian et al. ||k||0.163||1.240||6.220||0.250||0.762||0.916||0.968|
The KITTI odometry dataset  is a benchmark for depth and odometry evaluations including vision and LIDAR-based approaches. Images are recorded at Hz via an onboard camera mounted on a Volkswagen Passat B6. Frames are recorded in various environments such as residential, road, and campus scenes adding up to a km travel length. Ground truth pose values at each camera exposure are determined using an OXTS RT 3003 GPS solution with an accuracy of cm and corresponding depth ground truth data are acquired via a Velodyne laser scanner. A temporal synchronization between sensors is provided using a software-based calibration approach, which causes issues for VIO approaches that require strict time synchronization between RGB frames and IMU data.
We evaluate SelfVIO on the KITTI odometry dataset using Eigen et al.’s split . We use sequences for training and for the test set that is consistent across related works [58, 59, 23, 16, 35, 21, 60]. Additionally, of KITTI sequences are withheld as a validation set, which leaves a total of training images, testing images, and validation images. Input images are scaled to for training, whereas they are not limited to any specific image size at test time. In all experiments, we randomly select an image for the target and use consecutive images for the source. Corresponding Hz IMU data are collected from the KITTI raw datasets and for each target image, the preceding ms and the following ms of IMU data are combined yielding a tensor of size ( ms between the source images and target). Thus, the network learns how to implicitly estimate a temporal offset between camera and IMU as well as determine an estimate of the initial velocity at the time of target image capture by looking to corresponding IMU data.
The EuRoC dataset  contains 11 sequences recorded onboard from an AscTec Firefly micro aerial vehicle (MAV) while it was manually piloted around three different indoor environments executing 6-DoF motions. Within each environment, the sequences increase qualitatively in difficulty with increasing sequence numbers. For example, Machine Hall 01 is ”easy”, while Machine Hall 05 is a more challenging sequence in the same environment, containing faster and loopier motions, poor illumination conditions etc. These sequences are recorded by a front-facing visual-inertial sensor unit with tight synchronization between the stereo camera and IMU timestamps captured using a MAV. Accurate ground truth is provided by laser or motion capture tracking depending on the sequence, which has been used in many of the existing partial comparative evaluations of VIO methods. The dataset provides synchronized global shutter WVGA stereo images at a rate of Hz that we use only the left camera image, and the acceleration and angular rate measurements from a Skybotix VI IMU sensor at Hz. In the Vicon Room sequences they are provided by Vicon motion capture systems, while in the Machine Hall sequences, ground truth positioning measurements are provided by a Leica MS50 laser tracker. The dataset containing sequences, ground truth and sensor calibration data is publicly available 222http://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets. The EuRoC dataset, being recorded indoors on unstructured paths, exhibits motion blur and the trajectories follow highly irregular paths unlike the KITTI dataset.
The Cityscapes Urban Scene 2016 dataset 
is a large-scale dataset mainly used for semantic urban scene understanding, which containsstereo images for autonomous driving in an urban environment collected in street scenes from 50 different cities across Germany spanning several months. The dataset also provides precomputed disparity depth maps associated with the RGB images. Although it has a similar setting to the KITTI dataset, the Cityscapes dataset has higher resolution (), more image quality, and variety. We cropped the input images to keep only the top 80% of the image, removing the very reflective car hoods.
: average translational RMSE drift on length of 100m-800m.
: average rotational RMSE drift (m) on length of 100m-800m.
|SelfVIO (no IMU)||Med.||2.25||4.3||7.29||13.11||17.29|
Comparisons to VIO approaches on KITTI Odometry sequence 10. We present medians, first quartiles, and third quartiles. Results reproduced from[10, 38]. We report errors on distances of m from KITTI Odometry sequence 10 to have identical metrics with [10, 38]. Full results for SelfVIO on sequence 10 can be found in Tab. IV.
|SelfVIO (no IMU)||2.49||2.33||2.41|
|Zhan et al.||11.92||12.62||12.27|
: average translational RMSE drift on length of 100m-800m.
: average rotational RMSE drift (m) on length of 100m-800m.
Iv-B Network Training
We implement the architecture with the publicly available Tensorflow framework
. Batch normalization is employed for all of the layers except for the output layers. Three consecutive images are stacked together to form the input batch, where the central frame is the target view for the depth estimation. We augment the data with random scaling, cropping and horizontal flips. SelfVIO is trained foriterations using a batch size of . During the network training, we calculate error on the validation set at intervals of iterations. We use the ADAM  solver with , , , learning rate=, and an exponential learning rate policy. The network is trained using single-point precision (FP32) on a desktop computer with a 3.00 GHz Intel i7-6950X processor and NVIDIA Titan V GPUs.
We compare our approach to a collection of recent VO, VIO, and VSLAM approaches described earlier in Section II:
We include monocular versions of competing algorithms to have a common setup with our method. SFMLearner, Mahjourian et al., Zhan et al., and VINet optimize over multiple consecutive monocular images or stereo image pairs; and OKVIS and ORB-SLAM perform bundle adjustment. Similarly, we include the RGB version of VIOLearner for all the comparisons, which uses RGB image input and the monocular depth generation sub-network from SFMLearner  rather than RGB-depth data. We perform 6-DOF least-squares Umeyama alignment  for trajectory alignment on monocular approaches as they lack scale information. For SFMLearner, we follow  to estimate the scale from the ground truth for each estimate. It should be noted that OKVIS and ORBSLAM are evaluated at images scaled down to size to match the image resolution used by SelfVIO.
We train separate networks for KITTI and EuRoC datasets for benchmarking and the Cityscapes dataset  for evaluating the cross-dataset generalization ability of the model. SelfVIO implicitly learns to estimate camera-IMU extrinsics and IMU instrinsics directly from raw data, enabling SelfVIO to translate from one dataset (with a given camera-IMU configuration) to another (with a different camera-IMU configuration).
|GeoNet ||0.012 0.007||0.012 0.009|
|CC ||0.012 0.007||0.012 0.008|
|GANVO ||0.009 0.005||0.010 0.013|
|SelfVIO (ours)||0.008 0.006||0.009 0.008|
Iv-C1 Ablation Studies
We perform two ablation studies on our proposed network and call these SelfVIO (no IMU) and SelfVIO (LSTM).
Visual Vs. Visual-Inertial
We disable the inertial odometry module and omit IMU data; instead, we use a vision-only odometry to estimate the initial warp. This version of the network is referred to as SelfVIO (no IMU) and results are only included to provide additional perspective on the vision-only performance of our architecture (and specifically the adversarial training) compared to other vision-only approaches.
CNN vs RNN
Additionally, we perform ablation studies where we replace the convolutional network described in Sec. III-C with a recurrent neural network, specifically a bidirectional LSTM to process IMU input at the cost of an increase in the number of parameters and, hence, more computational power. This version of the network is referred to as SelfVIO (LSTM).
Iv-C2 Spatial Misalignments
We test the robustness of our method against camera-sensor miscalibration. We introduce calibration errors by adding a rotation of a chosen magnitude and random angle to the camera-IMU rotation matrices , where is the von Mises-Fisher distribution , is the directional mean and is the concentration parameter of the distribution.
Iv-C3 Evaluation Metrics
We evaluate our trajectories primarily using the standard KITTI relative error metric (reproduced below from ):
where is a set of frames, is the inverse compositional operator, and are estimated and true pose values as elements of Lie group , respectively.
For KITTI dataset, we also evaluate the errors at lengths of and m. Additionally, we compute the root mean squared error (RMSE) for trajectory estimates on five frame snippets as has been done recently in [21, 16].
We evaluate depth estimation performance of each method using several error and accuracy metrics from prior works :
Threshold: % of s.t. RMSE (linear): Abs relative difference: RMSE (log): Squared relative difference:
V Results and Discussion
In this section, we critically analyse and comparatively discuss our qualitative and quantitative results for depth and motion estimation.
V-a Monocular Depth Estimation
We obtain state-of-the-art results on single-view depth prediction as quantitatively shown in Table I. The depth reconstruction performance is evaluated on the Eigen et al.  split of the raw KITTI dataset , which is consistent with previous work [58, 59, 16, 35]. All depth maps are capped at 80 meters. The predicted depth map, , is multiplied by a scaling factor, , that matches the median with the ground truth depth map, , to solve the scale ambiguity issue, i.e. .
Figure 3 shows examples of reconstructed depth maps by the proposed method, GeoNet  and the Competitive Collaboration (CC) . It is clearly seen that SelfVIO outputs sharper and more accurate depth maps compared to the other methods that fundamentally use an encoder-decoder network with various implementations. An explanation for this result is that adversarial training using the convolutional domain-related feature set of the discriminator distinguishes reconstructed images from the real images, leading to less blurry results . Furthermore, Fig. 3 further implies that the depth reconstruction module proposed by SelfVIO is capable of capturing small objects in the scene whereas the other methods tend to ignore them. A loss function in the image space leads to smoothing out all likely detail locations, whereas an adversarial loss function in feature space with a natural image prior makes the proposed SelfVIO more sensitive to details in the scene . The proposed SelfVIO also performs better in low-textured areas caused by the shading inconsistencies in a scene and predicts the depth values of the corresponding objects much better in such cases. In Fig. 4, we demonstrate typical performance degradation of the compared unsupervised methods that is caused by challenges such as poor road signs in rural areas and huge objects occluding the most of the visual input. Even in these cases, SelfVIO performs slightly better than the existing methods.
Moreover, we test the adaptability of the proposed approach by training on the Cityscapes dataset and fine-tuning on the KITTI dataset (cs+k in Table I). As the Cityscapes dataset is an RGB-depth dataset, we remove the inertial odometry part and perform an ablation study (SelfVIO (no IMU)) on depth estimation. The results shown in Table I show a clear advantage of fine-tuning on data that is related to the test set. In this mode (SelfVIO (no IMU)), our network architecture for depth estimation is most similar to GANVO . However, the shared features among the encoder and generator networks enable the network to also have access to low-level information. In addition, the PatchGAN structure in SelfVIO restricts the discriminator from capturing high-frequency structure in depth map estimation. We observe that using the SelfVIO framework with inertial odometry results in larger performance gains even when it is trained on the KITTI dataset only.
|MH01 (E)||MH02 (E)||MH03 (M)||MH04 (D)||MH05 (D)||V101 (E)||V102 (M)||V103 (D)||V201 (E)||V202 (M)||V203 (D)|
V-B Motion Estimation
In this section, we comparatively discuss the motion estimation performance of the proposed method in terms of both vision-only and visual-inertial estimation modes.
V-B1 Visual Odometry
SelfVIO (no IMU) outperforms the VO approaches listed in Sec. IV-C as seen in Tab. II, which confirms that our results are not due solely to our inclusion of IMU data. It should be noted that the results in Tab. II for SelfVIO, VIOLearner, UnDeepVO, and SFMLearner are for networks that are tested on data on which they are also trained, which corresponds with the results presented in [37, 38]. We compare SelfVIO against UnDeepVO and VIOLearner using these results.
V-B2 Visual-Inertial Odometry
The authors of VINet  provide the errors in boxplots compared to several state-of-the-art approaches for m on the KITTI odometry dataset. We reproduced the median, first quartile, and third quartile from [10, 38] to the best of our ability and included them in Tab. III. SelfVIO outperforms VIOLearner and VINet for longer trajectories (m) on KITTI sequence 10. Although SelfVIO (LSTM) is slightly outperformed by SelfVIO, it still performs better than VIOLearner and VINET, which shows CNN architecture in SelfVIO increases the estimation performance. It should again be noted that our network can implicitly learn camera-IMU extrinsic calibration from the data. We also compare SelfVIO against the traditional state-of-the-art VIO and include a custom EKF with VISO2 as in VINET .
We successfully run SelfVIO on the KITTI odometry sequences 09 and 10 and include the results in Tab. IV and Fig. 5. SelfVIO outperforms OKVIS and ROVIO on KITTI sequences 09 and 10. However, both OKVIS and ROVIO require tight synchronization between the IMU measurements and the images that KITTI does not provide. This is most likely the reason for the poor performance of both approaches on KITTI. This also highlights a strength of SelfVIO in that it can compensate for loosely temporally synchronized sensors without explicitly estimating their temporal offsets.
In addition to evaluating with relative error over the entire trajectory, we also evaluated SelfVIO RGB using RMSE over five frame snippets as was done in [21, 16, 38] for their similar monocular approaches. As shown in Tab. V, SelfVIO surpasses RMSE performance of SFMLearner, Mahjourian et al. and VIOLearner on KITTI trajectories 09 and 10.
The results on the EuRoC sequences are shown in Tab. VI and sample trajectory plots are shown in Fig. 6. SelfVIO produces the most accurate trajectories for many of the sequences, even without explicit loop closing. To provide an objective comparison to the existing related methods in the literature, we use the following methods for evaluation described earlier in Section II:
MSCKF  - multistate constraint EKF,
OKVIS  - a keyframe optimization-based method using landmark reprojection errors,
ROVIO  - an EKF with tracking of both 3D landmarks and image patch features, and
VINS-Mono  - a nonlinear-optimization-based sliding window estimator using preintegrated IMU factors.
As we are interested in evaluating the odometry performance of the methods, no loop closure is performed. In difficult sequences (marked with D), the continuous inconsistency in brightness between the images causes failures in feature matching for the filter based approaches, which can result in divergence of the filter. On the easy sequences (marked with E), although OKVIS and VINSMONO slightly outperform the other methods, the accuracy of SVOMSF, ROVIO and SelfVIO approaches is similar except that MSCKF has a larger error in the machine hall datasets which may be caused by the larger scene depth compared to the Vicon room datasets.
As shown in Fig. 7, orientation offsets within a realistic range of less than degrees show low numbers of errors and great applicability of SelfVIO to sensor implementation with high noise. Furthermore, offsets within a range of less than degrees display a modestly sloped plateau that suggests successful learning of calibration. In contrast, OKVIS shows surprising robustness to rotation errors under degrees but is unable to handle orientation offsets around the degree mark, where error measures appear to drastically increase. This is plausibly expected because deviations of this magnitude result in large dimension shift, and unsurprisingly, OKVIS appears unable to compensate.
In this work, we presented our SelfVIO architecture and demonstrated superior performance against state-of-the-art VO, VIO, and even VSLAM approaches. Despite using only monocular source-target image pairs, SelfVIO surpasses state-of-the-art depth and motion estimation performances of both traditional and learning-based approaches such as VO, VIO and VSLAM that use sequences of images, keyframe based bundle adjustment, and full bundle adjustment and loop closure. This is enabled by a novel adversarial training and visual-inertial sensor fusion technique embedded in our end-to-end trainable deep visual-inertial architecture. Even when IMU data are not provided, SelfVIO with RGB data outperforms deep monocular approaches in the same domain. In future work, we plan to develop a stereo version of SelfVIO that could utilize the disparity map.
-  F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part ii: Matching, robustness, optimization, and applications,” IEEE Robotics & Automation Magazine, vol. 19, no. 2, pp. 78–90, 2012.
-  J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European Conference on Computer Vision. Springer, 2014, pp. 834–849.
R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
-  ——, “Visual-inertial monocular slam with map reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 796–803, 2017.
-  T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2043–2050.
-  R. Clark, S. Wang, H. Wen, A. Markham, and N. Trigoni, “Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem.” in AAAI, 2017, pp. 3995–4001.
-  M. Turan, Y. Almalioglu, H. Araujo, E. Konukoglu, and M. Sitti, “Deep endovo: A recurrent convolutional neural network (rcnn) based visual odometry approach for endoscopic capsule robots,” Neurocomputing, vol. 275, pp. 1861–1870, 2018.
-  M. Turan, Y. Almalioglu, H. B. Gilbert, F. Mahmood, N. J. Durr, H. Araujo, A. E. Sarı, A. Ajay, and M. Sitti, “Learning to navigate endoscopic capsule robots,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3075–3082, 2019.
-  P. Muller and A. Savakis, “Flowdometry: An optical flow and deep learning based approach to visual odometry,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 624–631.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3354–3361.
-  A. Asvadi, L. Garrote, C. Premebida, P. Peixoto, and U. J. Nunes, “Multimodal vehicle detection: fusing 3d-lidar and color camera data,” Pattern Recognition Letters, vol. 115, pp. 20–29, 2018.
-  R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5667–5675.
-  M. Turan, Y. Almalioglu, H. B. Gilbert, A. E. Sari, U. Soylu, and M. Sitti, “Endo-vmfusenet: A deep visual-magnetic sensor fusion approach for endoscopic capsule robots,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–7.
-  M. Artetxe, G. Labaka, E. Agirre, and K. Cho, “Unsupervised neural machine translation,” arXiv preprint arXiv:1710.11041, 2017.
-  J. Y. Jason, A. W. Harley, and K. G. Derpanis, “Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness,” in European Conference on Computer Vision. Springer, 2016, pp. 3–10.
S. Meister, J. Hur, and S. Roth, “Unflow: Unsupervised learning of optical
flow with a bidirectional census loss,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR, vol. 2, no. 6, 2017, p. 7.
-  D. Fortun, P. Bouthemy, and C. Kervrann, “Optical flow modeling and computation: a survey,” Computer Vision and Image Understanding, vol. 134, pp. 1–21, 2015.
-  Y. Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni, “Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 5474–5480.
-  M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
-  A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman filter for vision-aided inertial navigation,” in Proceedings 2007 IEEE International Conference on Robotics and Automation. IEEE, 2007, pp. 3565–3572.
-  J. Delmerico and D. Scaramuzza, “A benchmark comparison of monocular visual-inertial odometry algorithms for flying robots,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 2502–2509.
-  M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended kalman filter based visual-inertial odometry using direct photometric feedback,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1053–1072, 2017.
-  S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” The International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
-  R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European Conference on Computer Vision. Springer, 2016, pp. 740–756.
-  C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in CVPR, vol. 2, no. 6, 2017, p. 7.
H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 340–349.
-  B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “Demon: Depth and motion network for learning monocular stereo,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5038–5047.
-  Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2018.
-  J. Wulff and M. J. Black, “Temporal interpolation as an unsupervised pretraining task for optical flow estimation,” in German Conference on Pattern Recognition. Springer, 2018, pp. 567–582.
-  R. Li, S. Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visual odometry through unsupervised deep learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7286–7291.
-  E. J. Shamwell, K. Lindgren, S. Leung, and W. D. Nothwang, “Unsupervised deep visual-inertial odometry with online error correction for rgb-d imagery,” IEEE transactions on pattern analysis and machine intelligence, 2019.
-  E. J. Shamwell, S. Leung, and W. D. Nothwang, “Vision-aided absolute trajectory estimation using an unsupervised deep network with online error correction,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 2524–2531.
-  R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison, “Learning to solve nonlinear least squares for monocular stereo,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 284–299.
-  A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, 2016, pp. 658–666.
-  Y. Almalioglu, M. Turan, C. X. Lu, N. Trigoni, and A. Markham, “Milli-rio: Ego-motion estimation with millimetre-wave radar and inertial measurement unit sensor,” arXiv preprint arXiv:1909.05774, 2019.
-  C. Chen, S. Rosa, Y. Miao, C. X. Lu, W. Wu, A. Markham, and N. Trigoni, “Selective sensor fusion for neural visual-inertial odometry,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 542–10 551.
-  A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black, “Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 240–12 249.
-  G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  M. R. U. Saputra, P. P. de Gusmao, C. X. Lu, Y. Almalioglu, S. Rosa, C. Chen, J. Wahlström, W. Wang, A. Markham, and N. Trigoni, “Deeptio: A deep thermal-inertial odometry with visual hallucination,” arXiv preprint arXiv:1909.07231, 2019.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: a simple way to prevent neural networks from overfitting.”
Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” in European conference on computer vision. Springer, 2016, pp. 286–301.
M. Jaderberg, K. Simonyan, A. Zisserman et al.
, “Spatial transformer networks,” inAdvances in neural information processing systems, 2015, pp. 2017–2025.
A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” inInternational Conference on Machine Learning, 2016, pp. 1558–1566.
-  C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in European Conference on Computer Vision. Springer, 2016, pp. 702–716.
-  L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolutional neural networks,” in Advances in neural information processing systems, 2015, pp. 262–270.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366–2374.
-  F. Liu, C. Shen, G. Lin, and I. D. Reid, “Learning depth from single monocular images using deep convolutional neural fields.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2024–2039, 2016.
-  Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint learning of depth and flow using cross-task consistency,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 36–53.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and D. Scaramuzza, “Autonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial vehicle,” Journal of Field Robotics, vol. 33, no. 4, pp. 431–450, 2016.
-  C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2016.
-  S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 4, pp. 376–380, 1991.
-  A. T. Wood, “Simulation of the von mises fisher distribution,” Communications in statistics-simulation and computation, vol. 23, no. 1, pp. 157–164, 1994.
-  S. Lynen, M. W. Achtelik, S. Weiss, M. Chli, and R. Siegwart, “A robust and modular multi-sensor fusion approach applied to mav navigation,” in 2013 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2013, pp. 3923–3929.