Simultaneous localization and mapping (SLAM) and visual odometry (VO) play a vital role for many real-world applications, such as autonomous driving, robotics and mixed reality. Classic SLAM/VO [13, 14, 17, 29] methods perform well in regular scenes but fail in challenging conditions (e.g
. dynamic objects, occlusions, textureless regions) due to their reliance on low-level features. Since deep learning is able to extract high-level features and infer in an end-to-end fashion, learning-based VO[22, 40, 41, 44] methods have been proposed in recent years to alleviate the limitation of classic hand-engineered algorithms.
However, learning-based VO suffers from a notable decrease in accuracy when confronted with scenes different from the training dataset [8, 37] (Fig. 1). When applied a pre-trained VO network to the open world, the inability to generalize itself to new scenes presents a serious problem for its practical applications. This requires the VO network to continuously adapt to the new environment.
In contrast to fine-tuning a pre-trained network with ground truth data on the target domain , it is unlikely to collect enough data in advance when running VO in the open world. This requires the network to adapt itself in real-time to changing environments. In this online learning setting, there is no explicit distinction between training and testing phases — we learn as we perform. This is much different from conventional learning methods where a pre-trained model is fixed during inference.
During online adaptation, the VO network can only learn from the current data instead of the entire training data with batch training and multiple epoches. The learning objective is to find an optimal model that is well adapted to the current data. However, because of the limited temporal perceptive field , the current optimal model may not be well suited for subsequent frames. This makes the optimal parameters oscillate with time, leading to slow convergence during online adaptation [9, 11, 20].
In order to address these issues, we propose an online meta-learning scheme for self-supervised VO that achieves online adaptation. The proposed method motivates the network to perform consistently well at different time by incorporating online adaptation process into the learning objective. Besides, the past experience can be used to accelerate the adaptation to a new environment. Therefore, instead of learning only from the current data, we employ convolutional long short-term memory (convLSTM) to aggregate rich spatial-temporal information in the video that enables the network to use past experience for better estimation and also adapt quickly to the current frame. In order to achieve fast adaptation in changing environments, we propose a feature alignment method to align non-stationary feature distributions at different time. The proposed network automatically adapts to changing environments without ground truth data collected in advance for external supervision. Our contributions can be summarized as follows:
We propose an online meta-learning algorithm for VO to continuously adapt to unseen environments in a self-supervised manner.
The VO network utilizes past experience incorporated by convLSTM to achieve better estimation and adapt quickly to the current frame.
We propose a feature alignment method to deal with the changing data distributions in the open world.
Our VO network achieves 32 FPS on a Geforce 1080Ti GPU with online refinement, making it adapt in real-time for practical applications. We evaluate our algorithm across different domains, including outdoor, indoor and synthetic environments, which consistently outperforms state-of-the-art self-supervised VO baselines.
2 Related works
Learning-based VO has been widely studied in recent years with the advent of deep learning and many methods with promising results have been proposed. Inspired by the framework of parallel tracking and mapping in classic SLAM/VO, DeepTAM  utilizes two networks for pose and depth estimation simultaneously. DeepVO 
uses recurrent neural network (RNN) to leverage sequential correlations to estimate poses recurrently. However, these methods require ground truth which is expensive or impractical to obtain. To avoid the need of annotated data, self-supervised VO has been recently developed. SfMLearner utilizes the 3D geometric constraint of pose and depth to learn by minimizing photometric loss. Yin et al.  and Ranjan et al.  extend this idea to joint estimation of pose, depth and optical flow to handle non-rigid cases which are against static-scene assumption. These methods focus on mimicking local structure from motion (SfM) with image pairs, but fail to exploit spatial-temporal correlations over long sequence. SAVO  formulates VO as a sequential generative task and utilizes RNN to reduce scale drift significantly. In this paper, we adopt the same idea as SfMLearner  and SAVO .
Most machine learning models suffer from a significant reduce in performance when the test data are different from the training set. An effective solution to alleviate this domain shift issue is online learning, where data are processed sequentially and data distribution changes continuously. Previous methods use online gradient update  and probabilistic filtering 
. Recently, domain adaptation has been widely studied in computer vision. Longet al.  propose Maximum Mean Discrepancy loss to reduce the domain shift. Several works [5, 33] utilize Generative Adversarial Networks (GAN) to directly transfer images in the target domain to the source domain (e.g. day to night or winter to summer). Inspired by [5, 7], we propose a feature alignment method for online adaptation.
Meta-learning, or learning to learn, is a continued interest in machine learning. It exploits inherent structures in data to learn more effective learning rules for fast domain adaptation [27, 35]. A popular approach is to train a meta-learner that learns how to update the network [4, 15]. Finn et al. [15, 16]
proposed Model Agnostic Meta-Learning (MAML) that constrains the learning rule for the model and uses stochastic gradient descent to quickly adapt networks to new tasks. This simple yet effective formulation has been widely used to adapt deep networks to unseen environments[1, 2, 21, 30, 39]. Our proposed method is most relevant to MAML, which extends it to the self-supervised, online learning setting.
3 Problem setup
3.1 Self-supervised VO
Our self-supervised VO follows the similar idea of SfMLearner  and SAVO  (shown in Fig. 2). The DepthNet predicts depth of the current frame . The PoseNet takes stacked monocular images and to regress relative pose . Then view synthesis is applied to reconstruct by differentiable image warping:
where are the homogeneous coordinates of a pixel in and , respectively. denotes camera intrinsics. The MaskNet predicts a per-pixel mask  according to the warping residuals .
3.2 Online adaptation
As shown in Fig. 1, the performance of VO networks is fundamentally limited by their generalization ability when confronted with scenes different from the training data. The reason is they are designed under a closed world assumption: the training data and test data are i.i.d. sampled from a common dataset with fixed distribution. However, when running a pre-trained VO network in the open world, images are continuously collected in changing scenes. In this sense, the training and test data no longer share similar visual appearances, and the data at the current view may be different from previous views. This requires the network to online adapt to changing environments.
Given a model pretrained on , a naive approach for online learning is to update parameters by computing loss on the current data :
where and is the learning rate. Despite its simplicity, this approach has several drawbacks. The temporal perceptive field of the learning objective is 1, which means it accounts only for the current input and has no correlation with previous data. The optimal solution for current is likely to be unsuitable for subsequent inputs. Therefore, the gradients at different iterations are stochastic without consistency [9, 26]. This leads to slow convergence and may introduce negative bias in the learning procedure.
In order to address these issues, we propose to exploit correlations of different time for fast online adaptation. Our framework is illustrated in Fig. 2. The VO network takes consecutive frames in the sliding window to estimate pose and depth in a self-supervised manner (Sec. 3.1). Then it is updated according to the loss and infers for frames at the next time. The network learns to find a set of weights to perform well both for and (Sec. 4.1). During online learning, spatial-temporal information is incorporated by convLSTM (Sec. 4.2) and feature alignment is adopted (Sec. 4.3) for fast adaptation.
4.1 Self-supervised online meta-learning
In contrast to , we extend the online learning objective to , which can be written as:
Different from naive online learning, the temporal perceptive field of Eq. 3 becomes 2. It optimizes the performance on after adapting to the task on . The insight is instead of minimizing the training error on the current iteration , we try to minimize the test error on the next iteration. Our formulation directly incorporates online adaptation into the learning objective, which motivates the network to learn at to perform better at next time .
Our objective of learning to adapt is similar in spirit to that of Model Agnostic Meta Learning (MAML) :
which aims to minimize the evaluation (adaptation) error on the validation set instead of minimizing the training error on the training set. denotes tasks sampled from the task set . More details of MAML can be found in .
As a nested optimization problem, our objective function is optimized via a two-stage gradient descent. At each iteration , we take consecutive frames in the sliding window as a mini-dataset (shown within the blue area in Fig. 2):
In the inner loop of Eq. 3, we evaluate the performance of VO in by self-supervised loss and update parameters according to Eq. 2. Then, in the outer loop, we evaluate the performance of the updated model on subsequent frames . We mimic this continuous adaptation process on both training and online test phases. During training, we minimize the sum of losses by Eq. 3 across all sequences in the training dataset, which motivates the network to learn base weights that enables fast online adaptation.
In order to provide more intuition on what it learns and the reason for fast adaptation, we take Taylor expansion on our training objective:
where denotes Hessian matrix and
denotes inner product. Since most neural networks use ReLU activations, the networks are locally linear, thus the second order derivative equals 0 in most cases. Therefore, and higher order terms are also omitted.
As shown in Eq. 6, the network learns to minimize the prediction error with while maximizing the similarity between the gradients at and . Since the camera is continuously moving, the scenes may vary from different time. Naive online learning treats different scenes independently by fitting only the current scene but ignores the way to perform VO in different scenes are similar. As gradient indicates the direction to update the network, this leads to inconsistent gradients at and slow convergence. In contrast, the second term enforces consistent gradient directions by aligning gradient for with previous information, indicating that we are training the network at to perform consistently well for both and . This meta-learning scheme alleviates stochastic gradient problem in online learning. Eq. 6 describes the dynamics of sequential learning in non-stationary scenes. The network learns to adjust at current state by to better perform at next time. Consequently, the learned is less sensitive to the non-stationary data distributions of sequential inputs, enabling fast adaptation to unseen environments.
4.2 Spatial-temporal aggregation
As stated in Sec. 1, online learning suffers from slow convergence due to the inherent limitation of temporal perceptive field. In order to make online updating more effective, we let the network perform current estimation based on previous information. Besides, predicting pose from only image pairs is prone to error accumulation. This trajectory drift problem can be mitigated by exploiting spatial-temporal correlations over long sequence [22, 40].
In this paper, we use convolutional LSTM (convLSTM) to achieve fast adaptation and reduce accumulated error. As shown in Fig. 3
, we embed recurrent units into the encoder of DepthNet and PoseNet to allow the convolutional network to leverage not only spatial but also temporal information for depth and pose estimation. The lengthof convLSTM is the number of frames in . ConvLSTM acts as the memory of the network. As new frames are processed, the network is able to memorize and learn from its past experience, so as to update parameters to quickly adapt to unseen environments. This approach not only enforces correlations among different time steps, but also learns the temporally dynamic nature of the moving camera from video inputs.
4.3 Feature alignment
One basic assumption of conventional machine learning is that the training and test data are independently and identically (i.i.d.) drawn from the same distribution. However, this assumption does not hold when running VO in the open world, since the test data (target domain) are usually different from the training data (source domain). Besides, as the camera is continuously moving in the changing environment, the captured scenes also vary in time. As highlighted in [7, 25], aligning feature distributions of two domains will improve performance in domain adaptation.
Inspired by , we extend this domain adaptation method to the online learning setting by aligning feature distributions in different time. When training on the source domain, we collect the statistics of features
in a feature map tensor by Layer Normalization (LN):
where are the height, width and channels of each feature map. When adapted to the target domain, we initialize feature statistics at :
Then at each iteration , feature statistics are computed by Eq. 7. Given previous statistics , feature distribution at is aligned by:
is a hyperparameter. After feature alignment, the featuresare normalized to :
where is a small constant for numerical stability. and are the learnable scale and shift in normalization layers .
The insight of this approach is to enforce correlation of non-stationary feature distributions in changing environments. Learning algorithms perform well when feature distribution of the test data is the same as the training data. When changed to a new environment, despite the extracted features are different, we deem that feature distributions of two domains should be the same (Eq. 8). Despite the view is changing when running VO in an open world, and are observed continuously in time, thus their feature distributions should be similar (Eq. 9). This feature normalization and alignment approach acts as regularization that simplifies the learning process, which makes the learned weights consistent for non-stationary environments.
4.4 Loss functions
Our self-supervised loss is the same as most previous methods. It consists of:
Appearance loss We measure the reconstructed image by photometric loss and structural similarity metric (SSIM):
The regularization term prevents the learned mask converges to a trivial solution . The filter size of SSIM is set 55 and is set 0.85.
Depth regularization We introduce an edge-aware loss to enforce discontinuity and local smoothness in depth:
Thus the self-supervised loss is:
5.1 Implementation details
The architecture of our network is shown in Fig. 3. The DepthNet uses a U-shaped architecture similar to . The PoseNet is splited into 2 parts followed by fully-connected layers to regress Euler angles and translations of 6-DoF pose, respectively. The length of convLSTM is set 9. Layer Normalization and ReLUs are adopted in each layer except for the output layers. Detailed network architecture can be found in the supplementary materials.
Our model is implemented by PyTorch on a single NVIDIA GTX 1080Ti GPU. All sub-networks are jointly trained in a self-supervised manner. Images are resized to 128416 during both training and online adaptation. The Adam  optimizer with , is used and the weight decay is set . Weighting factors are set 0.01, 1 and 0.5, respectively. The feature alignment parameter is set 0.5. The batch size is 4 for training and 1 for online adaptation. The learning objective (Eq. 3) is used for both training and online adaptation. We pre-train the network for 20,000 iterations. The learning rate of the inner loop and outer loop are both initialized to and reduced by half for every 5,000 iterations.
5.2 Outdoor KITTI
First, we test our method on KITTI odometry  dataset. It contains 11 driving scenes with ground truth poses. We follow the same train/test split as [22, 41, 44] using sequences 00-08 for training and 09-10 for online test.
|Method||Seq. 09||Seq. 10|
|Zhan et al. ||11.89||3.62||12.82||3.40|
Instead of calculating absolute trajectory error (ATE) on image pairs in previous methods, we recover full trajectories and compute translation error by KITTI evaluation toolkit, rotation error . We compare our method with several state-of-the-art self-supervised VO baselines: SfMLearner , GeoNet , Zhan et al. , Vid2Depth  and SAVO . As stated in , a scaling factor is used to align trajectories with ground truth to solve the scale ambiguity problem in monocular VO. The estimated trajectories of sequences 09-10 are plotted in Fig. 4 and quantitative evaluations are shown in Table 1. Our method outperforms all the other baselines by a clear margin, the accumulated error is reduced by online adapation.
The comparison of the running speed with other VO methods can be found in Table 2. Since we are studying the online learning problem, the running time includes forward propagation, loss computing, back propagation and network updating. Our method achieves real-time online adaptation and outperforms state-of-the-art baselines considerably.
5.3 Synthetic to real
Synthetic datasets (e.g. virtual KITTI, Synthia and Carla) have been widely used for research since they provide ground truth labels and controllable environment settings. However, there’s a large gap between the synthetic and real-world data. In order to test the domain adaptation ability, we use Carla simulator  to collect synthetic images under different weather conditions in the virtual city for training, and use KITTI 00-10 for online testing.
that previous methods all failed when shifted to real-world environments. This is probably because the features of virtual scenes are much different from the real world despite they are both collected in the driving scenario. In contrast, our method significantly outperforms previous arts, which is able to bridge the domain gap and quickly adapt to the real-world data.
|SfMLearner ||Vid2Depth ||Zhan et al. ||GeoNet ||SAVO ||Ours|
5.4 Outdoor KITTI to indoor TUM
In order to further evaluate the adaptability of our method, we test various baselines on TUM-RGBD  dataset. KITTI is captured by moving cars with planar motion, high quality images and sufficient disparity. Instead, TUM dataset is collected by handheld cameras in indoor scenes with much more complicated motion patterns, which is significantly different from KITTI. It includes various challenging conditions (Fig. 6) such as dynamic objects, non-texture scenes, abrupt motions and large occlusions.
We pretrain these methods on KITTI 00-08 and test on TUM dataset. Despite the ground truth depth is available, we only use monocular RGB images during test. It can be seen (Table 4 and Fig. 6) that our method consistently outperforms all the other baselines. Despite the large domain shift and significant difference in motion patterns (i.e. large, planar motion vs small motion in 3 axes), our method can still recover trajectories well. On the contrary, GeoNet  and Zhan et al.  tend to fail. Despite SAVO  utilizes LSTM to alleviate accumulated error to some extent, our method performs better due to online adaptation.
|Sequence||Structure||Texture||Abrupt motion||Zhan et al. ||GeoNet ||SAVO ||Ours|
5.5 Ablation studies
In order to demonstrate the effectiveness of each component, we present ablation studies on various versions of our method on KITTI dataset (shown in Table 5).
First, we evaluate the backbone of our method (the first row) which includes convLSTM and feature alignment but no meta-learning process during training and online test. It can be seen from Table 1 and Table 5 that, even without meta-learning and online adaptation, our network backbone still outperforms most pervious methods. The results indicate that convLSTM is able to reduce accumulated error and feature alignment improves the performance when confronted with unseen environments.
|Seq. 09||Seq. 10|
Then we compare the efficiency of naive online learning (the second row) and meta-learning (the last row). It can be seen that, although naive online learning is able to reduce estimation error to some extent, it converges much slower than the meta-learning scheme, indicating that it takes much longer time to adapt the network to the new environment.
Finally, we study the effect of convLSTM and feature alignment during meta-learning (last four rows). Compared with baseline meta-learning scheme, convLSTM and feature alignment give the VO performance a further boost. Besides, convLSTM tends to perform better than feature alignment during online adaptation. One possible explaination is convLSTM incorporates spatial-temporal correlations and past experience over long sequence. It associates different states recurrently, making the gradient computation graph more intensively connected during back propagation. Meanwhile, convLSTM correlates the VO network at different time, enforcing to learn a set of weights that are consistent in the dynamic environment.
Besides, we study how the size of sliding window is influencing the VO performance. The change of has no much impact on the running speed (30-32 FPS), but as increases, the adaptation gets faster and better. When is greater than 15, the adaptation speed and accuracy becomes lower. Therefore, we set as the best choice.
In this paper, we propose an online meta-learning scheme for self-supervised VO to achieve fast online adaptation in the open world. We use convLSTM to aggregate spatial-temporal information in the past, enabling the network to use past experience for better estimation and fast adaptation to the current frame. Besides, we put forward a feature alignment method to deal with changing feature distributions in the unconstrained open world setting. Our network dynamically evolves in time to continuously adapt to changing environments on-the-fly. Extensive experiments on outdoor, virtual and indoor datasets demonstrate that our network with online adaptation ability outperforms state-of-the-art self-supervised VO methods.
Acknowledgments This work is supported by the National Key Research and Development Program of China (2017YFB1002601) and National Natural Science Foundation of China (61632003, 61771026).
-  (2018) Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. In ICLR, Cited by: §2.
-  (2016) Learning to Learn by Gradient Descent by Gradient Descent. In NeurIPS, Cited by: §2.
-  (2016) Layer Normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.3.
-  (1992) On the Optimization of a Synaptic Learning Rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Cited by: §2.
-  (2017) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks. In CVPR, Cited by: §2.
-  (2013) Streaming Variational Bayes. In NeurIPS, Cited by: §2.
-  (2017) Autodial: Automatic Domain Alignment Layers. In ICCV, Cited by: §2, §4.3, §4.3.
Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. In AAAI, Cited by: §1.
-  (2019) Self-Supervised GANs via Auxiliary Rotation Loss. In CVPR, Cited by: §1, §3.2.
-  (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §5.3.
-  (2018) Online Deep Learning: Learning Deep Neural Networks on the Fly. In IJCAI, Cited by: §1.
-  (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §2.
-  (2018) Direct Sparse Odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (3), pp. 611–625. Cited by: §1.
-  (2014) LSD-SLAM: Large-Scale Direct Monocular SLAM. In ECCV, Cited by: §1.
-  (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, Cited by: §2, §4.1.
-  (2018) Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate Any Learning Algorithm. In ICLR, Cited by: §2.
-  (2014) SVO: Fast Semi-Direct Monocular Visual Odometry. In ICRA, Cited by: §1.
-  (2012) Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, Cited by: §5.2.
-  (2015) Adam: A method for Stochastic Optimization. In ICLR, Cited by: §5.1.
-  (2017) Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1.
-  (2017) Learning to optimize. In ICLR, Cited by: §2.
-  (2019) Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry. In ICCV, Cited by: §1, §2, §3.1, §4.2, §5.2, §5.2, §5.4, Table 1, Table 3, Table 4.
Deep Transfer Learning with Joint Adaptation Networks. In ICML, Cited by: §2.
-  (2018) Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In CVPR, Cited by: §5.2, Table 1, Table 3.
-  (2018) Kitting in the Wild through Online Domain Adaptation. In IROS, Cited by: §4.3.
-  (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1, §3.2.
-  (2018) A Simple Neural Attentive Meta-Learner. In ICLR, Cited by: §2.
-  (2014) On the Number of Linear Regions of Deep Neural Networks. In NeurIPS, Cited by: §4.1.
-  (2015) ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §1.
Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning. In ICLR, Cited by: §2.
-  (2017) PyTorch. Note: https://github.com/pytorch/pytorch Cited by: §5.1.
-  (2019) Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. In CVPR, Cited by: §2.
-  (2018) Generate to Adapt: Aligning Domains Using Generative Adversarial Networks. In CVPR, Cited by: §2.
-  (2012) A Benchmark for the Evaluation of RGB-D SLAM Systems. In IROS, Cited by: §5.4.
-  (1998) Learning to Learn: Introduction and Overview. pp. 3–17. Cited by: §2.
-  (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §2.
-  (2019) Real-Time Self-Adaptive Deep Stereo. In CVPR, Cited by: §1, §1.
DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In ICRA, Cited by: §2.
-  (2019) Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning. In CVPR, Cited by: §2.
-  (2019) Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry. In CVPR, Cited by: §1, §4.2.
-  (2018) GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In CVPR, Cited by: §1, §2, §5.2, §5.2, §5.4, Table 1, Table 3, Table 4.
Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction. In CVPR, Cited by: §5.2, §5.4, Table 1, Table 3, Table 4.
-  (2018) DeepTAM: Deep Tracking and Mapping. In ECCV, Cited by: §2.
-  (2017) Unsupervised Learning of Depth and Ego-Motion from Video. In CVPR, Cited by: §1, §2, §3.1, §4.4, §5.1, §5.2, §5.2, Table 1, Table 3.