DeepAI
Log In Sign Up

Self-Supervised Deep Visual Odometry with Online Adaptation

Self-supervised VO methods have shown great success in jointly estimating camera pose and depth from videos. However, like most data-driven methods, existing VO networks suffer from a notable decrease in performance when confronted with scenes different from the training data, which makes them unsuitable for practical applications. In this paper, we propose an online meta-learning algorithm to enable VO networks to continuously adapt to new environments in a self-supervised manner. The proposed method utilizes convolutional long short-term memory (convLSTM) to aggregate rich spatial-temporal information in the past. The network is able to memorize and learn from its past experience for better estimation and fast adaptation to the current frame. When running VO in the open world, in order to deal with the changing environment, we propose an online feature alignment method by aligning feature distributions at different time. Our VO network is able to seamlessly adapt to different environments. Extensive experiments on unseen outdoor scenes, virtual to real world and outdoor to indoor environments demonstrate that our method consistently outperforms state-of-the-art self-supervised VO baselines considerably.

READ FULL TEXT VIEW PDF

page 3

page 7

page 8

03/29/2021

Generalizing to the Open World: Deep Visual Odometry with Online Adaptation

Despite learning-based visual odometry (VO) has shown impressive results...
07/21/2022

MetaComp: Learning to Adapt for Online Depth Completion

Relying on deep supervised or self-supervised learning, previous methods...
12/14/2017

Learning to Navigate by Growing Deep Networks

Adaptability is central to autonomy. Intuitively, for high-dimensional l...
03/22/2022

WayFAST: Traversability Predictive Navigation for Field Robots

We present a self-supervised approach for learning to predict traversabl...
12/07/2019

Self-Supervised 3D Keypoint Learning for Ego-motion Estimation

Generating reliable illumination and viewpoint invariant keypoints is cr...
11/17/2021

Learning to Align Sequential Actions in the Wild

State-of-the-art methods for self-supervised sequential action alignment...
12/03/2018

Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning

Learning is an inherently continuous phenomenon. When humans learn a new...

1 Introduction

Simultaneous localization and mapping (SLAM) and visual odometry (VO) play a vital role for many real-world applications, such as autonomous driving, robotics and mixed reality. Classic SLAM/VO [13, 14, 17, 29] methods perform well in regular scenes but fail in challenging conditions (e.g

. dynamic objects, occlusions, textureless regions) due to their reliance on low-level features. Since deep learning is able to extract high-level features and infer in an end-to-end fashion, learning-based VO 

[22, 40, 41, 44] methods have been proposed in recent years to alleviate the limitation of classic hand-engineered algorithms.

Figure 1: We demonstrate the domain shift problem for self-supervised VO. Previous methods fail to generalize when the test data are different from the training data. In contrast, our method performs well when tested on changing environments, which demonstrates the advantage of fast online adaptation

However, learning-based VO suffers from a notable decrease in accuracy when confronted with scenes different from the training dataset [8, 37] (Fig. 1). When applied a pre-trained VO network to the open world, the inability to generalize itself to new scenes presents a serious problem for its practical applications. This requires the VO network to continuously adapt to the new environment.

In contrast to fine-tuning a pre-trained network with ground truth data on the target domain [37], it is unlikely to collect enough data in advance when running VO in the open world. This requires the network to adapt itself in real-time to changing environments. In this online learning setting, there is no explicit distinction between training and testing phases — we learn as we perform. This is much different from conventional learning methods where a pre-trained model is fixed during inference.

During online adaptation, the VO network can only learn from the current data instead of the entire training data with batch training and multiple epoches 

[11]. The learning objective is to find an optimal model that is well adapted to the current data. However, because of the limited temporal perceptive field [26], the current optimal model may not be well suited for subsequent frames. This makes the optimal parameters oscillate with time, leading to slow convergence during online adaptation [9, 11, 20].

In order to address these issues, we propose an online meta-learning scheme for self-supervised VO that achieves online adaptation. The proposed method motivates the network to perform consistently well at different time by incorporating online adaptation process into the learning objective. Besides, the past experience can be used to accelerate the adaptation to a new environment. Therefore, instead of learning only from the current data, we employ convolutional long short-term memory (convLSTM) to aggregate rich spatial-temporal information in the video that enables the network to use past experience for better estimation and also adapt quickly to the current frame. In order to achieve fast adaptation in changing environments, we propose a feature alignment method to align non-stationary feature distributions at different time. The proposed network automatically adapts to changing environments without ground truth data collected in advance for external supervision. Our contributions can be summarized as follows:

  • We propose an online meta-learning algorithm for VO to continuously adapt to unseen environments in a self-supervised manner.

  • The VO network utilizes past experience incorporated by convLSTM to achieve better estimation and adapt quickly to the current frame.

  • We propose a feature alignment method to deal with the changing data distributions in the open world.

Our VO network achieves 32 FPS on a Geforce 1080Ti GPU with online refinement, making it adapt in real-time for practical applications. We evaluate our algorithm across different domains, including outdoor, indoor and synthetic environments, which consistently outperforms state-of-the-art self-supervised VO baselines.

2 Related works

Learning-based VO has been widely studied in recent years with the advent of deep learning and many methods with promising results have been proposed. Inspired by the framework of parallel tracking and mapping in classic SLAM/VO, DeepTAM [43] utilizes two networks for pose and depth estimation simultaneously. DeepVO [38]

uses recurrent neural network (RNN) to leverage sequential correlations to estimate poses recurrently. However, these methods require ground truth which is expensive or impractical to obtain. To avoid the need of annotated data, self-supervised VO has been recently developed. SfMLearner 

[44] utilizes the 3D geometric constraint of pose and depth to learn by minimizing photometric loss. Yin et al[41] and Ranjan et al[32] extend this idea to joint estimation of pose, depth and optical flow to handle non-rigid cases which are against static-scene assumption. These methods focus on mimicking local structure from motion (SfM) with image pairs, but fail to exploit spatial-temporal correlations over long sequence. SAVO [22] formulates VO as a sequential generative task and utilizes RNN to reduce scale drift significantly. In this paper, we adopt the same idea as SfMLearner [44] and SAVO [22].

Online adaptation

Most machine learning models suffer from a significant reduce in performance when the test data are different from the training set. An effective solution to alleviate this domain shift issue is online learning 

[36], where data are processed sequentially and data distribution changes continuously. Previous methods use online gradient update [12] and probabilistic filtering [6]

. Recently, domain adaptation has been widely studied in computer vision. Long

et al[23] propose Maximum Mean Discrepancy loss to reduce the domain shift. Several works [5, 33] utilize Generative Adversarial Networks (GAN) to directly transfer images in the target domain to the source domain (e.g. day to night or winter to summer). Inspired by [5, 7], we propose a feature alignment method for online adaptation.

Figure 2: The framework of our method. The VO network estimates pose , depth and mask from image sequences . At each iteration , the network parameters are updated according to the loss and performs inference for at next time. The network learns to find a set of weights that perform well both for and . During online learning, spatial-temporal information is aggregated by convLSTM and feature alignment is adopted to align feature distributions at different time for fast adaptation

Meta-learning, or learning to learn, is a continued interest in machine learning. It exploits inherent structures in data to learn more effective learning rules for fast domain adaptation [27, 35]. A popular approach is to train a meta-learner that learns how to update the network [4, 15]. Finn et al[15, 16]

proposed Model Agnostic Meta-Learning (MAML) that constrains the learning rule for the model and uses stochastic gradient descent to quickly adapt networks to new tasks. This simple yet effective formulation has been widely used to adapt deep networks to unseen environments 

[1, 2, 21, 30, 39]. Our proposed method is most relevant to MAML, which extends it to the self-supervised, online learning setting.

3 Problem setup

3.1 Self-supervised VO

Our self-supervised VO follows the similar idea of SfMLearner [44] and SAVO [22] (shown in Fig. 2). The DepthNet predicts depth of the current frame . The PoseNet takes stacked monocular images and to regress relative pose . Then view synthesis is applied to reconstruct by differentiable image warping:

(1)

where are the homogeneous coordinates of a pixel in and , respectively. denotes camera intrinsics. The MaskNet predicts a per-pixel mask  [44] according to the warping residuals .

3.2 Online adaptation

As shown in Fig. 1, the performance of VO networks is fundamentally limited by their generalization ability when confronted with scenes different from the training data. The reason is they are designed under a closed world assumption: the training data and test data are i.i.d. sampled from a common dataset with fixed distribution. However, when running a pre-trained VO network in the open world, images are continuously collected in changing scenes. In this sense, the training and test data no longer share similar visual appearances, and the data at the current view may be different from previous views. This requires the network to online adapt to changing environments.

Given a model pretrained on , a naive approach for online learning is to update parameters by computing loss on the current data :

(2)

where and is the learning rate. Despite its simplicity, this approach has several drawbacks. The temporal perceptive field of the learning objective is 1, which means it accounts only for the current input and has no correlation with previous data. The optimal solution for current is likely to be unsuitable for subsequent inputs. Therefore, the gradients at different iterations are stochastic without consistency [9, 26]. This leads to slow convergence and may introduce negative bias in the learning procedure.

4 Method

In order to address these issues, we propose to exploit correlations of different time for fast online adaptation. Our framework is illustrated in Fig. 2. The VO network takes consecutive frames in the sliding window to estimate pose and depth in a self-supervised manner (Sec. 3.1). Then it is updated according to the loss and infers for frames at the next time. The network learns to find a set of weights to perform well both for and (Sec. 4.1). During online learning, spatial-temporal information is incorporated by convLSTM (Sec. 4.2) and feature alignment is adopted (Sec. 4.3) for fast adaptation.

4.1 Self-supervised online meta-learning

In contrast to , we extend the online learning objective to , which can be written as:

(3)

Different from naive online learning, the temporal perceptive field of Eq. 3 becomes 2. It optimizes the performance on after adapting to the task on . The insight is instead of minimizing the training error on the current iteration , we try to minimize the test error on the next iteration. Our formulation directly incorporates online adaptation into the learning objective, which motivates the network to learn at to perform better at next time .

Our objective of learning to adapt is similar in spirit to that of Model Agnostic Meta Learning (MAML) [15]:

(4)

which aims to minimize the evaluation (adaptation) error on the validation set instead of minimizing the training error on the training set. denotes tasks sampled from the task set . More details of MAML can be found in [15].

As a nested optimization problem, our objective function is optimized via a two-stage gradient descent. At each iteration , we take consecutive frames in the sliding window as a mini-dataset (shown within the blue area in Fig. 2):

(5)

In the inner loop of Eq. 3, we evaluate the performance of VO in by self-supervised loss and update parameters according to Eq. 2. Then, in the outer loop, we evaluate the performance of the updated model on subsequent frames . We mimic this continuous adaptation process on both training and online test phases. During training, we minimize the sum of losses by Eq. 3 across all sequences in the training dataset, which motivates the network to learn base weights that enables fast online adaptation.

In order to provide more intuition on what it learns and the reason for fast adaptation, we take Taylor expansion on our training objective:

(6)

where denotes Hessian matrix and

denotes inner product. Since most neural networks use ReLU activations, the networks are locally linear, thus the second order derivative equals 0 in most cases 

[28]. Therefore, and higher order terms are also omitted.

As shown in Eq. 6, the network learns to minimize the prediction error with while maximizing the similarity between the gradients at and . Since the camera is continuously moving, the scenes may vary from different time. Naive online learning treats different scenes independently by fitting only the current scene but ignores the way to perform VO in different scenes are similar. As gradient indicates the direction to update the network, this leads to inconsistent gradients at and slow convergence. In contrast, the second term enforces consistent gradient directions by aligning gradient for with previous information, indicating that we are training the network at to perform consistently well for both and . This meta-learning scheme alleviates stochastic gradient problem in online learning. Eq. 6 describes the dynamics of sequential learning in non-stationary scenes. The network learns to adjust at current state by to better perform at next time. Consequently, the learned is less sensitive to the non-stationary data distributions of sequential inputs, enabling fast adaptation to unseen environments.

4.2 Spatial-temporal aggregation

As stated in Sec. 1, online learning suffers from slow convergence due to the inherent limitation of temporal perceptive field. In order to make online updating more effective, we let the network perform current estimation based on previous information. Besides, predicting pose from only image pairs is prone to error accumulation. This trajectory drift problem can be mitigated by exploiting spatial-temporal correlations over long sequence [22, 40].

In this paper, we use convolutional LSTM (convLSTM) to achieve fast adaptation and reduce accumulated error. As shown in Fig. 3

, we embed recurrent units into the encoder of DepthNet and PoseNet to allow the convolutional network to leverage not only spatial but also temporal information for depth and pose estimation. The length

of convLSTM is the number of frames in . ConvLSTM acts as the memory of the network. As new frames are processed, the network is able to memorize and learn from its past experience, so as to update parameters to quickly adapt to unseen environments. This approach not only enforces correlations among different time steps, but also learns the temporally dynamic nature of the moving camera from video inputs.

Figure 3: Network architecture of DepthNet, PoseNet and MaskNet in self-supervised VO framework. The height of each block represents the size of its feature maps

4.3 Feature alignment

One basic assumption of conventional machine learning is that the training and test data are independently and identically (i.i.d.) drawn from the same distribution. However, this assumption does not hold when running VO in the open world, since the test data (target domain) are usually different from the training data (source domain). Besides, as the camera is continuously moving in the changing environment, the captured scenes also vary in time. As highlighted in [7, 25], aligning feature distributions of two domains will improve performance in domain adaptation.

Inspired by [7], we extend this domain adaptation method to the online learning setting by aligning feature distributions in different time. When training on the source domain, we collect the statistics of features

in a feature map tensor by Layer Normalization (LN) 

[3]:

(7)

where are the height, width and channels of each feature map. When adapted to the target domain, we initialize feature statistics at :

(8)

Then at each iteration , feature statistics are computed by Eq. 7. Given previous statistics , feature distribution at is aligned by:

(9)

where

is a hyperparameter. After feature alignment, the features

are normalized to [3]:

(10)

where is a small constant for numerical stability. and are the learnable scale and shift in normalization layers [3].

The insight of this approach is to enforce correlation of non-stationary feature distributions in changing environments. Learning algorithms perform well when feature distribution of the test data is the same as the training data. When changed to a new environment, despite the extracted features are different, we deem that feature distributions of two domains should be the same (Eq. 8). Despite the view is changing when running VO in an open world, and are observed continuously in time, thus their feature distributions should be similar (Eq. 9). This feature normalization and alignment approach acts as regularization that simplifies the learning process, which makes the learned weights consistent for non-stationary environments.

4.4 Loss functions

Our self-supervised loss is the same as most previous methods. It consists of:

Appearance loss We measure the reconstructed image by photometric loss and structural similarity metric (SSIM):

(11)

The regularization term prevents the learned mask converges to a trivial solution [44]. The filter size of SSIM is set 55 and is set 0.85.

Depth regularization We introduce an edge-aware loss to enforce discontinuity and local smoothness in depth:

(12)

Thus the self-supervised loss is:

(13)

5 Experiments

5.1 Implementation details

The architecture of our network is shown in Fig. 3. The DepthNet uses a U-shaped architecture similar to [44]. The PoseNet is splited into 2 parts followed by fully-connected layers to regress Euler angles and translations of 6-DoF pose, respectively. The length of convLSTM is set 9. Layer Normalization and ReLUs are adopted in each layer except for the output layers. Detailed network architecture can be found in the supplementary materials.

Our model is implemented by PyTorch 

[31] on a single NVIDIA GTX 1080Ti GPU. All sub-networks are jointly trained in a self-supervised manner. Images are resized to 128416 during both training and online adaptation. The Adam [19] optimizer with , is used and the weight decay is set . Weighting factors are set 0.01, 1 and 0.5, respectively. The feature alignment parameter is set 0.5. The batch size is 4 for training and 1 for online adaptation. The learning objective (Eq. 3) is used for both training and online adaptation. We pre-train the network for 20,000 iterations. The learning rate of the inner loop and outer loop are both initialized to and reduced by half for every 5,000 iterations.

5.2 Outdoor KITTI

First, we test our method on KITTI odometry [18] dataset. It contains 11 driving scenes with ground truth poses. We follow the same train/test split as [22, 41, 44] using sequences 00-08 for training and 09-10 for online test.

Figure 4: Trajectories of different methods on KITTI dataset. Our method shows a better odometry estimation due to online updating
Method Seq. 09 Seq. 10
SfMLearner [44] 11.15 3.72 5.98 3.40
Vid2Depth [24] 44.52 12.11 21.45 12.50
Zhan et al. [42] 11.89 3.62 12.82 3.40
GeoNet [41] 23.94 9.81 20.73 9.10
SAVO [22] 9.52 3.64 6.45 2.41
Ours 5.89 3.34 4.79 0.83
Table 1: Quantitative comparison of visual odometry results on KITTI dataset. : average translational root mean square error (RMSE) drift (%); : average rotational RMSE drift (/100m)

Instead of calculating absolute trajectory error (ATE) on image pairs in previous methods, we recover full trajectories and compute translation error by KITTI evaluation toolkit, rotation error . We compare our method with several state-of-the-art self-supervised VO baselines: SfMLearner [44], GeoNet [41], Zhan et al[42], Vid2Depth [24] and SAVO [22]. As stated in  [24], a scaling factor is used to align trajectories with ground truth to solve the scale ambiguity problem in monocular VO. The estimated trajectories of sequences 09-10 are plotted in Fig. 4 and quantitative evaluations are shown in Table 1. Our method outperforms all the other baselines by a clear margin, the accumulated error is reduced by online adapation.

The comparison of the running speed with other VO methods can be found in Table 2. Since we are studying the online learning problem, the running time includes forward propagation, loss computing, back propagation and network updating. Our method achieves real-time online adaptation and outperforms state-of-the-art baselines considerably.

Method SfMLearner GeoNet Vid2Depth SAVO Ours
FPS 24 21 37 17 32
Table 2: Running speed of different VO methods.

5.3 Synthetic to real

Synthetic datasets (e.g. virtual KITTI, Synthia and Carla) have been widely used for research since they provide ground truth labels and controllable environment settings. However, there’s a large gap between the synthetic and real-world data. In order to test the domain adaptation ability, we use Carla simulator [10] to collect synthetic images under different weather conditions in the virtual city for training, and use KITTI 00-10 for online testing.

Figure 5: Trajectories of different methods pretrained on Carla and test on KITTI dataset. Our method significantly outperforms all the other baselines when changed from virtual to the real-world data

It can be seen from Fig. 15 and Table 3

that previous methods all failed when shifted to real-world environments. This is probably because the features of virtual scenes are much different from the real world despite they are both collected in the driving scenario. In contrast, our method significantly outperforms previous arts, which is able to bridge the domain gap and quickly adapt to the real-world data.

SfMLearner [44] Vid2Depth [24] Zhan et al. [42] GeoNet [41] SAVO [22] Ours
Seq frames
00 4541 61.55 27.13 61.69 28.41 63.30 28.24 44.08 14.89 60.10 28.43 14.21 5.93
01 1101 83.91 10.36 48.44 10.30 35.68 9.78 43.21 8.42 64.68 9.91 21.36 4.62
02 4661 71.48 27.80 70.56 25.72 84.63 24.67 73.59 12.53 69.15 24.78 16.21 2.60
03 801 49.51 36.81 41.92 27.31 50.05 16.44 43.36 14.56 66.34 16.45 18.41 0.89
04 271 23.80 10.52 39.34 3.42 12.08 1.56 17.91 9.95 25.28 1.84 9.08 4.41
05 2761 87.72 30.71 63.62 30.71 89.03 29.66 32.47 13.12 59.90 29.67 24.82 6.33
06 1101 59.53 12.70 84.33 32.75 93.66 30.91 40.28 16.68 63.18 31.04 9.77 3.58
07 1101 51.77 18.94 74.62 48.89 99.69 49.08 37.13 17.20 63.04 49.25 12.85 2.30
08 4701 86.51 28.13 70.20 28.14 87.57 28.13 33.41 11.45 62.45 27.11 27.10 7.81
09 1591 58.18 20.03 69.20 26.18 83.48 25.07 51.97 13.02 67.06 25.76 15.21 5.28
10 1201 45.33 16.91 49.10 23.96 53.70 22.93 46.63 13.80 58.52 23.02 25.63 7.69
Table 3: Quantitative comparisons of different methods pretraining on synthetic data in Carla simulator and testing on KITTI

5.4 Outdoor KITTI to indoor TUM

In order to further evaluate the adaptability of our method, we test various baselines on TUM-RGBD [34] dataset. KITTI is captured by moving cars with planar motion, high quality images and sufficient disparity. Instead, TUM dataset is collected by handheld cameras in indoor scenes with much more complicated motion patterns, which is significantly different from KITTI. It includes various challenging conditions (Fig. 6) such as dynamic objects, non-texture scenes, abrupt motions and large occlusions.

We pretrain these methods on KITTI 00-08 and test on TUM dataset. Despite the ground truth depth is available, we only use monocular RGB images during test. It can be seen (Table 4 and Fig. 6) that our method consistently outperforms all the other baselines. Despite the large domain shift and significant difference in motion patterns (i.e. large, planar motion vs small motion in 3 axes), our method can still recover trajectories well. On the contrary, GeoNet [41] and Zhan et al[42] tend to fail. Despite SAVO [22] utilizes LSTM to alleviate accumulated error to some extent, our method performs better due to online adaptation.

Figure 6: Raw images (top) and trajectories (bottom) recovered by different methods on TUM-RGBD dataset
Sequence Structure Texture Abrupt motion Zhan et al[42] GeoNet [41] SAVO [22] Ours
fr2/desk - 0.361 0.287 0.269 0.214
fr2/pioneer_360 0.306 0.410 0.383 0.218
fr2/pioneer_slam 0.309 0.301 0.338 0.190
fr2/360_kidnap 0.367 0.325 0.311 0.298
fr3/cabinet - - 0.316 0.282 0.281 0.272
fr3/long_off_hou_valid - 0.327 0.316 0.297 0.237
fr3/nstr_tex_near_loop - - 0.340 0.277 0.440 0.255
fr3/str_ntex_far - - 0.235 0.258 0.216 0.177
fr3/str_ntex_near - - 0.217 0.198 0.204 0.128
Table 4: Quantitative evaluation of different methods pretraining on KITTI and testing on TUM-RGBD dataset. We evaluate relative pose error (RPE) which is presented as translational RMSE in [m/s]

5.5 Ablation studies

In order to demonstrate the effectiveness of each component, we present ablation studies on various versions of our method on KITTI dataset (shown in Table 5).

First, we evaluate the backbone of our method (the first row) which includes convLSTM and feature alignment but no meta-learning process during training and online test. It can be seen from Table 1 and Table 5 that, even without meta-learning and online adaptation, our network backbone still outperforms most pervious methods. The results indicate that convLSTM is able to reduce accumulated error and feature alignment improves the performance when confronted with unseen environments.

Seq. 09 Seq. 10
Online Pretrain LSTM FA


-
Standard 10.93 3.91 11.65 4.11
Naive Standard 10.22 5.33 8.24 3.22
Meta Meta - - 9.25 4.20 7.58 3.13
Meta Meta - 6.36 3.84 5.37 1.41
Meta Meta - 7.52 4.12 5.98 2.72
Meta Meta 5.89 3.34 4.79 0.83
Table 5: Quantitative comparison of ablation study on KITTI dataset for various versions of our method. FA: feature alignment

Then we compare the efficiency of naive online learning (the second row) and meta-learning (the last row). It can be seen that, although naive online learning is able to reduce estimation error to some extent, it converges much slower than the meta-learning scheme, indicating that it takes much longer time to adapt the network to the new environment.

Finally, we study the effect of convLSTM and feature alignment during meta-learning (last four rows). Compared with baseline meta-learning scheme, convLSTM and feature alignment give the VO performance a further boost. Besides, convLSTM tends to perform better than feature alignment during online adaptation. One possible explaination is convLSTM incorporates spatial-temporal correlations and past experience over long sequence. It associates different states recurrently, making the gradient computation graph more intensively connected during back propagation. Meanwhile, convLSTM correlates the VO network at different time, enforcing to learn a set of weights that are consistent in the dynamic environment.

Besides, we study how the size of sliding window is influencing the VO performance. The change of has no much impact on the running speed (30-32 FPS), but as increases, the adaptation gets faster and better. When is greater than 15, the adaptation speed and accuracy becomes lower. Therefore, we set as the best choice.

6 Conclusions

In this paper, we propose an online meta-learning scheme for self-supervised VO to achieve fast online adaptation in the open world. We use convLSTM to aggregate spatial-temporal information in the past, enabling the network to use past experience for better estimation and fast adaptation to the current frame. Besides, we put forward a feature alignment method to deal with changing feature distributions in the unconstrained open world setting. Our network dynamically evolves in time to continuously adapt to changing environments on-the-fly. Extensive experiments on outdoor, virtual and indoor datasets demonstrate that our network with online adaptation ability outperforms state-of-the-art self-supervised VO methods.

Acknowledgments This work is supported by the National Key Research and Development Program of China (2017YFB1002601) and National Natural Science Foundation of China (61632003, 61771026).

References

  • [1] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel (2018) Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. In ICLR, Cited by: §2.
  • [2] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas (2016) Learning to Learn by Gradient Descent by Gradient Descent. In NeurIPS, Cited by: §2.
  • [3] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer Normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.3.
  • [4] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei (1992) On the Optimization of a Synaptic Learning Rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, pp. 6–8. Cited by: §2.
  • [5] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks. In CVPR, Cited by: §2.
  • [6] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan (2013) Streaming Variational Bayes. In NeurIPS, Cited by: §2.
  • [7] F. M. Cariucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò (2017) Autodial: Automatic Domain Alignment Layers. In ICCV, Cited by: §2, §4.3, §4.3.
  • [8] V. Casser, S. Pirk, R. Mahjourian, and A. Angelova (2019)

    Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

    .
    In AAAI, Cited by: §1.
  • [9] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby (2019) Self-Supervised GANs via Auxiliary Rotation Loss. In CVPR, Cited by: §1, §3.2.
  • [10] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16. Cited by: §5.3.
  • [11] Q. P. Doyen Sahoo, J. Lu, and S. C. Hoi (2018) Online Deep Learning: Learning Deep Neural Networks on the Fly. In IJCAI, Cited by: §1.
  • [12] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §2.
  • [13] J. Engel, V. Koltun, and D. Cremers (2018) Direct Sparse Odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (3), pp. 611–625. Cited by: §1.
  • [14] J. Engel, T. Schöps, and D. Cremers (2014) LSD-SLAM: Large-Scale Direct Monocular SLAM. In ECCV, Cited by: §1.
  • [15] C. Finn, P. Abbeel, and S. Levine (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, Cited by: §2, §4.1.
  • [16] C. Finn and S. Levine (2018) Meta-Learning and Universality: Deep Representations and Gradient Descent can Approximate Any Learning Algorithm. In ICLR, Cited by: §2.
  • [17] C. Forster, M. Pizzoli, and D. Scaramuzza (2014) SVO: Fast Semi-Direct Monocular Visual Odometry. In ICRA, Cited by: §1.
  • [18] A. Geiger, P. Lenz, and R. Urtasun (2012) Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In CVPR, Cited by: §5.2.
  • [19] D. P. Kingma and J. Ba (2015) Adam: A method for Stochastic Optimization. In ICLR, Cited by: §5.1.
  • [20] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1.
  • [21] K. Li and J. Malik (2017) Learning to optimize. In ICLR, Cited by: §2.
  • [22] S. Li, F. Xue, X. Wang, Z. Yan, and H. Zha (2019) Sequential Adversarial Learning for Self-Supervised Deep Visual Odometry. In ICCV, Cited by: §1, §2, §3.1, §4.2, §5.2, §5.2, §5.4, Table 1, Table 3, Table 4.
  • [23] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017)

    Deep Transfer Learning with Joint Adaptation Networks

    .
    In ICML, Cited by: §2.
  • [24] R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In CVPR, Cited by: §5.2, Table 1, Table 3.
  • [25] M. Mancini, H. Karaoguz, E. Ricci, P. Jensfelt, and B. Caputo (2018) Kitting in the Wild through Online Domain Adaptation. In IROS, Cited by: §4.3.
  • [26] M. McCloskey and N. J. Cohen (1989) Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1, §3.2.
  • [27] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2018) A Simple Neural Attentive Meta-Learner. In ICLR, Cited by: §2.
  • [28] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio (2014) On the Number of Linear Regions of Deep Neural Networks. In NeurIPS, Cited by: §4.1.
  • [29] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics 31 (5), pp. 1147–1163. Cited by: §1.
  • [30] A. Nagabandi, I. Clavera, S. Liu, R. S. Fearing, P. Abbeel, S. Levine, and C. Finn (2019)

    Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning

    .
    In ICLR, Cited by: §2.
  • [31] A. Paszke, S. Gross, S. Chintala, and G. Chanan (2017) PyTorch. Note: https://github.com/pytorch/pytorch Cited by: §5.1.
  • [32] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black (2019) Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. In CVPR, Cited by: §2.
  • [33] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa (2018) Generate to Adapt: Aligning Domains Using Generative Adversarial Networks. In CVPR, Cited by: §2.
  • [34] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A Benchmark for the Evaluation of RGB-D SLAM Systems. In IROS, Cited by: §5.4.
  • [35] S. Thrun and L. Pratt (1998) Learning to Learn: Introduction and Overview. pp. 3–17. Cited by: §2.
  • [36] S. Thrun (1998) Lifelong learning algorithms. In Learning to learn, pp. 181–209. Cited by: §2.
  • [37] A. Tonioni, F. Tosi, M. Poggi, S. Mattoccia, and L. D. Stefano (2019) Real-Time Self-Adaptive Deep Stereo. In CVPR, Cited by: §1, §1.
  • [38] S. Wang, R. Clark, H. Wen, and N. Trigoni (2017)

    DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks

    .
    In ICRA, Cited by: §2.
  • [39] M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi (2019) Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning. In CVPR, Cited by: §2.
  • [40] F. Xue, X. Wang, S. Li, Q. Wang, J. Wang, and H. Zha (2019) Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry. In CVPR, Cited by: §1, §4.2.
  • [41] Z. Yin and J. Shi (2018) GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In CVPR, Cited by: §1, §2, §5.2, §5.2, §5.4, Table 1, Table 3, Table 4.
  • [42] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid (2018)

    Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction

    .
    In CVPR, Cited by: §5.2, §5.4, Table 1, Table 3, Table 4.
  • [43] H. Zhou, B. Ummenhofer, and T. Brox (2018) DeepTAM: Deep Tracking and Mapping. In ECCV, Cited by: §2.
  • [44] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised Learning of Depth and Ego-Motion from Video. In CVPR, Cited by: §1, §2, §3.1, §4.4, §5.1, §5.2, §5.2, Table 1, Table 3.