1 Introduction
Accurate and robust localization with lowcost Inertial Measurement Units (IMUs) is an ideal solution to a wide range of applications from augmented reality[zhou2020integrated] to indoor positioning services[wang2016indoor, zhou2021mining]. An IMU usually consists of accelerometers and gyroscopes, sometimes magnetometers and can sample linear acceleration, angular velocity and magnetic flux density in an energyefficient way. It can be lightweight and pretty cheap that many mobile devices like smartphones and VR headsets are instrumented with it. In many scenarios such as indoor or underground where global navigation satellite system is not available, ubiquitous IMU is a promising signal source, which can provide reliable and continuous location service. Unlike VisualInertial Odometry (VIO)[forster2016manifold] that is sensitive to surroundings and cannot work under extreme lighting, IMUonly inertial odometry is more desired and possible to perform accurate and robust localization every time and everywhere[harle2013survey, liu2020tlio].
Recent advances of datadriven approaches (e.g., IONet[chen2018ionet], RoNIN[herath2020ronin], TLIO[liu2020tlio]
) based on machine learning and deep learning have pushed the limit of traditional inertial odometry
[park2010zero, ho2016step] with the help of kinematic model and prior knowledge. However, to the best of our knowledge, all of them are based on purely supervised learning, which is notoriously weak under distribution shifts. IMU sensor data varies widely with different devices and users, sometimes the sensor data drifts over time. It is hard to control the distribution variability when the supervised algorithms are deployed in diverse applications. A rich and diverse database such as RoNIN[herath2020ronin] database can alleviate the problem to some extent, but it is cumbersome to collect such a big database and there are always scenarios that the database does not include and therefore the supervised model cannot capture their characteristics.In order to mitigate the challenge of distribution shift in realworld, we propose a geometric constraint, rotationequivariance, that can improve generalizability of deep model in training phase and help the deep model to learn from shifted sensor data at inference time. Inspire by the HeadingAgnostic Coordinate Frame (HACF) presented in RoNIN[herath2020ronin], we define rotationequivariance that when the IMU sequence in HACF is rotated around Z axis by a random angle, the corresponding groundtruth trajectory should be transformed by the same horizontal rotation. Under this assumption, we propose an auxiliary task that minimizing angle error between deep model prediction for rotated IMU data and rotated prediction of the original data. In experiments we validate that the auxiliary task improves model robustness in training phase when it is jointly optimized with the supervised velocity loss. During inference time, we formulate the auxiliary task as a selfsupervised learning problem alone, that we update model parameters based on auxiliary losses generated by test samples to adapt the model to the distribution of given test data. This process is named as TestTime Training (TTT)[sun2020test]. Empirical results of TTT indicate the proposed selfsupervision task brings substantial improvements at inference time. Furthermore, we introduce deep ensembles, a promising approach for simple and scalable predictive uncertainty estimation[lakshminarayanan2016simple]. We show in experiments that the estimated uncertainty using deep ensembles is consistent with the error distribution. It helps us to develop adaptive TTT, that model parameters are updated when the uncertainty of prediction reaches a certain level. We compare different TTT strategies and study the relationship between update frequency and model precision.
In summary, our paper has the following three main contributions:

We propose Rotationequivariancesupervised Inertial Odometry (RIO) and demonstrate that rotationequivariance can be formulated as an auxiliary task with powerful supervisory signal in training phase.

We employ TTT based on rotationequivariance for learningbased inertial odometry and validate that it helps to improve the generalizability of RIO.

We introduce deep ensembles as a practical approach for uncertainty estimation, and utilize the uncertainty result as indicators for adaptively triggering TTT.
The remainder structure of this paper is: we first give an overview on previous work regarding inertial odometry algorithms and related selfsupervised tasks. Then we introduce our method and finally present experiments and evaluations.
2 Related work
Roughly, there are three types of inertial odometry algorithms: i) double integrationbased analytical solutions[titterton2004strapdown, wu2005strapdown, bortz1971new]; ii) constrained model with additional assumptions[park2010zero, foxlin2005pedestrian, ho2016step, jimenez2009comparison, solin2018inertial, hostettler2016imu] and iii) datadriven methods[yan2018ridi, chen2018ionet, herath2020ronin, liu2020tlio, sun2021idol, wang2021pose].
Conventional strapdown inertial navigation system is to use double integration of IMU readings to compute positions[titterton2004strapdown]. Many analytical solutions[wu2005strapdown, bortz1971new] have been studied to promote the performance of the system. However, double integration leads to exploded cumulative error if there are signal biases. It requires highprecision sensors which are expensive and heavy, and typically are instrumented with aircrafts, automobiles and submarines.
For consumergrade IMUs that are small and cheap but have lower accuracy, a variety of constrained models with different assumptions emerged[jimenez2009comparison] and mitigated error drifts to some extent. [park2010zero, foxlin2005pedestrian] resort to shoemounted sensors to detect zero velocity for limiting velocity errors. [ho2016step] proposes stepdetection and steplength estimation algorithms to estimate walking distance under regular gait hypothesis. Inertial odometry models fused with available measurements by Extended Kalmann Filter (EKF) are presented in [solin2018inertial, hostettler2016imu].[solin2018inertial] requires observations such as position fixes or loopclosures. [hostettler2016imu] suppose negligible acceleration of the device equipmented with IMU.
Datadriven methods further broaden applicable scenarios of IMUs and relax condition limitations. RIDI[yan2018ridi] and PDRNet[asraf2021pdrnet] proposes to estimate robust trajectories of natural human motions with supervised training in a hierarchical way. RIDI[yan2018ridi]
develop a cascaded regression model that first use a support vector machine to classify IMU placements and then typespecific support vector regression models to estimate velocities. PDRNet
[asraf2021pdrnet] employ a smartphone location recognition network to distinguish smartphone locations and then use different models trained for different locations for inference. IONet[chen2018ionet] and RoNIN[herath2020ronin]using unified deep neural networks provide more robust solutions that work in highly dynamic conditions. They show direct integration of estimated velocities helps with limiting error drifts and a unified deep neural network model is capable to generalize to various motions. TLIO
[liu2020tlio] introduces a stochastic cloning EKF coupled with the neural network to further reduce position drifts. IDOL[sun2021idol] and [wang2021pose] are recent deep learningbased works that release heavy dependent of device orientation. IDOL designs an explicit orientation estimation module relied on magnetometer readings and [wang2021pose] propose a novel loss formulation to regress velocity from raw inertial measurements.Our work is in line with datadriven inertial odometry research that focuses on mitigating challenge of distribution shift in realworld. We propose rotationequivariance as a selfsupervision scheme to improve model generalizability and learn from unlabeled data. [chen2019motiontransformer]
proposes MotionTransformer framework that uses a shared encoder to transform inertial sequences into a domaininvariant hidden representation with generative adversarial networks. They focus on domain adaptation for long sensory sequences from different domains. Our method mainly deal with distribution shifts over one sensory sequence, and we show obvious improvements with the help of proposed selfsupervision task. Notably, our work is a flexible module that can be combined with many other deep learning based approaches like RoNIN, TLIO and IDOL.
Selfsupervised tasks provide surrogate supervision signals for representation learning. Learning with selfsupervision gains increased interest to improve model performance and avoid intensive manual labeling effort. Many vision tasks utilize selfsupervision for pretraining[Newell_2020_CVPR] or multitask learning[ren2018cross]. [Zhou_2017_CVPR] uses view synthesis as supervisor to learn depth and egomotion from unstructured video. [agrawal2015learning] shows that egomotionbased supervision learns useful features for multiple vision problems. [komodakis2018unsupervised] demonstrates that predicting image rotations is a promising selfsupervised task for unsupervised representation learning. [sun2020test] uses the image rotation task and creates selfsupervised learning problem at test time. They validate their approach with object recognition and show substantial improvements under distribution shifts.
3 Rio
3.1 Rotationequivariance
Our goal is to develop a selfsupervised method to improve the robustness of inertial odometry, and make the model perform well in various scenarios. We observe that the trajectory should be rotated in the same way as the IMU data in HACF when it is rotated around zaxis by a certain angle. We name this property as rotationequivariance. It provides the benefit to learn a robust inertial odometry.
Specifically, for a sequence of accelerometer data in a world coordinate frame, namely acceleration with , and gyroscope data for the same period in the same coordinate frame, angular velocity with , we randomly select a horizontal rotation that rotates and by degrees around axis, notate as and . The neural network model takes acceleration and angular velocity as input and yields a velocity estimation as output:
(1) 
where are learnable parameters of model . With rotationequivariance, given velocity estimation , , there is only a horizontal rotation between and . That is, if operator is applied to velocity and then get the rotated velocity , we expect
. Negative cosine similarity
[chen2021exploring] is employed to evaluate the difference between the two velocities:(2) 
where denotes the inner product between vectors. Therefore, we define a selfsupervised auxiliary task, that given a set of training IMU samples , the neural network model must learn to solve the selfsupervised training objective:
(3) 
Where the loss function
is defined as:(4) 
In the following subsections we describe how the selfsupervised auxiliary task helps with model training and inference.
3.2 JointTraining
In the training phase, we optimize the auxiliary loss (see Eq. (4
)) with velocity losses jointly. With the auxiliary task, the neural network model is encouraged to produce velocity estimations with a certain relative geometric relationship. However, it is unrealistic for the model to learn the magnitude and direction of the velocity in a consistent coordinate frame only with the auxiliary task. Jointly, we adopt the robust stride velocity loss to supervise the model. Stride velocity losses compute mean square error between the model output
at time frame and the ground truth velocity which is calculated by the average velocity over the sensor input time stride. In practice, we take one second sensor data as input and calculate the corresponding average velocity as supervisor.To train the model on both tasks, we create conjugate data for each training input data and organize them as data pairs. IMU data is sampled at 200 Hz and we take every 200 continuous frames as input. In other words, we use IMU data last for 1 second as input. IMU data and ground truth trajectories are transformed into the same HACF as mentioned in RoNIN[herath2020ronin]. And ground truth velocity is then calculated as the displacement of the corresponding time divide by the time length according to the ground truth trajectory in the HACF. More clearly, at frame , input , and ground truth velocity where is the position on the trajectory. For each input , select a random angle to horizontally rotate accelerations and angular velocities in and get the conjugate data . and its conjugate are processed by the neural network model and get two outputs as and . For output , calculate the stride velocity loss as . As shown in Sec. 3.1, rotate around z axis by that and calculate the negative cosine similarity between and as the loss for the selfsupervised auxiliary task. To avoid ambiguous orientation of the velocity when stationary, we ignore the auxiliary loss when velocity magnitude is no more than . The pseudocode of joint training can found in Algorithm 1.
3.3 Adaptive TTT
At test time, we propose adaptive TTT based on rotationequivariance and uncertainty estimation. It helps improve model performance on unseen data which has a large gap with training data. For test samples, we create conjugate data pairs the same as in training phase. With the selfsupervised auxiliary task presented in Sec. 3.1, we calculate the auxiliary loss to update of neural network model before making predictions. For IMU data that arrive in an online stream, we adopt the online version that keep the state of the updated parameters for a while and restore the initial parameters in specific situations.
While properly updated models can make substantial improvements under distribution shifts, their performance on original distribution may drop dramatically if TTT updates the parameters in an inappropriate way. The proposed auxiliary task cannot capture accurate losses when objects moving with an ambiguous orientation like moving slowly or stationary. At inference time, the velocity threshold used in training phase is not enough to ensure stable and reliable updates since batch data to optimize model is from a continuous period of time and they tend to have ambiguous orientation at the same time. Therefore, we introduce uncertainty estimation to assist with determining the right time to update or restore model parameters.
Uncertainty estimation We use deep ensembles to provide predictive uncertainty estimations that are able to express higher uncertainty on outofdistribution examples[lakshminarayanan2016simple]. We adopt a randomizationbased approach, that with random initialization of the neural network models parameters and random shuffling of the training data to get individual ensemble models.
Formally, we randomly initialize neural network models with different parameters
that each of them parameterize a different distribution on the outputs. Each model converges through an independent optimization path with training data random shuffling. For convenience, assume the ensemble is a Gaussian distribution and each model prediction
represents a sample from the distribution. We approximate the prediction uncertainty as the variance of sampled predictions that
where .For our inertial odometry model, we get velocity from model , and the velocity variance can be calculated with corresponding sampled estimations. We show in experiments that velocity variance based on deep ensembles well indicates models confidence level for the estimation. The velocity variance is used as prediction uncertainty indicators to determine when to update or restore model parameters.
TTT strategy Further, we propose an adaptive TTT strategy based on uncertainty estimations. First, we stop updating model parameters when velocity estimations have a high confidence level. When objects move with ambiguous orientation, the auxiliary loss tends to be large, however, velocity variance is not necessary to be large and tends to be small. It avoids overhead updating and only updates models when necessary.
Second, we need to know when to reset models. Models will drift a lot if there is inappropriate updating. We hope to keep the state of updated parameters if the motion is continuous. However, if the motion switches to a different mode, IMU data distribution will change a lot and the updated model performance may be worse than the original model on the unseen data. Meanwhile, from a simple observation that in most cases there is a stationary or nearly stationary zone between two different motion modes, we propose to restore original model parameters when objects stationary or nearly stationary. We use velocity uncertainty to capture these moments in that the inertial odometry model tends to have an absolute high confidence level when stationary or nearly stationary.
To do inference at test time, the neural network model is first initialized with pretrained parameters . Test samples that with 200 frames IMU data are sampled every 10 frames at 20 Hz, and when 128 test samples arrive, we make them in a batch for testtime training. For convenience, we select four degrees to create conjugate inputs the same way as in the training phase. With the original and conjugate inputs, we can get velocity estimations from the model. Denoting original outputs as and conjugate outputs as , rotate original outputs by corresponding angle and get . The auxiliary loss is calculated the same as in the training phase:
(5) 
For every batch of data, we update the model at most 5 times. With deep ensemblebased uncertainty estimation, velocity uncertainty is estimated as the outputs variance of three independent pretrained models, denoted as . According to the adaptive TTT policy, we stop updating the model if the average velocity variance is smaller than a certain value; restore original parameters if the minimal velocity variance is absolute small. In practice, we stop updating if and restore parameters if any . The pseudocode of adaptive TTT can be found in the supplementary.
4 Evaluations
We evaluate our proposed method in this section. Our main purpose is to verify that the proposed auxiliary task based on rotationequivariance helps to improve models robustness and accuracy. In order to eliminate the influence of model architecture and datasets used for training, we adopt a consistent mature architecture and datasets for all models used for evaluation. All models are with ResNet18 [he2016deep] backbone and we use the largest smartphonebased inertial navigation database provided by RoNIN to train models[herath2020ronin]. With different supervision tasks in training phase and different strategies at inference time, we demonstrate that the proposed auxiliary task helps the model outperform the existing stateoftheart method.
Database  Metric  RResNet  BResNet  JResNet  BResNetTTT  JResNetTTT 

RoNIN  ATE ()  5.14  5.57  5.02  5.05  5.07 
RTE ()  4.37  4.38  4.23  4.14  4.17  
Ddrift  11.54  9.79  9.59  8.49  9.10  
OXIOD  ATE ()  3.46  3.52  3.59  2.92  2.96 
RTE ()  4.39  4.42  4.43  3.67  3.74  
Ddrift  20.67  19.68  17.43  15.50  15.98  
RIDI  ATE ()  1.33  1.19  1.13  1.04  1.03 
RTE ()  2.01  1.75  1.65  1.53  1.51  
Ddrift  10.50  7.99  7.61  6.89  6.93  
IPS  ATE ()  1.60  1.84  1.67  1.55  1.55 
RTE ()  1.52  1.68  1.65  1.46  1.47  
Ddrift  8.38  7.66  7.96  5.93  6.75 
Network details
We adopt Resnet18 backbone since ResNet18 model achieved the highest accuracy on multiple datasets shown by RoNIN. We replace Batch Normalization (BN) with Group Normalization (GN) in that the trained model is going to be used in TTT where training with small batches. BN that uses estimated batch statistics has been shown to be ineffective with small batches whose statistics are not accurate. GN that uses channels group statistics is not influenced by batch size and results in similar results as BN on inertial odometry problem. As we propose in
Sec. 3.2, we train a model denoted as JResNet with jointtraining setting following Algorithm 1.We use the RoNIN model with ResNet18 backbone as a baseline. While RoNIN publish a pretrained ResNet model, denoted as RResNet, which is exactly the one they claimed in [herath2020ronin], it is a model using BN and trained with the whole RoNIN dataset. They only publish half of the whole database due to privacy limitation. For fair comparison, we retrain a model using GN with the public database as a baseline. Other implementations are exactly the same as they claimed in [herath2020ronin]. We denote the retrained model as BResNet.
Databases Models are evaluated with three popular public databases for inertial odometry: OXIOD[chen2018oxiod], RoNIN[herath2020ronin] and RIDI[yan2018ridi], and one database collected in different scenarios by ourselves, IPS. Collecting details are presented in the supplementary. For trajectory sequences in OXIOD RIDI and IPS, the whole estimated trajectory is aligned to the groundtruth trajectory with Umeyama algorithm [umeyama1991least] before evaluation. For RoNIN whose sensor data and groundtruth trajectory data are well calibrated to the same global frame, we directly use the reconstructed trajectory to compare with the groundtruth.
We evaluate neural networks JResNet and BResNet with two different approaches. One is the standard neural network inference pipeline which is the same as in IONet[chen2018ionet], RoNIN[herath2020ronin], and another use the adaptive TTT proposed in Sec. 3.3. RResNet is evaluated only with standard pipeline since it is with BN as normalization layers and it cannot be optimized with small data batch.
4.1 Metrics definitions
Three metrics are used for quantitative trajectory evaluation of inertial odometry model: Absolute Trajectory Error (ATE) and Relative Trajectory Error (RTE), and Distance drift (Ddrift). ATE and RTE are standard metrics proposed in [sturm2012benchmark].
ATE (), is calculated as the average Root Mean Squared Error (RMSE) between the estimated and groundtruth trajectories as a whole.
RTE (), is calculated as the average RMSE between the estimated and groundtruth over a fixed length or time interval. Here we use timebased RTE the same as in RoNIN that we evaluate RTE over 1 minute.
Ddrift, is calculated as absolute difference between the estimated and groundtruth trajectory length divided by the length of groundtruth trajectory.
4.2 Performance
Tab. 1 is our main results. All subjects used to evaluate models do not present in training sets. Our evaluation of the RResNet for RoNIN test datasets is consistent with the report of RoNIN unseen sets in [herath2020ronin]. Other three datasets are not used in the training phase. RResNet is trained with full RoNIN dataset and we only use half of it which is published to train BResNet and JResNet. Therefore, we evaluate RResNet performance just for reference. BResNet is a fair baseline and we compare other methods with it.
The results show that JResNet outperforms BResNet on most databases. JResNet reduces ATE by , and on RoNIN, RIDI and IPS databases, respectively. Notably, for RoNIN database, JResNet outperforms RResNet which is trained with twice as much training data.
BResNetTTT outperforms BResNet on all databases by a significant margin. It reduces ATE by , , and on RoNIN, RIDI, OXIOD and IPS databases, respectively. For JResNetTTT, it reduces ATE by , and on OXIOD, RIDI and IPS, and it has a comparable performance on RoNIN comparing to JResNet. In a word, the adaptive TTT strategy proposed in Sec. 3.3 can further improve performance of BResNet and JResNet.
Both models are trained with RoNIN training database in training phase. JResNet is trained with the auxiliary task and it already helps to improve performance on RoNIN test database. We assume the auxiliary task is optimized in training phase for RoNIN database that it does not significant improve model performance further with testtime training. For OXIOD, RIDI and IPS which are novel databases for both models, adaptive TTT further improve models performance on all metrics.
Fig. 3 shows selected trajectories performance visualization of JResNet and JResNetTTT. It shows estimate trajectories against the groundtruth of both models along with velocity estimation losses comparison. Velocity estimation losses are reduced a lot when there are large velocity losses of original model. It demonstrates that the auxiliary task used in adaptive TTT can help to optimize model at pivotal steps and result in a better trajectory estimation.
4.3 Performance on Multiple Scenarios
The proposed rotationequivariance contributes differently in different scenarios. Although RoNIN training database is the largest public inertial odometry database with rich diversity, model performance under certain scenarios can be improved by a large margin using RIO.
We compare model performance by scenarios in IPS database and present results in Fig. 4. In different scenarios, devices with IMU sensors are mounted at different placements, and are handled with different ways. Fig. 4 shows that in all scenarios, JResNet outperforms BResNet, and TTT version of both models improve even further. Under calling, back pocket, photo portrait and photo landscape scenarios, ATE of BResNet can be reduced by , , and , and for JResNet, TTT version reduces ATE by , , and , respectively. These four scenarios are not very common in daily life and may not show up as frequent as other poses in RoNIN training database. Therefore, large improvements under unusual scenarios elucidate that adaptive TTT helps trained models to learn from novel data distribution and improve their performance under distribution shifts.
4.4 TTT Strategy Analysis
In this section, we evaluate the TTT strategy in isolation and explain why uncertainty estimations help with model performance improvement.
Database  Metric  ATTT  NTTT 

RoNIN  ATE ()  5.05  4.94 
RTE ()  4.14  4.27  
Ddrift  8.49  9.43  
OXIOD  ATE ()  2.92  3.50 
RTE ()  3.67  4.39  
Ddrift  15.50  19.55  
RIDI  ATE ()  1.04  1.11 
RTE ()  1.53  1.64  
Ddrift  6.89  7.56  
IPS  ATE ()  1.55  1.63 
RTE ()  1.46  1.54  
Ddrift  5.93  6.73 
1) Is uncertainty estimation with deep ensemble reliable? Fig. 5 shows one trajectory velocity estimations against its groundtruth together with their uncertainty estimations and velocity estimation losses. It demonstrates that our predictive uncertainty well follows the estimation losses. As we expected, the predictive uncertainty decreases when velocity estimation losses decrease. With the uncertainty estimation, we can stop updating when it is below a certain level since the model is pretty accurate at this time. Notably, the predictive uncertainty decreases to zero when the magnitude of velocity is around zero. As mentioned in Sec. 3.3, detection of stationary or nearly stationary zones is important for adaptive TTT in that it is the right time to restore original model parameters.
2) Comparing adaptive TTT with others: Last row of Fig. 5 compares Adaptive TTT (ATTT) ATE change over time with Naive TTT (NTTT). NTTT refers to the process that always update models according to losses of selfsupervised task and ignore velocity uncertainty estimations. It keeps the latest updated model and does not restore original parameters over one continuous trajectory. Fig. 5 shows that ATE of NTTT increase faster than ATTT. There are two obvious time windows that ATE increase steeper with NTTT, and during these time velocity is decreasing to zeros which means object is going to be stationary. With adaptive strategy, the model will restore original parameters and ATE will be suppressed. Tab. 2 compares ATTT and NTTT on four databases and shows that the performance of ATTT has obvious advantages over NTTT.
5 Ablation Studies
We conducted additional experiments with jointtraining and TTT settings under ablation considerations.
Models performance v.s. size of training data We trained models in jointtraining with different size of training datasets. Denote the neural network which is provided and published by RoNIN[herath2020ronin] as 100% BResNet since it is trained with the whole RoNIN database. We trained models with 50%, 30% and 10% data of the whole database in two ways as mentioned before, and evaluate their performance under different settings. Fig. 6 shows the comparison of different models. While BResNet and JResNet performances drop a lot as the training database becomes smaller, JResNetTTT with 30% training database is still comparable to 100% BResNet. However, JResNetTTT performance also drops a lot when using 10% training databases.
Influence of updating iterations At testtime, model can be updated multiple times with one batch of data. Fig. 7 shows results of one model with different updating iterations from 1 to 15. There are obvious improvements when increase iterations from 1 to 5. However, more than 5 updates do not show obvious advantages and the model performance even degrade a little when updating 15 times one batch. More iterations cost more time and computing resource. Therefore, we recommend no more than 5 updates one batch during TTT.
6 Discussion and Conclusion
In this paper we present a Rotationequivariancesupervised Inertial Odometry (RIO) in order to improve performance and robustness of inertial odometry. The rotationequivariance can be formulated as a selfsupervised auxiliary task and can be applied both in training phase and inference stage. Extensive experiments results demonstrate that the rotationequivariance task helps with advancing model performance under jointtraining setting and will further improve model with TestTime Training (TTT) strategy. Not only rotationequivariance, there may be more equivariance (e.g., time reversal, mask autoencoder of time series) that can be formulated as selfsupervised task for inertial odometry. We hope our observation will enlighten future work in the aspect of selfsupervise learning of inertial odometry.
Further, we propose to employ deep ensemble to estimate the uncertainty of RIO. With uncertainty estimation, we develop adaptive TTT for evolving RIO at inference time. It thus can largely improve the generalizability of RIO. Adaptive TTT using the auxiliary task makes a model trained with less than onethird of the data outperforms the stateoftheart deep inertial odometry model, especially under scenarios that the model does not see during the training phase. Adaptive online model update with uncertainty estimation is a practical way to improve deep model performance in real life applications. Uncertainty estimation based on deep ensemble gives reliable judgment on the output of deep models. Adaptive TTT can be implemented in different ways, either conservative or aggressive, for updating the model depending on the application scenarios.
References
Supplementary Materials
Models performance v.s. size of training data We trained models in jointtraining setting with different size of training datasets. Denote the neural network which is provided and published by RONIN as 100% BResNet since it is trained with the whole RONIN database. We trained models with 50%, 30% and 10% data of the whole database in two ways as mentioned before. And evaluate their performance under different settings. Fig. 6 shows the comparison of different models. While BResNet and JResNet performances drop a lot as the training database becomes smaller, JResNetTTT with 30% training database is still comparable to 100% BResNet. However, JResNetTTT performance also drops a lot when using 10% training databases.
Influence of updating iterations At testtime, model can be update multiple times with one batch of data. Fig. 7 shows results of one model with different updating iterations from 1 to 15. There are obvious improvements when increase iterations from 1 to 5. However, more than 5 updates do not show obvious advantages and the model performance even degrade a little when updating 15 times one batch. More iterations cause more time and computing resource consumption. Therefore, we recommend no more than 5 updates one batch during TTT.
At testtime, model can be update multiple times with one batch of data. Fig. 7 shows results of one model with different updating iterations from 1 to 15. There are obvious improvements when increase iterations from 1 to 5. However, more than 5 updates do not show obvious advantages and the model performance even degrade a little when updating 15 times one batch. More iterations cause more time and computing resource consumption. Therefore, we recommend no more than 5 updates one batch during TTT.
Influence of updating iterations At testtime, model can be update multiple times with one batch of data. Fig. 7 shows results of one model with different updating iterations from 1 to 15. There are obvious improvements when increase iterations from 1 to 5. However, more than 5 updates do not show obvious advantages and the model performance even degrade a little when updating 15 times one batch. More iterations cause more time and computing resource consumption. Therefore, we recommend no more than 5 updates one batch during TTT.
At testtime, model can be update multiple times with one batch of data. Fig. 7 shows results of one model with different updating iterations from 1 to 15. There are obvious improvements when increase iterations from 1 to 5. However, more than 5 updates do not show obvious advantages and the model performance even degrade a little when updating 15 times one batch. More iterations cause more time and computing resource consumption. Therefore, we recommend no more than 5 updates one batch during TTT.
Selected visualizations. We select 3 examples from each open sourced dataset and visualize trajectories of groundtruth, JResNet and JResNetTTT, along with velocity losses comparison of JResNet and JResNetTTT.