Neural Networks Versus Conventional Filters for Inertial-Sensor-based Attitude Estimation

Inertial measurement units are commonly used to estimate the attitude of moving objects. Numerous nonlinear filter approaches have been proposed for solving the inherent sensor fusion problem. However, when a large range of different dynamic and static rotational and translational motions is considered, the attainable accuracy is limited by the need for situation-dependent adjustment of accelerometer and gyroscope fusion weights. We investigate to which extent these limitations can be overcome by means of artificial neural networks and how much domainspecific optimization of the neural network model is required to outperform the conventional filter solution. A diverse set of motion recordings with a marker-based optical ground truth is used for performance evaluation and comparison. The proposed neural networks are found to outperform the conventional filter across all motions only if domain-specific optimizations are introduced. We conclude that they are a promising tool for inertial-sensor-based real-time attitude estimation, but both expert knowledge and rich data sets are required to achieve top performance.



There are no comments yet.


page 5


DIDO: Deep Inertial Quadrotor Dynamical Odometry

In this work, we propose an interoceptive-only state estimation system f...

RIANN – A Robust Neural Network Outperforms Attitude Estimation Filters

Inertial-sensor-based attitude estimation is a crucial technology in var...

A neural network based post-filter for speech-driven head motion synthesis

Despite the fact that neural networks are widely used for speech-driven ...

RoNIN: Robust Neural Inertial Navigation in the Wild: Benchmark, Evaluations, and New Methods

This paper sets a new foundation for data-driven inertial navigation res...

Incremental learning of LSTM framework for sensor fusion in attitude estimation

This paper presents a novel method for attitude estimation of an object ...

Tracking Human-like Natural Motion Using Deep Recurrent Neural Networks

Kinect skeleton tracker is able to achieve considerable human body track...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Inertial sensors have been used for several decades in aerospace system for attitude control and navigation. Drastic advances in microelectromechanic systems (MEMS) have lead to the development of miniaturized strapdown inertial measurement units (IMUs), which entered a multitude of new application domains from autonomous drones to ambulatory human motion tracking.

In strapdown IMUs, the angular rate and acceleration – and sometimes also the magnetic field vector – are measured in a sensor-intrinsic three-dimensional coordinate system, which moves along with the sensor. Estimating the orientation, velocity or position of the sensor with respect to some inertial frame requires strapdown integration of the angular rates and sensor fusion of the aforementioned raw measurement signals (cf. Figure 


Fig. 1: Attitude estimation workflow (graphic based on [1])

To estimate the orientation of an IMU from its raw measurement signals in real time is a fundamental standard problem of inertial sensor fusion. A large variety of filter algorithms have been proposed previously, some of which are implemented in motion processing units of modern miniature IMUs. It is well known that the attitude of the sensor can be determined by 6D sensor fusion, i.e. fusing 3D gyroscope and 3D accelerometer readings, while estimating the full orientation (attitude and heading) requires 9-axis sensor fusion, i.e. using 3D magnetometer readings in addition to the 6D signals.

Existing solutions to inertial attitude estimation are typically model-based and heuristically parameterized. They use mathematical models of measurement errors and three-dimensional rotations and transformations of the gravitational acceleration. They require a reasonable choice of covariance matrices, fusion weights or parameters that define how weights are adjusted. While considerably high accuracies have been achieved with such solution approaches in many application domains, it is also well-known that different parameterizations perform differently well for different types of motions and disturbances. In fact, to the best of our knowledge, there is to date no filter algorithm that yields consistently small errors across all types of motion that a MEMS-based IMU might perform.

Abundant research has demonstrated the capabilities of artificial neural networks in providing data-based solutions to problems that have conventionally been addressed by model-based approaches. If sufficiently large amounts of data and computation capability are available, generally usable solutions may be found for most problems. While ample work has shown that a number of problems can also be solved using neural networks, the practically more relevant question whether neural networks can outperform conventional solutions often remains unanswered.

In the present work, we investigate whether a neural network can solve the real-time attitude estimation task with similar or even better performance than a state-of-the-art inertial orientation estimation filter. Moreover, we analyze at which cost this can be achieved, in terms of required number of data sets, required complexity and application-specific structure of the neural network.

Ii Related Work

We first briefly review the state-of-the-art in real-time attitude estimation from inertial sensor signals and then describe previous work on the use of artificial neural networks for inertial motion analysis.

Ii-a Inertial Attitude Estimation

As mentioned above, the attitude of an IMU can be determined by sensor fusion of the accelerometer and gyroscope readings. Accelerometers yield accurate attitude information in static conditions, i.e. when the sensor moves with constant velocity. Under dynamic conditions, however, their readings are only useful under certain assumption, for example that the average change of velocity is zero on sufficiently large time scales. Gyroscopes yield highly accurate information on the change of attitude. However, pure strapdown integration of the angular rates is prone to drift resulting from measurement bias, noise, clipping and undersampling. Accurate attitude estimation under non-static conditions requires sensor fusion of both 3D signals.

A number of different solutions have been proposed for this task. Categorizations and comparisons of different algorithms can be found, for example, in [2, 3]

. Most filters use either an extended Kalman filter scheme or a complementary filter scheme, and unit quaternions are a common choice for mathematical representation of the three-dimensional orientation. The balance between gyroscope-based strapdown integration and accelerometer-based drift correction is typically adjusted to the specific application by manual tuning of covariance matrices or other fusion weights. Methods have been proposed that analyze the accelerometer norm to distinguish static and dynamic motion phases and adjust the fusion weights in real time.

A rather recently developed quaternion-based orientation estimation filter is described in [4]. It uses geodetic accelerometer-based correction steps and optional magnetometer-based correction steps for heading estimation. The correction steps are parametrized by intuitively interpretable time constants, which are adjusted automatically if the accelerometer norm is far from the static value or has been close to that value for several consecutive time steps. The performance of this filter and five other state-of-the-art filters has recently been evaluated across a wide range of motions. For all filters, errors between two and five degrees were found for different speeds of motion [5]. To the best of our knowledge, a significantly more accurate solution for attitude estimation in MEMS-based IMUs does not exist.

Ii-B Neural Networks for Attitude Estimation

In inertial motion tracking, neural networks have mostly been applied to augment existing conventional filter solutions. In [6] a Recurrent Neural Network (RNN) is used for movement detection in order to decide which Kalman filter should be applied to the current system state. In [7]

a feed forward neural network is used as for smoothing the output of a Kalman filter, while a RNN is used for data pre-processing of Kalman filter inputs in

[8]. A similar approach is used in [9], where a convolutional neural network is used for error correction of the gyroscope signal as part of a strapdown integration.

In [10] and [11] RNNs are used as blackboxes for the orientation integration over time. While the former uses a combination of gyroscope and visual data, the latter only relies on the gyroscope achieving similar results. In a few more recent works, neural networks have been applied directly as blackboxes for angle estimation problems. In [12] a RNN is used for human limb assignment and orientation estimation of IMUs that are attached to human limbs. It achieved a high accuracy at the assignment problem but was only partially successful at the orientation estimation problem. In [13] a bidirectional RNN is used for velocity and heading estimation on a two-dimensional plane in polar coordinates.

To conclude, an end-to-end neural network model for IMU-based attitude estimation has not been developed yet. All of the presented neural networks are either an addition to classical filters for attitude estimation or they address different problems.

Iii Problem Statement

Consider an inertial sensor with an intrinsic right-handed coordinate system . Neglect the rotation of the Earth and define an inertial frame of reference with vertical z-axis. The orientation of the sensor with respect to the reference frame is then described by the rotation between both coordinate systems, which can be expressed as a unit quaternion, a rotation matrix, a set of three Euler angles or a single angle and a corresponding rotation axis. Both frames are said to have the same attitude if the axis of that rotation is vertical.

If the true orientation of the sensor is given by the unit quaternion and an attitude estimation algorithm yields an estimate , then is the estimation error quaternion expressed in reference frame axes. The attitude estimation is said to be perfect if the estimated orientation is correct up to a rotation around the vertical axis. This is the case if the rotation axis of is vertical. If that axis is not vertical, then can be decomposed into a rotation around the vertical axis and a rotation around a horizontal axis. For any given with real part and third imaginary part , the smallest possible rotation angle of is . This corresponds to the smallest rotation by which one would need to correct the estimate to make its attitude error zero in the aforementioned sense.

These definitions allow us to formulate the following attitude estimation problem: Given a sampled sequence of three-dimensional accelerometer and gyroscope readings of a MEMS-based IMU moving freely in three dimensional space, estimate the attitude of that IMU with respect to the reference frame at each sampling instant only based on current and previous samples. Denote the sensor readings by and , respectively, with being the discrete time and the number of samples. The desired algorithm should then yield a sampled sequence of estimates with a possibly small cumulative attitude estimation error defined by


where is the true orientation of the sensor at time . In the following sections, we aim to develop an artificial neural network that solves the given problem and compare it to an established attitude estimation filter.

Iv Neural Network Model

In this work a neural network model with state-of-the-art best practices for time series will be implemented. Building upon that, further optimizations are introduced that utilize domain-specific knowledge.

Iv-a Neural Network Structure with general best practices

The performance of a neural network model depends on the model architecture and the training process. First we identify potential model architectures for attitude estimation. After that we develop an optimized training process for these architectures.

Fig. 2: RNN model for attitude estimation

The model architecture consists of multiple layers that may be connected in multiple ways leading to different characteristics. First a method for modelling the dynamic system states has to be chosen. A common practice is to connect the model output to the model input creating an autoregressive model that stores the system state information in the single autoregressive connection. For longer sequences, the autoregressive model’s inherent sequential nature prevents parallelization and therefore an efficient use of hardware acceleration, which slows down the training. Using neural network layers that are able to model system states avoids the need of autoregression for dynamic systems. The most commonly used ones are Recurrent Neural Networks (RNNs) and Temporal Convolutional Networks (TCNs).

RNNs have recurrent connections between samples in their hidden activations for modelling the state of a dynamic system. There are different variants of RNNs with Long Short-Term Memories (LSTMs) being the most prevalent

[14]. LSTMs add three gates to the classical RNN, which regulate the information flow of its hidden activations. This stabilizes the stored state, enabling the application to systems with long-term dependencies, like integrating movements over a long amount of time. Because LSTMs are prone to overfitting, several regularization methods for sequential neural networks have been developed [15]. Increasing the amount of regularization together with the model size is the main approach for improving a neural network without domain-specific knowledge. In the present work, we use a two-layer LSTM Model with a hidden size of 200 for each layer and a final linear layer that reduces the hidden activation count to four. These four activations represent the elements of the estimated attitude quaternion. In order to always generate a unit quaternion, the elements are divided by their Euclidean norm. The structure of the RNN model used in this work is visualized in Figure 2.

Fig. 3: TCN model for attitude estimation

An alternative approach to RNNs for sequential data are TCNs. TCNs are causal one-dimensional dilated convolutional neural networks with receptive fields big enough to model the system dynamics[16]. The main advantage of TCNs compared to RNNs is their pure feed-forward nature. Having no sequential dependencies leads to parallelizability and therefore fast training on hardware accelerators [17]. The TCN’s receptive field describes the amount of samples taken into account for predicting a sample. Because TCNs are stateless, the receptive field needs to be large enough to implicitly estimate the system state from the input signals. Because of the dilated convolutional layers, the receptive field grows exponentially with the depth of the neural network allowing for large windows using a manageable amount of layers. In the present work, we use a 10-layer TCN with a receptive field of samples and a hidden size of 200 for each layer. The structure of that TCN model is visualized in Figure 3.

For linear and convolutional layers, batchnorm [18]

is used. Batchnorm standardizes the layer activations, enabling larger learning rates and better generalization. Instead of the commonly used sigmoid or rectified linear unit activation functions, we use Mish, which achieved state-of-the-art results in multiple domains


. Mish combines the advantages of both activation functions. On the one hand, it is unbounded in positive direction and thus avoids saturation like rectified linear units. On the other hand, it is smooth like sigmoid functions, which improves gradient-based optimization.

For training, long overlapping sequences get extracted from the measured sequences, so the Neural Networks initializes with different states. Because RNNs can only be reasonably trained with a limited amount of time steps for every minibatch, truncated backpropagation through time is used


. That means that the long sequence gets split in shorter windows that are used for training, transferring the hidden state of the RNN between every minibatch. The measured sequences are standardized with the same mean and standard deviation values to improve training stability


The main component of the training process is the optimizer. We use a combination of RAdam and Lookahead, which has proven to be effective at several tasks [21], [22]

. For the training process we used the Fastai 2 API, that is built upon Pytorch


. One of the most important hyperparameters for training a neural network is the learning rate of the optimizer. We choose the maximum learning rate with the learning rate finder heuristic

[24] and use cosine annealing for faster convergence [25]

. The learning rate finder heuristic determines the maximum learning rate by exponentially increasing the learning rate in a dummy training and finding the point at which the loss has the steepest gradient. Cosine annealing starts with the maximum learning rate, keeps it constant for a given amount of epochs and then exponentially decreases it over time.

The other hyperparameters of the neural network model, such as activation dropout and weight dropout, form a vast optimization space. To find a well performing configuration, we use population-based training [26]. It is an evolutionary hyperparameter optimization algorithm that is parallelizable and computationally efficient. It creates a population of neural networks with different hyperparameters and trains them for some epochs. Then the hyperparameters and weights of the best performing models are overriding the worst ones, and minor hyperparameter variations are introduced. Repeating this process quickly yields a well performing solution.

Iv-B Loss Function

(a) Function Comparison
(b) Gradient Comparison
Fig. 4: Comparison of the values and gradients of and

The output of the model is a quaternion that describes the attitude of the sensor. The loss function describes the accumulated error between the estimated and the ground truth values. In most cases, the mean-squared-error between the estimated and reference values are taken. In the present case, an elementwise mean-squared-error of the quaternion is not a reasonable choice, since the orientation cannot be estimated unambiguously with only accelerometer and gyroscope signals –a magnetometer would be necessary. An obvious solution would be to choose the loss function equal to the attitude error function



However, experiments show that using this error definition leads to unstable training resulting from an exploding-gradient problem. This is caused by the

function, whose derivative function explodes for arguments approaching 1, which is the target of the optimization problem:


Truncating close to 1 leads to a solution that is numerically stable with rare exceptions. Replacing the function with a linear term avoids the exploding gradient completely while keeping the monotonicity and correlation with the attitude:


Figure 4 visualizes the differences between both functions and their gradients.

(a) Linear Slow Nonstop
(b) Rotation Fast Paused
(c) Arbitrary Fast Paused
Fig. 5: IMU signal and attitude error comparison for three different measurements

Another difficulty of many datasets is the presence of outliers that result from measurement errors. Therefore, we use the smooth-l1-loss function, which is less prone to outliers than the mean-squared-error


Iv-C Data Augmentation

Data augmentation is a method for increasing the size of a given dataset by introducing domain-knowledge. This is a regularization method that improves the generalizability of a model and has already been applied successfully in computer vision

[28] and audio modelling [29]. In case of the present attitude estimation task, we virtually rotate the IMU by transforming the measured accelerometer, gyroscope and reference attitude data by a randomly generated unit quaternion. Thereby, orientation invariance for sensor measurements will be introduced to the model.

Iv-D Grouped Input Channels

Fig. 6: Grouped input version of the RNN model

The default way of processing a multivariate time series is to put all the input signals into the same layer. An alternative way is to create groups of signals that interact with each other and disconnect them from those they don’t need to interact with. The idea is to alleviate the neural network’s effort in finding interactions between signals. This method has been applied previously to other tasks but without analysis of its impact on the performance [30][10]. In the present application, the accelerometer and gyroscope are grouped separately, with the accelerometer providing attitude information at large time scales and the gyroscope providing accurate information on the change of orientation, as visualized in Figure 6.

V Experiments

The performance of the proposed neural network is compared to the performance of an established attitude estimation filter in experiments with a ground truth based on marker-based optical motion tracking. A MEMS-based IMU (aktos-t, Myon AG, Switzerland) is rigidly attached to a 3D-printed X-shaped structure with three reflective markers whose position is tracked at millimeter accuracy by a multi-camera system (OptiTrack, Natural Point Inc., USA). For each moment in time, the three-dimensional marker positions are used to determine a ground-truth sensor orientation with sub-degree accuracy.

To analyze the algorithm performance across different types of motions and different levels of static or dynamic activity, we consider a large number of data sets from different experiments with the following characteristics:

  • rotation: The IMU is rotating freely in three-dimensional space while remaining close to the same point in space.

  • translation: The IMU is translating freely in three-dimensional space while remaining in almost the same orientation.

  • arbitrary: The IMU is rotating and translating freely in three-dimensional space.

  • slow versus medium versus fast: The speed of the motion is varied between three different levels.

  • paused versus nonstop: The motion is paused every thirty seconds and continued after a ten-seconds break or it is performed non-stop for the entire duration of the five-minutes recordings.

Different combinations of these characteristics lead to a diverse data set of 15 recordings each of which contains more than 50,000 samples of accelerometer and gyroscope readings and ground-truth orientation at a sampling rate of 286 Hz. Figure 5 shows the Euclidean norms of the three axis of acceleration (acc) and angular rate signal (gyr) over time for three experiments with different combinations of the described characteristics.

The experimental data is used to validate and compare the following two attitude estimation algorithms:

  • Baseline: a quaternion-based attitude estimation filter with accelerometer-based correction steps and automatic fusion weight adaptation [4]. The filter time constant and weight adaptation gain are numerically optimized to yield the best performance across all data sets.

  • Neural Network (NN): The proposed neural network is trained on a subset of the available (augmented) data sets and validated on the complementary set of data.

The characteristics of applying neural networks to the attitude estimation problem are analyzed in three experiments. The first one compares the performance of the optimised neural network with the filter. The second one is an ablation study that quantifies the effect of every optimization and compares the performance of the RNN and TCN model. The last experiment analyzes the effect of scaling the size of the neural network.

V-a Performance Analysis

In order to compare the performance of the proposed neural network model with the filter, the 15 recordings will be used for a leave-one-out cross-validation. That means that the model will be trained with 15 recordings and validated on the one that was left out. This leads to an increase in computation time because for every recording a new independent model has to be trained, but it provides a better view on generalizability of the model architecture. The neural network used is the RNN with all the proposed optimizations applied.

Fig. 7: RMSE comparison between the best neural network and the baseline
Fig. 8: RMSE comparison for every recording between the best neural network and the baseline

The boxplot in Figure 7 compares the error distribution of the 15 recordings between the neural network and the baseline filter. It visualizes that (1) the neural network has a better average performance and (2) that it performs more consistently in difficult cases, exhibiting clearly smaller maximum errors. The performance comparison for each individual recording is visualized in Figure 8. It shows that, in the slow cases, both methods perform similarly, while the baseline filter sometimes diverges in the fast- and arbitrary-motion cases. The diverging behaviour may be observed in Figure 5 in the fast arbitrary-motion case. Between the movements, when the IMU is resting, the algorithms use the gravitational acceleration to quickly converge towards the true attitude. Overall, the neural network outperforms the baseline filter significantly, which is even more remarkable in light of the fact that the baseline filter has been optimized on the whole dataset, while the neural network has never seen any of the validation data.

V-B Ablation Study

In the ablation study, the effect of every domain-specific optimization on the performance of the neural network is analyzed. Furthermore, the performance of the RNN and TCN architectures on the attitude estimation problem are compared. In this study, the 15 recordings are split in 12 training recordings and 3 validation recordings. In order to be representative, the validation recordings are the ones that yielded the maximum, minimum and median error in the performance analysis. To both the RNN and TCN architecture with current best practices for time series as basemodels, the three domain-specific optimizations are added iteratively. First the elementwise mean-squared-error loss is replaced by the optimized attitude error with smooth-l1-loss. In the second step, the data augmentation, which simulates a rotated IMU, is added. In the last step, the input layers are grouped in acceleration and gyroscope signals.

Fig. 9: Ablation Study with BM: Basemodel, LO: Loss Optimization, DA: Data Augmentation and GI: Grouped Input

The results of the study are visualized in Figure 9. Without the optimizations, the RNN and TCN model perform at a similar level. However, after adding the optimizations, the RNN has a much smaller error. This is plausible with the TCN being limited to its receptive field, while the RNN can track the IMU movement for an indefinite time with its hidden states. Even extending the TCN’s receptive field to samples, which is a time window of more than seconds, the results stay the same. When the IMU moves for a longer duration than the time window, the estimation diverges. For this application, especially with real time applications in mind, the RNN is the better approach.

The second result is that all the optimizations improve both the RNN and the TCN. Grouping the input leads consistently to minor improvements, while the loss optimization and data augmentation have a significant impact on the performance. When the data augmentation is added to the model, the other general regularization methods need to be reduced or deactivated in order to avoid over-regularization. Training and validation loss drop with same pace, which shows that it is very effective at regularizing the model. The same effect probably could be achieved by increasing the size of the dataset by several orders of magnitude, which would require more costly recordings.

The final result is that both the loss optimization and the data augmentation are necessary to outperform the baseline filter. Without these domain-specific optimizations, even the highly optimized general purpose neural networks do not generalize well enough. If all aforementioned optimizations are applied, the neural network performs significantly better than the baseline filter.

V-C Model Size Analysis

In order to analyze the effect of the model size to the attitude error, the RNN model of the first experiment is applied to the 12 training and 3 validation recordings of the second experiment. The amount of neurons of each layer of the RNN is scaled from 10 to 200, and the attitude error is compared.

Fig. 10: Model Size Analysis

The results of the study are visualized in Figure 10. As expected, the error decreases with increasing hidden size, with the gradient decreasing at bigger neuron counts. In this example, 20 neurons per layer are already enough to achieve the same mean attitude error as the baseline filter. Decreasing the hidden size of RNNs helps to reduce the memory footprint and overall computation time, which is important for embedded systems. But it only marginally reduces the training and prediction time on hardware accelerators with high parallelization capabilities, because of its sequential nature.

Vi Conclusion

This work has shown that neural networks are a potent tool for IMU-based real-time attitude estimation. If domain-specific optimizations are in place, then large recurrent neural networks can outperform state-of-the-art nonlinear attitude estimation filters. These optimizations require knowledge about the process that the neural network identifies. However, it does not require the specific knowledge (equations, signal characteristics, parameters) that is needed for implementing a well-performing filter. Another requirement for the neural-networks-based solution is a sufficiently rich set of data with ground truth attitude. However, data augmentation was proven to reduce this demand significantly.

Leave-one-out cross validation was used to show that the trained network performs well on new data from motions that were used for training. Future research will focus on generalizing applying the trained network to data from different IMUs with different sampling rates and different error characteristics. This will answer the question whether a sufficiently trained neural network can be used as a competitive solution in new sensor and environment settings without the need for collecting and using new training data.


This work was partly funded by the German Federal Ministry of Education and Research (BMBF, Funding number: 16EMO0262).