Nowadays, inertial measurement units (IMUs) have become ubiquitously available in wearable and mobile devices. An important category of IMU-enabled applications is monitoring and assessment of human mobility, which aims to continuously track people’s daily activities, analyze motion patterns and extract digital mobility bio-markers such as gait parameters in the wild. Increasingly, data-driven deep learning models have been developed for human activity recognition (HAR) [25, 3]. Despite their impressive performance, these models generally require a large amount of sensory data for model training. Unfortunately, it is challenging to collect high-quality IMU data in the wild while data collected from controlled settings where subjects are asked to perform certain activities often have very different characteristics from those in freestyle motions .
The scarcity of IMU data for HAR is evident when compared with the richness of other data sources. PAMAP2 , a popular dataset for HAR, includes 8 subjects with 59.67 minutes of samples per person. In contrast, AMASS , a motion capture (MoCap) dataset, includes 2420.86 minutes of data and is growing; not to mention YouTube videos, which offer a practically infinite amount of action data. Therefore, to mitigate the “small data” problem, one possible solution is to convert data from other modalities to IMU, a process called cross-modality simulation.
Though several previous works explored the feasibility of simulating IMU sensor data from other data modalities (see Section II
), a number of challenges remain. First, sensors are attached to human skin rather than directly to bone joints during data collection. Skeleton models are inadequate in representing human poses and shapes. Second, even with state-of-the-art (SOTA) solutions in computer vision, the extracted 3D human motion trajectories from monocular video clips are far from being perfect. Analytically computing IMU readings on such imperfect input sequences will result in large errors. However, if a deep learning model is adopted to learn the mapping between noisy motion trajectories and measured sensor readings, it is unclear how well such models generalize to arbitrary unseen on-body positions.
To tackle the above-mentioned challenges, we design and implement CROMOSim, a cross-modality IMU sensor simulator that simulates high fidelity virtual IMU sensor data from motion capture systems and monocular RGB cameras. It differs from existing work in several important aspects. First, it is based on 3D skinned multi-person linear (SMPL) models . SMPL is capable of modeling muscle and soft tissue artifacts while the 2D or 3D skeleton representations adopted by other works are segment models without volumetric information. Second, we empirically demonstrated that the direct computation of IMU readings from motion trajectories extracted from videos is unreliable (in Section IV
), even with filtering and interpolation techniques as the case of IMUSim. We instead design and train a neural network to learn the relationship between measured IMU readings and the noisy motion trajectories. Special cares have to be given to ensure the trajectories are represented in a global coordinate frame even if the videos are captured by moving cameras. Compared to existing IMU simulators, experiments show that CROMOSim achieves higher fidelity and superior performance in HAR tasks.
Ii Related Work
is amongst the first open-source tool to simulate IMU data from either MoCap data in the Biovision Hierarchy (BVH) format or a user-provided 3D position and orientation sequence. Given motion trajectories in a global frame, acceleration can be calculated by taking the second derivatives of positions over time. The resulting data has been used in existing work to pre-train human pose estimation (HPE) and HAR models [26, 22]. One drawback of this type of method is that none of these researches targets to simulate realistic IMU sensor readings, and gyroscope data is omitted.
After that, simulating IMU readings from monocular RGB videos for data augmentation has attracted some attention in recent years. ZeroNet  extracted finger motion data from videos and transformed them into acceleration and orientation information measured by IMU sensors. The authors of  and its follow-up work  simulated acceleration norms and/or angular velocity norms from human 2D poses for a HAR purpose. They differed from CROMOSim as both ZeroNet and Rey’s sensor simulators simply avoided the video-based global motion tracking problem by limiting the human subjects‘ movement to a fixed camera scene (in-place motion), while we addressed such a problem in CROMOSim pipeline. Closest to our work is IMUTube  and its extension in , which aim at simulating full-body IMU data from moving camera videos captured in the wild. But confined by the skeleton body representation adopted, neither work can simulate realistic sensor readings from arbitrary on-body locations. Moreover, in IMUTube, the estimation of view depth and camera ego-motion is in two independent steps though the two are intrinsically coupled [6, 13]. A wrongly predicted camera pose can lead to inaccurate view depth estimation and vice versa . In addition, the lifting of 2D postures to 3D poses module in IMUTube pipeline is more compute-intensive and error-prone, as it is a simple combination of existing technologies.
Iii System Design
CROMOSim is designed with several requirements in mind: i) allowing arbitrary user-specified placement and orientation of target sensors, ii) extensibility to different input data modalities and configurations, iii) flexibility to incorporate SOTA models to extract motion trajectories, and iv) high fidelity. To meet these requirements, the CROMOSim pipeline contains three function modules as shown in Fig. 1 : an input data processing module that extracts global human motion sequences from source data, a human body model that can fully represent the extracted sequences and can be sampled from any on-body location, and a simulator module that transforms noisy motion sequences to high-fidelity 3-axis accelerometer and gyroscope readings. Though the pipeline is extensible to other possible input data modalities such as millimeter wave radar and depth camera, we will focus on MoCap and monocular camera video here. A detailed illustration of each component will be provided in the following sections.
Iii-a SMPL Model
An SMPL model represents 3D human body poses and shapes with a fine-grained full-body tri-mesh. Unlike skeleton or cylinder models that only capture joint poses, this parametric 3D representation provides a widely applicable and differentiable way to visualize a realistic 3D human body. There are three reasons to choose SMPL over other body models in CROMOSim. First, instead of measuring the movements of bones, IMU readings reflect the soft tissue dynamics at the location that a sensor is attached to. Second, SMPL provides a pose and shape-dependent full-body tri-mesh that can be sampled at any on-body location. Third, since it is widely used in HPE research, many off-the-shelf models are available to extract SMPL representations from different data sources.
To see the difference between movements of joints in a skeleton model and SMPL skin mesh, we compare accelerations computed by taking second-order derivatives of the corresponding motion trajectories and ground-truth accelerometer readings over time. In Fig. 2, red curves denote the calculated 3-axis accelerations while the black ones are accelerometer ground truth. Figures in the left column compare the accelerations at a pelvis joint in a skeleton model while figures in the right column compare those at an SMPL lower back skin mesh vertices. Clearly, the use of the SMPL skin mesh provides better agreements with the ground truth (e.g., in the interval [100,300]). Simulated data from the pelvic joint, on the other hand, fails to capture high-frequency acceleration components, which are most likely due to muscle and soft issue movements. This observation indicates that SMPL is a good candidate for an intermediate data representation of CROMOSim pipeline.
Iii-B Input Data Processing
Iii-B1 From MoCap Data to SMPL Models
MoCap data consists of raw marker sequences collected by an optical motion capture system of high precision (usually with a position error mm). MoSH++  allows the fit of an SMPL model to MoCap data from a set of sparse markers. Prior to motion capture, a global coordinate system needs to be established during the calibration phase. As a result, the collected motion trajectories are expressed in the global frame. Under the assumption that the global frame is aligned with the inertial frame111Such an assumption is not restrictive as random rotation can be applied in further data augmentation to obtain data if the global and inertial frames differ. , the SMPL mesh model can be used directly in subsequent processing.
Iii-B2 From Video Clips to SMPL Models
Extracting 3D human poses and shapes from monocular RGB videos is not trivial, especially when they are captured from moving cameras with unknown parameters, which is common in a locomotion-related video recorded in the wild. We propose to decompose such a problem into two sub-problems: a reconstruction of human global displacement and rotation; and an estimation of 3D in-place human motion and body shape.
Estimating root joint global trajectory
A precise calculation of global displacement for the human subject is essential for a high-fidelity simulation of IMU data from RGB videos. This requirement can be achieved by reconstructing the 3D motion trajectory of a fixed body position (a.k.a, the root joint), which can be inferred from the depth map of the human subject and camera parameters .
In CROMOSim, we adopt robust consistent video depth estimation (Robust CVD) method , a SOTA model to estimate consistent dense depth maps and camera poses from a monocular video. Robust CVD jointly estimates both outputs by solving an optimization problem over the entire video sequence. It is advantageous as the two outputs are intrinsically coupled and thus leads to higher accuracy (compared to the pipeline adopted by IMUTube). In the implementation, we locate the 2D torso joint positions in video frames using OpenPose 
, and designate the pelvis as our root joint. In addition, depth reconstructed by robust CVD is reasonably accurate up to scale. To resolve scale ambiguity, an object of known size in the scene is needed. Prior knowledge regarding heights of subjects in the video, dimensions of fixtures (e.g., street lamps, road lanes) can be utilized. Subsequently, the predicted depth of the pelvis joint is re-scaled by an estimated scale factor. Since in some frames, the root joint is not visible or cannot be located well due to occlusion or poor lighting, we only extract root joint coordinates from the frames with high confident scores from OpenPose. Root joint coordinates in the remaining frames are then interpolated from the estimated ones, and a Kalman filter is applied to further smooth the resulting trajectory.
Body pose and shape estimation in camera frames
We adopt VIBE , a SOTA method to directly estimate realistic 3D human poses and shapes from monocular videos. In the implementation, we make two extensions to VIBE. First, VIBE assumes a fixed camera configuration and in-place human motion only, losing track of human subjects’ global motion trajectory. As elaborated in the previous paragraph, robust CVD is adopted to complete the missing information. Second, VIBE estimates body shapes for every video frame. This is unnecessary since people’s body shapes are unlikely to change in a short period. Instead, we take the averaged body shapes for the same subject in a video sequence.
Finally, by combining the aforementioned steps, we can extract 3D body poses in a global frame and shape parameters from monocular RGB video, which can serve as input to generate SMPL body meshes.
Iii-C From SMPL Models to IMU Data
Given the 3D human pose and shape represented by SMPL tri-mesh over time, accelerations and angular velocities in a global frame can be computed analytically. In particular, accelerations can be calculated by taking second derivatives of positions over time; angular velocities can be determined from the changes in the norm vector of a plane associated with three non-collinear mesh points (e.g., the vertices of a mesh triangle). However, SMPL tri-meshes generated by the models in SectionIII-B2 tend to be noisy, erroneous and incomplete. Furthermore, accelerations and angular velocities measured by IMUs are subject to hardware imperfection such as noises, biases, and non-orthogonal axes, which are not easily replicated by analytical calculation.
To address the aforementioned issues, we design two neural network models, an accelerometer and a gyroscope network, to learn the mapping between motion trajectories of SMPL tri-mesh points and actual acceleration or angular velocity measured by IMUs in a global frame, respectively. The neural networks are capable of generating data from any arbitrary unseen region over the human body by training with real data from some selected on-body positions of various motion ranges (such as the head, chest, one side of the wrist, and ankle). Both models take the same design, with three convolutional and two bidirectional long-short term memory (LSTM) layers as the feature extractor, and a following linear layer as regression output. The model is fed a user-specified skin area, with three mesh triangles chosen near the area’s center as input. In each triangle, the vertices are traversed counter-clockwise to ensure the norm direction always points outside of the human body.
The collected IMU data are usually in the local sensor frame while the predictions of CROMOSim are in the global frame. Therefore, a coordinates transformation step is required. A user needs to select the skin region a virtual sensor affixes to and define its alignment represented as a rotation matrix (). With the rotation matrix from the bone frame to the sensor frame , we can calculate IMU data in the sensor frame from the accelerations and angular velocities in the global frame as follows:
where is obtained from the SMPL model for the corresponding skin region.
Due to noisy data sources and modeling errors, domain gaps exist between simulated and real data. Such gaps are more pronounced in simulated data from videos. To mitigate these gaps, we adopt the same distribution mapping technique  as IMUTube.
In this section, we will evaluate CROMOSim in two sets of experiments. Firstly, we evaluate the fidelity of simulated sensor data both qualitatively and quantitatively. Then, we evaluate the utility of CROMOSim in data augmentation for downstream HAR tasks.
Iv-a Experimental Setup
To train the simulator network and evaluate the fidelity of simulated data, we use the TotalCapture dataset which has all three data modalities (MoCap, IMU and video). For HAR evaluation, Realworld  and the Physical Activity Monitoring version 2 (PAMAP2)  datasets are used in task model training and testing.
In the fidelity evaluation, we divide data from all modalities into a 2-seconds sliding window with 80% overlapping for model training and prediction. For HAR, to make the results directly comparable to baseline approaches, we follow the same procedure described in IMUTube, where simulated and real IMU data are low-pass filtered, normalized and divided into sliding windows with 1-second length and 50% overlapping.
To evaluate the fidelity of CROMOSim, we compute the root mean square error (RMSE) between simulated IMU data and ground truth. In HAR tasks, as the class in datasets is imbalanced, we adopt an F1 score to evaluate random single-subject-out experiments.
We consider IMUSim and an analytic method as baselines to compare the fidelity of simulated data because IMUTube also utilizes IMUSim to generate IMU data from 3D global motion trajectories. The analytic method we adopt to compute linear acceleration is Richardson’s extrapolation [20, 19] . Compared to taking second-order derivatives, it gives a more accurate estimation with a 4th order error term (as opposed to 2nd order). The angular velocity of a selected skin region on an SMPL body mesh is calculated by tracking the rotation of its norm vector. For HAR tasks, we take IMUTube as the baseline, but due to the lack of open source implementations, we include the reported performance on PAMAP2 dataset from .
Iv-B Fidelity of CROMOSim
In this section, we provide qualitative and quantitative comparisons between CROMOSim and two baseline methods, namely, the analytical method (IMUCal) and IMUSim in terms of fidelity. We use TotalCapture in this experiment since it contains data from all three required modalities. Two CROMOSim models are trained using MoCap and video data from Subjects 1 – 3 with sensor positions at their right wrist, right foot and pelvis. The models are used to predict accelerometer and gyroscope data on both left and right wrists of Subject 5 from the respective data sources.
Figures 3 and 4 show the simulated IMU readings from different methods with MoCap and RGB video data, respectively. From the figures, we observe that the fidelity of IMUSim is low across the board. It is because the default setting of IMUSim filters out too much high-frequency components. IMUCal works well for simulating accelerometer and gyroscope data with MoCap inputs. However, its performance significantly degrades when monocular RGB videos are taken as the source modality. This can be attributed to large noise and relative low accuracy of extracted SMPL body tri-mesh. In contrast, CROMOSim consistently outperforms baseline methods for both data modalities.
|Acceleration ()||Angular velocity ()|
Table I reports the case where both subject and sensor position are unseen to the simulator networks. The quantitative results are consistent with those in qualitative ones. With MoCap data, the accuracy of CROMOSim is 187.5% and 11% higher than that of IMUSim and IMUCal for accelerations, respectively, and 87% and 58% for gyroscope data. The advantage of CROMOSim is more pronounced with monocular RGB videos, outperforming the next best method (IMUSim) by 84% and 67% for accelerometer and gyroscope data.
Iv-C Applications of CROMOSim in HAR Tasks
In this section, we evaluate the utility of CROMOSim in data augmentation for training HAR models. Here we consider three settings: i) R2R, where models are both trained and tested with real IMU data; ii) V2R, where models are trained with simulated data but tested with real data; iii) Mix2R, where models are trained using a mixture of real and simulated data while tested with real data.
We adopt the DeepConvLSTM network proposed in  as the task model, while the same simulator neural network trained on the TotalCapture dataset is used here to simulate sensor readings from videos. Evaluations are made on the Realworld and PAMAP2 datasets respectively, with data simulated from the same video source (Realworld dataset). An ablation study was conducted by removing robust CVD from the proposed pipeline, and the resulting approach is called CROMOSim Lite. To make the result directly comparable, we followed the experiment protocol in IMUTube .
|CROMOSim Lite||0.7290.007||0.5800.047||0.802 0.013|
Table II reports the average F1 scores of five single-subject-hold out experiments on the RealWorld dataset. Since the authors of IMUTube provide their simulated data on this dataset, we directly replicated their experiments and the results are in the second row. For comparison, we also include the scores reported in  as the first row. It can be seen the two are quite similar to one another. Even CROMOSim Lite outperforms IMUTube in V2R and Mix2R experiments, while CROMOSim works the best. Moreover, Mix2R achieves much higher F1 scores compared to R2R and V2R, demonstrating the utility of data augmentation with simulated data.
Table III summarizes the results from CROMOSim and those reported in . Due to the different sensor placements in the PAMAP2 datasets, the simulated data provided by the authors of IMUTube cannot be used. Similar to the RealWorld dataset, CROMOSim outperforms IMUTube for the PAMAP2 datasets but with a more prominent margin; the HAR model trained from Mix2R is still superior to those from R2R and V2R.
V Conclusion and Future Work
In this paper, we implemented CROMOSim, a pipeline that simulates accelerometer and gyroscope readings at arbitrary user-designated on-body positions from MoCap and monocular RGB camera videos. A DNN model is trained to learn the functional mapping between imperfect trajectory estimations in a 3D body tri-mesh to IMU data. Experiments showed that CROMOSim can generate higher fidelity data than baseline methods and is useful for downstream HAR tasks. As part of the future work, we are implementing a graphical user interface and wrapping up CROMOSim as an easy-to-use tool now. Hopefully, it will be open-sourced to the public by this summer.
-  (2019) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence 43 (1), pp. 172–186. Cited by: §III-B2.
-  (1981) Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician 35 (3), pp. 124–129. Cited by: §III-C.
-  (2021) A survey on deep learning for human activity recognition. ACM Computing Surveys (CSUR) 54 (8), pp. 1–34. Cited by: §I.
-  (2018) Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–15. Cited by: §II.
Vibe: video inference for human body pose and shape estimation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263. Cited by: §III-B2.
-  (2021) Robust consistent video depth estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §II, §III-B2.
-  (2020) IMUTube: automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4 (3), pp. 1–29. Cited by: §II, §IV-A, §IV-C, §IV-C, §IV-C, TABLE II, TABLE III.
-  (2021) Approaching the real-world: supporting activity recognition training with virtual imu data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5 (3), pp. 1–32. Cited by: §II.
-  (2021) When video meets inertial sensors: zero-shot domain adaptation for finger motion analytics with inertial sensors. In Proceedings of the International Conference on Internet-of-Things Design and Implementation, pp. 182–194. Cited by: §II.
-  (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 1–16. Cited by: §I.
-  (2020) Consistent video depth estimation. ACM Transactions on Graphics (TOG) 39 (4), pp. 71–1. Cited by: §II.
-  (2019) AMASS: archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451. Cited by: §I, §III-B1.
-  (2019) Unsupervised learning for depth, ego-motion, and optical flow estimation using coupled consistency conditions. Sensors 19 (11), pp. 2459. Cited by: §II.
Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 16 (1), pp. 115. Cited by: §IV-C.
-  (2012) Computer vision: models, learning, and inference. Cambridge University Press. Cited by: §III-B2.
-  (2012) Introducing a new benchmarked dataset for activity monitoring. In 2012 16th International Symposium on Wearable Computers, pp. 108–109. Cited by: §I, §IV-A.
-  (2020) Yet it moves: learning from generic motions to generate imu data from youtube videos. arXiv preprint arXiv:2011.11600. Cited by: §II.
-  (2019) Let there be imu data: generating training data for wearable, motion sensor based activity recognition from monocular rgb videos. In Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers, pp. 699–708. Cited by: §II.
-  (1927) VIII. the deferred approach to the limit. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character 226 (636-646), pp. 299–361. Cited by: §IV-A.
-  (1911) The approximate arithmetical solution by finite differences with an application to stresses in masonry dams. Philosophical Transactions of the Royal Society of America 210, pp. 307–357. Cited by: §IV-A.
-  (2016) On-body localization of wearable devices: an investigation of position-aware activity recognition. In 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom), Vol. , pp. 1–9. External Links: Cited by: §IV-A.
-  (2018) A multi-sensor setting activity recognition simulation tool. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 1444–1448. Cited by: §II.
-  (2017) Total capture: 3d human pose estimation fusing video and inertial sensors.. In BMVC, Vol. 2, pp. 1–13. Cited by: §IV-A.
-  (2017) Recognizing detailed human context in the wild from smartphones and smartwatches. IEEE pervasive computing 16 (4), pp. 62–74. Cited by: §I.
-  (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recognition Letters 119, pp. 3–11. Cited by: §I.
-  (2020) A deep learning method for complex human activity recognition using virtual wearable sensors. In International Conference on Spatial Data and Intelligence, pp. 261–270. Cited by: §II.
-  (2011) IMUSim: a simulation environment for inertial sensing algorithm design and evaluation. In Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks, pp. 199–210. Cited by: §I, §II.