Musculoskeletal Disorders (MSD) are recognised as a primary contributor to disease burden in developed countries . Finding technological solutions for either the prevention or self-management of MSD have been a research area which has emerged over the last few years. Maintaining a regular self-managed exercise routine while adhering to correct execution is an essential component when living with MSD. An effective digital intervention should be capable of interception, recognition and quality assessment of exercises in real-time.
Exercises can be viewed as a sub category of human activities that comprises of complex sequences of human movements. Capturing these movements with sensors require much more detail that a single accelerometer on a smart watch can provide. They require capturing multiple limb movements and need to be captured with strategically placed multiple sensors. Recent literature has looked at sensor fusion for HAR with multiple sensors yet single modality [6, 4, 3], but not in a qualitative manner. Evidently there are challenges that arise when reasoning with multi-modal sensors of heterogeneous data types. Sensor fusion focused on exploiting multiple sensors for improved recognition performance, non-invasive HAR and open-ended HAR for improved deployability and qualitative HAR are to name a few.
We identify the need for a heterogeneous sensory dataset for physiotherapy exercises in order to address these challenges. Importantly we collaborated with the physiotherapy researchers to identify exercises and sensors that are effective for MSD and implemented a data collection task. Outcome of this task is the MEx: Multi-modal Exercises dataset that is presented with this report. In summary, MEx consists of seven exercises recorded with four sensors, a pressure mat, a depth camera and two accelerometers. It is publicly available to download at Mendeley Data repository 111https://data.mendeley.com/datasets/p89fwbzmkd/2.
This paper is organised as follows. Section 2 presents the exercises, sensors and the processes followed in the data collection. Section 3 explore different data visualisation techniques with the pressure mat sensor data, followed by Section 4 and Section 5 where details of experiments and results can be found for the single sensor recognition task. Finally Section 6 will present the conclusions and plans for future work.
2 Multi-modal Exercise Dataset (MEx)
In this section we present the list of exercises, sensor specifications and the data collection methodology.
Table 1 presents the exercises that were selected by a physiotherapist for this data collection. They are frequently used for prevention or management of Musculoskeletal pain. Each exercises is annotated with a starting position and an action; action is comprised of several guidelines and steps to accurately perform the exercise.
|Starting Position||Lying on back, knees together and bent, feet flat on floor|
|Action||Slowly roll knees to the right, back to the centre, then to the left,|
|keeping upper trunk still|
|Starting position||Lying on back with knees bent and slightly apart, feet flat on floor|
|and arms by side|
|Action||Squeeze buttock muscles and lift hips off floor. Hold approximately|
|5-seconds and lower slowly.|
|Starting Position||Lying on back with knees bent and slightly apart, feet flat on floor|
|and arms by side|
|Action||Tighten stomach muscles and press small of back against the floor|
|letting your bottom rise. Hold approximately 5 seconds then relax.|
|Starting Position||Lying on right side with hips and shoulders in straight line. Bend|
|knees so thighs are at 90 degrees angle. Rest head on top arm|
|(stretched overhead or bent depending on comfort). Bend top arm|
|and place hand on floor for stability. Stack hips directly on top of|
|each other (same for shoulders)|
|Action||Keep big toes together and slowly rotate leg in hip socket so|
|the top knee opens. Open knee as far as you can without disturbing|
|alignment of hips. Slowly return to starting position|
|Repeated Extension in Lying|
|Starting Position||Lying face down, place palms on floor and elbows under shoulders|
|Action||Straighten elbows as far as you can and push top half of body up as|
|far as you can. Pelvis, hips and legs must stay relaxed. Maintain|
|position for approximately 2-seconds then slowly lower to starting|
|Starting Position||On all 4’s with hands directly beneath shoulders, knees slightly apart|
|and straight back.|
|Action||Punch right arm in front and lower to floor. Repeat with left arm.|
|Keep trunk as still as possible|
|Starting Position||On all 4’s with hands directly beneath shoulders, knees slightly apart|
|and straight back.|
|Action||Extend right arm straight in front of you and left leg straight behind|
|you, keeping trunk as still as possible. Hold approximately 5-seconds|
|then lower and repeat with other arm and leg.|
Exercises are repeated controlled movements of multiple parts of the body. We explored the state of the art sensor technologies and selected three sensors to capture these movements; Obbrec Astra Depth Camera 222https://orbbec3d.com/product-astra-pro/, Sensing Tex Pressure Mat 333http://sensingtex.com/sensing-mats/pressure-mat/ and Axivity AX3 3-Axis Logging Accelerometer 444https://axivity.com/product/ax3.
The aim is to explore their capabilities to capture exercises independently as well as an ensemble while considering the non-invasive use in the real world. Accordingly we select following placements for sensors; two accelerometers will be placed on the wrist and the thigh of the user; the pressure mat will be used as a exercise mat where the user will lay on to perform exercises; the depth camera will be placed above the user facing down-words recording an aerial view. In addition, top of the depth camera frame will be aligned with the top of the pressure mat and the user is asked to align their shoulders such that the face is not recorded in the depth camera or pressure mat data. The four sensors, accelerometer on the thigh, accelerometer on the wrist, pressure mat and the depth camera will be referred as ACT, ACW, PM and DC in the rest of this paper.
|Depth Camera Sensor|
|Product||Obbrec Astra Depth Camera|
|Frame rate||15 fps|
|Product||Sensing Tex Pressure Mat|
|Frame rate||75 fps|
|Product||Axivity AX3 3-Axis Logging Accelerometer|
|Accelerometer Range||8 g|
2.3 Data Collection Methodology
The data collection task was performed with 30 volunteers. Figure 1 show the age and sex statistics of the group. 60% of the population was female and 40% was male. 47% of the group were in the 18-24 age category and the rest were dispersed among ages from 24 to 54. The volunteers were recruited through internal adverting in the university and majority of the participants were students from the Schools of Computing or School of Health Sciences. 8 of the 30 had a good understanding of the exercises as they were either physiotherapists or physiotherapy students. A Physical Activity Readiness Questionnaire (PAR-Q) was given to each user prior to participating in the study to evaluate their physical fitness and only who passed PAR-Q performed the exercises.
At the beginning of each exercise the user was given a sheet with instructions for the exercise. The document includes the starting position and the action and the researcher demonstrated each exercise to the user. Then the user performed the exercise for maximum of 60 seconds while being recorded with multiple sensors. During the recording, the researcher did not give any advice or kept count/time to enforce rhythm. For exercises where it suggested holding a position for 5 or 2 seconds, user was instructed at the beginning to keep count by themselves to preserve their natural rhythm. Our goal was to capture individual nuances of each user which replicates a scenario where a patient performs these exercises at home without the guidance of the physiotherapist.
3 Data visualisation
In this section we visualise collected data, first visualise raw data from each sensor, secondly visualise the PM data using two dimensionality reduction methods.
3.1 Raw data
is the visualisation of thigh and wrist accelerometer sensor data over 5 exercises. It is evident that noisy outliers are found more commonly in the wrist sensor compared to the thigh sensor. It is also evident that some exercises does not require any movement or only move lightly of limbs which makes recognising exercises from a single accelerometer sensor challenging. Figure(b)b show five depth camera data frames within 2 seconds from the Knee-rolling exercise and Figure (c)c show data frames captured by the pressure mat for the same exercise at the same timestamps. Depth camera visibly capture large amount of data compared to the pressure mat for Knee-rolling exercise, in-contrast an exercise such as Pelvic tilt seemed to be better captured with the pressure mat.
3.2 PCA visualisation
3.3 t-SNE visualisation
Secondly we select high dimensional data visualisation method t-Distributed Stochastic Neighbour Embedding(t-SNE). t-SNE build probability distributions over data pairs and maximise probability if pairs are similar and minimise probability if pairs are different. It uses Euclidean distance as the similarity measure. Figure4 illustrates the t-SNE dimensionality reduction to two components clustered by exercise class, for all users, for a physiotherapy user and a regular user. Similar to PCA and more significantly, we observe that there are clear cluster boundaries in the physiotherapy user data and that it is difficult to capture clear cluster boundaries with a regular user data.
Visualisation of PM data emphasise that automated learning of features is more suited for visual data compared to manual feature engineering or raw data. The visualisations also suggests that the potential of PM data to capture exercise performance quality. With this insight, PM data presents a great potential to explore qualitative assessment of exercise performance, that is essential to determine how a patient deviate from the correct execution once they are away from the supervision of the physiotherapist.
4 Exercise Recognition with MEx
In this section we present experiments that are designed to get reference performances on each individual sensor with standard machine learning (ML) and deep learning (DL) algorithms with pre-processing steps, algorithm details and experiment designs.
First few data cleaning steps were performed on the sensors that were continuously recording to remove the data that were recorded in between exercises. Then we have a dataset where data from four sensor are annotated with respective activity and user at each time-stamp.
To prepare this dataset for ML algorithms we use windowing method with overlap. We use 5 second window and 3 second overlap in these experiments and the values are kept constant across all sensors for comparability and consistency.
We used a reduced frame rate of 1 frame/second with DC and PM data. The frame was selected by increasing the time-stamp by 1 second and selecting the nearest frame (i.e. in contrast to averaging the frame over the one second). Additionally DC data frames were compressed using OpenCV resize library from to
considering computational memory requirements. Resulting raw data feature vector for DC or PM is of size. An exploratory study of the hyper-parameters window size, overlap, frame rate, frame selection methods and frame sizes with regards to PM data can be found on the Appendix.
Similar windowing method was applied for accelerometer data ACW and ACT followed by DCT transformation. DCT transformation was applied to each axis individually and top 60 DCT components were selected. All DCT feature vectors were appended together resulting in a final vector of length 180. Table 3 details meta-data of the pre-processed dataset. All are mean values averaged across number of folds or number of users.
|Total number of instances||6262|
|Instance per user||208|
|Instance per class per user||30|
|5-user fold cross validation||Training instances||5218|
|Input vector sizes||ACT/ACW|
|ACT/ACW raw data||Minimum value||-8.0|
4.2 Evaluation Methodology
Each experiment was evaluated with 5-user fold cross validation creating 6 folds. Each fold was repeated for 10 times for algorithms that are non-deterministic. User specific setting will use data from 25 of the 30 users in training 5 users in testing. This methodology emulates a real-life deployment setting where the end-user data are not seen during training.
Macro F-measure is the selected performance measure for the experiments as it provides a better representation of precision and recall compared to accuracy. F-measure is calculated for each label, and their non-weighted mean is presented as the final value (Equation 1). Weighted F-measure is not required in these experiment as the dataset is class balanced (i.e. contains similar amount of data instances for each class).
Here we list the classification algorithms and the different feature representations considered in the experiments.
|K-Nearest Neighbours algorithm; we present results with k=1 and k=3|
|SVM||Support Vector Machine classifier with a Radial Basis Function kernel|
Multi-layer Perceptron classifier with a softmax activation layer that
|selects the maximum probability class as the predicted class.|
4.3.1 Feature representations
Raw sensor data (flattened if necessary) as the feature representation.
For ACT and ACW; the three axial time-series data are converted in a DCT feature vector by converting each axis to DCT feature vector, then selecting the first 60 elements of each vector and finally appending three vectors together to form a feature vector of length 180.
For PM and DC; An Auto-encoder (AE) with 5 hidden layers (Figure 5) is trained with sensor data to reconstruct itself and the centre hidden layer with lowest dimension is used as the feature representation. Here the AE performs a dimensionality reduction and produce a feature vector of size 64 as the input of the classifier.
Artificial Neural Network; Artificial Neural Network, comprised of single or multiple layers of densely connected hidden layers. Each hidden layer is followed by a Batch Normalisation layer to normalise the output and to avoid over-fitting. Both variations below output a feature vector of size 100 as the input of the classifier.
Figure 6: ANN models
Convolutional Neural Networks. Similar to ANN, we use Batch Normalisation at the output of each hidden layer for regularisation. We explore three variations to suit different sensor modalities as follows and they all produce an output vector of size 100 as the input of the classifier.
Raw-1D: Comprised of 1-dimensional convolutions and max pooling layers as in Figure (a)a. For ACT and ACW, input is raw data of length 500 () with 3 channels (i. e. for 3 axes in accelerometer data). For PM and DC data, each frame is flattened to form a vector and frames from a time window are appended together to crate the single dimension input feature vector of length with 1 channel.
2D: Consist of 2-dimensional convolution and max pooling layers(Figure (b)b). For PM and DC data, with in a time window, frames are appended to form a 2D vector with 1 channel.
Figure 7: CNN models
Long-short Term Memory Neural Networks. Similar to ANN and CNN, e use Batch Normalisation for regularisation. Here, time Distributed CNN architectures that learn low-level feature embeddings are followed by a LSTM layer that learns temporal dependencies as in Figure 8. Again we explore three variations as follows.
DCT-1D-CNN: Time distributed 1D Convolution architecture (similar to Figure (a)a) for the DCT data of ACT and ACW.
Raw-1D-CNN: Time distributed 1D Convolution architecture (similar to Figure (a)a) suited for all sensor modalities.
2D-CNN: Time distributed 2D Convolution architecture (similar to Figure (a)a) for PM and DC data.
All k-NN and SVM models were implemented using sklearn Python libraries 555https://scikit-learn.org/stable/ and the DCT transformation was done using scipy Python library 666https://docs.scipy.org/doc/scipy/reference/fftpack.html777https://arxiv.org/abs/1212.5701 with default settings was used to optimise the model and training batch size was set to 32. All the pre-processing and model implementation code is publicly available on GitHub 888https://github.com/anjanaw/MEx.
In addition to above experiments an exploratory study of the dataset is included in Appendix. First subsection in the Appendix compare each sensor in personalised and non-personalised evaluation settings. With these experiments we observe how each sensor performs with the introduction of end user data during training. Following subsections explore four hyper-parameters we considered during the pre-processing of the dataset; first we explore different window and overlap values with PM data followed by different frame rates and compression ratios with the PM data. These studies gained us insights in to how to improve performance that will be explored in future.
. For each sensor, the best performing algorithm is highlighted with bold text and algorithms that are significantly similar (t-test with 95% confidence) to the best performing algorithm are highlighted with an asterisk.
In general MLP classifier achieved the best performance with different feature embedding functions. In addition, deep models performed comparatively better than the shallow models consistently. PM and DC sensors with visual data found 2D convolutional feature embedding functions are advantages compared to 1D feature embedding functions or learning temporal dependencies. In contrast ACW and ACT sensors with time-series data found learning temporal dependencies with 1D feature embedding to be more advantages.
ACT and ACW data are best represented for classification with the LSTM model with 1D-CNN as the feature embedding function. It is evident that learning temporal dependencies with LSTM models results in better feature representations for accelerometer data but it is also noteworthy that 1D-CNN models and ANN representations performs closely with the LSTM models. In addition, a notable observation is made between DCT and raw accelerometer data performance with deep models. Specifically, ACT and ACW DCT feature representations significantly outperform similar model with raw data 18.66% and 22.41% respectively for LSTM models and by 11.10% and 15.09% respectively for CNN models. These results confirm that DCT feature transformation achieves better performance compared to raw data for accelerometer sensors. This is on par with evidence seen in literature comparing raw vs. lazy feature engineering methods .
Considering PM and DC raw data, kNN and SVM performs poorly suggesting the importance of learning a feature representation for visual data. Accordingly we compare the classification performance of raw data against a feature representation learnt with auto-encoders (AE). Feature representation learnt at the narrow hidden layer of the AE through dimensionality reduction improved performance significantly for both DC and PM data. With 3-NN algorithm, DC and PM data achieved 6.30% and 8.68% performance improvements respectively and with SVM it was 3.06% and 31.77%.
With DC data, best performance was observed with 2D-CNN model which significantly outperformed all other models. LSTM models with 1D-CNN feature embedding performs second best with DC data. PM data was best represented by the LSTM model with 1D-CNN feature embedding function. But it is noteworthy that 2D-CNN model and 2D-CNN-LSTM models performs significantly well. These results suggest that the visual sensor data with are best presented for classification with convolution models. In general we observe that learning temporal dependencies with LSTM effectively contribute towards increased performance with PM and DC data.
In summary these results emphasise the characteristics of different classification models and feature representation methods with different sensor data types. In future, we plan to optimise these models for single sensor classification as well as in multi-sensor fusion classification.
This paper presents the MEx: Multi-modal Exercises Dataset for Human Activity Recognition and benchmark performance on standard classification algorithms. The dataset contains 7 exercises recorded with four sensors, a depth camera, a pressure mat and two accelerometers. These sensors generate different data types, and we explore how classifier performance is affected by different shallow and deep feature representations of sensor data. Our study on concluded that visual data such as pressure mat or depth camera data are better represented with 2-D convolutional architectures(2D-CNN) while, time-series data from accelerometers preferred the combination of shallow feature transformations(DCT) and learning temporal dependencies with recurrent architectures(LSTM). With these results, we plan to explore multi-modal sensor fusion methods with attention mechanisms to find algorithms with improved performance towards implementing a exercise recognition system for patients with Musculoskeletal Disorders. In addition the data visualisation suggests that the exercise performance quality is distinctly captured by pressure mat data which we will exploit in future to guide users to perform exercises with precision.
This work is part funded by SelfBACK. The SelfBACK project is funded by the European Union’s H2020 research and innovation programme under grant agreement No. 689043. More details available at http://www.selfback.eu.
-  Abajobir, A.A., Abate, K.H., Abbafati, C., Abbas, K.M., Abd-Allah, F., Abdulkader, R.S., Abdulle, A.M., Abebo, T.A., Abera, S.F., Aboyans, V., et al.: Global, regional, and national incidence, prevalence, and years lived with disability for 328 diseases and injuries for 195 countries, 1990–2016: a systematic analysis for the global burden of disease study 2016. The Lancet 390(10100), 1211–1259 (2017)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. pp. 448–456 (2015)
Ordóñez, F.J., Roggen, D.: Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors16(1), 115 (2016)
-  Radu, V., Lane, N.D., Bhattacharya, S., Mascolo, C., Marina, M.K., Kawsar, F.: Towards multimodal deep learning for activity recognition on mobile devices. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct. pp. 185–188. ACM (2016)
-  Sani, S., Massie, S., Wiratunga, N., Cooper, K.: Learning deep and shallow features for human activity recognition. In: International Conference on Knowledge Science, Engineering and Management. pp. 469–482. Springer (2017)
-  Yao, S., Hu, S., Zhao, Y., Zhang, A., Abdelzaher, T.: Deepsense: A unified deep learning framework for time-series mobile sensing data processing. In: Proceedings of the 26th International Conference on World Wide Web. pp. 351–360. International World Wide Web Conferences Steering Committee (2017)
Personalised vs. Non-personalised settings
We explore two settings, personalised and non-personalised to measure the susceptibility of each sensor to personal nuances.
Personalised: Classification models is tested on the same user set used during training.
Non-personalised: Model is trained and tested on data from mutually exclusive two sets of users.
Non-personalised evaluation is the established method of evaluation for HAR models as it resembles real-life scenario where end-users(test users) are not seen during training. With Personalised setting, we plan to observe the impact of introducing end user data to the learning process. For these experiments we used the LSTM model with 1D-CNN embedding for ACW and ACT data (achieved best performance in Section 5), and the 2DC-CNN model for DC and PM data. We follow the same methodology here from Section 4 by repeating each 6-fold test 10 times. Mean f-measure was measured and presented in Table 6.
Results suggest that using end user data during training achieves significant performance improvement with all sensors. Highest improvements are observed with ACW data of 27.78% and PM data of 26.94% followed by DC and ACT with 13.22% and 9.80% respectively.
The results suggests that it is highly advantages for PM and ACW data models to learn from end-user data. It is also observed that these sensors performed poorly in non-personalised setting. It suggests that the feature embedding function learnt with training users is finding it is difficult to represent test-user data correctly (i.e. unable to generalise to new data), and once some of the end-user data is introduced in training, the models perform significantly better. These observations concludes that PM and ACW data carry the highest amount of personal nuances, resulting in highest differences between data from different users.
In summary, a sensor such as PM is most inclined to capture personal nuances such as weight and shape; ACW sensor may capture more personal quirks and habits. In contrast data from sensors such as ACT or DC are indifferent to personal characteristics and are able to capture movements generalisable across users. This is an essential insight for when selecting sensors beyond the task of activity recognition, such as exercise quality assessment or counting where capturing personal characteristic is essential.
Window and overlap
An empirical study was conducted to understand the affect on classification performance with different window and overlap settings. We conducted a set of experiments with the PM data on the MLP classifier with 2D-CNN feature embedding function. 3, 5 and 8 seconds were considered as window sizes and a number of overlap values were considered for each window size. Experiments were done in a non-personalised setting and each 6-fold experiment was repeated 5 times and mean F1-measure is presented in Table 7. Best performing overlap for each window setting is highlighted in bold.
Overall we observe that classification performance improves with the window size, which is intuitive given that with larger window size, the data instance carry more information to recognise an activity. But there are limitations to using larger window size in practice, firstly larger window size results in large data instances and smaller number of training instances. Limited number of training instances may cause parametric models to over-fit or not optimise. In addition, at test time, the time elapsed between two predictions is higher, which is not user friendly for real-time use. Therefore it is important to find a window and overlap size that are both practical in real-world applications while preserving classification performance.
Frame Selection, Frame Rate and Compression
Achieving highest performance with small amount of data can be challenging due to information lost. An empirical study was carried out to observe the impact on performance when frames are down-scaled to reduce frame sizes and with reduced frame rates. Number of experiments were designed exploring three frame sizes and three frame rates with following frame selection techniques.
Average (AVG): A new frame is created by pixel wise averaging all frames within the time-period.
Increment (INC): Increment time-stamp by the time-period and select the nearest time-stamp and respective frame.
Each 6-fold experiment was repeated 5 times and mean F1-measure is presented in Table 8. Best performing overlap for each window setting is highlighted in bold.
|Frame Size||Frame Selection||Frames per second|
Considering the frame size, we observe that both original frame size 1 and down-scaled frame size 2 perform similarly and the performance is degraded with frame size 3. It is also observed that increased frame rate does not necessarily contribute towards improved performance as frame sizes 1 and 2 achieve best performance with frame rate 1 frame per second. But once frame size is reduced to it is evident that increased frame rate improves accuracy. In general INC method yielded the better performance across all three frame sizes. But we observe no statistically significant performance difference between the AVG and INC methods.
The important take away from this study is that frame size and frame rate can be selected such that they compensate for each other, and more importantly selecting a higher frame rate and a larger frame size does not naturally improve performance. Lazy feature augmentation techniques such as frame down-scaling and frame selection can be pivotal to achieving better performance with reduced memory and computational power.