Classifying action correctness in physical rehabilitation exercises

by   Alina Miron, et al.
Brunel University London

The work in this paper focuses on the role of machine learning in assessing the correctness of a human motion or action. This task proves to be more challenging than the gesture and action recognition ones. We will demonstrate, through a set of experiments on a recent dataset, that machine learning algorithms can produce good results for certain actions, but can also fall into the trap of classifying an incorrect execution of an action as a correct execution of another action.



page 1

page 2

page 3

page 4


Machine Learning: A Dark Side of Cancer Computing

Cancer analysis and prediction is the utmost important research field fo...

Improving Human Action Recognition by Non-action Classification

In this paper we consider the task of recognizing human actions in reali...

An Action Recognition network for specific target based on rMC and RPN

The traditional methods of action recognition are not specific for the o...

Roweisposes, Including Eigenposes, Supervised Eigenposes, and Fisherposes, for 3D Action Recognition

Human action recognition is one of the important fields of computer visi...

Towards Improving Spatiotemporal Action Recognition in Videos

Spatiotemporal action recognition deals with locating and classifying ac...

Convolutional Architecture Exploration for Action Recognition and Image Classification

Convolutional Architecture for Fast Feature Encoding (CAFFE) [11] is a s...

Action Recognition in the Frequency Domain

In this paper, we describe a simple strategy for mitigating variability ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Analyzing human motion has been an intensely studied problem in the computer vision community. While most works focus on the challenging task of action detection and recognition, there is still limited work in the domain of human motion quality assessment from a functional point of view. Several potential applications for this exists, including heath care for patient rehabilitation and sports for athlete performance improvement

[Pirsiavash et al.2014].

Assessing the quality of human motion/actions is a difficult problem. Human experts such as coaches, physiotherapists, or doctors have been trained extensively to discover the rules required to assess different types of motions.

In this work, we concentrate on motion quality assessment from a correctness perspective, using machine learning methods. We want to see whether machine learning methods could easily classify an action, from a set of various actions correctly and incorrectly executed, as valid or not, in a binary manner.

2 Related work

The task of everyday activity recognition [Pirsiavash and Ramanan2012][Shahroudy et al.2016] and action recognition [Idrees et al.2017] have been discussed in several papers, with a large number of datasets being publicly available [Escalera et al.2017].

There are a few articles where the task of action correctness is approached from different perspectives. Some approaches use either accelerometer sensors, others rely on depth or colour cameras.

[Ebert et al.2017] tackles the task of a qualitative assessment of human motion through the use of an accelerometer and assigning a quality class label to each motion. Other authors, such as [Parisi et al.2016], focus on computing how much a performed movement recorded using a depth camera matches the correct continuation of a learned sequence. [Paiement et al.2014] uses the recorded gait movement of six healthy subjects going up the stairs to train a model. The model’s ability to detect the abnormalities is tested on 6 other patients with 3 types of simulated knee injuries.

Action correctness is also very similar to action completeness [Heidarivincheh et al.2016]. In this context, an action is considered completed if the action goal was achieved: i.e. the drinking action is completed when one actually consumes a beverage from a cup. The authors have used six types of actions to test the completeness, which involves the interaction with different objects. In this article, we do not aim for action completeness, but instead we focus on the specific task of how correct an action is actually performed.

For the action correctness task, the only publicly available dataset that we found is UI-PRMD, proposed by [Vakanski et al.2018]. They have recorded 10 subjects performing 10 types of actions, with each action being performed in an optimal and non-optimal way. The dataset was not recorded with a particular type of injury in mind, but focuses instead on healthy subjects performing a few types of exercises.

At the moment, no public baseline benchmark has been published on the UI-PRMD dataset. Nevertheless, we are studying the feasibility of training an action correctness model on this dataset. We construct a binary classifier for every type of exercise, with the purpose of differentiating between a correctly and wrongly executed action. Different subjects might perform the non-optimal movement in several ways. For example, for the

”Deep squat” exercise, the non-optimal movement is defined by [Vakanski et al.2018] as ”Subject does not maintain upright trunk posture, unable to squat past parallel, demonstrates knee values collapse or trunk flexion greater than 30. This definition allows a certain degree of subjectivity in assessing the correctness of the movement.

Figure 1: The structure of the Res-TCN, where BRC stands for Batch Normalization, ReLU and Convolution

Instead of using hand crafted features for every type of action, our purpose is to use a machine learning system to learn what makes a movement non-optimal. We use the Temporal Convolutional Neural Network (Res-TCN) proposed by

[Kim and Reiter2017]. This classifier is one of the top performing methods on a large scale action recognition dataset [Shahroudy et al.2016], giving slightly better performance than the STA-LSTM used by [Song et al.2017].

3 Machine learning for action correctness

The results presented in this section are obtained with Convolutional Neural Networks. It is worth mentioning that a few other standard machine learning methods have been tested, including Support Vector Machines and Random Forests, but the classification accuracy was not significantly higher that 50%, which is close to a random decision.

3.1 Overview of the Res-TCN classifier

In this section we provide a short overview of the structure of the network proposed by [Kim and Reiter2017].

Figure 1 depicts the structure of the Res-TCN network. The input to the network is the concatenated skeleton features from every frame of the video sequence. This is followed by a first convolution with the convolution filter length

of eight, a stride

of one and number of filters of eight.

The following nine blocks are Residual Units introduced by [He et al.2016]

and consist of batch normalization, ReLU and convolution operations, with the number of filters of the convolution increasing from 64 to 128 and 256. After the last layer a global average pooling is used across the entire temporal sequence. A final softmax layer with the number of neurons equal to the number of classes is used for classification.

The advantage of the Res-TCN architecture over recurrent structures like LSTM alternatives is possible model interpretability as shown by [Kim and Reiter2017].

3.2 Model parameters

In [Kim and Reiter2017]

, the authors are using as input to the TCN the raw 3D skeleton points. We have used a similar setup, but have also tested the system performance when it receives as input the angles between different joints. For 3D skeleton points setup, we take the computed (X, Y, Z) values of each skeleton joint and concatenate all values to form a skeleton feature. A skeleton feature per frame is a 66 dimensional vector obtained by multiplying the number of joints (which is 22) with the data per point (which is 3) as it can be seen in Figure


Figure 2:

The raw feature extraction for a single sample from the UI-PRMD dataset

One disadvantage of a TCN architecture over other types of architectures like LSTM is the fact that the data size has to be consistent over all input examples. The way we overcome this limitation is by finding the maximum video length across all segmented movements and use zero padding for the 2D feature-array.

For the Res-TCN parameters, we used the same configuration as proposed by [Kim and Reiter2017]

: stochastic gradient descent with nesterov acceleration with a momentum of 0.9, all convolution layers have applied a

regularizer with a weight of , and to prevent overfitting a dropout of 0.5 is applied after every ReLU.

The model is trained for 500 epochs, and we use a batch size of 128.

3.3 Data and experiments

As mentioned above, we use the UI-PRMD dataset for our machine learning investigation. The data has been recorded using both a Kinect camera and a Vicon optical tracker [Vicon2018]. The Vicon optical tracker is a system designed for capturing human motion with high accuracy and consists of eight high speed cameras that track a set of retroreflective markers. We have focused just on the data recorded with the Kinect, due to the low cost of the sensor compared with Vicon.

UI-PRMD data consists of 10 movements: deep squat, hurdle step, inline lunge, side lunge, sit to stand, standing active straight leg raise, standing shoulder abduction, standing shoulder extension, standing shoulder internal-external rotation, and standing shoulder scaption. Each movement is repeated 10 times by each of the 10 individuals recorded.

The Kinect data is presented in the form of 22 YXZ triplets of Euler angle and positions. The values for the waist joint are given in absolute coordinates, while the values of the rest of the joints are given in relative coordinates with respect to the parent joint [Vakanski et al.2018]. Based on the local angle and position data we have computed the transformation matrix in order to obtain the absolute joint coordinates as 3D points.

We followed a cross-subject evaluation splitting the data into training, validation and testing. For the training phase we have used 6 persons, 3 were used in validation and 1 for testing. We have used a 10-fold cross validation in order to validate our model, every time using a different person in the testing set.

The authors of [Shahroudy et al.2016] have defined the testing protocol for NTU RGB+D dataset as a training/testing split. We find that due to the smaller size of UI-PRMD data, a protocol training/validation/testing to be more appropriate. We use the validation set to avoid model overfitting on the training set. We save the model which generalized best, i.e. obtained the best accuracy, on the validation set, and that model is used on the testing set.

3.4 Results and discussions

Figure 3: Average accuracy over 10-folds for every action

For every movement type, we have trained a different model with the explicit purpose of differentiating between optimal and non-optimal movements. Figure 3 presents the average results for every movement type, by using as input for the neural network the absolute 3D point position or the relative joint angles.

When using the 3D points, on average the accuracy is 62.3%, while when using the relative joint angles is 71.2%.

The types of movements for which the non-optimal movement was consistently detected across subjects are: deep squat with 83.5% accuracy for relative joint angles and sit to stand with 82% accuracy. The movements that proved to be the most difficult are standing shoulder extension with 62% accuracy and standing shoulder internal-external rotation with 63% accuracy.

Figure 4: Accuracy for side lunge, standing shoulder internal-external rotation and standing shoulder scaption for each test case

In figure 4 we present a detailed view of the classifier performance for 3 actions (side lunge, standing shoulder internal-external rotation and standing shoulder scaption) and all subjects in the dataset. For example, we obtained the accuracy of  50% for Person 1 for side lunge by training a binary classifier where the training set consisted of Person 2 to 7, the validation set of Person 8 and 9, and the test set of Person 1. An accuracy of 50% is equivalent to a random decision level, therefore the classifier wasn’t able to generalize for this particular subject. On the other hand, for Person 2 and 8 the classifier achieved a perfect accuracy for certain types of actions.

One reason for this discrepancy of performance between different subjects might be given by the way the actions were actually performed. For example, for the three analyzed actions, side lunge, standing shoulder internal-external rotation and standing shoulder scaption, the motion is performed using the right side of the body (right leg, right shoulder) by most of the subjects. The exception are Person 7 and Person 10 which perform the motion using their left side of the body.

We have also trained a general model that classifies for any input movement if it is performed correctly or not and it obtained an average accuracy of 63.3% when using joint angles as input data. This is much lower than the 71.2% accuracy obtained when training specialized models for every type of movement.

4 Conclusion

The work presented here is an initial investigation of the applicability of machine learning to human motion quality assessment. In particular, we looked into the task of training a model to recognize an action or a movement as correct or not. Our results show a high variability of results for different action types. Angle features seem to be more relevant than using raw 3D joint positions. More work has to be done in identifying, adapting, and improving certain machine learning methods, such as the convolutional neural networks which prove efficient at least for some classes of actions.

The results shown that the actions which involve a movement symmetry of the body, like sit to stand and deep squat, were easier to train. For certain classes of actions, the variability of movement is too high. Therefore, we believe that a data augmentation process might help especially with these action types. As future work, we plan to augment the training set with simulated data, by translating the motion performed by one side of the body to the other one.


  • [Ebert et al.2017] Andre Ebert, Michael Till Beck, Andy Mattausch, Lenz Belzner, and Claudia Linnhoff Popien. Qualitative assessment of recurrent human motion. arXiv preprint arXiv:1703.02363, 2017.
  • [Escalera et al.2017] Sergio Escalera, Vassilis Athitsos, and Isabelle Guyon. Challenges in multi-modal gesture recognition. In Gesture Recognition, pages 1–60. Springer, 2017.
  • [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [Heidarivincheh et al.2016] Farnoosh Heidarivincheh, Majid Mirmehdi, and Dima Damen. Beyond action recognition: Action completion in rgb-d data. In BMVC, 2016.
  • [Idrees et al.2017] Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
  • [Kim and Reiter2017] Tae Soo Kim and Austin Reiter. Interpretable 3d human action analysis with temporal convolutional networks. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1623–1631. IEEE, 2017.
  • [Paiement et al.2014] Adeline Paiement, Lili Tao, Sion Hannuna, Massimo Camplani, Dima Damen, and Majid Mirmehdi. Online quality assessment of human movement from skeleton data. In British Machine Vision Conference, pages 153–166. BMVA press, 2014.
  • [Parisi et al.2016] German I Parisi, Sven Magg, and Stefan Wermter. Human motion assessment in real time using recurrent self-organization. In Robot and Human Interactive Communication (RO-MAN), 2016 25th IEEE International Symposium on, pages 71–76. IEEE, 2016.
  • [Pirsiavash and Ramanan2012] Hamed Pirsiavash and Deva Ramanan. Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2847–2854. IEEE, 2012.
  • [Pirsiavash et al.2014] Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. Assessing the quality of actions. In European Conference on Computer Vision, pages 556–571. Springer, 2014.
  • [Shahroudy et al.2016] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. arXiv preprint arXiv:1604.02808, 2016.
  • [Song et al.2017] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu.

    An end-to-end spatio-temporal attention model for human action recognition from skeleton data.

    In AAAI, volume 1, page 7, 2017.
  • [Vakanski et al.2018] Aleksandar Vakanski, Hyung-pil Jun, David Paul, and Russell Baker. A data set of human body movements for physical rehabilitation exercises. Data, 3(1):2, 2018.
  • [Vicon2018] Vicon. Vicon optical tracker system., 2018. [Online; accessed 25-April-2018].