A Survey of Visual Analysis of Human Motion and Its Applications

08/02/2016 ∙ by Qifei Wang, et al. ∙ berkeley college 0

This paper summarizes the recent progress in human motion analysis and its applications. In the beginning, we reviewed the motion capture systems and the representation model of human's motion data. Next, we sketched the advanced human motion data processing technologies, including motion data filtering, temporal alignment, and segmentation. The following parts overview the state-of-the-art approaches of action recognition and dynamics measuring since these two are the most active research areas in human motion analysis. The last part discusses some emerging applications of the human motion analysis in healthcare, human robot interaction, security surveillance, virtual reality and animation. The promising research topics of human motion analysis in the future is also summarized in the last part.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Human motion analysis is one of the most active research areas in artificial intelligence. The body motion is the basis of human activities. The analysis of human motion can help us to learn the sematic information of human activities that includes but not limited to what the subject did which is studied as motion recognition, how the subject performed which is investigated in human performance analysis, and what the subject will do next which is known as motion prediction. These research topics are widely applied in human healthcare, security, and robotics.

In 2007, Poppe [poppe2007vision] made a comprehensive overview of the vision based human motion analysis. In the past decade, we witnessed significant progress in human motion analysis and its rapid increasing applications. In this paper, we review the most recent updates in human motion analysis, which includes the motion capture, motion representation model, data processing, action recognition, and dynamics measuring. Moreover, we introduce some typical applications of human motion analysis in healthcare, human robot interaction, security surveillance, virtual reality and animation. This overview can help the reader to catch the state-of-the-art technologies and learn the future of human motion analysis.

The rest of this paper is organized as following: Section II reviews the motion capture systems and the representation model of human motion data; Section III sketches the motion data processing, including motion data filtering, temporal alignment, and temporal segmentation; Section IV discusses the recent progress in action recognition. Section V goes through the work of human dynamic measuring and performance analysis; Section VI summarizes the emerging applications of human motion analysis and points out the promising research directions in the future.

Ii Human Motion Capture And Modelling

In anthropometry, the human articulated rigid body (HARB) model [kajzer2000human] is widely applied in human motion analysis. Based on the HARB model, the researchers proposed multiple human motion capture approaches and motion representation models. In this section, three popular human motion capture approaches and the representation models are briefly reviewed.

Ii-a Human motion capture

In the early days, human motion is captured by the attached sensors, e.g. accelerometer, gyro, etc [aminian2004capturing]. However, attaching sensor make the data collecting inconvenient and limited in certain applications. The emerging optical sensors introduced some new approaches to record the human motion by visual information [moeslund2006survey]. One basic solution is capturing human motion video by a monocular 2D camera [agarwal2005monocular]

. Since humans perform motion in the 3D space, a monocular 2D camera cannot record the motion of every degree-of-freedom (DoF). With the developing of multi-view stereo technologies, multi-camera system

[wang2015overview] was brought up to record the human motion in 3D space. One sort of solution uses multiple cameras to track the markers attached on the human’s body and fit the marker coordinates into skeletal joint trajectories [herda2000skeleton]. The marker-based commercial system can provide motion data with sub-millimeter accuracy and frequency of up to 960 Hz [wang2015evaluation]. To avoid the pain of attaching markers, people proposed another sort of markerless multi-view motion capture systems which track the feature points in the video to fit the skeletal joint trajectories [gall2009motion]. Although the multi-view motion capture systems can capture human motion with high accuracy and frame rate, the requirements of capturing stage limit the applications of the multi-view system.

With the emerging of infrared sensors, the depth camera is introduced to build the affordable human motion capture system [wang2015computational]. The most significant milestone is the release of Microsoft Kinect. The Kinect SDK featured real-time full-body tracking of human limbs based on the algorithm that extract body part from the depth maps [Shotton_2011]. Most recently, Ding and Fan [Meng_HumanPoseEstimation_TIP] proposed a generalized Gaussian kernel correlation function for the similarity measure to facilitate the optimization of body joint angles in a complex articulated structure. The proposed algorithm can also achieve real-time performance with competitive accuracy, which has great potential in mobile applications.

Compared to the motion capture system, the human pose extracted with Kinect still sufferes from the noise introduced by the depth sensor. In the first version of Kinect, the depth is obtained based on the structured light principle with typical accuracy ranging from about 1-4 cm in the range of 1-4 m. The Kinect version 2 acquires the depth by the time-of-flight principle with improved accuracy that the depth error is under 2 mm in the central viewing cone [Zhang_2012]. Wang et al. [wang2015evaluation] extensively studied the skeletal joint tracking error of Kinect version 1, version 2, and the marker based motion capture system. Based on the analysis, the RGBD camera can obtained relatively stable tracking of human skeletal motion but also introduce jitters on certain joints compared with the high accuracy motion capture system.

Ii-B Modeling

Human motion is produced by a sophisticated mechanical system composed by bone, muscle, tendon, etc. Fully modelling the human’s motion mechanism is still a puzzle in biomechanics. In the context of computer vision and robotics, a skeletal model is proposed to transform the human motion data into the form of a skeletal joint chain which is represented by the root trajectories and the rotation/translation of each DoF

[herda2001using]. The skeletal joint chain model is adaptive to the number of DoF and therefore can be applied to various types of motion. In order to model the mechanism of muscle and tendon, some elastic models are proposed based on the skeletal model [geijtenbeek2013flexible]. The model driven approaches can produce robust analysis of human motion but may lack of smoothness in motion reconstruction due to the constraints on joint DoF.

Besides the model driven approaches, the model-free approaches which are so-called data driven approaches are also proposed to investigate the human motion. One of the typical applications of the data driven approaches is modeling the skin and muscle deformation in human motion without fitting the motion data into a skeletal model [park2008data]

. Compared to the model driven approaches, the data driven approaches can reduce the cost on fitting the motion data to a skeletal model, which sometimes is not efficient to model the human’s pose. However, the data driven approaches suffer from the data dimension burden compared to the parametric model driven approaches.

Detailed comparison of the model driven and data driven approaches is beyond the scope of this paper. In the rest of this paper, the discussion scope is narrowed to the skeletal model approaches.

Iii Human Motion Data Processing

Iii-a Motion Data Filtering

As the introduction in Section II, the accuracy of motion data captured by multi-view camera system is higher than that of the data captured by the RGBD camera, like Microsoft Kinect. The analysis in [wang2015evaluation]

reveals that the noise of motion data captured by Microsoft Kinect is due to the loss tracking of human’s body parts. The similar problem exists in the motion data estimated from single 2D camera when the segmentation of human body parts is inaccurate.

To reduce the outlier in motion data, Wang et al.


proposed a mixture Gaussian and uniform distribution model to fit the joint motion data which classify the regular motion data and the outliers into the Gaussian and uniform distribution, respectively. By applying this mixture model, the variance of the joint position error is reduced by 90%.

Other than removing the outlier, Wang et al. [wang2015unsupervised]

proposed a kinematic filter based on unscented Kalman filter and the rigid segment length assumption to smooth the jitters in motion data. Applying the proposed filtering to the raw data captured by Kinect significantly reduces the jitters and produces smooth joint trajectories.

The analysis in [wang2015unsupervised] proved that the motion data filtering can help the following data processing and analysis modules produce reliable results based on the noisy input motion data.

Iii-B Temporal Alignment and Dynamic Time Warping

Temporal alignment is a challenging problem in the temporal domain signal analysis such as speech recognition, computer vision, and bio-informatics. In human motion analysis, dynamic time warping (DTW) is a widely adopted solution to normalize the paces for the purpose of motion sequence comparison and motion recognition. The classical DTW problem can be solved by the dynamic programming in polynomial time [zhou2012generalized]. Zhou et al. [zhou2012generalized] proposed a generalized time warping to align motion data of different dimensionalities. Kurtek et al. [kurtek2011signal] proposed a random time warping for the purpose of nonlinear signal alignment which shows good results on the repetitive signals like human’s motion.

Iii-C Temporal Motion Segmentation

Temporally partitioning continuous motion sequences into atomic actions is another extensively studied topic to facilitate further motion analysis. The general motion segmentation aims at partitioning the motion sequence into multiple primitive atomic actions which constitute the whole motion sequence. To the high-precision motion capture data, Jernej et al. [barbivc2004segmenting] proposed three segmentation methods based on the principle component analysis (PCA) to distinguish one primitive action from the other. Their first two methods can perform the segmentation in real-time using PCA and probabilistic PCA, respectively, whereas the third method (PCA-GMM) fits a Gaussian mixed model to the data of the entire exercise sequence offline. Zhou et al. [zhou2013hierarchical]

proposed a bottom-up hierarchical aligned clustering analysis (HACA) algorithm by combining kernel k-means with generalized dynamic time alignment kernel to cluster motion data into motion primitives. To the noisy motion data captured by the RGBD camera, Sung et al.

[sung2012unstructured] proposed a algorithm based on neighbor graph to obtain robust segmentation.

Another interesting problem is partitioning the repetitive motion sequences into multiple segments where each represents one temporal repetition of the primitive action. Due to the feature within each repetition is almost the same, the general motion segmentation is not practical for this purpose. Ajo et al. [fod2002automated] proposed a zero-velocity crossing (ZVC) detection algorithm based on joint angle velocities to partition the motion data of repetitive arm exercises into individual repetitions. Lu and Ferrier [lu2004repetitive] introduced a multi-dimensional segmentation algorithm to automatically decompose a complex motion into a sequence of simple linear dynamical models. Recently, Wang et al. [wang2015unsupervised] proposed a fully unsupervised algorithm based on the most informative joint selection and adaptive k-means clustering.

Iv Human Action Recognition

Action recognition is one of the most active research areas in computer vision. Based on the input data format, the action recognition approaches can be classified into two classes, video based approaches and skeleton based approaches. Despite of the different input data format, these two types of approaches share the similar framework which usually includes two modules, generating features and training classifier. In this section, we briefly sketch the recent progress of skeleton based motion recognition. More detailed overview can be found in [poppe2010survey].

Iv-a Feature Generation

Compared to the video based approaches, the skeleton based approaches can naturally access the skeletal joint trajectories. Extracting the statistical features of joint trajectories followed by a classical classifier, such as nearest neighbour (NN) or support vector machine (SVM) forms the basic solution. Most algorithms in this class use either joint position or joint angles to generate the feature. In

[gowayyed2013histogram], a histogram records the displacements of joint orientations over the whole trajectory. In [ohn2013joint], a pairwise affinities trajectories of joint angles was proposed to represent the motion. However, these approaches are sensitive to the noise in the motion data. Due to the observation that not all the joint participated in the motion, Ofli et al. [ofli2014sequence] proposed a novel feature representation based on the selection of most informative joints. The experimental results demonstrate that this algorithm can improve the recognition performance by reducing the noise from unrelated joints. This algorithm typically verifies the “less is more” principle in motion recognition. In [vemulapalli2014human]

, the human skeletons are transformed to the rotations and translations between different segments which forms a Lie group. Consequently, the human motion is represented as a curved manifold for further classification. This novel representation outperforms most of the existing approaches based on skeletal representation but suffers from the high complexity of transformation. Taking advantage of deep learning, researchers fed the skeletal joint trajectories into the convolutional neural network (CNN)


to train the feature automatically. The features learned from CNN mostly outperform the manually generated feature. To further exploit the temporal features, a hierarchical recurrent neural network (HRNN)

[du2015hierarchical] is proposed with high computational efficiency.

Iv-B Probabilistic Graphical Model in Motion Recognition

In the basic motion recognition solution, the classical classifiers do not take advantage of the temporal structural information which makes it not practical for the complex action recognition. The probabilistic graphical model is therefore adopted to exploit the structural feature for human action recognition. The probabilistic graphical models can be divided into two categories: generative models and discriminative models. Hidden Markov model (HMM) is a typical generative model with three assumptions: 1) each action stage is associated with a hidden state and the action is therefore a state transition chain; 2) the current state is only conditioned on the most recent state; 3) the current observation only depends on the current state. HMM uses training data to model the state transition and observation probabilities. In testing, HMM infers the states of the motion sequences and does classification based on the state sequences. Sung et al.

[sung2011human] proposed a hierarchical maximum entropy Markov model to the action recognition. However, the independent assumption in HMM which assumes the observations are temporal independent is often not the case.

As opposed to the generative mode, the discriminative model infers the posterior probabilities of the latent action labels given the observations. Therefore, the discriminative model is trained to distinguish the classes rather than learning the parameters in generative models. The discriminative model is more effective than the generative model when the action is similar and the training dataset is large. Conditional random field (CRF) is a typical discriminative model that widely used in human action recognition

[wang2009max]. Zhang and Gong [zhang2010action] proposed a hidden CRF (HCRF) which use a single state to label the whole sequences. The HMM pathing introduced in this algorithm can obtain globally optimized parameters of the learned HCRF. Hu et al. [hu2014learning] extended the CRF model with an augmented hidden layer which represents the subtypes of the activities. This additional layer helps to further distinguish the semantic difference such as the action with similar motions but different targets. Experimental results demonstrate this model is efficient in inference than the other existing approach based on CRF.

V Learning Dynamics From Motion

Beside learning what people do by action recognition, learning human’s performance in action (e.g. dynamics, stability, flexibility, endurance, etc.) is also an active research area [ofli2016design]. To the best of our knowledge, the non-interventional dynamic measuring is still an open challenge. Since the motion and dynamics is highly intertwined in the physical world, extracting dynamics from motion provides a promising solution to non-interventional dynamic measuring.

Based on the HABR model, Brubaker et al. [brubaker2009estimating]

proposed a physical model based on the Newtonian dynamics equation to estimate the contact dynamics. In this model, the dynamics of each rigid body part are modelled by the Newtonian dynamics equation. By applying the principle of virtual work and some assumptions of the contract friction force, they proposed a TMT model associating the joint force and torques with the body mass, inertia, and the motion data. The joint force and torques are therefore solved by the -2 norm optimization. Experimental results verified the estimated torques curves is smooth with consistent standard deviation. Besides, this approach is extended to track human in motion

[brubaker2008kneed]. The results also verify the effectiveness of the dynamics in motion tracking. In [zhang2014leveraging], Zhang et al. leveraged the wearable pressure sensor to estimate the contact force and extracted the dynamics based on the same principle. They concluded that the dynamics enhanced the realistic visual quality of human motion animation with the object interaction. Agarwal et al. [agarwal2014estimating] also apply the same model to the human motion tracking in video and also achieved robust performance compared to the existing tracking algorithms without dynamics.

Vi Applications and Discussions

The applications of human motion analysis are growing fast in the areas including but not limited to healthcare [rao2016anterior], security, virtual reality [wang2012free], animation [cao20123d], human social interaction [ji2014online], etc., over the past decades. In the first part of this section, we briefly introduce some typical applications of human motion analysis in remote healthcare and human-robot interactions. Next, we point out the future research topics of human motion analysis and its emerging applications.

In the smart healthcare of elderly and patients with physical problems, monitoring the physical performance remotely can reduce both the cost and risk during the physical training process. In this process, doctors will assign physical training exercises to the subjects and evaluate their performance based on the human motion analysis. To facilitate this process, Ofli et al. [ofli2016design] proposed a remote health coaching system which records the human motion by RGBD camera [wang2010region] and performs online motion performance analysis. The subjects can receive instant feedback from the system followed by the professional feedback from the doctor. Although this kind of systems can provide a solution for remote smart healthcare, there is still a long way from achieving an unsupervised analysis of human motion for the purpose of medicine [wang2015remote]. The motion data accuracy and communication delay [zhai2014content] would be the main burden to make the system adopted by the medical applications.

Another application scenario of human motion analysis is human robot interaction which is desired in both the human daily live and the healthcare. Building robotics to serve human is one of the main objectives of the robotics industry. Understanding human’s language, action, and emotion can improve the intelligence of the robot and make the robot serve people better. Koppula and Saxena [koppula2016anticipating] proposed a human motion prediction algorithm to predict the goal and the trajectories of human’s action. By learning these, the robots can help the human to accomplish the job which is dangerous or arduous. However, due to the high complexity, learning the way that human handle the object is still a challenging problem.

Besides the healthcare and human robot interaction, human motion analysis has also been widely applied to produce high realistic animation of human, enhance the immersion in virtual reality [wang2012complexity] and 3D video [wang2011reduced][wang2013complexity], recognize the bad actions to improve the security, control the object in remote surgery. Generally speaking, human motion analysis can be adopted in all the applications that humans participate.

Although extensive efforts have addressed the problems of motion recognition and reconstruction, there are still many unsolved problems in human motion analysis. First of all, capturing motion data with high accuracy by cost-efficient devices is still a challenge problem. Although the existing RGBD camera provided a balance solution, the lack of consistency and low frame rate still requires new solutions coming with the improvements on sensors or algorithms [chen2008wyner]. Human’s motion understanding is another hot research topic beyond the action recognition. The human’s motion contains much high level sematic information including the purpose, emotion, interaction, etc. Extract the sematic information will be super useful in the applications such as human robot interaction, security surveillance, and sociology. Some latest artificial intelligence tools, like CNN would play an important role in this area. Measuring human’s dynamics and its applications in healthcare are still an open research area. Improving the accuracy of dynamic measuring and building a comprehensive model of human physical functionality with respect to the measured dynamics would be most desired features in this area.

Motion is one of the basic channel for human to interact with the world. Human motion convey various information related to both physiology and psychology. Learning human motion can help to improve the healthcare, social functionality, and security for human. With no doubt, the studies of human motion analysis will raise much attentions and have plenty of applications in the future.