Sit-to-Stand Analysis in the Wild using Silhouettes for Longitudinal Health Monitoring

by   Alessandro Masullo, et al.
University of Bristol

We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS) analysis framework for long-term monitoring of patients in free-living environments using video silhouettes. Our method adopts a coarse-to-fine time localisation approach, where a deep learning classifier identifies possible StS sequences from silhouettes, and a smart peak detection stage provides fine localisation based on 3D bounding boxes. We tested our method on data from real homes of participants and monitored patients undergoing total hip or knee replacement. Our results show 94.4 and an error of 0.026 m/s in the speed of ascent measurement, highlighting important trends in the recuperation of patients who underwent surgery.


page 2

page 5


COVID-19 Smart Chatbot Prototype for Patient Monitoring

Many COVID-19 patients developed prolonged symptoms after the infection,...

Detection of 3D Bounding Boxes of Vehicles Using Perspective Transformation for Accurate Speed Measurement

Detection and tracking of vehicles captured by traffic surveillance came...

Accelerometer-based Bed Occupancy Detection for Automatic, Non-invasive Long-term Cough Monitoring

We present a machine learning based long-term cough monitoring system by...

Intake Monitoring in Free-Living Conditions: Overview and Lessons we Have Learned

The progress in artificial intelligence and machine learning algorithms ...

Deep Learning-Based Computer Vision for Real Time Intravenous Drip Infusion Monitoring

This paper explores the use of deep learning-based computer vision for r...

Detection of squawks in respiratory sounds of mechanically ventilated COVID-19 patients

Mechanically ventilated patients typically exhibit abnormal respiratory ...

1 Introduction

Novel concepts and technologies like the Internet of Things (IoT) for Ambient Assisted Living (AAL) or specific health monitoring enable people to live independently, to be aided in their recuperation, and improve their quality of life. Such systems often include multiple sensors and monitoring devices, producing large amounts of data that need to be analysed and summarised in a few, clinically relevant parameters [22]. The transition from a sitting position to a standing one (StS222In this work, by StS we do in fact mean both ‘Sit-to-Stand’ and ‘Stand-to-Sit’, but will specify which of the two, if and when necessary.) is one of the most essential movements in daily activities [6], especially for older patients suffering from musculoskeletal illnesses. StS has been linked to recurrent falls [4], sedentary behaviour [7] and fall histories [20]. Continuous monitoring of the StS action over a long period of time can therefore highlight important trends, particularly for subjects undergoing physical rehabilitation.

To the best of our knowledge, the automatic analysis of StS has not been attempted for long term monitoring and trend analysis. Some previous works have focused on automating the Sit-to-Stand clinical test, performed under supervised conditions and often in the presence of a clinician, e.g. [3]. Shia et al[16] suggested modelling the physics of the human body during stand-up transitions by using a motion capture suite. Their method was tested in the lab on 10 healthy individuals but this approach is clearly impractical for long-term monitoring. In [9], Galna et al. investigated the suitability of skeleton data extracted by the Kinect sensor to assess clinically relevant movements, showing that the StS timing can be accurately captured with errors comparable to the VICON motion capture system. Their method was applied in the lab on 9 individuals with Parkinson’s Disease and 10 control subjects. Skeleton data was also used in [8]

to estimate the StS timing by using the vertical displacement of the head joint and a manual threshold. Their method was tested in the laboratory for 94 subjects and in participants’ own homes for 20 individuals.

The detection of StS transitions can be seen as an action classification problem, and a large body of research has investigated the application of deep convolutional neural networks (CNN) for this task, for example

[17, 5]. However, while these works enable high accuracy in action classification, they always make use of RGB or depth data, which is not compatible with the privacy requirements of home monitoring systems, for instance [23, 2]. As already addressed in [13], silhouettes constitute a valid alternative form of data that allows action recognition to be performed whilst respecting privacy requirements.

The aim of this work is to propose a novel approach to continuously monitor StS transitions in the wild and, while addressing privacy issues, to generate automatic trend analysis. For each StS transition, we measure the speed of ascent/descent as an indicator of physical function. We installed RGBD cameras (PrimeSense) in participants’ own houses and recorded silhouette video data from 9 subjects in 4 different habitations, for a minimum period of 4 months, up to 1 year, under the auspices of the SPHERE and HemiSPHERE projects [10, 22]. Two of the participants, aged between 65 and 90, underwent total hip or knee replacement and we monitored them before and after their intervention. The remaining 7 participants, aged between 40 and 60, did not record any particular health condition that could affect their mobility. We show that our method can identify StS transitions into the wild with 94.4% overall accuracy and our measurement of the speed of ascent is comparable with the VICON motion capture gold standard in a supervised setting. Moreover, our analyses highlight important trends linked to the rehabilitation process, potentially allowing for surgeons to follow the progress of their patients remotely and anticipate possible complications.

2 Methodology

Monitoring people in their homes poses stringent ethical restrictions on the type of data that can be recorded, analysed and shared, e.g. prohibiting the use of RGB data [14, 21]. To provide a privacy-compatible monitoring system (based on a user study [22]), we generate silhouettes and 3D bounding boxes from the RGB data and discard the raw pixel values immediately thereafter. We deployed one camera in each house (in the living room) and set it up at a similar height to have a comparable field of view.

Figure 1: Network architecture of the proposed method.

Our proposed pipeline can be divided into three steps: pre-processing of videos, classification and StS measurement. First, the incoming silhouettes are cropped at the detected bounding boxes and resized, producing one video per individual. These videos are subdivided into short clips of 10 seconds each333The frame-rate of the silhouette recorder varied according to different conditions and produced 10 fps on average, which are then classified with a deep CNN (detailed in Section 2.1) into one of three categories: “Sit-to-Stand”, “Stand-to-Sit” or “Other”. The StS video clips only are then further analysed to measure the speed of ascent/descent using the 3D bounding boxes, as described in Section 2.2.

Contrary to previous works that have focused on StS duration [3, 16], our method measures the speed of ascent/descent, defined as the maximal transferring velocity of the centre of gravity (CG) between the start and the completion of the StS movement [15]. The speed of ascent/descent does not depend on a specific beginning or end of the movement, but rather on the maximum velocity. Thanks to this property, the speed of ascent/descent shows no significant difference between the Sit-to-Stand and the Sit-to-Walk movements [12], or the Stand-to-Sit and the Walk-to-Sit movements, making it a more suitable measurement for free-living monitoring.

2.1 Classification

Inspired by the work from Carreira et al. [5]

, we built our classifier network using Inception modules with 3D convolutions, as presented in Figure 

1. It was shown in our previous work [13]

that using very deep networks on silhouette data increases the computational cost without inducing any advantages. We therefore adopted a shallow architecture composed of 4 stacks of Inception modules, followed by a Long Short-Term Memory (LSTM) layer located between the last convolutional layer and the final fully connected layer. In our experiments, we found that the use of an LSTM module in addition to the 3D convolution produced the best results in classification accuracy.

The video sequences recorded from the participants’ homes contained highly varied data, with video clips of StS transitions only constituting less than 1% of the whole dataset. To tackle this class imbalance problem [11]

, we under-sampled the “Other” class to match the size of the minority classes “Sit-to-Stand” and “Stand-to-Sit”, sampling new random elements for each epoch. This ensured a balanced training and prevented the potential loss of useful data from the “Other” class.

2.2 Speed of Ascent Measurement

The 10-seconds clip classifier provides a coarse time localisation of the StS transitions. To narrow the exact frame of the transition and measure the speed of ascent444Although here we refer to the computation of the speed of ascent, the methodology applies identically for the speed of descent by simply using the negative sign in Eq. 1., we employ data from the 3D bounding boxes, in particular the evolution in time of the upper edge. Let the 3D bounding box for the time interval of a clip be , where the indices 1 and 2 respectively represent the ‘right’, ‘top’, ‘front’, and ‘left’, ‘bottom’, ‘back’ vertices of the 3D box. Let us call the component of the top vertex, and the vertical speed of the subject can then be estimated as:


where the sign is for “Sit-to-Stand” and for “Stand-to-Sit” classes. Using the definition of speed of ascent as the maximum vertical velocity during the StS movement, we can then compute the speed of ascent as:


It is important to note that the computation of Eq. (2) is only performed on those clips classified earlier as StS. In fact, its simplicity is built upon the accuracy of the classifier, which filters out all the other possible movements that might contain a vertical motion and are not StS transitions. A visualisation of this computation can be seen in Figure 2, showing a strong correlation between the vertical speed of the bounding box and the Sit-to-Stand action.

Figure 2: Example computation of the speed of ascent: (top) video frames of a Sit-to-Stand sequence from the SPHERE data, colour coded with intensity of the vertical derivative; (bottom) 3D bounding box vertical coordinate and derivative. The maximum intensity of the vertical speed corresponds to the speed of ascent.

In order to reduce noise of the 3D bounding boxes, we adopted a Savitzky-Golay filter (savgol) as implemented in SciPy. The advantage of the savgol filter is that it replaces each data-point by the least-squares polynomial fit of its neighbours, allowing noise reduction and a simple analytical derivative of the polynomial. We used a kernel window size of 11 points and a polynomial of 3rd order. The vertical velocity can then be computed as the ratio between the filtered

and the filtered time vector:


3 Experiments

The architecture was built with 4 Inception modules [18]

, each composed of a sequence of (1) 3D convolutions, (2) batch normalisation and (3) activation ReLu, using respectively 16, 32, 64 and 128 filters. The last layer produces a set of convolutional features which are, once reshaped, 512 dimensional for 25 pseudo-time steps. The resulting features are fed into an LSTM module with 128 units, whose output is then fed into a 3D fully connected layer with

softmax activation. The input comprises video clips of 100 frames, each 100 by 100 pixels, while the output is a 3 by 1 classifier.

We demonstrate the validity of our algorithm by assessing the StS video classifier and the speed of ascent/descent computation independently on two different datasets.

3.1 Physical Rehabilitation Movements Data Set

The UI-PRMD dataset includes skeleton data from typical exercises and movements which are performed by patients during therapy and rehabilitation programs [19]. It consists of 10 healthy subjects, performing 10 different movements 10 times each, and recorded simultaneously using a Kinect and a VICON (gold standard) motion-capture system.

In particular for our work, we extracted the Sit-to-Stand movement from the dataset and used the VICON motion capture data to validate our proposed approach. We generated 3D bounding boxes using the extent of Kinect skeleton joints and we compared the speed of ascent with the one computed using the centre of gravity (CG) from the VICON data555The CG was estimated using the average of the Left, Right, Anterior and Posterior Superior Illiac skeletal joints..

The curves in Figure 5a show a comparison of the true speed of ascent, computed using the VICON CG (blue curve), and our estimation using the Kinect head joint (orange). In both cases, the vertical derivative was obtained for all the StS transitions available () and averaged to highlight possible discrepancies, while the time was normalised using the beginning and the end of the StS trasition. The two curves exhibit a very similar pattern, with a maximum value (i.e. the speed of ascent) which differs by 23.3%. This amplification of the maximum vertical speed results in a bias error of the speed of ascent of about 0.026 m/s, or 28.3% of the average measurement. In spite of this bias error, the correlation between our estimated speed of ascent and the ground truth is more than 92.8%, as shown in Figure 5b. While this bias could be mitigated by appropriate calibration, the aim of this work is to investigate trends in the speed of ascent/descent and the high correlation between our measurement and the ground truth is more than sufficient for its application.

Figure 5: Comparison of speed of ascent computed with our algorithm using the Kinect data and the VICON system

3.2 SPHERE data

The SPHERE project (Sensor Platform for Healthcare in a Residential Environment) [22] developed a multi-modal sensing platform aimed to record data from up to 100 houses in the Bristol (UK) area for healthcare monitoring. Each house was equipped with a variety of sensors, including RGBD cameras, which were used to generate human silhouettes and 2D/3D bounding boxes via the OpenNI API [1], from different communal spaces: living room, kitchen and hall. The HEmiSPHERE (Hip and knEe study of SPHERE) project [10] is an UK National Health Service application of SPHERE sensors within the homes of patients undergoing a total hip or knee replacement.

In this work, we present data collected from the living room of 4 different houses, described in Table 1, two belonging to the HEmiSPHERE cohort and two belonging to the SPHERE one. This subset includes a total of 1,177,082 video clips, of which 5,645 are StS transitions and the rest belong to the “Other” class. The videos were manually labelled by the authors using the MuViLab annotator tool666Available on GitHub: and were used for cross-validation as per Table 2. The discrepancy between the number of Sit-to-Stand and Stand-to-Sit transitions can be explained by the type of silhouette detector adopted (OpenNI), that was optimised for standing poses. This increases the chances of detecting a person walking and sitting down and hence the number of Stand-to-Sit transitions recorded.

Id Duration Occup. #Other  #Sit-to-Stand #Stand-to-Sit
House A 4 months 2 107404 339 491
House B 3 months 2 266853 1289 2051
House C 9 months 4 416628 297 1054
House D 6 months 1 380552 54 70
Table 1: Description of the data from the 4 houses: 2 cohorts of SPHERE (bottom two rows) and HemiSPHERE (top two rows).
(a) Fold 1
(b) Fold 2
(c) Fold 3
Figure 9: Confusion matrix for each validation fold.

3.3 Classification

Data from homes A, B and C was used to train and validate the network (described in Section 2.1) using a cross-validation strategy, as depicted in Table 2. Data from House D was left out of this procedure and was only used to generate the trend plot. Results are presented in Table 2 and show an overall accuracy of 94.8%, 95.0% and 93.5% for the three validation folds, computed by averaging the accuracy of the three classes. The average accuracy across the three folds is 94.4%. Details of the classification results are presented in Figure 9, showing the confusion matrices for each validation fold.

Particular attention must be paid to the false positive scores. The number of “Other” videos mis-classified as StS was found to be 1.63%, producing 28119 false positive against the 6548 correctly identified StS transitions. While these values might potentially damage our score, a manual inspection of the false positives concluded that many of the mis-classification videos are, indeed, visually similar to StS transitions. This included subjects interacting with the environment for long periods of time while standing up, raising from the floor, kneeling while doing exercises or housekeeping chores. Although these movements are not strictly StS transitions, they still involve a vertical motion that requires physical effort. As we will show in the next Section, although the presence of these false detection increases the uncertainty of our measurements, it does not hamper the calculation of the trend plots.

Fold Train Validate Stand-to-Sit Sit-to-Stand Other Overall
1 House C, B House A 97.2% 91.2% 96.0% 94.8%
2 House C, A House B 95.2% 93.2% 96.5% 95.0%
3 House A, B House C 96.7% 86.0% 97.9% 93.5%
Average 94.4%
Table 2: Cross-validation accuracy results, with overall average accuracy.

3.4 Trend plots

Following the classification, the speed of ascent/descent was computed for all the video clips detected as StS transitions and it was averaged per week. The resulting trend plot, for Fold 2 as an example, is presented for the manually labelled video clips (Manual trend) in Figure 12a, and for the automatic labels (Automatic trend) in Figure 12b. The reader is reminded that one of the occupiers of this house underwent a total hip or knee replacement intervention and the surgery day is marked with a solid black line. Before surgery, the speed of ascent is between 0.35 and 0.45 m/s, which is followed by a sudden drop soon after the operation. This is due to the pain and the discomfort following the surgery, which impair the physical ability of the patient and hence their speed of ascent. In the following weeks, the speed of ascent shows a slow but steady increase with a slope of around 0.04 m/s per month. Finally, 14 weeks after the surgery, the speed of ascent reaches a value which is just shy of 0.5 m/s, confirming a full recovery. The presence of the trend is also corroborated by a high coefficient of determination .

The comparison between the Manual trend and the Automatic trend from Figure 12 shows a very similar pattern, with a correlation coefficient between the two plots of 0.88. In spite of the higher error bars, due to false positives, the main characteristic aspects of the plot are preserved, including the drop in the speed of ascent following the surgery and the full recovery after 14 weeks.

(a) Manual
(b) Automatic
Figure 12: Comparison of speed of ascent trend for Fold 2, extracted from (a) the manually labelled StS transitions and (b) the video clips automatically labelled as StS. The correlation between the plots is 0.88.

For comparison, we present Automatic trends generated for House C and D in Figure 15, occupied by healthy participants. As expected, no particular trend can be noticed for these houses, as confirmed by the low coefficients of determination of -0.21 and -0.45 respectively.

(a) House C
(b) House D
Figure 15: Comparison of speed of ascent trend for House C and D from the SPHERE cohort.

Although the trend plots presented in this section only refer to the speed of ascent (i.e. Sit-to-Stand), the trend plot computed using the speed of descent (i.e. Stand-to-Sit) showed a very similar behaviour and were omitted from this paper for brevity.

4 Conclusions

The demand of AAL technologies for home monitoring is continuously increasing. We presented a simple and efficient approach for the detection and analysis of StS transitions for home monitoring in completely unsupervised environments. We implemented and tested our method in 4 different houses, 2 of which were occupied by patients with total hip or knee replacement. We showed that we are able to reliably identify StS transitions in video clips of binary silhouettes and we can confidently measure the speed of ascent for each transition as an indicator of improving or deteriorating functionality for the StS test. Plots of the average speed of ascent estimated by our method highlights important trends in the recovery process of the surgery patients.

5 Acknowledgements

This work was performed under the SPHERE IRC funded by the UK Engineering and Physical Sciences Research Council (EPSRC), Grant EP/K031910/1. The authors wish to thank all the study subjects for their participation in this project and Rachael Gooberman-Hill, Andrew Judge, Ian Craddock, Ashley Blom, Michael Whitehouse and Sabrina Grant for their support with the HEmiSPHERE project. The HEmiSPHERE project was approved by the Research Ethics Committee (reference number: 17/SW/0121).


  • [1] OpenNI.
  • [2] Giles Birchley, Richard Huxtable, Madeleine Murtagh, Ruud Ter Meulen, Peter Flach, and Rachael Gooberman-Hill. Smart homes, private homes? An empirical study of technology researchers’ perceptions of ethical issues in developing smart-home health technologies. BMC Medical Ethics, 18(1):1–13, 2017.
  • [3] Richard W Bohannon. Sit-to-Stand Test for Measuring Performance of Lower Extremity Muscles. Perceptual and Motor Skills, 80(1):163–166, 1995.
  • [4] Severine Buatois, Darko Miljkovic, Patrick Manckoundia, Rene Gueguen, Patrick Miget, Guy Vançon, Philippe Perrin, and Athanase Benetos. Five times sit to stand test is a predictor of recurrent falls in healthy community-living subjects aged 65 and older. Journal of the American Geriatrics Society, 56(8):1575–1577, 2008.
  • [5] João Carreira and Andrew Zisserman. Quo Vadis, action recognition? A new model and the kinetics dataset. CVPR, pages 4724–4733, 2017.
  • [6] Yuan Yang Cheng, Shun Hwa Wei, Po Yin Chen, Mei Wun Tsai, I. Chung Cheng, Ding Hao Liu, and Chung Lan Kao. Can sit-to-stand lower limb muscle power predict fall status? Gait and Posture, 40(3):403–407, 2014.
  • [7] Philippa M. Dall and Andrew Kerr. Frequency of the sit to stand task: An observational study of free-living adults. Applied Ergonomics, 41(1):58–61, 2010.
  • [8] Andreas Ejupi, Matthew Brodie, Yves J Gschwind, Stephen R Lord, Wolfgang L Zagler, and Kim Delbaere. Kinect-Based Five-Times-Sit-to-Stand Test for Clinical and In-Home Assessment of Fall Risk in Older People. Gerontology, 62(1):118–124, 2015.
  • [9] Brook Galna, Gillian Barry, Dan Jackson, Dadirayi Mhiripiri, Patrick Olivier, and Lynn Rochester. Accuracy of the Microsoft Kinect sensor for measuring movement in people with Parkinson’s disease. Gait & Posture, 39(4):1062–1068, 2014.
  • [10] Sabrina Grant, A. W. Blom, Michael R. Whitehouse, Ian Craddock, Andrew Judge, Emma L. Tonkin, and Rachael Gooberman-Hill. Using home sensing technology to assess outcome and recovery after hip and knee replacement in the UK: The HEmiSPHERE study protocol. BMJ Open, 8(7):1–11, 2018.
  • [11] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429–449, 2002.
  • [12] Munetsugu Kouta, Koichi Shinkoda, and Naohiko Kanemura. Sit-to-Walk versus Sit-to-Stand or Gait Initiation: Biomechanical Analysis of Young Men. Journal of Physical Therapy Science, 18(2):201–206, 2006.
  • [13] Alessandro Masullo, Tilo Burghardt, Dima Damen, Sion Hannuna, Victor Ponce-Lopez, and Majid Mirmehdi. CaloriNet : From silhouettes to calorie estimation in private environments. Britic Machine Vision Conference, pages 1–14, 2018.
  • [14] Veralia Gabriela Sánchez, Ingrid Taylor, and Pia Cecilie Bing-Jonsson. Ethics of smart house welfare technology for older adults: A systematic literature review. International Journal of Technology Assessment in Health Care, 33(06):691–699, 2017.
  • [15] Philip K. Schot, Kathleen M. Knutzen, Susan M. Poole, and Leigh A. Mrotek. Sit-to-stand performance of older adults following strength training. Research Quarterly for Exercise and Sport, 74(1):1–8, 2003.
  • [16] Victor Shia and Ruzena Bajcsy. Vision-based Event Detection of the Sit-to-Stand Transition. In Int. Conf. on Wireless Mobile Communication and Healthcare. ICST, 2015.
  • [17] Karen Simonyan and Andrew Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. pages 1–9, 2014.
  • [18] C Szegedy, L Wei, J Yangqing, P Sermanet, S Reed, D Anguelov, Du Erhan, V Vanhoucke, and A Rabinovich. Going deeper with convolutions. In CVPR, volume 91, pages 1–9. IEEE, 2015.
  • [19] Aleksandar Vakanski, Hyung-pil Jun, David Paul, and Russell Baker. A Data Set of Human Body Movements for Physical Rehabilitation Exercises. Data, 3(1):2, 2018.
  • [20] Takayoshi Yamada, Shinichi Demura, and Kenji Takahashi. Center of gravity transfer velocity during sit-to-stand is closely related to physical functions regarding fall experience of the elderly living in community dwelling. Health, 05(12):2097–2103, 2013.
  • [21] Wl Zagler, Paul Panek, and Marjo Rauhala. Ambient Assisted Living Systems - The Conflicts between Technology, Acceptance, Ethics and Privacy. Assisted Living Systems - Models, Architectures and Engineering Approaches, pages 1–4, 2008.
  • [22] Ni Zhu, Tom Diethe, Massimo Camplani, Lili Tao, Alison Burrows, Niall Twomey, Dritan Kaleshi, Majid Mirmehdi, Peter Flach, and Ian Craddock. Bridging e-Health and the Internet of Things: The SPHERE Project. IEEE Intelligent Systems, 30(4):39–46, 2015.
  • [23] Martina Ziefle, Carsten Röcker, and Andreas Holzinger. Medical technology in smart homes: Exploring the user’s perspective on privacy, intimacy and trust. International Computer Software and Applications Conference, pages 410–415, 2011.