Falls Prediction Based on Body Keypoints and Seq2Seq Architecture

by   Minjie Hua, et al.
CloudMinds Technologies Co. Ltd

This paper presents a novel approach for predicting falls event in advance based on the human pose. First, every person in consecutive frames are detected and tracked. And their body keypoints are extracted and then normalized for later processing. Next, the observed keypoint sequence of each person is input to a sequence-to-sequence(seq2seq) based model to predict the future keypoint sequence, which is used for falls classification to judge whether the person will fall down in the future. The action prediction module and falls classifier are trained separately and tuned jointly. The proposed model is evaluated on Le2i dataset, which is composed of 191 videos including various normal daily activities and falls performed by the actors. Contrast experiments are conducted with those algorithms that use RGB information directly and that classify without action prediction module. Experimental results show that our model improves the accuracy of falls recognition by utilizing body keypoints with the ability of predicting falls in advance.



There are no comments yet.


page 7


Unsupervised Keypoint Learning for Guiding Class-Conditional Video Prediction

We propose a deep video prediction model conditioned on a single image a...

Predicting Human Activities Using Stochastic Grammar

This paper presents a novel method to predict future human activities fr...

Recurrent Human Pose Estimation

We propose a novel ConvNet model for predicting 2D human body poses in a...

Spatio-Temporal Human Action Recognition Modelwith Flexible-interval Sampling and Normalization

Human action recognition is a well-known computer vision and pattern rec...

FollowMeUp Sports: New Benchmark for 2D Human Keypoint Recognition

Human pose estimation has made significant advancement in recent years. ...

View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose

Recognition of human poses and activities is crucial for autonomous syst...

Detekcja upadku i wybranych akcji na sekwencjach obrazów cyfrowych

In recent years a growing interest on action recognition is observed, in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Falls and fall induced injuries among elderly population are a major cause of morbidity, disability and increased health care utilization [30]. So falls prediction is one of the most meaningful applications for elder caring and home monitoring and can largely decrease the risk after falling down. As a sub-problem of human action recognition (HAR), we may generally use HAR algorithms to recognize falls as one of the action classes. As mentioned in [13]

, convolutional networks (ConvNets) and temporal modeling are the two major variables for action recognition. Karpathy presents a 2D ConvNet pretrained on ImageNet 

[6] in [11]. However, 2D ConvNets does not utilize the temporal information. The widely used ConvNets in action recognition are 3D ConvNets like [31, 10], which adopts 3D convolutional kernels to extract the spatiotemporal features from video data. Temporal modeling methods like Two-Stream [24] and TSN [34]

use 2D convolution to extract spatial features and recurrent neural networks(RNN) to encode temporal information. In a word, both 3D ConvNets and temporal models aim to extract both spacial and temporal features from the raw images.

However, the following characteristics make falls distinct from other actions:

  1. Falls is highly relevant to the status of body keypoints, , the body skeleton of a fallen person is obviously different from others.

  2. Unlike smoking and handshaking, falls is generally not involved in the interaction with objects or other people.

  3. Falls is not a normal action, but an accident. So it’s expected to predict its happening in advance and alert emergency as soon as possible.

According to 1 and 2, the falls can be recognized based on the human body keypoints instead of raw RGB image, which decrease the dimensionality of features with key information remained. And for the third characteristic, we proposed a sequence-to-sequence (seq2seq) [28] based action prediction module to predict future keypoint sequence based on the observed keypoint sequence.

However, body keypoints extracted by mainstream algorithms like OpenPose [4]

are represented by the coordinates of the image. This representation is not robust to the body’s absolute position and scale, which are irrelevant to the action. So we transformed and normalized the keypoints coordinates, after which only the direction information will be remained in the feature vector.

Since the action prediction module already includes temporal information, the falls classifier can focus on the spatial cues in the keypoints vector. Different from the video action recognition dataset that tagging each clip a single label, we re-annotated Le2i dataset [5] to assign a label to each frame. This operation hugely expands the amount of training data, which makes our model converge better.

The main contributions of this paper are as follows:

  • We propose an end-to-end falls prediction model, which consists of a seq2seq-based action prediction module and a falls classifier.

  • We develop a keypoints vectorization method to normalize human body keypoints.

  • We conduct experiments to compare our model with popular action recognition networks as well as our own model with action prediction module disabled.

The rest of the paper is organized as follows: Section 2 reviews the related work on falls detection and action recognition. The proposed network is presented in Section 3. In Section 4, the dataset and experiments are described. Finally, Section 5 gives a conclusion.

2 Related Work

2.1 Falls Detection

Many early methods for falls detection relied on the accelerometer wearing on the person to be monitored [3, 22, 1, 2]. Bourke proposed a method in [3] to detect falls based on peak acceleration. Narayanan developed a distributed falls management system, which is capable of real-time falls detection using a waist-mounted triaxial accelerometer [22]. Bianchi enhanced previous falls detection system with an extra barometric pressure sensor in  [1, 2] and found that the algorithm incorporating pressure information achieved higher accuracy. Acceleration-based algorithms were simple, but they required the users to wear several sensors. And each sensor could only be used for one person, which makes it expensive to be widely popularized.

With the huge success of convolution neural networks (CNN) in image recognition problem, Karpathy pioneered to introduce CNN for human action recognition 

[11]. They transferred a pre-trained model on ImageNet [6] to UCF-101 [25], which is a large-scale dataset containing 13320 videos and 101 classes of human actions. Later, 3D ConvNets [31, 10] and temporal modeling methods [24, 34] were developed. However, the supporting for falls recognition depends on the dataset used for training. For example, the aforementioned UCF-101 dataset does not include falls. But the HMDB-51 dataset [16] takes ‘fall on the floor’ as a class.

For the specialized falls detection task, most of the former efforts were based on depth sensor [26, 20, 27], such as Kinect. However, due to the character of the sensor itself, their works may not apply to outdoor scenes, which limits their application.

There is few monocular vision-based algorithm specially optimized for the falls detection. Quero proposed a CNN-based method to detect falls from non-invasive thermal vision sensors rather than monocular camera [23]. However, all the aforementioned methods do not involve action prediction module.

2.2 Action Recognition and Prediction

The action recognition models predict action label after observing the entire video. However, sometimes we need to predict the action based on partial clip. There are roughly two types of action prediction methods based on different goals: early classification and motion prediction. Given an incomplete video clip composed of frames: , early classification tries to infer the action label of this clip, while motion prediction tries to infer the future motions in next frames: .

The work in [19] designed novel ranking losses to learn activity progressing in LSTMs for early classification. Kong adopted an auto-encoder to reconstruct missing features from observed frames by learning from the complete action videos [15]. In [14], a mem-LSTM model was proposed to store several hard-to-predict samples and a variety of early observations in order to improve the prediction performance at early stage. However, sometimes there are multiple actions presenting in the same video, even in the meantime. Since many action recognition models are trained on the dataset in which a video clip is annotated only one label, they cannot predict multiple actions for a video, which may lead to the missing detection. Like example shown in Fig. 1, we used TSN [34] trained on UCF-101 [25] to predict its label and got ‘swing’ with the man’s falling being ignored.

Figure 1: The example of multiple actions in a single video clip. TSN model trained on UCF-101 dataset predicts the label ‘swing’. However, the man’s falling is ignored because only one label will be produced.

In recent years, motion prediction attracted much attentions. Fragkiadaki proposed the encoder-recurrent-decoder (ERD) model [7]

that extended long short term memory (LSTM) 

[8] to jointly learn representations and their dynamics. Jain proposed structural-RNN (SRNN) based on spatiotemporal graphs (st-graphs) in [9] to capture the interactions between the human and the objects. Martinez modified standard RNN models in sampling-based loss and residual architectures for better motion prediction [21]. Tang proposed modified highway unit (MHU) to filter the still keypoints and adopted gram matrix loss in [29] for long-term motion prediction. For the same purpose, Li proposed convolutional encoding module (CEM) in [17] to learn the correlations between each keypoint, which is hard for an RNN model.

Figure 2: The major work flow of our model. A sequence of observed frames are input to the network. Then the body keypoints of each frame are extracted to form a keypoints sequence, which is used to predict corresponding future keypoints sequence. At last, the predicted body keypoints are passed into a falls classifier to judge whether the person presenting in the video will fall down in the future.

2.3 Sequence-to-Sequence Models

The authors in [28] proposed a seq2seq framework, which was applied to machine translation and achieved excellent performance. Later, they introduced the same approach to conversational modeling in [33]. In analogy to mapping a sentence from one language to another in machine translation, the conversational model maps a query sentence to a response sentence. Generally, the seq2seq framework uses an LSTM [8] layer to encode the input sentence to a vector of a fixed dimensionality, and then another LSTM layer to decode the target sentence from the vector. This encoder-decoder architecture is widely used in sequence mapping problems like machine translation [28], conversation modeling [33] and even video caption [32] due to its powerful capabilities.

In this paper, the seq2seq framework is adopted to construct an action prediction module, which is together with a falls classifier to predict and classify the falls event.

3 Methodology

3.1 Overview of the Proposed Model

The problem to be solved in this paper is formulated as follow: Given observed frames , and we need to predict whether the human(s) presented in the video will fall down at future frame (, at frame ).

The skeleton framework of our model is presented in Fig. 2. The input is a sequence of observed frames. We first used OpenPose [4] to extract each person’s keypoints coordinates from each observed frame . The bounding boxes of detected persons were passed to DeepSort [35], a tracking algorithm, to cluster keypoints groups for each person. After that, the -th person was corresponding to a sequence of keypoints , where denoted the keypoints vector of the -th person in frame .

Based on the observation that the falls is highly correlated to the relative position of body keypoints, we exploited a keypoints vectorization method to emphasize the crucial features in the representation. Formally, the transformed sequence of the -th person was . Then we adopted seq2seq model [28] to predict the body motion in future frames. To avoid the network from staking too much LSTM units, which makes the convergency very hard, we encoded several keypoints vectors of consecutive frames in one LSTM unit. Consequently, the lengths of both encoder and decoder LSTMs were shortened so that the mean-pose problem caused by long-term prediction [29, 17] would also be suppressed.

After that, we unpacked the last keypoints prediction and obtained , the feature vector of the -th person at frame , for falls classification. The classifier was trained on Le2i [5] dataset, in which each frame was labeled either ‘falls’ or ‘no falls’. After passing the predicted body motion to the classifier, our model was capable of judging whether the corresponding person would fall down.

3.2 Keypoints Vectorization

The keypoints coordinates extracted by OpenPose [4] cannot reflect the correlation of different keypoints, and suffered from the effects of body skeleton’s absolute position and scale. We hoped that the same pose should be encoded to the same feature vector, so we exploited keypoints vectorization method.

Recall that the 18 keypoints of MS COCO [18] are nose, neck, left and right shoulders, elbows, wrists, hips, knees, ankles, eyes and ears. We focused on the body keypoints thus ignored 5 keypoints on the face: nose, left and right eyes and ears. For the remained 13 body keypoints, we transformed their coordinates to adjacent vectors. As illustrated in Fig. 3, the left and right shoulders are connected to the neck, the left/right elbow is connected to left/right shoulder, and the left/right wrist is connected to left/right elbow. Similarly, the left and right hips are connected to the neck, the left/right knee is connected to left/right hip, and the left/right ankle is connected to left/right knee. As a result, we obtained 12 keypoints vectors of adjacent keypoints and normalized all the vectors to unit length.

Figure 3: The illustration of the keypoints vectorization method. First, the adjacent keypoints are connected as the arrow indicates. Then, each connection is normalized to unit length with only direction information remained.

Formally, suppose that the -th keypoints sequence extracted by OpenPose [4] and tracked by DeepSort [35] was , where


For the -th connection pointing from keypoint to keypoint , the -th transformed feature vector was calculated by:


thus the transformed keypoints vector equals to:

Figure 4: The architecture of our seq2seq-based falls prediction module, which is composed of an encoder (colored green), and a decoder (colored blue). Each LSTM unit in the encoder accepts a transformed keypoints vector (visualized in the figure to present more intuitive result) and produced a hidden vector. The last hidden vector is passed to the decoder to generate the first prediction, and latter LSTM units receive the previous prediction and produce a new prediction. Note that the vector packing is reflected in the figure.

It’s obvious that the transformed keypoints vector eliminated the absolute position and scale of body skeleton, and kept the direction of adjacent keypoints, which was the salient features for falls classification.

3.3 Seq2Seq-based Action Prediction Module

We implemented a seq2seq architecture to predict keypoints vector in future frames (, from frame to frame ). As illustrated in Fig. 4, the seq2seq architecture is composed of two LSTM [8] layers as the encoder and decoder respectively. The encoder reads the input sequence of observed keypoints vector with each LSTM unit parsing a single vector. Then a hidden vector is produced and passed into the decoder. In the decoder, one keypoints vector is generated by each LSTM unit. The first LSTM unit receives the hidden vector and outputs the first prediction. Latter LSTM units take the previous prediction as input and generates a new prediction. A predicted sequence will be generated after the last LSTM unit in the decoder produces its prediction.

Recall the mechanism of LSTM network proposed in [8]. Assume that the input sequence is , the -th LSTM unit updates the states based on the states at :



denotes the sigmoid function,

is the hyperbolic tangent function, denotes the element-wise multiplication, , , represent input gate, forget gate and output gate of -th LSTM unit respectively. and are the -th cell state and hidden state. and are trainable weights.

We attempted to apply LSTM units to the encoder and LSTM units to the decoder (, each LSTM unit only deals with a single keypoints vector). However, when we increased or , the network was getting harder to converge. So we packed every consecutive keypoints vectors in one for decreasing the length of sequence and the number of LSTM units.

Formally, recall that the observed sequence of the -th person was . After the above packing, the new sequence was , where


where denotes the concatenation of two vectors. When there are no enough keypoints vectors for the last package (it always happens when is not divisible by

), zero-paddings are concatenated to its tail for the dimensional equality. The same operation was also applied to the output sequence.

Benefit from this vector packing method, our network needed less time to get convergent, and the mean-pose problem raised in [29, 17] was also suppressed.

3.4 Falls Classifier

Since the action prediction module already contained spatiotemporal features, the falls classifier was trained to predict the label of a single keypoints vector. We adopted a traditional fully connected neural network for classification. The input layer has 24 neurons, which equals to the length of unpacked keypoints vector. We set up five hidden layers, with 96, 192, 192, 96, 24 neurons respectively. And the output layer with 2 neurons is used to produce binary prediction: ‘falls’ or ‘no falls’.

4 Experiments

4.1 Dataset Overview

We trained and evaluated our model on Le2i fall detection dataset [5]. The dataset consists of 191 videos captured under four different scenes: home, coffee room, office and lecture room. The frame rate is 25 frames per second and the resolution is 320240 pixels. In each video, an actor performs various of normal activities and falls (falls might be absent in several videos). The official annotations include the falling-start frame stamp and the falling-end frame stamp for each video. If there is no falls in one video, both frames stamps are annotated as 0 in the corresponding annotation.

For the requirements of the classifier training, we reviewed all the videos and added an extra getting-up time stamp to the annotation. If there is no falls, this value is set to 0. And if the actor does not get up until the video ends, this value is set to the last frame.

Suppose that the frame stamps of falling start, falling end and getting up are , , respectively. We attempted three annotation principles in our experiments:

  1. Frames between and are labeled ‘falls’,

  2. Frames between and are labeled ‘falls’,

  3. Frames between and are labeled ‘falls’.

The first principle failed because only the falling proceeding was regarded as falls. It’s unreasonable to label a fallen person as ‘no falls’. The second one suffered from a high false positive rate. Many pre-actions of falling will be predicted as ‘falls’ even if they are not leading to falls. So we adopted the third principle to label ‘falls’ when the actor has already fallen. Combined with the action prediction module, our model is also capable of giving advance prediction.

4.2 Experiments Setup

We implemented our model on a workstation with double Nvidia 1080Ti GPUs. The seq2seq-based action prediction module and the falls classifier were trained separately and tuned jointly.

For training the action prediction module, we preprocessed all the videos in Le2i dataset. For each video, we applied OpenPose frame by frame to extract keypoints coordinates, which were transformed to keypoints vector using the method proposed in Section 3.2. The algorithm may miss the detection of the actor in some frames due to the illuminance or camera perspective . To ensure the coherence of keypoints sequence, keypoint vector with distance larger than 10 frames to the previous one will be divided into a new sequence. Then, we discarded all the sequences containing less than 10 frames and finally obtained 139 sequences. The maximum, minimum and average frames among these sequences are 1773, 13 and 241.26 respectively.

During the training phase, the network read processed sequences and acquired training samples based on the configurations of , . At each step, a sub-sequence with length of frames was segmented. The former frames were regarded as input and the latter were ground truth. We used a pointer to mark the start frame of current sub-sequence. was initialized to the first frame and moved one frame forward after each step.

We used mean cosine similarity (MCS) to measure the accuracy of the trained model. Specifically, suppose that there are

test samples, the ground truth and the model’s prediction sequence of the -th sample are and respectively. Note that and do not include input sequence, thus both contain keypoints vectors. and denote the -th vector of and . Then the MCS was calculated as follows:



Figure 5: The MCS comparisons of different parameter configurations. We selected five combinations of and , which is denoted by (, ) in the figure, then trained and evaluated our model with different .

We adopted mean square error (MSE) as the loss function and used Adam 

[12] for optimization. The learning rate was set to 0.001. We tried several configurations of , and to see their effects to the network. It can be seen from Fig. 5 that a 10-frames sequence was insufficient for the network to give accurate predictions, especially for long predictions. Using 25 frames to predict 25 frames or using 50 frames to predict 50 frames both obtained high MCS, and the network was capable of predicting future 50 frames based on 25 observed frames within a tolerable MCS decreasing. And performed the best among all the test values.

Finally, we adopted , and in our experiments. Although they did not present the best performance, we hoped that our model could utilize less but predict more with an acceptable accuracy.

As for the falls classifier, we adopted a 7-layers fully connected neural network including 1 input layer, 5 hidden layers and 1 output layer. The input layer contains 24 neurons to receive a single keypoints vector as input, the number of neurons in each hidden layer is 96, 192, 192, 96, 24 respectively, and the output layer has 2 neurons to predict whether the person falls or not.

As many classification models do, we adopted cross-entropy loss function during the training phase. And we also utilized Adam [12] to optimize our falls classifier with learning rate 0.001.

4.3 Evaluation

We evaluated our model from two aspects:

  1. Compared to the models that utilize raw RGB rather than body keypoints information.

  2. Compared to our own model but without the action prediction module.

4.3.1 Raw RGB body keypoints on falls recognition

Popular RGB-based action recognition models include C3D [31], Two-Stream [24] and TSN [34] . However, considering that all these models predict a label to a complete video clip rather than a single frame, we adjusted our annotations for a fair comparison.

Recall that each video in our dataset has three frame stamps: marks falling start, marks falling end, and marks getting up. And frames between and are labeled ‘falls’, with the others labeled ‘no falls’. All the 3-seconds (75 frames) clips that satisfy one of the following conditions are segmented from the video:


where and denote the start and end frames of the clip with . It was feasible because we analysed all the videos and found that the average duration of the falls proceeding () was 31.5 frames (1.26 seconds).

Through above rules, we finally filtered 9549 samples, which were randomly divided into a training set and a test set with a ratio of 7:3. We finetuned C3D (3 nets) [31], Two-Stream [24] and TSN (2 modalities) [34] on the training set and evaluated their accuracy on the test set. To emphasize the comparison between RGB-based and keypoints-based methods on falls recognition, our model directly classified the keypoints vector extracted and transformed from the last frame. All the data have been checked to ensure the consistency of each video’s label and its last frame’s original label.

  Model Acc. Prec. Rec. F1
C3D [31] 89.4% 66.1% 87.9% 0.755
Two-Stream [24] 91.6% 71.6% 91.0% 0.801
TSN [34] 94.6% 79.8% 94.5% 0.866
Ours 97.8% 90.8% 98.3% 0.944
Table 1: Comparison between RGB-based models and our keypoints-based model on falls classification problem

The test set consists of 2865 samples, including 531 positive samples (falls) and 2334 negative samples (no falls). We adopted accuracy, precision, recall and F1-score to evaluate each model. As illustrated in Table 1, our keypoints-based model shown better performance than the RGB-based models, which proved that using keypoints information to recognize falls was effectual.

Figure 6: Some experimental results of our falls prediction model. In case (a)-(c), our model successfully predicts the future keypoints sequence based on the observed video clip and produces correct label in advance. (d) and (e) are two failure cases. In (d), there are many missing detections in the extracted keypoints due to the camera perspective. Consequently, our model generates absurd keypoints prediction, which is not enough for classification (the falls classifier will return ‘unknown’ if it receives insufficient information). In (e), the observed keypoints sequence does not includes any ‘precursors’ of falls, which leads to a wrong prediction and classification.

4.3.2 Influence of the action prediction module

We designed another experiment to evaluate the action prediction module by a self comparison with this module enabled and disabled. For a concise description, the model with action prediction module is denoted as and the model without action prediction module is denoted as .

These two contrasted models worked in different ways. While predicting the label of the -th frame in a video, read 25 frames from to and then predicted the keypoints vectors from frame to . Later, the keypoints vector of frame was passed to falls classifier for producing a label. As for , it was directly given the -th frame. The keypoints vector of this frame was extracted and transformed for classification.

  Model Acc. Prec. Rec. F1
98.7% 92.6% 98.0% 0.952
99.2% 95.1% 98.8% 0.970
Table 2: Comparison between and

The comparison result was shown in Table 2. However, since can predict falls in advance, for a more fair comparison, the annotations for training were modified by labeling the frames of falling proceeding (, between and ) as ‘falls’.

  Model Acc. Prec. Rec. F1
98.7% 92.6% 98.0% 0.952
98.1% 87.7% 99.7% 0.933
Table 3: Comparison between and with modified annotations

We trained on modified annotations and re-evaluated its performance. As shown in Table 3, although regarding falling proceeding as falls could predict the falls ‘in advance’ and improve the recall, the precision dropped obviously due to the increasing rate of false positive.

4.4 Results

In Fig. 6, we present several experimental results of our model. In most cases, our model can correctly predict the falling event in advance. However, (e) and (f) show two failure cases caused by different reasons. In (d), OpenPose [4] miss the detection of many keypoints because the upper body is invisible under that camera perspective. As a consequence, the action prediction module produces absurd keypoints prediction, which includes insufficient features for classification. While in (e), there is no ‘precursor’ of falls in the observed frame sequence, which misleads the action prediction module thus the falls classifier also predicts a wrong label.

5 Conclusion

In this work, we proposed a model composed of an action prediction module and a falls classifier to predict falls event in advance. The action prediction module is based on seq2seq architecture, which takes a sequence of vectorized body keypoints as input, and predicts future keypoints sequence. Then the predicted keypoints vector is passed into the falls classifier to produce a label. Our model is evaluated on Le2i dataset and proved the capability of predicting falls in advance by the experimental results.


  • [1] F. Bianchi et al. (2009) Falls event detection using triaxial accelerometry and barometric pressure measurement. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6111–6114. Cited by: §2.1.
  • [2] F. Bianchi et al. (2010) Barometric pressure and triaxial accelerometry-based falls event detection. IEEE Transactions on Neural Systems and Rehabilitation Engineering 18 (6), pp. 619–627. Cited by: §2.1.
  • [3] A. K. Bourke, J. V. O’brien, and G. M. Lyons (2007) Evaluation of a threshold-based tri-axial accelerometer fall detection algorithm. Gait & posture 26 (2), pp. 194–199. Cited by: §2.1.
  • [4] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh (2018)

    OpenPose: realtime multi-person 2d pose estimation using part affinity fields

    arXiv preprint arXiv:1812.08008. Cited by: §1, §3.1, §3.2, §3.2, §4.4.
  • [5] I. Charfi, J. Miteran, J. Dubois, M. Atri, and R. Tourki (2013)

    Optimized spatio-temporal descriptors for real-time fall detection: comparison of support vector machine and adaboost-based classification

    Journal of Electronic Imaging 22 (4), pp. 041106. Cited by: §1, §3.1, §4.1.
  • [6] J. Deng et al. (2009) Imagenet: a large-scale hierarchical image database. In

    IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §1, §2.1.
  • [7] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik (2015) Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354. Cited by: §2.2.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.2, §2.3, §3.3, §3.3.
  • [9] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena (2016)

    Structural-rnn: deep learning on spatio-temporal graphs

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317. Cited by: §2.2.
  • [10] S. Ji, W. Xu, M. Yang, and K. Yu (2012) 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §1, §2.1.
  • [11] A. Karpathy et al. (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §1, §2.1.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2, §4.2.
  • [13] Y. Kong and Y. Fu (2018) Human action recognition and prediction: a survey. arXiv preprint arXiv:1806.11230. Cited by: §1.
  • [14] Y. Kong, S. Gao, B. Sun, and Y. Fu (2018) Action prediction from videos via memorizing hard-to-predict samples. In

    AAAI Conference on Artificial Intelligence

    Cited by: §2.2.
  • [15] Y. Kong, Z. Tao, and Y. Fu (2017) Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1473–1481. Cited by: §2.2.
  • [16] H. Kuehne et al. (2011) HMDB: a large video database for human motion recognition. In International Conference on Computer Vision, pp. 2556–2563. Cited by: §2.1.
  • [17] C. Li, Z. Zhang, L. W. Sun, and L. G. Hee (2018) Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5226–5234. Cited by: §2.2, §3.1, §3.3.
  • [18] T. Y. Lin et al. (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.2.
  • [19] S. Ma, L. Sigal, and S. Sclaroff (2016) Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1942–1950. Cited by: §2.2.
  • [20] X. Ma et al. (2014) Depth-based human fall detection via shape features and improved extreme learning machine. IEEE journal of biomedical and health informatics 18 (6), pp. 1915–1922. Cited by: §2.1.
  • [21] J. Martinez, M. J. Black, and J. Romero (2017) On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900. Cited by: §2.2.
  • [22] M. R. Narayanan et al. (2007) Falls management: detection and prevention, using a waist-mounted triaxial accelerometer. In 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 4037–4040. Cited by: §2.1.
  • [23] J. Quero et al. (2018) Detection of falls from non-invasive thermal vision sensors using convolutional neural networks. In Multidisciplinary Digital Publishing Institute Proceedings, Vol. 2, pp. 1236. Cited by: §2.1.
  • [24] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §1, §2.1, §4.3.1, §4.3.1, Table 1.
  • [25] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §2.1, §2.2.
  • [26] E. E. Stone and M. Skubic (2011) Evaluation of an inexpensive depth camera for passive in-home fall risk assessment. In 5th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth) and Workshops, pp. 71–77. Cited by: §2.1.
  • [27] E. E. Stone and M. Skubic (2014) Fall detection in homes of older adults using the microsoft kinect. IEEE journal of biomedical and health informatics 19 (1), pp. 290–301. Cited by: §2.1.
  • [28] I. Sutskever, O. Vinyals, and Q. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1, §2.3, §3.1.
  • [29] Y. Tang, L. Ma, W. Liu, and W. Zheng (2018) Long-term human motion prediction by modeling motion context and enhancing motion dynamic. arXiv preprint arXiv:1805.02513. Cited by: §2.2, §3.1, §3.3.
  • [30] M. E. Tinetti (1994) Prevention of falls and fall injuries in elderly persons: a research agenda.. Preventive medicine 23 (5), pp. 756–762. Cited by: §1.
  • [31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §1, §2.1, §4.3.1, §4.3.1, Table 1.
  • [32] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence-video to text. In ICCV, pp. 4534–4542. Cited by: §2.3.
  • [33] O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §2.3.
  • [34] L. Wang et al. (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §1, §2.1, §2.2, §4.3.1, §4.3.1, Table 1.
  • [35] N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. Cited by: §3.1, §3.2.