Code repo for LSTM Pose Machines
We observed that recent state-of-the-art results on single image human pose estimation were achieved by multi-stage Convolution Neural Networks (CNN). Notwithstanding the superior performance on static images, the application of these models on videos is not only computationally intensive, it also suffers from performance degeneration and flicking. Such suboptimal results are mainly attributed to the inability of imposing sequential geometric consistency, handling severe image quality degradation (e.g. motion blur and occlusion) as well as the inability of capturing the temporal correlation among video frames. In this paper, we proposed a novel recurrent network to tackle these problems. We showed that if we were to impose the weight sharing scheme to the multi-stage CNN, it could be re-written as a Recurrent Neural Network (RNN). This property decouples the relationship among multiple network stages and results in significantly faster speed in invoking the network for videos. It also enables the adoption of Long Short-Term Memory (LSTM) units between video frames. We found such memory augmented RNN is very effective in imposing geometric consistency among frames. It also well handles input quality degradation in videos while successfully stabilizes the sequential outputs. The experiments showed that our approach significantly outperformed current state-of-the-art methods on two large scale video pose estimation benchmarks. We also explored the memory cells inside the LSTM and provided insights on why such mechanism would benefit the prediction for video-based pose estimations.READ FULL TEXT VIEW PDF
Code repo for LSTM Pose Machines
Estimating joint locations of human bodies is a challenging problem in computer vision which finds many real applications in areas including augmented reality, animation and automatic photo editing. Previous methods[6, 2, 34] mainly addressed this problem by well designed graphical models. Newly developed approaches [4, 32, 19] achieved higher performance with deep Convolutional Neural Networks (CNN).
Nevertheless, those state-of-the-art models were trained on still images, limiting their performance on videos. Figure 1 demonstrates some unsatisfactory situations. For instance, the lack of geometric consistency makes the previous methods prone to making obvious errors. Mistakes caused by serious occlusion and large motion are not uncommon as well. In addition, those models usually have a deep architecture and would be computationally very intensive for real-time applications. Therefore, a relatively light-weight model is preferable if we want to deploy it in a real-time video processing system.
An ideal model of such kind must be able to model the geometric consistency as well as the temporal dependency among video frames. One way to address this is to calculate the flow between every two frames and use this additional cue to improve the prediction [22, 28]. This approach is effective when the flow can be accurately calculated. However, this is not always the case because the calculation of optical flow suffers from image quality degradation as well.
In this paper, we adopted a data-driven approach to better tackle this problem. We showed that a multi-stage CNN could be re-written as a Recurrent Neural Network (RNN) if we impose the weight sharing scheme. This new formulation decouples the relationship among multiple network stages and results in significantly faster speed in invoking the network for videos. It also enables the adoption of Long Short-Term Memory (LSTM) units between video frames. By effectively learning the temporal dependency among video frames, this novel architecture well captures the geometric relationships of joints in time and increases the stability of joint predictions on moving bodies. We evaluated our method on two large scale video pose estimation benchmarks namely, Penn Action  and sub-JHMDB . Our method significantly outperformed all previous methods both in performance and speed.
To well justify our findings, we also investigated the internal dynamics of the memory cells inside our LSTM and explained why and how LSTM units would improve the video pose estimation performance. The memory cells were visualized and insights were provided.
The contributions of our work can be summarized as follows.
First, we built a novel recurrent architecture with LSTM to capture temporal geometric consistency and dependency among video frames for pose estimation. Our method surpassed all the existing approaches on two large scale benchmarks.
Second, the new architecture decouples the relationship among network stages and results in much faster inference speed for videos.
Third, we probed into the LSTM memory cells and visualized how the memory cell would help to improve the joint predictions on videos. It provides insights and justifies our findings.
Early works on single-image pose estimation started from building graphical structures [6, 2, 34, 29, 24] to model the relations between joints. However, those methods rely heavily on hand-crafted features which restrict their generality on varied human poses in reality. The performance of these methods has recently been surpassed by CNN based methods [4, 31, 30, 32, 19, 5]. Those deep models had the capacity to generalize from unseen scenes by learning various spatial relations from data. Recent works [32, 19] employed the strategy of iteratively refining the output of each network stage and achieved state-of-the-art results in many image-based benchmarks. In , a recurrent model was proposed to reduce training parameters, but it was designed for images rather than videos.
Directly applying the existing image-based methods on video sequences produces sub-optimal results. There are two major problems. First, these models failed to capture temporal dependency among video frames and they were unable to keep the geometric consistency. It can be shown that the image-based models can easily suffer from motion blur and occlusion and usually generate inconsistent results for neighbouring frames. Second, the image-based models are usually very deep and computationally expensive. It is problematic when adopting them in real-time applications.
A few previous studies integrated temporal cues into pose estimation [11, 23, 22, 28, 7, 20, 17]. Modeep  first tried to merge motion features into ConvNet, and Pfister et al.  made a creative attempt to insert consecutive frames at different color channels as input. In later works [22, 28], dense optical flow  was produced and used to adjust the predicted positions in order to let the movement smooth across frames. Good results were achieved by Thin-Slicing Network  which relied on both adjustment from optical flow and a spatial-temporal model. However, this system is computationally very intensive and slower than the previous image-based method. Our method is similar to the Chained Model , which is a simple recurrent architecture that can capture temporal dependencies. Unlike Chained Model , our model better captured geometric consistency and temporal dependency of video frames by memory augmented RNN and it achieved better performance. We are able to outperform current state-of-the-art methods while keeping a concise architecture.
Understanding the underlying mechanism behind neural networks is important and of great interests among many researchers. Several works [35, 18] aimed to explain what the convolution models had learned by reconstructing the features into original images. Likewise,  studied the long-range interactions captured by recurrent neural network in text processing. And in particular, it interpreted the function of LSTM in text-based works. In this paper, we combined the analysis from these two sides, and visualized how our model learned and helped the work of locating moving joints in videos.
Pose Machine  was first brought up as a method to predict joint locations in a sequentially refined manner. The model was built on the inference machine framework to learn strong interconnections between body parts. Convolutional Pose Machine (CPM)  inherited the idea from pose machine with implementing it in a deep architecture. At the same time, it adopted a fully convolutional design by producing predicted heat maps at the end of the system. As a critical strategy exploited in pose machines, passing prior beliefs into next stages and supervising the loss in all stages benefit the training of such a deep ConvNet by addressing the problem of gradient vanishing. Following the descriptions in , we can formulate the model mathematically in the following way: Denote (P joints plus one background channel with size ) as the beliefs in stage , they can be calculated iteratively by:
where is the original image sent into every stage. is a ConvNet used to extract valuable features from input image. Those features will be concatenated (indicated by operation ) with prior beliefs (i.e. ) and sent into another ConvNet to produce refined belief maps. It is easy to observe that CPM does a great job on pose estimation because and are not identical across different stages even though they share the same architecture (in fact uses a deeper structure compared with in order to produce more precise confidence maps for further refinements since its unprocessed input contains only local evidences). It repetitively modifies the confidence maps by adding intermediate supervisions at the end of each stage. However, applying this deep structure for video-based pose estimation is not practical because it does not integrate any temporal information.
Chained model  provided us a motivation to construct a RNN style model for this problem. And we were also inspired by the design of CPM to reform it into a recurrent one. Referring to Eq. (1), we found that CPM could be easily transformed into a recurrent structure by sharing the weights of those two functions and across stages. Mathematically, a new Recurrent Pose Machine derived from CPM can be formulated as:
Here, is no longer the belief maps in a certain stage as described in Eq. (1), but it represents the produced belief maps matched with frame where is now the length of frames in this video. The input ’s are not the same in different stages, but they are consecutive frames from a video sequence. Similarly, at the initial place is still different from , and now all the following stages share an exactly identical function. With this implementation, the model is rebuilt with recurrent design and it can be used to predict joint locations from a variable-length video. Apart from its recurrent property, it also accomplishes another notable achievement which is lessening the parameters for predicting locations from a single frame.
Training of the model described in Eq. (2) can now be proceeded collectively on a set of successive frames. However, this RNN model can not achieve optimal performance on video-based pose estimation. We found that it was beneficial to include a LSTM unit  because of its special gate designs and memory implementation. This modification can be achieved by further adapting Eq. (2). In other words, our new memory-enabled recurrent pose machines become:
is a function controlling memory’s inflow and outflow procedures. In Eq. (2), contains two parts, namely a feature encoder and a prediction generator. Since directly receives processed features, we separate these two parts and plug the LSTM between them as shown in Eq. (3). The extractor acts like in other stages but it is much deeper, so we denote it as . Now we can also see that the generators are identical across all stages. Since nothing is in LSTM’s memory at the first stage, will be a little bit different from that in subsequent stages, but they all perform similar functionality. We will discuss the implementation in detail in later sections, and more importantly, we will explain how the LSTM can robustly boost the performance of our recurrent pose machines.
Figure 2 illustrates our structure stated in Eq. (3) for pose estimation on video. Consecutive frames in the same video clip will be sent into the network as input in different stages. As shown in the figure, when , can be decomposed as , where is the ConvNet1 aiming at processing raw input and is the encoder ConvNet2 consistently used in all stages. produces preliminary belief maps associated with the first frame. Since the prediction does not have a high confidence level, it will be concatenated with again to generate a more accurate result. LSTM is the most critical components in this architecture. It can be referred to as the function we mentioned above. In reality, it takes multiple steps to forget the old memory, absorb new information and create the output. ConvNet3 is the generator we described in Eq. (3) and it is connected to the output from LSTM. All those ConvNet segments comprise several convolution layers, activation layers and pooling layers. They inherit the design of Convolutional Pose Machines , and the architectures of them are the same as the counterparts used in the CPM model. The difference is that our model allows weight sharing for all these components across stages. Following CPM , we add an extra slice containing a central Gaussian peak during input concatenation for better performance. Dropout is also included in the last layers of ConvNet1.
The structure and functionality of LSTM have been discussed in many prior works [9, 8, 27]. A vanilla LSTM is defined in  and it is the most commonly used LSTM implementation. In , Greff et al. conducted a comprehensive study on the components of LSTM, and they found out that this vanilla LSTM with forget gate, input gate and output gate already outperformed other variants of LSTM. Eq. (4) illustrates the operations inside a vanilla LSTM unit that we used in our recurrent model:
Unlike traditional LSTM, ’*’ here does not refer to a matrix multiplication but to a convolution operation similar as that in  and . As a result, all the ’+’ in Eq. (4) represent the element-wise addition. The ’s here denote the bias terms. These settings result in our convolutional LSTM design. , , are the input gate, forget gate and output gate at time respectively. They are controlled by new input and hidden state from last stage mutually. Note that here is not the same as that in Eq. (3). Here it is already the concatenated inputs (i.e. in Eq. (3)). Convolutional design of the gates focuses more on regional context rather than global information, and it pays more attention to the changes of joints in smaller local areas. One convolution layer with kernel is found to be best for performance. is the memory cell which preserves knowledges in a long range by forgetting old memory and taking in new information continuously. Hidden state will be outputted from the newly formed memory and it will be used to generate current beliefs via the generator . The first memory cell is calculated by only since forget operation is unavailable.
Our LSTM Pose Machine is implemented in Caffe , and functions in LSTM are simply implemented by convolutions and element-wise operations. Labels in Cartesian coordinates are transformed into heat maps with Gaussian peaks centred at the joint positions. The network has stages, where T is the number of consecutive frames in the training sequence. Loss will be added at the end of each stage to supervise the learning periodically. Training aims to reduce the total
distance between prediction and ground truth for all joints and all frames jointly. Loss function is defined as:
where is the produced belief and is the ground truth heat map for part in stage t.
In this section, we present our experiments and quantitative results on two widely used datasets. Our method achieved state-of-the-art results in both of them. Qualitative results will be also provided in this part. At last, we will explore and visualize the dynamics inside LSTM units.
Penn Action Dataset  is a large dataset containing in total 2326 video clips, with 1258 clips for training and 1068 clips for testing. On average each clip contains 70 frames, but the number in fact varies a lot for different cases. 13 joints including head, shoulders, elbows, wrists, hips, knees and ankles are annotated in all the frames. An additional label indicates whether a joint is visible or not in a single image. Following previous works, evaluation will be only conducted on visible joints.
JHMDB  is another video-based dataset for pose estimation. For comparison purpose, we only conduct our experiment on a subset of JHMDB called sub-JHMDB dataset to maintain consistency with previous works. This subset contains only complete bodies and no invisible joint is annotated. Sub-JHMDB has 3 different split schemes, so we trained our model separately and reported the average result over these three splits. This subset has 316 clips with all 11200 frames in the same size. Split results in a train/test ratio which is roughly equal to 3.
is randomly performed to increase variation of input. Since a set of frames will be sent into the network at the same time, the transformation will be consistent within a patch. Images will be randomly scaled by a factor. For Penn this factor is between 0.8 to 1.4 while for sub-JHMDB it is between 1.2 to 1.8 since the bodies are originally smaller. Images will then be rotated with degree [,] and flipped with randomness. At last, all the images will be cropped to a fixed size () with bodies set at center.
Since we directly revised the architecture of Convolutional Pose Machines , we can easily initialize the weights based on the pre-trained CPM model. Instead of directly copying weights from it, we first built a single image model which used the same structure as our model trained on video sequences. The difference is that we set
for this single image model and the inputs are identical in all stages. We only copied the weights in the first two stages of CPM model since weights in our model are shareable across stages. This model was fine-tuned for several epochs on the combination of LSP and MPII  datasets, which is the same data source for training the CPM model from scratch.
Our models for training on Penn and sub-JHMDB started by copying the weights from our single image models described above. During training, length of our recurrent model is set to be 5 (i.e.
=5), which is large enough to observe sufficient changes from a video sequence. Stochastic gradient descent with momentum of 0.9 and weight decay of 0.0005 is used to optimize the learning process. Batch size is selected to be 4. The initial learning rate is set to be
and it will drop by multiplying a factor of 0.333 every 40k iterations. Gradient clipping is used and set as 100 to prevent gradient explosion. Dropout ratio is 0.5 in the first stage.
Similar to many prior works, beliefs for joints are produced at the end of each stage. Positions in x,y coordinates can then be interpolated from finding the maximum confidence. During testing, we first rescaled the input into different sizes, and averaged the outputs to produce a more reliable belief. In our experiments, we rescaled the images into 7 scales and the scaling factors are within the corresponding regions that we used for augmentation during training. To evaluate the results, we adopt the PCK metric introduced in. An estimation is considered correct if it lies within from the true position, where and are the height and width of the bounding box. In order to consistently compare with other methods, is chosen to be 0.2 for evaluation on both datasets. Penn already annotates the bounding box within each image, but the bounding boxes for sub-JHMDB are deduced from the puppet masks used for segmentation.
Table 1 and table 2 show the performance of our models and previous works on Penn dataset as well as sub-JHMDB dataset. Apart from LSTM Pose Machines (LSTM PM) stated in Eq. (3), we also present a simplified Recurrent Pose Machine model (RPM) as described in Eq. (2). It simply takes off the LSTM modules and it was trained using the same parameters in order to study the contribution of LSTM component. By considering long-term temporal information in our models, we achieved improved results in both benchmarks. Comparing our state-of-the-art LSTM Pose Machines with previous video-based pose estimation methods such as Thin-Slicing Net , we observe an overall improvement of 1.2% which is evenly distributed in all body parts in the case of Penn benchmark. Among all those parts, we find that the greatest boost of 1.9% increase comes from the wrist. Similarly, for sub-JHMDB dataset, we achieved improvements in almost all the joints. It is worth noticing that the biggest increases come from elbow and wrist. This is a significant result since we have robustly improved the predictive accuracy of the joints that are subject to drastic movements and occlusion. In our experiments, we trained a CPM model  on these two datasets with the same training scheme as well. We can see that it has already surpassed all existing methods on both benchmarks but it still can not compete with us. Qualitative results are presented in figure 3. We can see that our method is especially suitable to cope with big changes across frames through its strong predictive power. Even though the body is in motion or it suffers from an occlusion in the middle of the video, positions can be inferred from their past trajectories smoothly.
From table 1 and table 2, we can see that our recurrent models without LSTM module also provided improved results. It performed similarly as the CPM model by taking the temporal cues into account, while it only used a shorter structure. After adding LSTM, our models received extra gain in performance. For Penn, the average increment is 0.7%, and for sub-JHMDB, it is 1.4%. For those easy parts such as head, shoulder and hips, recurrent model is already able to perform well. But for those joints that are easily subject to occlusion or motion, the memory cells help to robustly promote the estimation accuracy of them by better utilizing their historical locations. With the help of our LSTM module, we can conclude that our approach increased overall stability in predicting joints from moving frames.
In this part, we explore the effect of using different iterations T. We train our model with different number of stages, i.e., T=1, 2, 5, 10, on the sub-JHMDB dataset, and report the experimental results in Table 3. When there is just one iteration in the LSTM, the performance drops a lot, even worse than CPM, since there is no temporal information or refine operations like CPM model. When iterations increase to 2, the performance has a notable improvement, since current frame would keep information about the joints which are nearly static compared to the last frame from the last stage, and just learn the joints which move a litter faster. It makes the preference more stable among video frames. What’s more, the performance still increases when we add iterations from 2 to 5, which means long-term temporal information is good for video pose estimation. However, it doesn’t mean the more iterations, the higher performance. The experiment in T=10 tells us that the information of the frames which are very long before current frame is helpless. In order to balance the performance and training computation consumption, we set T=5.
Inference time is critical for real-time applications. Previous methods are relatively time-consuming in producing the results because they need to go through many stages for a single frame. Our method only needs to go through a single stage for every video frame thus performs significantly faster than the previous multi-stage CNN based methods. Note that for the first frame, our method needs to go through a longer stage to get started. For fair comparison, we randomly pick a video clip with 100 frames and send them into the CPM model and our model for testing separately. The experiment result shows that the CPM model needs 48.4ms per-frame, but we only need 25.6ms per-frame which means that our model runs about 2x faster than the CPM model. Comparing to the flow based methods such as Thin-Slicing Net , which is based on CPM and needs to generate flow map, our model has greater advantages in speed. Thus our model is especially preferable for real-time video-based pose estimation applications.
In order to better understand the mechanism behind LSTM, exploring the content of memory supplies substantial cues. Sharma et al.  and Li et al.  have made an attempt on relevant issues recently. In their works, they focused more on the static attentions in each stage, but we are going to address the transition of memory content resulted from the changing positions.
Figure 4 displays the results of our exploration. We first up-sampled the channels in memory and mapped them back to original image space. Following our setup, there are 48 channels in each memory cell and we only selected some representative ones here for visualization. From the figure, we can see that memories in different channels are the attentions on distinct parts. Some of them are the global views on trunks or edges (the first three samples), and some just focus on a particular joint (the other three show the memory attentions on elbow, hip and head). Remember that those memories will be selectively outputted and processed by a network for estimation. Therefore, the memory cell containing both global and local information helps the prediction of spatially correlated joints on a single frame.
A more important property of LSTM is that it maintains its memory by using both useful prior information and new knowledges. As described in Eq. (4), LSTM goes through the process of forgetting and remembering during each iteration. In each row of Figure 5 illustrates different phases of the memory cell within one iteration. It captures the evolution of our LSTM inside the iteration (only represented by one selected channel). Each column represents a single phase according to the figure’s description. We can observe from the first sample that the forget operation selectively retains useful information for the prediction in next stage, such as wrists and head, which are nearly static in the three consecutive frames (col. 3), while new input of this stage brings more emphasis on the regions containing latest appearance of joints, such as knees, which have movement in the three consecutive frames (col. 4). These two parts combin to be a new memory and produc the predictions on a new frame with high confidence (col. 5). That is why our model can capture temporal geometric consistency and prevent the mistakes in videos as illustrated in Figure 1. For the second sample, in the first frames, the left wrist still can be seen, but it is occluded in the next two frames. In our model, since the left wrist has been recognized in the first frame, the following frames can infer the location of it by the memory cell of the last stage though it has been occluded. What’s more, the movement of elbows in the third sample is flicking, but our model can keep the static joints (e.g. hip and keen), and quickly track the new information of rapidly moving joints (e.g. elbows) by memory cells and new inputs.
In conclusion, those mechanisms can help to make the predictions more accurate and robust for post estimations on video.
In this paper, we presented a novel recurrent CNN model with LSTM for video pose estimation. We achieved significant improvement in terms of both accuracy and efficiency. We showed that the LSTM module indeed contributed to the better utilization of temporal information. Meanwhile, we explored and visualized the memory cells inside the LSTM and explained the underlying dynamics of the memory during pose estimation on changing frames.
Modeep: A deep learning framework using motion features for human pose estimation.In ACCV, 2014.
Convolutional lstm network: A machine learning approach for precipitation nowcasting.In NIPS, 2015.