1 Introduction
From a psychological stand point, it has been argued that humans detect realworld structures by detecting changes along physical dimensions (contrast values) and representing these changes (with respect to time) as relations (differences) along subjective dimensions [9]. More directly, it has been suggested that the temporal dimension is necessary and is coupled with spatial dimensions in human mental representations of the world [3]
. This implies merit in incorporating time into a definition of structure from a computer vision modelling point of view. This forms the inspiration for this work.
This work deals with a longstanding task in computer vision  human pose modelling in 3D from monocular videos. The challenges of this task include large variability in poses, movements, appearance and background, occlusions and changes in illumination.
This paper proposes a method to estimate the body pose of a human (in terms of body joint locations in 3D) from video capture using a single 2D monocular camera via a deep three dimensional convolutional neural network. The key idea behind this approach is that time, as a dimension, could be encoded as the dimension of 3D convolutional operation (where the other two and dimensions are along the height and width of the image). The hypothesis behind this is that temporal information can be efficiently represented as an additional dimension in deep convolutional neural networks (see [16, 5] for a detailed description of 3D convolution). It is important to note here that no depth information is provided to the network as input, and the system is expected to infer the location of body joint positions in all three spatial dimensions only based on the stream of 2D frames in the video. A more detailed and complete description of this work can be found in [5].
Such a system can have applications in areas such as visual surveillance, human action prediction, emotional state recognition, humancomputer interfaces, video coding, ergonomics, video indexing and retrieval, etc.
2 Related Work
There have been a number of studies carried out in the human pose estimation field using different generative and discriminative approaches. Most of the published works deal with still single [17] or depth images [10]. Also, most often it is attempting to estimate 2D full [1], upper body [15] or single [2] joint position in the image plane. Additionally, many approaches incorporate 2D pose estimations or features to retrieve 3D poses [18, 19]. The work in [15] formulates 2D pose estimation as a joint regression problem, using a conventional deep CNN architecture. The predictions are further iteratively refined by analysing relevant regions within the images in higher resolution. [14] introduces a heatmap based approach, where a spatial pyramid input is used to generate a heat map describing the spatial likelihood of joint positions. [11] presents an architecture similar to [15], with a key difference being that multiple consecutive video frames are encoded as separate colour channels in the input. Although this approach appears similar to that of 3D CNNs, the key difference here is that this approach enforces the dimension of the ‘3D’ kernel to be equal to the number of channels. Therefore, the kernel has no space to convolve in this dimension. The first architecture utilizing 3D CNNs was proposed in 2013 and applied to human action recognition in [8]
. As in our proposed work, the third spatial dimension of the convolution operation is used to encode the time dimension on the video stream. This work also utilizes recurrent neural networks to finally predict the human action category. However, they do not explore the use of 3D CNNs for predicting the precise locations of body joints. Recent methods tested on the Human3.6M dataset include a discriminative approach to 3D human pose estimation using spatiotemporal features (HOGKDE)
[13], as well as a 2D CNN based 3D pose estimation framework (2DCNNEM) [19]. However, one of the drawbacks of these approaches is that they utilize a large number of frames in a sequence comparing to our proposed 3D CNN method.Our approach studies the suitability of using 3D convolutional networks for the task of 3D pose estimation from 2D videos. To the extent of our knowledge, this is the first work to do so. More fundamentally, this work explores the effects of processing spatiotemporal data using three dimensional convolutions, where the temporal dimension in data is represented as a additional dimension in convolutions.
3 Dataset
Human3.6M Dataset [7] is so far the largest publicly available motion capture dataset. It consists of high resolution 50Hz video sequences from 4 calibrated cameras capturing 10 subjects performing 15 different actions (‘eating’, ‘posing’, etc.). 3D ground truth joint locations as well as bounding boxes of human bodies are provided. Note that we consider videos from the 4 camera positions independently, and do not combine them in any way. Our evaluation was done on 17 core joints from the available 32 joint locations. For official testing, the ground truth data for 3 subjects is withheld and used for results evaluation on the server.
4 Method
4.1 Preprocessing
The original Human3.6M video frames are cropped using bounding box binary masks and extended to the larger side to make the crop squared. Cropped images are resized to 128128 resolution (chosen arbitrarily). The results of cropping can be seen in Figure 1.
Data sampling
Due to the large amount of available data, limited memory and time constrains, data subsampling is performed. One training data sample is composed of 5 sequential colour images with resolution of 128128. These were sampled from the original video to obtain a framerate of 13Hz. Random selection was performed from every chosen training, validation and testing subjects’ videos to ensure that all the possible poses are selected.
Data alignment
Ground truth joint positions were centered to the pelvis bone position (first joint).
Contrast normalization
To reduce the variability that DNN needs to account for during training, global contrast normalization (GCN) was applied to the network’s input data (per colour channel).
4.2 Deep 3D Convolutional Neural Network
The final model of network’s architecture was made up by starting with the small basic network with only three hidden 3D convolutional layers and building it up when testing with the small subset of data. Decisions on the construction parts and hyperparameter selection were made by analysing experimental results and utilizing similar choices reported in related work reviewed in Section 2. In this network, all the activations are PReLUs [6] with p set to 0.01.
The following equation provides a mathematical expression of discrete convolution (denoted by ) applied to three dimensional data (, of dimensions ), using three dimensional flipped kernels ():
(1) 
In our implementation, the stride is always equal to 1 and there is no zeropadding performed. Experiments have been completed with different kernel sizes and a number of convolutional layers in the network. The best performance was achieved with 5 convolutional layers with kernel sizes
, , , andrespectively. Max pooling is performed after the first, second and fifth convolutional layers, and only on the image space with the kernel of size
(and not on the third time dimension).In our proposed architecture, the output of the last pooling layer is flattened to one dimensional vector of size 9680 and then is fully connected to the output layer of size 255 (5 frames
17 joints 3 dimensions). Complete 3D CNN architecture is shown in Figure 2.Training
The network was trained using minibatch (of size 10) stochastic gradient descent (with a learning rate of
and Nestrov momentum [12] of ). Xavier initialization method [4] was used to set the initial weights, while the biases in convolutional layers were set to zero. Due to the memory and time limitations, the maximum number of batches used was 20,000 for training, 2,000 for validation and 2,000 for testing (approximately half of the available data). The cost function to be minimized during training was chosen to be the mean per joint position error (MPJPE)[7], which is the mean euclidean distance between the true and predicted joint locations. This also serves as a good performance measure during testing. Early stopping technique was used to avoid overfitting, where the training was terminated when the performance on the validation set stopped improving for 15 consecutive epochs.
4.3 PostProcessing
The shape of the network output contains estimated 3D joint positions for 5 consecutive frames. During inference time, this makes it possible to feed each video frames 5 times through the network at 5 different positions in the input sequence. This gives us 5 outputs for each frame. In order to get a more robust estimation, these overlapping outputs are averaged together.
5 Results
In Table 4 the best results are compared with stateoftheart reported on the dataset website. All the numbers are MPJPEs in millimetres. It can be seen that network performs better on 11 actions and the MPJPE is 11% smaller on average. However, the model performs worse on the actions where people are sitting on the chair or on the ground showing difficulties to deal with body part occlusions. Figure 4 shows some selected examples of pose estimation by the network. This could also be due to the fact that the temporal window of 5 frames is too short to capture these joint positions. Expanding the window or incorporating recurrent neural networks in this architecture could handle this better by capturing longerterm trajectories.
On further investigation, it was also found that the joint position of freely moving upper body joints like hands were relatively poorly predicted. Countering this, a further improvement in performance was obtained by training a separate network to estimate only the upper body joints, and merging the outputs together.
Unfortunately, the two most recent works in 3D pose estimation on the Human3.6M dataset by [13, 19] fail to report their scores on the official test sets, thereby making it very hard to compare out works. However, they do report average MPJPE scores of 124 ([19]) and 113 ([13]) on two male subjects (S9 and S11, which are in our training set).
Additionally, a comparison was performed with a 2D convolution based model with an otherwise identical architecture and training. It was found that our 3D CNN architecture outperforms this 2D CNN based network even without the postprocessing step, thereby suggesting that modelling temporal dynamics improves 3D human pose estimation, perhaps due to inherent bodyjoint trajectory tracking.
The average processing time per 5frame sample during testing was about 1ms / 13ms on a Nvidia GTX 1080 GPU / Intel Xeon E5 CPU, implying realtime frame rates.
6 Conclusions
A discriminative 3D CNN model was implemented for the task of human pose estimation in 3D coordinate space using 2D RGB video data. To the best of our knowledge, this is the first attempt to utilize 3D convolutions for the formulated task. It was shown that such a model can cope with 3D human pose estimation in videos and outperform the existing methods on the Human3.6M dataset. Proposed model was officially tested on dataset provider’s evaluation server and compared with other reported results, which it could outperform with realtime processing speeds. These results suggest that time can be successfully encoded as an additional convolutional dimension for the task of modelling real world objects from 2D sequence of images.
Future Work
There are a number of possible future work directions that can extend this work: More hyperparameter tuning and utilizing higher computational resources could possibly lead to more accurate estimations; testing model’s capabilities on other available datasets; expanding the temporal window and/or combining the proposed model with recurrent neural networks (known for their ability to process temporal information).
References
 [1] Du, Y., Huang, Y., Peng, J.: FullBody human pose estimation from monocular video sequence via MultiDimensional boosting regression. In: Computer VisionACCV 2014 Workshops. pp. 531–544. Springer (2014)
 [2] Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance and holistic view: DualSource deep neural networks for human pose estimation. CoRR abs/1504.07159 (2015), http://arxiv.org/abs/1504.07159
 [3] Freyd, J.J.: Dynamic mental representations. Psychological review 94(4), 427 (1987)

[4]
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International conference on artificial intelligence and statistics. pp. 249–256 (2010)

[5]
Grinciunaite, A.: Development of a Deep Learning Model for 3D Human Pose Estimation in Monocular Videos. Master’s thesis, Vilniaus Gedimino Technikos Universitetas (2016)
 [6] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing HumanLevel performance on ImageNet classification. CoRR abs/1502.01852 (2015), http://arxiv.org/abs/1502.01852
 [7] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)
 [8] Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (Jan 2013), http://dx.doi.org/10.1109/TPAMI.2012.59
 [9] Jones, M.R.: Time, our lost dimension: Toward a new theory of perception, attention, and memory. Psychological Review 83, 323–355 (1976)
 [10] Oberweger, M., Wohlhart, P., Lepetit, V.: Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807 (2015)
 [11] Pfister, T., Simonyan, K., Charles, J., Zisserman, A.: Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Asian Conference on Computer Vision (ACCV) (2014)
 [12] Qian, N.: On the momentum term in gradient descent learning algorithms. Neural networks 12(1), 145–151 (1999)
 [13] Tekin, B., Sun, X., Wang, X., Lepetit, V., Fua, P.: Predicting people’s 3D poses from short sequences. arXiv preprint arXiv:1504.08200 (2015)
 [14] Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. CoRR abs/1411.4280 (2014), http://arxiv.org/abs/1411.4280
 [15] Toshev, A., Szegedy, C.: DeepPose: Human pose estimation via deep neural networks. CoRR abs/1312.4659 (2013), http://arxiv.org/abs/1312.4659
 [16] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV). pp. 4489–4497. IEEE (2015)

[17]
Wang, C., Wang, Y., Lin, Z., Yuille, A., Gao, W.: Robust estimation of 3D human poses from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2361–2368 (2014)
 [18] Zhou, F., De la Torre, F.: SpatioTemporal matching for human pose estimation in video (2016)
 [19] Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. arXiv preprint arXiv:1511.09439 (2015)