Estimating 3D Trajectories from 2D Projections via Disjunctive Factored Four-Way Conditional Restricted Boltzmann Machines

04/20/2016 ∙ by Decebal Constantin Mocanu, et al. ∙ TU Eindhoven 0

Estimation, recognition, and near-future prediction of 3D trajectories based on their two dimensional projections available from one camera source is an exceptionally difficult problem due to uncertainty in the trajectories and environment, high dimensionality of the specific trajectory states, lack of enough labeled data and so on. In this article, we propose a solution to solve this problem based on a novel deep learning model dubbed Disjunctive Factored Four-Way Conditional Restricted Boltzmann Machine (DFFW-CRBM). Our method improves state-of-the-art deep learning techniques for high dimensional time-series modeling by introducing a novel tensor factorization capable of driving forth order Boltzmann machines to considerably lower energy levels, at no computational costs. DFFW-CRBMs are capable of accurately estimating, recognizing, and performing near-future prediction of three-dimensional trajectories from their 2D projections while requiring limited amount of labeled data. We evaluate our method on both simulated and real-world data, showing its effectiveness in predicting and classifying complex ball trajectories and human activities.



There are no comments yet.


page 6

page 7

page 8

page 9

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating and predicting trajectories in three-dimensional spaces based on two-dimensional projections available from one camera source is an open problem with wide-ranging applicability including entertainment 3dtrajrobotics , medicine 3dtrajmedicie , biology  3dtrajbiology , physics 3dtrajdarkmatter , etc. Unfortunately, solving this problem is exceptionally difficult due to a variety of challenges, such as the variability of states of the trajectories, partial occlusions due to self articulation and layering of objects in the scene, and the loss of 3D information resulting from observing trajectories through 2D planar image projections. A variety of techniques have considered variants of this problem by incorporating additional sensors, e.g., cameras 3dtrajwith3cameras , radars 3dtrajwithradar , which provide new data for geometric solvers allowing for accurate estimation and prediction. Though compelling, the success of these methods arrives at increased costs (e.g., incorporating new sensors) and computational complexities (e.g., handling more inputs geometrically).

The problem above, however, can be framed as a time-series estimation and prediction one, for which numerous machine learning algorithms can be applied. An emerging trend in machine learning for computer vision and pattern recognition is deep learning (DL) which has been successfully applied in a variety of fields, e.g., multi-class classification 

Larochelle+Bengio-2008 , collaborative filtering Salakhutdinov07restrictedboltzmann , image quality assessment mocanu2014deep

, reinforcement learning 


, transfer learning 

ecml2013dec , information retrieval Gehler06therate , depth estimation 3ddepthnips2014

, face recognition 

escalerafacerecognition , and activity recognition escaleraactivityrecognition . Most related to this work are temporal-based deep learners, e.g., Shotton13 ; temporalrbm , which we briefly review next. Extending on standard restricted Boltzmann machines (RBMs)originalrbm , temporal RBMs (TRBMs) consider a succession of RBMs, one for each time frame, allowing them to perform accurate prediction and estimation of time-series. Due to their complexity, such naive extensions require high computational effort before acquiring acceptable behavior. Conditional RBMs (CRBMs) remedy this problem by proposing an alternative extension of RBMs taylorcrbmicml

. Here, the architecture consists of two separate visible layers, representing history (i.e., values from previous time frames), and current values, and a hidden layer for latent correlation discovery. Though successful, CRBMs are only capable of modeling time series data with relatively “smooth” variations and similarly with other state-of-the-art neural network architectures for time series, e.g. recurrent neural networks, they can not learn within the same model different types of time-series. Thus, to model different types of non-linear time variations within the same model, the authors in 

taylorcrbmicml extend CRBMs by allowing for a three-way weight tensor connection among the different layers. Computational complexity is then reduced by adapting a factored version (i.e., FCRBMs) of the weight tensor, which leads to a construction exhibiting accurate modeling and prediction results in a variety of experiments, including human motion styles gwtaylorhdts . However, these methods fail to perform both classification and regression in one unified framework. Recently, Factored Four-Way Conditional Restricted Boltzmann Machines (FFW-CRBMs) have been proposed ffwcrbmprl . These extend FCRBMs by incorporating a label layer and a four-way weight tensor connection among the layers to modulate the weights for capturing subtle temporal differences. This construction allowed FFW-CRBMs to perform both, i.e. classification and real-valued predictions, within the same model, and to outperform state-of-the-art specialized methods for classification or prediction ffwcrbmprl .

Contributions: In this paper we, first, propose the use of FFW-CRBMs to estimate 3D trajectories from their 2D projections, while at the same time being also capable to classify those trajectories. Though successful, we discovered that FFW-CRBMs require substantial amount of labeled data before achieving acceptable performance when predicting three-dimensional trajectories from two-dimensional projections. As FFW-CRBMs require three-dimensional labeled information for accurate predictions which is not typically available, secondly, in this paper, we remedy these problems by proposing an extension of FFW-CRBMs, dubbed Disjunctive FFW-CRBMs (DFFW-CRBMs). Our extension refines the factoring of the four-way weight tensor connecting the machine layers to settings where labeled data is scarce. Adopting such a factorization “specializes” FFW-CRBMs and ensures lower energy levels (approximately three times less energy on the overall dataset). This yields the sufficiency of a reduced training dataset for DFFW-CRBMs to reach similar classification performance to state-of-the-art methods and to at least double the performance on real-valued predictions. Importantly, such accuracy improvements come at the same computational cost of compared to FFW-CRBMs. Precisely, our machine requires limited labeled data (less than 10 of the overall dataset) for: i) simultaneously classifying and predicting three-dimensional trajectories based on their two-dimensional projections, and ii) accurately estimating three-dimensional postures up to an arbitrary number of time-steps in the future.

We have extensively tested DFFW-CRBMs on both, simulated and real-world data, to show that they are capable of outperforming state-of-the-art methods in real-valued predictions and classifications. In the first set of experiments we evaluate its performance by predicting and classifying simulated three-dimensional ball trajectories (based on a real-world physics simulator) thrown from different initial spins. Given these successes, in the second set of experiments we predict and classify high-dimensional human poses and activities (up-to 32 human skeleton joints in 2D and 3D coordinates systems, corresponding to 160 dimensions) using real-world data showing that DFFW-CRBMs acquire double accuracy results at reduced labeled data sizes.

2 Background

This section provides relevant background knowledge essential to the remainder of the paper. Firstly, restricted Boltzmann machines (RBMs), being at the basis of our proposed method, are surveyed. Secondly, Contrastive Divergence, a training algorithm for Deep Learning methods, is presented. The section concludes with a brief description of deep-learning based models for time series prediction and classification.

2.1 Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) originalrbm

are energy-based models for unsupervised learning. They use a generative model of the distribution of training data for prediction 

mocanugenerativereplay . These models employ stochastic nodes and layers, making them less vulnerable to local minima gwtaylorhdts . Further, due to their stochastic neural configurations, RBMs possess excellent generalization and density estimation capabilities bengiodl ; mocanumljxbm .

Formally, an RBM consists of visible and hidden binary layers connected by an undirected bipartite graph. More exactly, the visible layer collects all visible units and represents the real-data, while the hidden layer representing all the hidden units increases the learning capability by enlarging the class of distributions that can be represented to an arbitrary complexity. and

are the number of neurons in the visible and hidden layers, respectively.

denotes the weight connection between the visible and hidden unit, and and denote the state of the visible and hidden unit, respectively. The matrix of all weights between the layers is given by . The energy function of RBMs is given by


where, and

represent the biases of the visible and hidden layers, respectively. The joint probability of a visible and hidden configuration can be written as

with . The marginal distribution, , can be used to determine the probability of a data point represented by a state .

2.2 Training an RBM via Contrastive Divergence

The RBMs parameters are trained by maximizing the likelihood function, typically by following the gradient of the energy function. Unfortunately, in RBMs, maximum likelihood estimation can not be applied directly due to intractability problems. These problems can be circumvented by using Contrastive Divergence (CD) hintoncd to train the RBM. In CD, learning follows the gradient of:



is the distribution of a Markov chain running for

steps and

symbolizes the Kullback-Leibler divergence 

Ponti2017470 . To find the update rules for the free parameters of the RBM (i.e weights and biases), the RBM’s energy function from Equation 1 has to be differentiated with respect to those parameters. Thus, in the weight updates are done as follows: where is the iteration number, is the learning rate, and where is the total number of input instances, and the superscript shows the input instance. The superscript indicates that the states are obtained after steps of Gibbs sampling on a Markov chain which starts at the original data distribution . In practice, learning can be performed using just one step Gibbs sampling, which is carried in four sub-steps: (1) initialize visible units, (2) infer all the hidden units, (3) infer all the visible units, and (4) update the weights and the biases.

2.3 Factored Conditional Restricted Boltzmann Machine

Conditional Restricted Boltzmann Machines (CRBM) gwtaylorhdts are an extension of RBMs used to model time series data, for example, human activities. They use an undirected model with binary hidden variables connected to real-valued visible ones. At each time step , the hidden and visible nodes receive a connection from the visible variables at the last time-steps. The history of the real-world values until time

is collected in the real-valued history vector

with being the number of elements in . The total energy of the CRBM is given by:


where and represent the “dynamic biases”, with being the index of the elements from .

Taylor and Hinton introduced the Factored Condition Restricted Boltzmann Machine (FCRBM) gwtaylorhdts , which permits the modeling of different styles of time series within the same model, due to the introduction of multiplicative, three-way interactions and of a preset style label,

. To reduce the computational complexity of this model, they factored the third order tensors between layer in products of matrices. Formally, FCRBM defines a joint probability distribution over the visible

and hidden

neurons. The joint distribution is conditioned on the past

observations, , model parameters, , and the preset style label, . Interested readers are referred to gwtaylorhdts for a more comprehensive discussion on CRBMs and FCRBMs.

2.4 Four-Way Conditional Restricted Boltzmann Machines

Due to the limitations exhibited by FCRBMs, e.g., the impossibility of performing classification without extensions, we proposed the four-way conditional restricted Boltzmann machines (FW-CRBMs) for performing prediction and classification in one unified framework ffwcrbmprl . FW-CRBMs introduced an additional layer and a four-way multiplicative weight tensor interaction between neurons. Please note that, later on, other four-way models have been proposed but they can perform just classification an no prediction Elaiwat2016152 .

Figure 1: A high level depiction of the FFW-CRBM showing the four layer configuration and the factored weight tensor connection among them. Gaussian nodes shown on the history and visible layers represent real-valued inputs, while sigmoidal nodes on the hidden and label layers demonstrate binary values.

FW-CRBMs extended FCRBMs to include a label layer and a fourth order weight tensor connection , where , , , represent the number of neurons from the present, hidden, history and label layers, respectively. Though successful, FW-CRBMs exhibited high computational complexities (i.e., ) for tuning free parameters. Circumventing these problems, we factored the weight tensor into sums of products leading to more efficient machines (i.e., ) labeled as factored four-way conditional restricted Boltzmann machines (FFW-CRBMs). FFW-CRBMs, shown in Figure 1, minimize the following energy functional


where is number of factors and , , , and are the indices of the visible layer neurons , the hidden layer neurons , the history layer neurons and the labeled layer neurons respectively. , , symbolize the bidirectional and symmetric weights from the visible, hidden and label layers to the factors, respectively, while represents the directed weights from the history layer to the factors. As in the case of the three-way models modellingjointdensities , standard CD is unsuccessful in training also the four-way models, due to the need of predicting two output layers (i.e. label and present layers). Thus, in ffwcrbmprl we proposed a sequential variant of CD, named sequential Markov chain contrastive divergence, more suitable for tuning the free parameters in FW-CRBMs.

FFW-CRBMs have shown good generalization and time series latent feature learning capabilities compared to state-of-the-art techniques including but not limited to, support vector machines, CRBMs, and FCRBMs 

ffwcrbmprl . It is for these reasons that we believe that FFW-CRBMs can serve as a basis for predicting three-dimensional trajectories from two-dimensional projections. Unfortunately, FFW-CRBMs are not readily applicable to such a problem as they require substantial amount of labeled data for successful tuning. In this paper, we extend FFW-CRBMs to Disjunctive FFW-CRBMs (DFFW-CRBMs) by proposing a novel factoring process essential for predicting and classifying 3D trajectories from 2D projections. Our model, detailed next, reduces sample complexities of current methods and allows for lower energy levels compared to FFW-CRBMs leading to improved performance.

3 Disjunctive Factored Four Way Conditional restricted Boltzmann Machines

Figure 2: A high level depiction of DFFW-CRBMs showing the four layer configuration and the refined tensors factoring for increased accuracy and efficiency.

This section details disjunctive factored four way conditional restricted Boltzmann machines (DFFW-CRBMs), shown in Figure 2. Similarly to FFW-CRBMs, our model consists of four layers to represent visible, history, hidden, and label units. Contrary to the factoring adopted by FFW-CRBMs, however, our model incorporates two new factoring layers. The first, i.e., in the figure, is responsible for specializing the machine to real-valued predictions through , , , and , while the second, , specializes the machine to classification through the corresponding weight tensor collections. Such a specialization is responsible for reducing sample complexities needed by DFFW-CRBMs for successful parameter tuning as demonstrated in Section 4, while the computational complexity of DFFW-CRBM remains the same as for FFW-CRBM (i.e., ) . Given our novel construction, DFFW-CRBMs require their own special mathematical treatment. Next, we detail each of the energy functional and learning rules needed by DFFW-CRBMs.

3.1 DFFW-CRBM’s Energy Function

The energy function of DFFW-CRBMs consists of three major terms. The first, i.e., corresponds to the standard energy representing a specific submachine of DFFW-CRBMs (i.e. the energy given by the neurons of each layers and their biases) while the second two denote energies related to the first and second factoring layers, respectively:


Here, denotes the total number of factors for the weight tensor collection specializing DFFW-CRBMs to regression, while is total the number of factors responsible for classification. , , , and represent the indices of the visible layer neurons , the hidden layer neurons , the history layer neurons and the labeled layer neurons , respectively. Furthermore, and represent the bidirectional and symmetric weight connections from the visible and hidden layers to the factors, while and denote the directed weights from the label and history layers to the factors. Similarly, and represent the bidirectional and symmetric weights from the label and hidden layers to the factors, while and denote the directed weights from the visible and history layers to the factors. Finally, the two groups of four weight matrices each noted with and belong to the factorized tensor specialization in regression and classification, respectively.

3.2 DFFW-CRBM’s Activation Probabilities

Inference for DFFW-CRBM corresponds to determining values of the activation probabilities for each of the units. As shown in Figure 2, units within the same layer do not share connections. This allows for parallel probability computation for all units within the same layer. The overall input of each hidden , visible , and labelled unit is given by:


Consequently, for each of the hidden, visible, and labelled units, the activation probabilities can be determined as



represents the standard Gaussian distribution.

3.3 Parameter Tuning: Update Rules & Algorithm

3.3.1 Update Rules

Generally, parameters, , are updated according to:


where represents the update iteration, is the momentum, denotes the learning rate, and is the weight decay. A more detailed discussion on the choice of these parameters is provided by Hinton in hintontrain . Therein, the update rules are attained by deriving the energy functional with respect to free parameters (i.e., weights matrices, and the biases of each of the layers). In DFFW-CRBMs, a set of eight free parameters, corresponding to the connections between the factors and each of the layers, has to be inferred. These are presented below. Intuitively, each of these update equations, aims at minimizing the reconstruction error (i.e., the error between the original inputs and these reconstructed through the model). Moreover, each of the update equations include three main terms representing the connections between the factored weights and the corresponding layer of the machine, as per Figure 2. For instance, connections to only the hidden, history, and label layers suffice for updating . Thus, the update rules for each of the weights corresponding to the first factored layer, can be computed as:


while for the second factoring we have:


and for the biases are:


where represents a Markov chain step running for a total of steps and starting at the original data distribution, denotes the expectation under the input data, and represents the model’s expectation.

3.3.2 Sequential CD for DFFW-CRBMs

Algorithm 1 presents a high-level description of the sequential Markov chain contrastive divergence ffwcrbmprl

adapted to train DFFW-CRBMs. It shows the two main steps needed for training such machines. Firstly, the visible layer is inferred by fixing the history and label layers. While in the second step the label layer is reconstructed by fixing the history and the present layers. Updating the weights involves the implementation of the rules derived in the previous section. These two procedures are then repeated for a pre-specified number of epochs, where at each epoch the reconstruction error is decreasing to reach the minimum of the energy function, guaranteeing a minimized divergence between the original data distribution and the one given by the model.

  Inputs: TD - training data, - number of Markov Chain steps;
 Initialization: , Set , , ;
 for all epochs do
        for each Sample TD do
               %%First Markov Chain to reconstruct ;
               Init 0, =Sample.Label, =Sample.History;
               = InferHiddenLayer(,,,);
               for ;; do
                      %%Positive phase;
                      %%Negative phase;
                      = InferHiddenLayer(,,,);
               end for
              %%Second Markov Chain to reconstruct ;
               Init 0, = Sample.Present, = Sample.History;
               = InferHiddenLayer(,,,);
               for ;; do
                      %%Positive phase;
                      %%Negative phase;
                      = InferHiddenLayer(,,,);
               end for
        end for
end for
Algorithm 1 Sequential Contrastive Divergence for DFFW-CRBMs

4 Experiments and Results

This section extensively tests the performance of DFFW-CRBMs on both simulated as well as on real-world datasets. The major goal of these experiments was to assess the capability of DFFW-CRBM to predict three-dimensional trajectories from two-dimensional projection, given small amounts of labeled data (i.e., in the order of 9-10 % of the total dataset). As a secondary objective, the goal was to classify such trajectories to different spins (ball trajectories) or activities (human pose estimation). In the real-valued prediction setting, we compared our method to state-of-the-art FFW-CRBMs and FCRBMs, while for classification our method’s performance was tested against FFW-CRBMs and support vector machines with radial basis functions (SVM-RBFs) 

vapniksvm .

Evaluation Metrics: To assess the models’ performance, a variety of standard metrics were used. For classification, we used accuracy roc in percentages, while for estimation tasks, we used the Normalized Root Mean Square Error (NRMSE) estimating distance between the prediction and ground truth, Pearson Correlation Coefficient (PCC) reflecting the correlations between predictions and ground truth, and the P-value to arrive at statistically significant predictions.

4.1 Ball Trajectory Experiments

We generated different ball trajectories thrown with different spins using the Bullet Physics Library111, Last accessed on November 2016. With this simulated dataset we targeted three objectives using small amounts (9 %) of labeled training data. First, we estimated 3D ball coordinates based on their 2D projections at each time-step

(i.e., one-step prediction). Second, we aimed at predicting near-future (i.e., couple of time steps in the future) 3D ball coordinates recursively, while giving limited 2D sequence of coordinates as a starting point. Third, we classified various ball spins based on just 2D coordinates. We used four trajectory classes corresponding to four different ball spin types. For each class, a set of 11 trajectories each containing approximately 400 time-steps (amounting to a total of 17211 data instances) were sampled. To assess the performance of DFFW-CRBM, we performed 11-fold cross validation and reported mean and standard deviation results. Precisely, from each class of trajectories we used only

one labeled trajectory222A labeled trajectory has complete information: the 3D ball coordinates, their 2D projections, and the spin (i.e. class). to train the models and the other 10 were used for testing.

Figure 3: Averaged energy levels of FFW-CRBM and DFFW-CRBM over all ball trajectories when the parameters (i.e. number of hidden neurons and factors) are varying. The training was done for 100 epochs.
Task Metrics Methods
Classification Accuracy[%] 39.264.63 N/A 37.493.66 39.514.47
Present Step NRMSE[%] N/A 18.388.07 19.5333.24 11.248.53
3D estimation PCC N/A -0.060.70 0.310.72 0.620.61
P-value N/A 0.510.29 0.400.29 0.280.27
After NRMSE[%] N/A 25.613.25 23.532.48 9.526.12
1 step PCC N/A 0.140.69 0.310.74 0.950.14
Multi-Step P-value N/A 0.500.29 0.380.25 0.120.18
3D After NRMSE[%] N/A 31.387.99 29.498.14 19.9310.27
prediction 50 steps PCC N/A 0.050.69 -0.050.72 0.200.66
P-value N/A 0.510.26 0.470.26 0.510.26
Table 1: Classification, present step 3D estimation, and multi-step 3D prediction for the balls trajectories experiment. Results, cross-validated and presented with mean and standard deviation, show that our method is capable of outperforming state-of-the-art techniques on all evaluation metrics.

Deep Learner Setting: The visible layers of both models (i.e. FFW-CRBM and DFFW-CRBM) were set to 5 neurons, three denoting 3D ball center coordinates (i.e. x, y, z), and two for its 2D projection at time . The label layer consisted of 4 neurons (one for each of the different spins classes), while the history layers included 100 neurons corresponding to the last 50 history frames. One frame incorporates the 2D coordinates of the center of the ball projected in a two dimensional space. The number of hidden neurons was set to , and the number of factors to , as discussed in the next paragraph, and in Subsection 4.2. A learning rate of and momentum of were chosen. Weight decay factors were set to , and the number of the Markov chain steps for CD in the training phase, but also for the Gibbs sampling in the testing phase, was set to 3. All weights were initialized with

. Finally, data were normalized to have 0 mean and unit variance as explained in 

hintontrain , and the models were trained for 100 epochs.

Importance of disjunctive Factoring: To find the optimal number of hidden neurons and factors, we have performed exhaustive search by varying the number of hidden neurons from 10 to 100 and the number of factors from 10 to 160. To gain some insights on the behavioral differences between FFW-CRBMs and DFFW-CRBMs, even if the energy equation of DFFW-CRBM has an extra tensor, in Figure 3 we illustrate on the same scale the heat-map of the averaged energy levels. They were computed using Equation 4 for FFW-CRBM and Equation 5 for DFFW-CRBM, after both models were trained for 100 epochs. Though both models acquire the lowest energy levels in a configuration starting with 10-20 hidden neurons and a number of factors larger than 100, analyzing these results signifies the importance of the disjunctive factoring introduced in the paper. Namely, DFFW-CRBMs always acquire lower energy levels compared to FFW-CRBMs due to it’s “specialized” tensor factoring. Moreover, by averaging the energy levels from the aforementioned figure, we found that the average energy level of DFFW-CRBM is approximately three times smaller than the one of FFW-CRBM (i.e. for DFFW-CRBM, and for FFW-CRBM), thus anticipating the more accurate performance results, as showed next.

Figures 4 and 5 compare the capabilities of DFFW-CRBMs on estimating different 3D trajectories of balls picked at random to FFW-CRBMs, showing that our method is capable of achieving closely correlated transitions to the real trajectory. Interestingly, DFFW-CRBMs can handle discontinuities “less abruptly” compared to FFW-CRBMs. The cross-validation results showing the performance of all models of all ball trajectories are summarized in Table 1. In terms of classification, SVM-RBF, FFW-CRBM, and DFFW-CRBM perform almost similarly, with a slightly advantage of DFFW-CRBM333It is worth noting that in this scenario the random guess for classification would have an accuracy of .. In the case of 3D coordinates estimation from 2D projection at a time-step , DFFW-CRBM clearly outperforms state-of-the-art methods with a NRMSE almost twice smaller than FCRBMs and FFW-CRBMs. Besides that, in this case, the mean value of the correlation coefficient for DFFW-CRBM is , double than that for FFW-CRBM, while the one for FCRBM is powerless (i.e below zero). For the multi-step prediction of near-future 3D point coordinates, DFFW-CRBM has an even more significant improvement. It is worth highlighting that in this scenario, the average PCC value after one step prediction is almost perfectly , while after 50 steps predicted into the future the mean PCC value is still positive and larger than those of the other methods. In a final set of experiments we tested the change in the accuracy of classification as a number of data points used. These are summarized in the bar-graph in Figure 6, showing that our method slightly outperforms the state-of-the-art techniques in all cases.

Figure 4: Estimation of different 3D balls trajectories from their 2D counterparts with DFFW-CRBM (top) and FFW-CRBM (bottom) showing that our method outperforms state-of-the-art techniques while requiring less data.
Figure 5: Estimation of the 3D trajectory for the center of one ball from its 2D projection using FFW-CRBM (left) and DFFW-CRBM (right). The top figure presents the trajectory in the 3D space, while the bottom figure presents the Ox, Oy, Oz coordinates of the same trajectory in a 2D plot.
Figure 6: Average classification accuracies with mean and standard deviation, over all balls trajectories, when the amount of training data is increased.

4.2 Human Activity Recognition

Given the above successes, next we evaluate the performance of our method on real-world data representing a variety of human activities. In each set of experiments, we targeted two main objectives and a third secondary one. The first two corresponded to estimating three-dimensional joint coordinates from two-dimensional projections as well as predicting such coordinates in near future, while the third involved classifying activities based on only two-dimensional joint coordinates. Please note that the third experiment is exceptionally hard due to the loss of three-dimensional information making different activities more similar.

Human 2.6m dataset. For all experiments, we used the real-world comprehensive benchmark database IonescuSminchisescu11 ; h36m_pami , containing 17 activities performed by 11 professional actors (6 males and 5 females) with over 3.6 million 3D human poses and their corresponding images. Further, for 7 actors, the database accurately reports 32 human skeleton joint positions in 3D space, together with their 2D projections acquired at 50 frames per seconds (FPS).

Figure 7: Averaged energy levels of FFW-CRBM and DFFW-CRBM for the human activities experiments when the parameters (i.e. number of hidden neurons and factors) are varying. The training was done for 100 epochs.

We used these seven actors being Subject 1 (S1), Subject 5 (S5), Subject 6 (S6), Subject 7 (S7), Subject 8 (S8), Subject 9 (S9), Subject 11 (S11) accompanied with their corresponding joint activities, such as Purchasing (A1), Smoking (A2), Phoning (A3), Sitting-Down (A4), Eating (A5), Walking-Together (A6), Greeting (A7), Sitting (A8), Posing (A9), Discussing (A10), Directing (A11), Walking (A12), and Waiting (A13). To avoid computational overhead, we have also reduced the temporal resolution of the data to 5 FPS leading to a total of 46446 training and testing instances. The instances were split between different subjects as: S1 (5514 instances), S5 (8748 instances), S6 (5402 instances),S7 (9081 instances), S8 (5657 instances), S9 (6975 instances), and S11 (5069 instances).

Deep Learner Setting: The visible layers of both the FFW-CRBM and DFFW-CRBM were set to 160 neurons corresponding to 96 neurons for the 3D coordinates of the joints, and 64 for their 2D projections at time . The label layer consisted of 13 neurons (one for each of the activities), and the history layers included 320 neurons corresponding to 5 history frames each incorporating 2D joint coordinates. The size of the hidden layer was set to neurons, and the number of factors to , as explained in the next paragraph. Furthermore, a learning rate of was used to guarantee bounded reconstruction errors. The number of the Markov Chain steps in the training phase and of the Gibbs sampling in the testing phase were set to 3, and the weights were initialized with . Further particularities, such as momentum and weight decay were set to and . Also, all data were normalized to have a 0 mean and unit standard deviation.

Importance of disjunctive Factoring: Similarly with the previous experiment on simulated balls trajectories, we searched for the optimal number of hidden neurons and factors, by performing exhaustive search and varying the number of hidden neurons and factors from 10 to 100 and from 10 to 160, respectively. Figure 7 depicts on the same scale the averaged energy levels for both FFW-CRBM and DFFW-CRBM, after being trained for 100 epochs. As before, in the balls experiment, the energy levels of both models are more affected by the number of factors than the number of hidden neurons. Even if we are scrutinizing unnormalized energy levels, the fact that the energy levels of DFFW-CRBM are always much lower than the energy levels of FFW-CRBM reflects the importance of the disjunctive factoring. By quantifying and averaging all the energy levels for each model, we may observe that DFFW-CRBM has in average approximately three times less energy than FFW-CRBM (i.e. for DFFW-CRBM, and for FFW-CRBM).

4.2.1 Training and Testing on The Same Person

Here, data from the same subject has been used for both training and testing. Emulating real-world 3D trajectory prediction settings where labeled data is scarce, we made use of only 10 of the available data for training and 90 for testing with the aim of performing accurate one and multi-step 3D trajectory predictions.

Results in Tables 23, and 4 show that DFFW-CRBMs are capable of achieving better performance than state-of-the-art techniques in both classification and prediction even when only using a small amount of training data. These results provide a proof-of-concept to the fact that DDFW-CRBMs are capable of accurately predicting (in both one-step and multi-step scenarios) 3D trajectories from their 2D projections by using only 10 of the data for training and 90 for testing.

S1 49.77 50.53 49.34
S5 36.92 38.82 40.21
S6 30.68 31.51 30.18
S7 38.50 37.03 37.94
S8 26.49 30.41 31.32
S9 24.69 28.12 22.63
S11 34.56 34.21 32.16
Average 34.51 35.80 34.83
Table 2: Classification accuracies in percentages for the human activities experiments, when training and testing data belong to the same person.
NRMSE [%] PCC P-value NRMSE [%] PCC P-value NRMSE [%] PCC P-value
S1 8.413.75 0.020.10 0.490.29 9.937.47 0.130.37 0.150.26 6.363.45 0.540.29 0.050.16
S5 6.702.44 -0.030.09 0.540.28 6.953.21 0.100.33 0.160.26 4.302.30 0.680.25 0.020.10
S6 4.411.93 0.030.09 0.530.28 4.502.37 0.010.28 0.210.29 3.191.64 0.500.32 0.050.16
S7 9.143.46 0.020.10 0.490.29 9.164.30 0.130.35 0.140.25 6.193.11 0.710.24 0.010.09
S8 8.313.37 -0.000.11 0.470.29 8.234.42 0.020.27 0.260.31 4.962.57 0.620.26 0.050.18
S9 7.252.74 0.000.09 0.550.28 8.405.05 0.010.27 0.220.29 4.632.34 0.540.32 0.050.17
S11 9.624.05 -0.000.10 0.540.28 9.895.94 0.060.32 0.150.25 6.823.82 0.530.35 0.040.14
Average 7.693.10 0.010.09 0.510.28 8.154.68 0.070.31 0.180.27 5.212.75 0.590.29 0.040.14
Table 3: The 3D estimation of the human joints from their 2D counterpart at the present time, when the training and testing are done on the same person. The results are presented with mean and standard deviation.
Predicted NRMSE [%] PCC P-value NRMSE [%] PCC P-value NRMSE [%] PCC P-value
S1 7.683.71 0.020.09 0.510.28 7.784.95 0.170.32 0.170.27 5.913.45 0.550.26 0.060.16
S5 6.722.48 -0.050.09 0.500.28 6.412.70 0.190.37 0.180.29 3.881.68 0.680.25 0.010.08
After S6 4.402.17 0.040.09 0.520.28 4.272.31 0.110.31 0.210.30 3.171.83 0.480.32 0.050.16
1 step S7 9.073.20 0.020.11 0.460.29 8.783.52 0.270.35 0.060.17 6.523.05 0.730.17 0.000.02
S8 7.163.08 0.010.12 0.480.31 6.423.51 0.040.23 0.300.31 3.931.87 0.690.19 0.010.05
S9 6.982.64 -0.010.09 0.540.29 6.913.17 0.080.26 0.220.28 4.401.69 0.640.20 0.010.08
S11 9.554.05 -0.000.08 0.560.25 8.924.66 0.100.31 0.180.27 7.024.00 0.510.44 0.040.14
Average 7.373.05 0.000.09 0.510.28 7.073.55 0.140.31 0.180.27 4.962.51 0.610.26 0.030.1
S1 10.883.17 0.010.10 0.550.32 10.235.39 0.100.41 0.170.26 8.224.56 -0.030.29 0.140.24
S5 7.502.20 0.020.10 0.530.28 8.403.20 0.010.43 0.190.30 6.862.47 0.120.43 0.080.19
After S6 5.381.92 0.010.12 0.450.29 4.772.65 -0.020.24 0.220.30 4.442.17 0.120.32 0.120.23
50 steps S7 11.073.61 -0.030.09 0.520.30 10.684.04 0.110.44 0.150.26 9.313.60 0.170.33 0.120.25
S8 15.411.66 0.010.11 0.450.29 11.335.81 0.090.24 0.260.29 9.915.73 0.080.38 0.100.21
S9 9.252.10 0.010.11 0.480.28 8.563.23 0.030.25 0.250.28 7.483.60 -0.020.39 0.120.22
S11 14.392.71 -0.010.09 0.520.28 10.935.78 0.170.37 0.180.29 8.564.47 0.050.51 0.120.26
Average 10.552.48 0.000.10 0.50.29 9.274.3 0.070.34 0.200.28 7.823.8 0.070.37 0.110.23
Table 4: Multi-step 3D prediction for the human activities experiments, when the training and testing are done on the same person. The results are presented with mean and standard deviation.
Figure 8: Average classification accuracies with mean and standard deviation for the human activities experiments, over all subjects, when the data for training and testing the models come from the same person and the amount of training data is increased.

Activity Recognition (Classification) The goal in this set of experiments was to classify the 13 activities based on only their 2D projections. Please note that such a task is substantially difficult to solve due to the loss of information exhibited by the performed projection. Namely, activities different in 3D space might resemble high similarities in their 2D projections leading to low classification accuracies. Table 2 reports the accuracy performance of DFFW-CRBMs, against state-of-the-art methods including SVMs and FFW-CRBMs. By averaging the results over all subjects, we can observe that all three models perform comparable. It is worth mentioning that the classification accuracy for random choice in this scenario would be and all models performs approximately 5 times better.

Figure 9: Multi-step 3D prediction on the worst performer (S11) and best performer (S6) subjects, when the data for training and testing the models come from the same person.

We also performed two more experiments to classify activities with more input data points to prove the correctness of the presented methods and show DFFW-CRBMs is capable of achieving state-of-the-art classification results. Here, we used 33 and 66 of the data to train the models, and the remaining for test. It is clear from Figure 8 that all models increase in performance as the amount of training data increases, reaching around 55 accuracy when 66 of the data is used for training.

Figure 10: Multi-step 3D prediction using cross-validation on all subjects, when the data for training and testing the models come from different persons.

Estimating 3D Skeleton Coordinates from 2D Projections (Present Step Prediction) In this task, we estimate the 3D joint coordinates from their 2D counterpart while using 10 training data. Results depicted in Table 3 show that DFFW-CRBMs achieves better performance than FFW-CRBMs and FCRBM. Though FFW-CRBMs perform comparatively, it is worth noting that the PCC and P-values signify the fact that DFFW-CRBMs drastically outperform FFW-CRBMs in the sense that the predictions are correlated with ground truth, a property essential for accurate and reliable predictions.

Prediction of 3D Skeleton Trajectories (Multi-Step Prediction) Here, the goal was to perform multi-step predictions of the 3D skeleton joints based on only 2D projections. Starting from a 2D initial state, the model was executed autonomously by recursively feeding-back 2D outputs to perform next-step predictions. Definitely, the performance is expected to degrade since the prediction errors accumulate with time. Table 4, showing the performance of the models after 1 and 50 step predictions, validate this phenomenon since all metrics show a decrease in both models’ performance over time. Table 4, however, also signify that DFFW-CRBMs outperform FFW-CRBMs in both one and multi-step predictions achieving an average NRMSE of 7.82 compared to 9.27 NRMSE for FFW-CRBMs. Further results are summarized by Figure 9 showing the minimum and maximum performance results of both models. In these experiments, clearly, DFFW-CRBM is the best performer in both, prediction errors and correlations.

4.2.2 Testing Generalization Capabilities

Motivation: In the second set of human activities experiments, our goal was to determine to what extend can DFFW-CRBMs generalize across different human subjects and activities. The main motivation is that in reality subject-specific data is scarce, while data available from different users or domains is abundant. Results reported in Table 5 and Figure 10 show that DFFW-CRBMs are capable of generalizing beyond specific subjects due to their ability in learning latent features shared among a variety of tasks.

Task Metrics Methods
Classification Accuracy[%] 37.935.04 N/A 44.962.68 44.496.60
Present Step NRMSE[%] N/A 7.583.62 7.523.63 3.931.75
3D estimation PCC N/A -0.000.09 0.140.24 0.790.16
P-value N/A 0.520.28 0.210.28 0.010.03
After NRMSE[%] N/A 6.603.53 6.523.54 3.951.99
1 step PCC N/A -0.010.11 0.210.27 0.810.14
Multi-Step P-value N/A 0.490.29 0.140.24 0.010.03
3D After NRMSE[%] N/A 7.273.81 7.243.84 7.343.84
prediction 50 steps PCC N/A 0.010.11 0.130.46 0.160.50
P-value N/A 0.490.31 0.100.20 0.100.22
Table 5: Classification, present step 3D estimation, and multi-step 3D prediction, for the human activities experiments, when the training and the testing are done on different persons. The results are cross-validated and presented with mean and standard deviation.

Experiments: Here, data from 6 subjects was used to train the models, and predictions on an unseen subject were performed. The procedure was then repeated to cross-validate the results. Further, to emulate real-world settings only 10 of the data was used for training. During testing, however, all data from the all testing subjects was used increasing the tasks’ difficulty. The same three goals of the previous experiments were targeted.

Activity Recognition (Classification): Results reported in Table 5 show that DFFW-CRBMs achieve comparable results to FFW-CRBMs at an accuracy of 44.5 both outperforming SVMs. Clearly, these classification results resemble higher accuracies when compared to these in Table 2. The reasons can be attributed back to the availability of similar domain data from other subjects signifying the latent feature similarities automatically learn by DFFW-CRBMs.

Estimation of 3D Skeleton Coordinates from 2D Projections (Present Step Prediction): Again, DFFW-CRBMs achieve better performance than FFW-CRBMs in present step estimation of the 3D skeleton joints from 2D projections, while both outperform FCRBM. It is worth highlighting that DFFW-CRBMs are capable of attaining a high average prediction correlation to ground-truth of almost 0.8.

Prediction of 3D skeleton Trajectories (Multi-Step Prediction): Finally, Figure 10 shows that DFFW-CRBMs are capable of surpassing FFW-CRBMs in multi-step predictions on unseen subjects achieving low prediction errors and high ground truth correlation.

5 Conclusion

In this paper we proposed disjunctive factored four-way conditional restricted Boltzmann machines (DFFW-CRBMs). These novel machine learning techniques can be used for estimating 3D trajectories from their 2D projections using limited amounts of labeled data. Due to the new tensor factoring introduced by DFFW-CRBMs, these machines are capable of achieving substantially lower energy levels than state-of-the-art techniques leading to more accurate predictions and classification results. Furthermore, DFFW-CRBMs are capable of performing classification and accurate near-future predictions simultaneously in one unified framework.

Two sets of experiments, one on a simulated ball trajectories dataset and one on a real-world benchmark database, demonstrate the effectiveness of DFFW-CRBMs. The empirical evaluation showed that our methods are capable of outperforming state-of-the-art machine learning algorithms in both classification and regression. Precisely, DFFW-CRBM were capable of achieving substantially lower energy levels (approximately three times less energy on the overall datasets, independently on the number of factors or hidden neurons) than FFW-CRBM. This leads to at least double accuracies for real-valued predictions, while acquiring similar classification performance, at no increased computational complexity costs.


  • (1) H. Silva, A. Dias, J. Almeida, A. Martins, E. Silva, Real-time 3d ball trajectory estimation for robocup middle size league using a single camera, in: T. Röfer, N. Mayer, J. Savage, U. Saranlı (Eds.), RoboCup 2011: Robot Soccer World Cup XV, Vol. 7416 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2012, pp. 586–597. doi:10.1007/978-3-642-32060-6_50.
  • (2) S. Qiu, Y. Yang, J. Hou, R. Ji, H. Hu, Z. Wang, Ambulatory estimation of 3d walking trajectory and knee joint angle using marg sensors, in: Innovative Computing Technology (INTECH), 2014 Fourth International Conference on, 2014, pp. 191–196. doi:10.1109/INTECH.2014.6927742.
  • (3) S. Maleschlijski, G. Sendra, A. Di Fino, L. Leal-Taixé, I. Thome, A. Terfort, N. Aldred, M. Grunze, A. Clare, B. Rosenhahn, A. Rosenhahn, Three dimensional tracking of exploratory behavior of barnacle cyprids using stereoscopy, Biointerphases 7 (1-4). doi:10.1007/s13758-012-0050-x.
  • (4) M. Jauzac, E. Jullo, J.-P. Kneib, H. Ebeling, A. Leauthaud, C.-J. Ma, M. Limousin, R. Massey, J. Richard, A weak lensing mass reconstruction of the large-scale filament feeding the massive galaxy cluster macs j0717.5+3745, Monthly Notices of the Royal Astronomical Society 426 (4) (2012) 3369–3384. doi:10.1111/j.1365-2966.2012.21966.x.
  • (5) J. Tao, B. Risse, X. Jiang, R. Klette, 3d trajectory estimation of simulated fruit flies, in: Proceedings of the 27th Conference on Image and Vision Computing New Zealand, IVCNZ ’12, ACM, New York, NY, USA, 2012, pp. 31–36. doi:10.1145/2425836.2425844.
  • (6) J. Pinezich, J. Heller, T. Lu, Ballistic projectile tracking using cw doppler radar, Aerospace and Electronic Systems, IEEE Transactions on 46 (3) (2010) 1302–1311. doi:10.1109/TAES.2010.5545190.
  • (7) H. Larochelle, Y. Bengio, Classification using discriminative restricted boltzmann machines, in: Proceedings of the 25th International Conference on Machine Learning, ICML ’08, ACM, 2008, pp. 536–543. doi:10.1145/1390156.1390224.
  • (8) R. Salakhutdinov, A. Mnih, G. Hinton, Restricted boltzmann machines for collaborative filtering, in: In Machine Learning, Proceedings of the Twenty-fourth International Conference (ICML 2004). ACM, AAAI Press, 2007, pp. 791–798.
  • (9) D. C. Mocanu, G. Exarchakos, A. Liotta, Deep learning for objective quality assessment of 3d images, in: Image Processing (ICIP), 2014 IEEE International Conference on, 2014, pp. 758–762. doi:10.1109/ICIP.2014.7025152.
  • (10) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. doi:10.1038/nature14236.
  • (11) H. Bou-Ammar, D. C. Mocanu, M. E. Taylor, K. Driessens, K. Tuyls, G. Weiss, Automatically mapped transfer between reinforcement learning tasks via three-way restricted boltzmann machines, in: Machine Learning and Knowledge Discovery in Databases, Vol. 8189 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, pp. 449–464. doi:10.1007/978-3-642-40991-2_29.
  • (12) P. V. Gehler, A. D. Holub, M. Welling, The rate adapting poisson model for information retrieval and object recognition, in: In Proceedings of 23rd International Conference on Machine Learning (ICML06), ACM Press, 2006, p. 2006.
  • (13) D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 2366–2374.
  • (14)

    P. Rasti, T. Uiboupin, S. Escalera, G. Anbarjafari, Convolutional Neural Network Super Resolution for Face Recognition in Surveillance Monitoring, Springer International Publishing, Cham, 2016, pp. 175–184.

  • (15) K. Nasrollahi, S. Escalera, P. Rasti, G. Anbarjafari, X. Baro, H. J. Escalante, T. B. Moeslund, Deep learning based super-resolution for improved action recognition, in: 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), 2015, pp. 67–72. doi:10.1109/IPTA.2015.7367098.
  • (16) J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, A. Blake, Efficient human pose estimation from single depth images, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12) (2013) 2821–2840. doi:
  • (17)

    I. Sutskever, G. E. Hinton, Learning multilevel distributed representations for high-dimensional sequences., in: M. Meila, X. Shen (Eds.), AISTATS, Vol. 2 of JMLR Proceedings,, 2007, pp. 548–555.

  • (18) P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory, in: D. E. Rumelhart, J. L. McClelland, et al. (Eds.), Parallel Distributed Processing: Volume 1: Foundations, MIT Press, Cambridge, 1987, pp. 194–281.
  • (19) G. W. Taylor, G. E. Hinton, Factored conditional restricted boltzmann machines for modeling motion style, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 2009, pp. 1025–1032.
  • (20) G. W. Taylor, G. E. Hinton, S. T. Roweis, Two distributed-state models for generating high-dimensional time series, Journal of Machine Learning Research 12 (2011) 1025–1068.
  • (21) D. C. Mocanu, H. Bou-Ammar, D. Lowet, K. Driessens, A. Liotta, G. Weiss, K. Tuyls, Factored four way conditional restricted boltzmann machines for activity recognition, Pattern Recognition Lettersdoi:
  • (22) D. C. Mocanu, M. T. Vega, E. Eaton, P. Stone, A. Liotta, Online contrastive divergence with generative replay: Experience replay without storing data, CoRR abs/1610.05555.
  • (23) Y. Bengio, Learning deep architectures for ai, Found. Trends Mach. Learn. 2 (1) (2009) 1–127.
  • (24) D. C. Mocanu, E. Mocanu, P. H. Nguyen, M. Gibescu, A. Liotta, A topological insight into restricted boltzmann machines, Machine Learning 104 (2) (2016) 243–270. doi:10.1007/s10994-016-5570-z.
  • (25) G. E. Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Neural Computation 14 (8) (2002) 1771–1800.
  • (26) M. Ponti, J. Kittler, M. Riva, T. de Campos, C. Zor, A decision cognizant kullback–leibler divergence, Pattern Recognition 61 (2017) 470 – 478. doi:
  • (27) S. Elaiwat, M. Bennamoun, F. Boussaid, A spatio-temporal rbm-based model for facial expression recognition, Pattern Recognition 49 (2016) 152 – 161. doi:
  • (28) G. Hinton, M. Pollefeys, J. Susskind, R. Memisevic, Modeling the joint density of two images under a variety of transformations, 2013 IEEE Conference on Computer Vision and Pattern Recognition 00 (undefined) (2011) 2793–2800.
  • (29) G. Hinton, A practical guide to training restricted boltzmann machines, in: Neural Networks: Tricks of the Trade, Vol. 7700 of Lecture Notes in Computer Science, Springer, 2012, pp. 599–619. doi:10.1007/978-3-642-35289-8_32.
  • (30) C. Cortes, V. Vapnik, Support-Vector Networks, Mach. Learn. 20 (3) (1995) 273–297.
  • (31) T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861–874.
  • (32) C. S. Catalin Ionescu, Fuxin Li, Latent structured models for human pose estimation, in: International Conference on Computer Vision, 2011.
  • (33) C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu, Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7) (2014) 1325–1339.