1 Introduction
Estimating and predicting trajectories in threedimensional spaces based on twodimensional projections available from one camera source is an open problem with wideranging applicability including entertainment 3dtrajrobotics , medicine 3dtrajmedicie , biology 3dtrajbiology , physics 3dtrajdarkmatter , etc. Unfortunately, solving this problem is exceptionally difficult due to a variety of challenges, such as the variability of states of the trajectories, partial occlusions due to self articulation and layering of objects in the scene, and the loss of 3D information resulting from observing trajectories through 2D planar image projections. A variety of techniques have considered variants of this problem by incorporating additional sensors, e.g., cameras 3dtrajwith3cameras , radars 3dtrajwithradar , which provide new data for geometric solvers allowing for accurate estimation and prediction. Though compelling, the success of these methods arrives at increased costs (e.g., incorporating new sensors) and computational complexities (e.g., handling more inputs geometrically).
The problem above, however, can be framed as a timeseries estimation and prediction one, for which numerous machine learning algorithms can be applied. An emerging trend in machine learning for computer vision and pattern recognition is deep learning (DL) which has been successfully applied in a variety of fields, e.g., multiclass classification
Larochelle+Bengio2008 , collaborative filtering Salakhutdinov07restrictedboltzmann , image quality assessment mocanu2014deep mnih15 ecml2013dec , information retrieval Gehler06therate , depth estimation 3ddepthnips2014 escalerafacerecognition , and activity recognition escaleraactivityrecognition . Most related to this work are temporalbased deep learners, e.g., Shotton13 ; temporalrbm , which we briefly review next. Extending on standard restricted Boltzmann machines (RBMs)originalrbm , temporal RBMs (TRBMs) consider a succession of RBMs, one for each time frame, allowing them to perform accurate prediction and estimation of timeseries. Due to their complexity, such naive extensions require high computational effort before acquiring acceptable behavior. Conditional RBMs (CRBMs) remedy this problem by proposing an alternative extension of RBMs taylorcrbmicml. Here, the architecture consists of two separate visible layers, representing history (i.e., values from previous time frames), and current values, and a hidden layer for latent correlation discovery. Though successful, CRBMs are only capable of modeling time series data with relatively “smooth” variations and similarly with other stateoftheart neural network architectures for time series, e.g. recurrent neural networks, they can not learn within the same model different types of timeseries. Thus, to model different types of nonlinear time variations within the same model, the authors in
taylorcrbmicml extend CRBMs by allowing for a threeway weight tensor connection among the different layers. Computational complexity is then reduced by adapting a factored version (i.e., FCRBMs) of the weight tensor, which leads to a construction exhibiting accurate modeling and prediction results in a variety of experiments, including human motion styles gwtaylorhdts . However, these methods fail to perform both classification and regression in one unified framework. Recently, Factored FourWay Conditional Restricted Boltzmann Machines (FFWCRBMs) have been proposed ffwcrbmprl . These extend FCRBMs by incorporating a label layer and a fourway weight tensor connection among the layers to modulate the weights for capturing subtle temporal differences. This construction allowed FFWCRBMs to perform both, i.e. classification and realvalued predictions, within the same model, and to outperform stateoftheart specialized methods for classification or prediction ffwcrbmprl .Contributions: In this paper we, first, propose the use of FFWCRBMs to estimate 3D trajectories from their 2D projections, while at the same time being also capable to classify those trajectories. Though successful, we discovered that FFWCRBMs require substantial amount of labeled data before achieving acceptable performance when predicting threedimensional trajectories from twodimensional projections. As FFWCRBMs require threedimensional labeled information for accurate predictions which is not typically available, secondly, in this paper, we remedy these problems by proposing an extension of FFWCRBMs, dubbed Disjunctive FFWCRBMs (DFFWCRBMs). Our extension refines the factoring of the fourway weight tensor connecting the machine layers to settings where labeled data is scarce. Adopting such a factorization “specializes” FFWCRBMs and ensures lower energy levels (approximately three times less energy on the overall dataset). This yields the sufficiency of a reduced training dataset for DFFWCRBMs to reach similar classification performance to stateoftheart methods and to at least double the performance on realvalued predictions. Importantly, such accuracy improvements come at the same computational cost of compared to FFWCRBMs. Precisely, our machine requires limited labeled data (less than 10 of the overall dataset) for: i) simultaneously classifying and predicting threedimensional trajectories based on their twodimensional projections, and ii) accurately estimating threedimensional postures up to an arbitrary number of timesteps in the future.
We have extensively tested DFFWCRBMs on both, simulated and realworld data, to show that they are capable of outperforming stateoftheart methods in realvalued predictions and classifications. In the first set of experiments we evaluate its performance by predicting and classifying simulated threedimensional ball trajectories (based on a realworld physics simulator) thrown from different initial spins. Given these successes, in the second set of experiments we predict and classify highdimensional human poses and activities (upto 32 human skeleton joints in 2D and 3D coordinates systems, corresponding to 160 dimensions) using realworld data showing that DFFWCRBMs acquire double accuracy results at reduced labeled data sizes.
2 Background
This section provides relevant background knowledge essential to the remainder of the paper. Firstly, restricted Boltzmann machines (RBMs), being at the basis of our proposed method, are surveyed. Secondly, Contrastive Divergence, a training algorithm for Deep Learning methods, is presented. The section concludes with a brief description of deeplearning based models for time series prediction and classification.
2.1 Restricted Boltzmann Machines
Restricted Boltzmann machines (RBMs) originalrbm
are energybased models for unsupervised learning. They use a generative model of the distribution of training data for prediction
mocanugenerativereplay . These models employ stochastic nodes and layers, making them less vulnerable to local minima gwtaylorhdts . Further, due to their stochastic neural configurations, RBMs possess excellent generalization and density estimation capabilities bengiodl ; mocanumljxbm .Formally, an RBM consists of visible and hidden binary layers connected by an undirected bipartite graph. More exactly, the visible layer collects all visible units and represents the realdata, while the hidden layer representing all the hidden units increases the learning capability by enlarging the class of distributions that can be represented to an arbitrary complexity. and
are the number of neurons in the visible and hidden layers, respectively.
denotes the weight connection between the visible and hidden unit, and and denote the state of the visible and hidden unit, respectively. The matrix of all weights between the layers is given by . The energy function of RBMs is given by(1) 
where, and
represent the biases of the visible and hidden layers, respectively. The joint probability of a visible and hidden configuration can be written as
with . The marginal distribution, , can be used to determine the probability of a data point represented by a state .2.2 Training an RBM via Contrastive Divergence
The RBMs parameters are trained by maximizing the likelihood function, typically by following the gradient of the energy function. Unfortunately, in RBMs, maximum likelihood estimation can not be applied directly due to intractability problems. These problems can be circumvented by using Contrastive Divergence (CD) hintoncd to train the RBM. In CD, learning follows the gradient of:
(2) 
where,
is the distribution of a Markov chain running for
steps andsymbolizes the KullbackLeibler divergence
Ponti2017470 . To find the update rules for the free parameters of the RBM (i.e weights and biases), the RBM’s energy function from Equation 1 has to be differentiated with respect to those parameters. Thus, in the weight updates are done as follows: where is the iteration number, is the learning rate, and where is the total number of input instances, and the superscript shows the input instance. The superscript indicates that the states are obtained after steps of Gibbs sampling on a Markov chain which starts at the original data distribution . In practice, learning can be performed using just one step Gibbs sampling, which is carried in four substeps: (1) initialize visible units, (2) infer all the hidden units, (3) infer all the visible units, and (4) update the weights and the biases.2.3 Factored Conditional Restricted Boltzmann Machine
Conditional Restricted Boltzmann Machines (CRBM) gwtaylorhdts are an extension of RBMs used to model time series data, for example, human activities. They use an undirected model with binary hidden variables connected to realvalued visible ones. At each time step , the hidden and visible nodes receive a connection from the visible variables at the last timesteps. The history of the realworld values until time
is collected in the realvalued history vector
with being the number of elements in . The total energy of the CRBM is given by:(3) 
where and represent the “dynamic biases”, with being the index of the elements from .
Taylor and Hinton introduced the Factored Condition Restricted Boltzmann Machine (FCRBM) gwtaylorhdts , which permits the modeling of different styles of time series within the same model, due to the introduction of multiplicative, threeway interactions and of a preset style label,
. To reduce the computational complexity of this model, they factored the third order tensors between layer in products of matrices. Formally, FCRBM defines a joint probability distribution over the visible
and hiddenneurons. The joint distribution is conditioned on the past
observations, , model parameters, , and the preset style label, . Interested readers are referred to gwtaylorhdts for a more comprehensive discussion on CRBMs and FCRBMs.2.4 FourWay Conditional Restricted Boltzmann Machines
Due to the limitations exhibited by FCRBMs, e.g., the impossibility of performing classification without extensions, we proposed the fourway conditional restricted Boltzmann machines (FWCRBMs) for performing prediction and classification in one unified framework ffwcrbmprl . FWCRBMs introduced an additional layer and a fourway multiplicative weight tensor interaction between neurons. Please note that, later on, other fourway models have been proposed but they can perform just classification an no prediction Elaiwat2016152 .
FWCRBMs extended FCRBMs to include a label layer and a fourth order weight tensor connection , where , , , represent the number of neurons from the present, hidden, history and label layers, respectively. Though successful, FWCRBMs exhibited high computational complexities (i.e., ) for tuning free parameters. Circumventing these problems, we factored the weight tensor into sums of products leading to more efficient machines (i.e., ) labeled as factored fourway conditional restricted Boltzmann machines (FFWCRBMs). FFWCRBMs, shown in Figure 1, minimize the following energy functional
(4) 
where is number of factors and , , , and are the indices of the visible layer neurons , the hidden layer neurons , the history layer neurons and the labeled layer neurons respectively. , , symbolize the bidirectional and symmetric weights from the visible, hidden and label layers to the factors, respectively, while represents the directed weights from the history layer to the factors. As in the case of the threeway models modellingjointdensities , standard CD is unsuccessful in training also the fourway models, due to the need of predicting two output layers (i.e. label and present layers). Thus, in ffwcrbmprl we proposed a sequential variant of CD, named sequential Markov chain contrastive divergence, more suitable for tuning the free parameters in FWCRBMs.
FFWCRBMs have shown good generalization and time series latent feature learning capabilities compared to stateoftheart techniques including but not limited to, support vector machines, CRBMs, and FCRBMs
ffwcrbmprl . It is for these reasons that we believe that FFWCRBMs can serve as a basis for predicting threedimensional trajectories from twodimensional projections. Unfortunately, FFWCRBMs are not readily applicable to such a problem as they require substantial amount of labeled data for successful tuning. In this paper, we extend FFWCRBMs to Disjunctive FFWCRBMs (DFFWCRBMs) by proposing a novel factoring process essential for predicting and classifying 3D trajectories from 2D projections. Our model, detailed next, reduces sample complexities of current methods and allows for lower energy levels compared to FFWCRBMs leading to improved performance.3 Disjunctive Factored Four Way Conditional restricted Boltzmann Machines
This section details disjunctive factored four way conditional restricted Boltzmann machines (DFFWCRBMs), shown in Figure 2. Similarly to FFWCRBMs, our model consists of four layers to represent visible, history, hidden, and label units. Contrary to the factoring adopted by FFWCRBMs, however, our model incorporates two new factoring layers. The first, i.e., in the figure, is responsible for specializing the machine to realvalued predictions through , , , and , while the second, , specializes the machine to classification through the corresponding weight tensor collections. Such a specialization is responsible for reducing sample complexities needed by DFFWCRBMs for successful parameter tuning as demonstrated in Section 4, while the computational complexity of DFFWCRBM remains the same as for FFWCRBM (i.e., ) . Given our novel construction, DFFWCRBMs require their own special mathematical treatment. Next, we detail each of the energy functional and learning rules needed by DFFWCRBMs.
3.1 DFFWCRBM’s Energy Function
The energy function of DFFWCRBMs consists of three major terms. The first, i.e., corresponds to the standard energy representing a specific submachine of DFFWCRBMs (i.e. the energy given by the neurons of each layers and their biases) while the second two denote energies related to the first and second factoring layers, respectively:
(5) 
Here, denotes the total number of factors for the weight tensor collection specializing DFFWCRBMs to regression, while is total the number of factors responsible for classification. , , , and represent the indices of the visible layer neurons , the hidden layer neurons , the history layer neurons and the labeled layer neurons , respectively. Furthermore, and represent the bidirectional and symmetric weight connections from the visible and hidden layers to the factors, while and denote the directed weights from the label and history layers to the factors. Similarly, and represent the bidirectional and symmetric weights from the label and hidden layers to the factors, while and denote the directed weights from the visible and history layers to the factors. Finally, the two groups of four weight matrices each noted with and belong to the factorized tensor specialization in regression and classification, respectively.
3.2 DFFWCRBM’s Activation Probabilities
Inference for DFFWCRBM corresponds to determining values of the activation probabilities for each of the units. As shown in Figure 2, units within the same layer do not share connections. This allows for parallel probability computation for all units within the same layer. The overall input of each hidden , visible , and labelled unit is given by:
(6) 
Consequently, for each of the hidden, visible, and labelled units, the activation probabilities can be determined as
(7) 
where
represents the standard Gaussian distribution.
3.3 Parameter Tuning: Update Rules & Algorithm
3.3.1 Update Rules
Generally, parameters, , are updated according to:
(8) 
where represents the update iteration, is the momentum, denotes the learning rate, and is the weight decay. A more detailed discussion on the choice of these parameters is provided by Hinton in hintontrain . Therein, the update rules are attained by deriving the energy functional with respect to free parameters (i.e., weights matrices, and the biases of each of the layers). In DFFWCRBMs, a set of eight free parameters, corresponding to the connections between the factors and each of the layers, has to be inferred. These are presented below. Intuitively, each of these update equations, aims at minimizing the reconstruction error (i.e., the error between the original inputs and these reconstructed through the model). Moreover, each of the update equations include three main terms representing the connections between the factored weights and the corresponding layer of the machine, as per Figure 2. For instance, connections to only the hidden, history, and label layers suffice for updating . Thus, the update rules for each of the weights corresponding to the first factored layer, can be computed as:
(9) 
while for the second factoring we have:
(10) 
and for the biases are:
(11) 
where represents a Markov chain step running for a total of steps and starting at the original data distribution, denotes the expectation under the input data, and represents the model’s expectation.
3.3.2 Sequential CD for DFFWCRBMs
Algorithm 1 presents a highlevel description of the sequential Markov chain contrastive divergence ffwcrbmprl
adapted to train DFFWCRBMs. It shows the two main steps needed for training such machines. Firstly, the visible layer is inferred by fixing the history and label layers. While in the second step the label layer is reconstructed by fixing the history and the present layers. Updating the weights involves the implementation of the rules derived in the previous section. These two procedures are then repeated for a prespecified number of epochs, where at each epoch the reconstruction error is decreasing to reach the minimum of the energy function, guaranteeing a minimized divergence between the original data distribution and the one given by the model.
4 Experiments and Results
This section extensively tests the performance of DFFWCRBMs on both simulated as well as on realworld datasets. The major goal of these experiments was to assess the capability of DFFWCRBM to predict threedimensional trajectories from twodimensional projection, given small amounts of labeled data (i.e., in the order of 910 % of the total dataset). As a secondary objective, the goal was to classify such trajectories to different spins (ball trajectories) or activities (human pose estimation). In the realvalued prediction setting, we compared our method to stateoftheart FFWCRBMs and FCRBMs, while for classification our method’s performance was tested against FFWCRBMs and support vector machines with radial basis functions (SVMRBFs)
vapniksvm .Evaluation Metrics: To assess the models’ performance, a variety of standard metrics were used. For classification, we used accuracy roc in percentages, while for estimation tasks, we used the Normalized Root Mean Square Error (NRMSE) estimating distance between the prediction and ground truth, Pearson Correlation Coefficient (PCC) reflecting the correlations between predictions and ground truth, and the Pvalue to arrive at statistically significant predictions.
4.1 Ball Trajectory Experiments
We generated different ball trajectories thrown with different spins using the Bullet Physics Library^{1}^{1}1http://bulletphysics.org, Last accessed on November 2016. With this simulated dataset we targeted three objectives using small amounts (9 %) of labeled training data. First, we estimated 3D ball coordinates based on their 2D projections at each timestep
(i.e., onestep prediction). Second, we aimed at predicting nearfuture (i.e., couple of time steps in the future) 3D ball coordinates recursively, while giving limited 2D sequence of coordinates as a starting point. Third, we classified various ball spins based on just 2D coordinates. We used four trajectory classes corresponding to four different ball spin types. For each class, a set of 11 trajectories each containing approximately 400 timesteps (amounting to a total of 17211 data instances) were sampled. To assess the performance of DFFWCRBM, we performed 11fold cross validation and reported mean and standard deviation results. Precisely, from each class of trajectories we used only
one labeled trajectory^{2}^{2}2A labeled trajectory has complete information: the 3D ball coordinates, their 2D projections, and the spin (i.e. class). to train the models and the other 10 were used for testing.Task  Metrics  Methods  
SVMRBF  FCRBM  FFWCRBM  DFFWCRBM  
Classification  Accuracy[%]  39.264.63  N/A  37.493.66  39.514.47  
Present Step  NRMSE[%]  N/A  18.388.07  19.5333.24  11.248.53  
3D estimation  PCC  N/A  0.060.70  0.310.72  0.620.61  
Pvalue  N/A  0.510.29  0.400.29  0.280.27  
After  NRMSE[%]  N/A  25.613.25  23.532.48  9.526.12  
1 step  PCC  N/A  0.140.69  0.310.74  0.950.14  
MultiStep  Pvalue  N/A  0.500.29  0.380.25  0.120.18  
3D  After  NRMSE[%]  N/A  31.387.99  29.498.14  19.9310.27 
prediction  50 steps  PCC  N/A  0.050.69  0.050.72  0.200.66 
Pvalue  N/A  0.510.26  0.470.26  0.510.26 
Deep Learner Setting: The visible layers of both models (i.e. FFWCRBM and DFFWCRBM) were set to 5 neurons, three denoting 3D ball center coordinates (i.e. x, y, z), and two for its 2D projection at time . The label layer consisted of 4 neurons (one for each of the different spins classes), while the history layers included 100 neurons corresponding to the last 50 history frames. One frame incorporates the 2D coordinates of the center of the ball projected in a two dimensional space. The number of hidden neurons was set to , and the number of factors to , as discussed in the next paragraph, and in Subsection 4.2. A learning rate of and momentum of were chosen. Weight decay factors were set to , and the number of the Markov chain steps for CD in the training phase, but also for the Gibbs sampling in the testing phase, was set to 3. All weights were initialized with
. Finally, data were normalized to have 0 mean and unit variance as explained in
hintontrain , and the models were trained for 100 epochs.Importance of disjunctive Factoring: To find the optimal number of hidden neurons and factors, we have performed exhaustive search by varying the number of hidden neurons from 10 to 100 and the number of factors from 10 to 160. To gain some insights on the behavioral differences between FFWCRBMs and DFFWCRBMs, even if the energy equation of DFFWCRBM has an extra tensor, in Figure 3 we illustrate on the same scale the heatmap of the averaged energy levels. They were computed using Equation 4 for FFWCRBM and Equation 5 for DFFWCRBM, after both models were trained for 100 epochs. Though both models acquire the lowest energy levels in a configuration starting with 1020 hidden neurons and a number of factors larger than 100, analyzing these results signifies the importance of the disjunctive factoring introduced in the paper. Namely, DFFWCRBMs always acquire lower energy levels compared to FFWCRBMs due to it’s “specialized” tensor factoring. Moreover, by averaging the energy levels from the aforementioned figure, we found that the average energy level of DFFWCRBM is approximately three times smaller than the one of FFWCRBM (i.e. for DFFWCRBM, and for FFWCRBM), thus anticipating the more accurate performance results, as showed next.
Figures 4 and 5 compare the capabilities of DFFWCRBMs on estimating different 3D trajectories of balls picked at random to FFWCRBMs, showing that our method is capable of achieving closely correlated transitions to the real trajectory. Interestingly, DFFWCRBMs can handle discontinuities “less abruptly” compared to FFWCRBMs. The crossvalidation results showing the performance of all models of all ball trajectories are summarized in Table 1. In terms of classification, SVMRBF, FFWCRBM, and DFFWCRBM perform almost similarly, with a slightly advantage of DFFWCRBM^{3}^{3}3It is worth noting that in this scenario the random guess for classification would have an accuracy of .. In the case of 3D coordinates estimation from 2D projection at a timestep , DFFWCRBM clearly outperforms stateoftheart methods with a NRMSE almost twice smaller than FCRBMs and FFWCRBMs. Besides that, in this case, the mean value of the correlation coefficient for DFFWCRBM is , double than that for FFWCRBM, while the one for FCRBM is powerless (i.e below zero). For the multistep prediction of nearfuture 3D point coordinates, DFFWCRBM has an even more significant improvement. It is worth highlighting that in this scenario, the average PCC value after one step prediction is almost perfectly , while after 50 steps predicted into the future the mean PCC value is still positive and larger than those of the other methods. In a final set of experiments we tested the change in the accuracy of classification as a number of data points used. These are summarized in the bargraph in Figure 6, showing that our method slightly outperforms the stateoftheart techniques in all cases.
4.2 Human Activity Recognition
Given the above successes, next we evaluate the performance of our method on realworld data representing a variety of human activities. In each set of experiments, we targeted two main objectives and a third secondary one. The first two corresponded to estimating threedimensional joint coordinates from twodimensional projections as well as predicting such coordinates in near future, while the third involved classifying activities based on only twodimensional joint coordinates. Please note that the third experiment is exceptionally hard due to the loss of threedimensional information making different activities more similar.
Human 2.6m dataset. For all experiments, we used the realworld comprehensive benchmark database IonescuSminchisescu11 ; h36m_pami , containing 17 activities performed by 11 professional actors (6 males and 5 females) with over 3.6 million 3D human poses and their corresponding images. Further, for 7 actors, the database accurately reports 32 human skeleton joint positions in 3D space, together with their 2D projections acquired at 50 frames per seconds (FPS).
We used these seven actors being Subject 1 (S1), Subject 5 (S5), Subject 6 (S6), Subject 7 (S7), Subject 8 (S8), Subject 9 (S9), Subject 11 (S11) accompanied with their corresponding joint activities, such as Purchasing (A1), Smoking (A2), Phoning (A3), SittingDown (A4), Eating (A5), WalkingTogether (A6), Greeting (A7), Sitting (A8), Posing (A9), Discussing (A10), Directing (A11), Walking (A12), and Waiting (A13). To avoid computational overhead, we have also reduced the temporal resolution of the data to 5 FPS leading to a total of 46446 training and testing instances. The instances were split between different subjects as: S1 (5514 instances), S5 (8748 instances), S6 (5402 instances),S7 (9081 instances), S8 (5657 instances), S9 (6975 instances), and S11 (5069 instances).
Deep Learner Setting: The visible layers of both the FFWCRBM and DFFWCRBM were set to 160 neurons corresponding to 96 neurons for the 3D coordinates of the joints, and 64 for their 2D projections at time . The label layer consisted of 13 neurons (one for each of the activities), and the history layers included 320 neurons corresponding to 5 history frames each incorporating 2D joint coordinates. The size of the hidden layer was set to neurons, and the number of factors to , as explained in the next paragraph. Furthermore, a learning rate of was used to guarantee bounded reconstruction errors. The number of the Markov Chain steps in the training phase and of the Gibbs sampling in the testing phase were set to 3, and the weights were initialized with . Further particularities, such as momentum and weight decay were set to and . Also, all data were normalized to have a 0 mean and unit standard deviation.
Importance of disjunctive Factoring: Similarly with the previous experiment on simulated balls trajectories, we searched for the optimal number of hidden neurons and factors, by performing exhaustive search and varying the number of hidden neurons and factors from 10 to 100 and from 10 to 160, respectively. Figure 7 depicts on the same scale the averaged energy levels for both FFWCRBM and DFFWCRBM, after being trained for 100 epochs. As before, in the balls experiment, the energy levels of both models are more affected by the number of factors than the number of hidden neurons. Even if we are scrutinizing unnormalized energy levels, the fact that the energy levels of DFFWCRBM are always much lower than the energy levels of FFWCRBM reflects the importance of the disjunctive factoring. By quantifying and averaging all the energy levels for each model, we may observe that DFFWCRBM has in average approximately three times less energy than FFWCRBM (i.e. for DFFWCRBM, and for FFWCRBM).
4.2.1 Training and Testing on The Same Person
Here, data from the same subject has been used for both training and testing. Emulating realworld 3D trajectory prediction settings where labeled data is scarce, we made use of only 10 of the available data for training and 90 for testing with the aim of performing accurate one and multistep 3D trajectory predictions.
Results in Tables 2, 3, and 4 show that DFFWCRBMs are capable of achieving better performance than stateoftheart techniques in both classification and prediction even when only using a small amount of training data. These results provide a proofofconcept to the fact that DDFWCRBMs are capable of accurately predicting (in both onestep and multistep scenarios) 3D trajectories from their 2D projections by using only 10 of the data for training and 90 for testing.
Persons  SVMRBF  FFWCRBM  DFFWCRBM 

S1  49.77  50.53  49.34 
S5  36.92  38.82  40.21 
S6  30.68  31.51  30.18 
S7  38.50  37.03  37.94 
S8  26.49  30.41  31.32 
S9  24.69  28.12  22.63 
S11  34.56  34.21  32.16 
Average  34.51  35.80  34.83 
Persons  FCRBM  FFWCRBM  DFFWCRBM  

NRMSE [%]  PCC  Pvalue  NRMSE [%]  PCC  Pvalue  NRMSE [%]  PCC  Pvalue  
S1  8.413.75  0.020.10  0.490.29  9.937.47  0.130.37  0.150.26  6.363.45  0.540.29  0.050.16 
S5  6.702.44  0.030.09  0.540.28  6.953.21  0.100.33  0.160.26  4.302.30  0.680.25  0.020.10 
S6  4.411.93  0.030.09  0.530.28  4.502.37  0.010.28  0.210.29  3.191.64  0.500.32  0.050.16 
S7  9.143.46  0.020.10  0.490.29  9.164.30  0.130.35  0.140.25  6.193.11  0.710.24  0.010.09 
S8  8.313.37  0.000.11  0.470.29  8.234.42  0.020.27  0.260.31  4.962.57  0.620.26  0.050.18 
S9  7.252.74  0.000.09  0.550.28  8.405.05  0.010.27  0.220.29  4.632.34  0.540.32  0.050.17 
S11  9.624.05  0.000.10  0.540.28  9.895.94  0.060.32  0.150.25  6.823.82  0.530.35  0.040.14 
Average  7.693.10  0.010.09  0.510.28  8.154.68  0.070.31  0.180.27  5.212.75  0.590.29  0.040.14 
Steps  Persons  FCRBM  FFWCRBM  DFFWCRBM  

Predicted  NRMSE [%]  PCC  Pvalue  NRMSE [%]  PCC  Pvalue  NRMSE [%]  PCC  Pvalue  
S1  7.683.71  0.020.09  0.510.28  7.784.95  0.170.32  0.170.27  5.913.45  0.550.26  0.060.16  
S5  6.722.48  0.050.09  0.500.28  6.412.70  0.190.37  0.180.29  3.881.68  0.680.25  0.010.08  
After  S6  4.402.17  0.040.09  0.520.28  4.272.31  0.110.31  0.210.30  3.171.83  0.480.32  0.050.16 
1 step  S7  9.073.20  0.020.11  0.460.29  8.783.52  0.270.35  0.060.17  6.523.05  0.730.17  0.000.02 
S8  7.163.08  0.010.12  0.480.31  6.423.51  0.040.23  0.300.31  3.931.87  0.690.19  0.010.05  
S9  6.982.64  0.010.09  0.540.29  6.913.17  0.080.26  0.220.28  4.401.69  0.640.20  0.010.08  
S11  9.554.05  0.000.08  0.560.25  8.924.66  0.100.31  0.180.27  7.024.00  0.510.44  0.040.14  
Average  7.373.05  0.000.09  0.510.28  7.073.55  0.140.31  0.180.27  4.962.51  0.610.26  0.030.1  
S1  10.883.17  0.010.10  0.550.32  10.235.39  0.100.41  0.170.26  8.224.56  0.030.29  0.140.24  
S5  7.502.20  0.020.10  0.530.28  8.403.20  0.010.43  0.190.30  6.862.47  0.120.43  0.080.19  
After  S6  5.381.92  0.010.12  0.450.29  4.772.65  0.020.24  0.220.30  4.442.17  0.120.32  0.120.23 
50 steps  S7  11.073.61  0.030.09  0.520.30  10.684.04  0.110.44  0.150.26  9.313.60  0.170.33  0.120.25 
S8  15.411.66  0.010.11  0.450.29  11.335.81  0.090.24  0.260.29  9.915.73  0.080.38  0.100.21  
S9  9.252.10  0.010.11  0.480.28  8.563.23  0.030.25  0.250.28  7.483.60  0.020.39  0.120.22  
S11  14.392.71  0.010.09  0.520.28  10.935.78  0.170.37  0.180.29  8.564.47  0.050.51  0.120.26  
Average  10.552.48  0.000.10  0.50.29  9.274.3  0.070.34  0.200.28  7.823.8  0.070.37  0.110.23 
Activity Recognition (Classification) The goal in this set of experiments was to classify the 13 activities based on only their 2D projections. Please note that such a task is substantially difficult to solve due to the loss of information exhibited by the performed projection. Namely, activities different in 3D space might resemble high similarities in their 2D projections leading to low classification accuracies. Table 2 reports the accuracy performance of DFFWCRBMs, against stateoftheart methods including SVMs and FFWCRBMs. By averaging the results over all subjects, we can observe that all three models perform comparable. It is worth mentioning that the classification accuracy for random choice in this scenario would be and all models performs approximately 5 times better.
We also performed two more experiments to classify activities with more input data points to prove the correctness of the presented methods and show DFFWCRBMs is capable of achieving stateoftheart classification results. Here, we used 33 and 66 of the data to train the models, and the remaining for test. It is clear from Figure 8 that all models increase in performance as the amount of training data increases, reaching around 55 accuracy when 66 of the data is used for training.
Estimating 3D Skeleton Coordinates from 2D Projections (Present Step Prediction) In this task, we estimate the 3D joint coordinates from their 2D counterpart while using 10 training data. Results depicted in Table 3 show that DFFWCRBMs achieves better performance than FFWCRBMs and FCRBM. Though FFWCRBMs perform comparatively, it is worth noting that the PCC and Pvalues signify the fact that DFFWCRBMs drastically outperform FFWCRBMs in the sense that the predictions are correlated with ground truth, a property essential for accurate and reliable predictions.
Prediction of 3D Skeleton Trajectories (MultiStep Prediction) Here, the goal was to perform multistep predictions of the 3D skeleton joints based on only 2D projections. Starting from a 2D initial state, the model was executed autonomously by recursively feedingback 2D outputs to perform nextstep predictions. Definitely, the performance is expected to degrade since the prediction errors accumulate with time. Table 4, showing the performance of the models after 1 and 50 step predictions, validate this phenomenon since all metrics show a decrease in both models’ performance over time. Table 4, however, also signify that DFFWCRBMs outperform FFWCRBMs in both one and multistep predictions achieving an average NRMSE of 7.82 compared to 9.27 NRMSE for FFWCRBMs. Further results are summarized by Figure 9 showing the minimum and maximum performance results of both models. In these experiments, clearly, DFFWCRBM is the best performer in both, prediction errors and correlations.
4.2.2 Testing Generalization Capabilities
Motivation: In the second set of human activities experiments, our goal was to determine to what extend can DFFWCRBMs generalize across different human subjects and activities. The main motivation is that in reality subjectspecific data is scarce, while data available from different users or domains is abundant. Results reported in Table 5 and Figure 10 show that DFFWCRBMs are capable of generalizing beyond specific subjects due to their ability in learning latent features shared among a variety of tasks.
Task  Metrics  Methods  
SVMRBF  FCRBM  FFWCRBM  DFFWCRBM  
Classification  Accuracy[%]  37.935.04  N/A  44.962.68  44.496.60  
Present Step  NRMSE[%]  N/A  7.583.62  7.523.63  3.931.75  
3D estimation  PCC  N/A  0.000.09  0.140.24  0.790.16  
Pvalue  N/A  0.520.28  0.210.28  0.010.03  
After  NRMSE[%]  N/A  6.603.53  6.523.54  3.951.99  
1 step  PCC  N/A  0.010.11  0.210.27  0.810.14  
MultiStep  Pvalue  N/A  0.490.29  0.140.24  0.010.03  
3D  After  NRMSE[%]  N/A  7.273.81  7.243.84  7.343.84 
prediction  50 steps  PCC  N/A  0.010.11  0.130.46  0.160.50 
Pvalue  N/A  0.490.31  0.100.20  0.100.22 
Experiments: Here, data from 6 subjects was used to train the models, and predictions on an unseen subject were performed. The procedure was then repeated to crossvalidate the results. Further, to emulate realworld settings only 10 of the data was used for training. During testing, however, all data from the all testing subjects was used increasing the tasks’ difficulty. The same three goals of the previous experiments were targeted.
Activity Recognition (Classification): Results reported in Table 5 show that DFFWCRBMs achieve comparable results to FFWCRBMs at an accuracy of 44.5 both outperforming SVMs. Clearly, these classification results resemble higher accuracies when compared to these in Table 2. The reasons can be attributed back to the availability of similar domain data from other subjects signifying the latent feature similarities automatically learn by DFFWCRBMs.
Estimation of 3D Skeleton Coordinates from 2D Projections (Present Step Prediction): Again, DFFWCRBMs achieve better performance than FFWCRBMs in present step estimation of the 3D skeleton joints from 2D projections, while both outperform FCRBM. It is worth highlighting that DFFWCRBMs are capable of attaining a high average prediction correlation to groundtruth of almost 0.8.
Prediction of 3D skeleton Trajectories (MultiStep Prediction): Finally, Figure 10 shows that DFFWCRBMs are capable of surpassing FFWCRBMs in multistep predictions on unseen subjects achieving low prediction errors and high ground truth correlation.
5 Conclusion
In this paper we proposed disjunctive factored fourway conditional restricted Boltzmann machines (DFFWCRBMs). These novel machine learning techniques can be used for estimating 3D trajectories from their 2D projections using limited amounts of labeled data. Due to the new tensor factoring introduced by DFFWCRBMs, these machines are capable of achieving substantially lower energy levels than stateoftheart techniques leading to more accurate predictions and classification results. Furthermore, DFFWCRBMs are capable of performing classification and accurate nearfuture predictions simultaneously in one unified framework.
Two sets of experiments, one on a simulated ball trajectories dataset and one on a realworld benchmark database, demonstrate the effectiveness of DFFWCRBMs. The empirical evaluation showed that our methods are capable of outperforming stateoftheart machine learning algorithms in both classification and regression. Precisely, DFFWCRBM were capable of achieving substantially lower energy levels (approximately three times less energy on the overall datasets, independently on the number of factors or hidden neurons) than FFWCRBM. This leads to at least double accuracies for realvalued predictions, while acquiring similar classification performance, at no increased computational complexity costs.
References
 (1) H. Silva, A. Dias, J. Almeida, A. Martins, E. Silva, Realtime 3d ball trajectory estimation for robocup middle size league using a single camera, in: T. Röfer, N. Mayer, J. Savage, U. Saranlı (Eds.), RoboCup 2011: Robot Soccer World Cup XV, Vol. 7416 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2012, pp. 586–597. doi:10.1007/9783642320606_50.
 (2) S. Qiu, Y. Yang, J. Hou, R. Ji, H. Hu, Z. Wang, Ambulatory estimation of 3d walking trajectory and knee joint angle using marg sensors, in: Innovative Computing Technology (INTECH), 2014 Fourth International Conference on, 2014, pp. 191–196. doi:10.1109/INTECH.2014.6927742.
 (3) S. Maleschlijski, G. Sendra, A. Di Fino, L. LealTaixé, I. Thome, A. Terfort, N. Aldred, M. Grunze, A. Clare, B. Rosenhahn, A. Rosenhahn, Three dimensional tracking of exploratory behavior of barnacle cyprids using stereoscopy, Biointerphases 7 (14). doi:10.1007/s137580120050x.
 (4) M. Jauzac, E. Jullo, J.P. Kneib, H. Ebeling, A. Leauthaud, C.J. Ma, M. Limousin, R. Massey, J. Richard, A weak lensing mass reconstruction of the largescale filament feeding the massive galaxy cluster macs j0717.5+3745, Monthly Notices of the Royal Astronomical Society 426 (4) (2012) 3369–3384. doi:10.1111/j.13652966.2012.21966.x.
 (5) J. Tao, B. Risse, X. Jiang, R. Klette, 3d trajectory estimation of simulated fruit flies, in: Proceedings of the 27th Conference on Image and Vision Computing New Zealand, IVCNZ ’12, ACM, New York, NY, USA, 2012, pp. 31–36. doi:10.1145/2425836.2425844.
 (6) J. Pinezich, J. Heller, T. Lu, Ballistic projectile tracking using cw doppler radar, Aerospace and Electronic Systems, IEEE Transactions on 46 (3) (2010) 1302–1311. doi:10.1109/TAES.2010.5545190.
 (7) H. Larochelle, Y. Bengio, Classification using discriminative restricted boltzmann machines, in: Proceedings of the 25th International Conference on Machine Learning, ICML ’08, ACM, 2008, pp. 536–543. doi:10.1145/1390156.1390224.
 (8) R. Salakhutdinov, A. Mnih, G. Hinton, Restricted boltzmann machines for collaborative filtering, in: In Machine Learning, Proceedings of the Twentyfourth International Conference (ICML 2004). ACM, AAAI Press, 2007, pp. 791–798.
 (9) D. C. Mocanu, G. Exarchakos, A. Liotta, Deep learning for objective quality assessment of 3d images, in: Image Processing (ICIP), 2014 IEEE International Conference on, 2014, pp. 758–762. doi:10.1109/ICIP.2014.7025152.
 (10) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Humanlevel control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. doi:10.1038/nature14236.
 (11) H. BouAmmar, D. C. Mocanu, M. E. Taylor, K. Driessens, K. Tuyls, G. Weiss, Automatically mapped transfer between reinforcement learning tasks via threeway restricted boltzmann machines, in: Machine Learning and Knowledge Discovery in Databases, Vol. 8189 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, pp. 449–464. doi:10.1007/9783642409912_29.
 (12) P. V. Gehler, A. D. Holub, M. Welling, The rate adapting poisson model for information retrieval and object recognition, in: In Proceedings of 23rd International Conference on Machine Learning (ICML06), ACM Press, 2006, p. 2006.
 (13) D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multiscale deep network, in: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, 2014, pp. 2366–2374.

(14)
P. Rasti, T. Uiboupin, S. Escalera, G. Anbarjafari, Convolutional Neural Network Super Resolution for Face Recognition in Surveillance Monitoring, Springer International Publishing, Cham, 2016, pp. 175–184.
doi:10.1007/9783319417783_18.  (15) K. Nasrollahi, S. Escalera, P. Rasti, G. Anbarjafari, X. Baro, H. J. Escalante, T. B. Moeslund, Deep learning based superresolution for improved action recognition, in: 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), 2015, pp. 67–72. doi:10.1109/IPTA.2015.7367098.
 (16) J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, A. Blake, Efficient human pose estimation from single depth images, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12) (2013) 2821–2840. doi:http://doi.ieeecomputersociety.org/10.1109/TPAMI.2012.241.

(17)
I. Sutskever, G. E. Hinton, Learning multilevel distributed representations for highdimensional sequences., in: M. Meila, X. Shen (Eds.), AISTATS, Vol. 2 of JMLR Proceedings, JMLR.org, 2007, pp. 548–555.
 (18) P. Smolensky, Information processing in dynamical systems: Foundations of harmony theory, in: D. E. Rumelhart, J. L. McClelland, et al. (Eds.), Parallel Distributed Processing: Volume 1: Foundations, MIT Press, Cambridge, 1987, pp. 194–281.
 (19) G. W. Taylor, G. E. Hinton, Factored conditional restricted boltzmann machines for modeling motion style, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 2009, pp. 1025–1032.
 (20) G. W. Taylor, G. E. Hinton, S. T. Roweis, Two distributedstate models for generating highdimensional time series, Journal of Machine Learning Research 12 (2011) 1025–1068.
 (21) D. C. Mocanu, H. BouAmmar, D. Lowet, K. Driessens, A. Liotta, G. Weiss, K. Tuyls, Factored four way conditional restricted boltzmann machines for activity recognition, Pattern Recognition Lettersdoi:http://dx.doi.org/10.1016/j.patrec.2015.01.013.
 (22) D. C. Mocanu, M. T. Vega, E. Eaton, P. Stone, A. Liotta, Online contrastive divergence with generative replay: Experience replay without storing data, CoRR abs/1610.05555.
 (23) Y. Bengio, Learning deep architectures for ai, Found. Trends Mach. Learn. 2 (1) (2009) 1–127.
 (24) D. C. Mocanu, E. Mocanu, P. H. Nguyen, M. Gibescu, A. Liotta, A topological insight into restricted boltzmann machines, Machine Learning 104 (2) (2016) 243–270. doi:10.1007/s109940165570z.
 (25) G. E. Hinton, Training Products of Experts by Minimizing Contrastive Divergence, Neural Computation 14 (8) (2002) 1771–1800.
 (26) M. Ponti, J. Kittler, M. Riva, T. de Campos, C. Zor, A decision cognizant kullback–leibler divergence, Pattern Recognition 61 (2017) 470 – 478. doi:http://dx.doi.org/10.1016/j.patcog.2016.08.018.
 (27) S. Elaiwat, M. Bennamoun, F. Boussaid, A spatiotemporal rbmbased model for facial expression recognition, Pattern Recognition 49 (2016) 152 – 161. doi:http://dx.doi.org/10.1016/j.patcog.2015.07.006.
 (28) G. Hinton, M. Pollefeys, J. Susskind, R. Memisevic, Modeling the joint density of two images under a variety of transformations, 2013 IEEE Conference on Computer Vision and Pattern Recognition 00 (undefined) (2011) 2793–2800. doi:doi.ieeecomputersociety.org/10.1109/CVPR.2011.5995541.
 (29) G. Hinton, A practical guide to training restricted boltzmann machines, in: Neural Networks: Tricks of the Trade, Vol. 7700 of Lecture Notes in Computer Science, Springer, 2012, pp. 599–619. doi:10.1007/9783642352898_32.
 (30) C. Cortes, V. Vapnik, SupportVector Networks, Mach. Learn. 20 (3) (1995) 273–297.
 (31) T. Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27 (8) (2006) 861–874.
 (32) C. S. Catalin Ionescu, Fuxin Li, Latent structured models for human pose estimation, in: International Conference on Computer Vision, 2011.
 (33) C. Ionescu, D. Papava, V. Olaru, C. Sminchisescu, Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7) (2014) 1325–1339.
Comments
There are no comments yet.