1 Introduction
The statistical modelling of temporal data is a problem of great interest to machine learning. Not only because of the number of data sources that are intrinsically temporal, but also because of the growing number of applications that interact with users in real time and require efficient and scalable handling of large streams of temporal data. Good models of the statistical structure of datasets are also generally thought to yield good representations for discriminative or predictive tasks on these data. One class of statistical model which has received a great deal of attention in the recent literature is the Restricted Boltzmann Machine (Hinton and Salakhutdinov, 2006). The Restricted Boltzmann Machine (RBM) is a simple graphical model which is easily trainable using contrastive divergence (CD) learning (CarreiraPerpinan and Hinton, 2005). There are two canonical ways in which RBMs have been extended to model temporal data: The Temporal RBM (Sutskever and Hinton, 2007); and the Conditional RBM (Taylor et al., 2007), both of which have had notable success. The TRBM learns temporal correlations between latent representations for each temporal sample, while the CRBM learns a latent representation for the whole data sequence (see sec:methods below).
One marked advantage of these methods is that they allow for the generation of samples from the learned data distribution. However, although these methods have had success generating data from a number of data sources, they have only done so in relatively narrow contexts. Here we improve on the training methods of temporal and conditional RBMs to allow for better and more robust generation from general datasets.
The TRBM and CRBM seek to model the structure of the data, but the learning methods usually employed disregard the causality in its structure. For these models, contrastive divergence learning seeks to approximately maximize the likelihood of sequences of observed data (in the case of the TRBM) or the conditional likelihood of the present data given the past (in the case of the CRBM), without any regard to the underlying dynamics. Naturally, the method learns a dynamical model of the data, but to explicitly train it do so could result in better models of the system’s dynamics. We propose a simple method to enforce the dynamics of the data in the models learnt representations. We achieve this by training the model as a neural network for prediction, similar to what is done in denoising autoencoders
(Vincent et al., 2010). We refer to this approach as Temporal Autoencoding (TA), which by itself it does not yield good generative models. However, by initializing the model through Temporal Autoencoding and then applying contrastive divergence training, one can bias the structure of the models towards the dynamics of the data, resulting in better generative performance.One natural way to measure the quality of a generative model is to take samples from it and compare them to samples from the dataset. A simple way to quantify the fidelity of these samples is to provide partial samples from the data where certain dimensions are left out and then to generate the missing dimensions by sampling from the model. This approach is generally called fillingin and is particularly wellsuited to temporal applications as we can condition on the observations up to a certain time and fill in the missing frames by sampling from the model. One can then compare the generated sample to the true data, for example by taking the mean squared error (MSE)or the Mean Absolute Percentage Error (MAPE) between them. We will use these measures in a fillinginframes task to quantify thhe quality of our models throughout this paper.
Temporal Autoencoding pretraining improves the performance of both generative models across all datasets considered by as much as 80% with approximately the same training time those models trained in the conventional manner. These findings hold across different modalities of data, such as human motion capture data, and a number of datasets taken from the M3 forecasting competition (Makridakis and Hibon, 2000), which encompass yearly, quarterly, monthly and a few unspecified types of temporal data . The fact that the proposed pretraining betters model performance across datasets for both the CRBM and TRBM confirms that the method provides a robust improvement in the generative performance of both RBM models. Furthermore, the performance increase is not limited to short timescales, but can be seen to hold even for longer periods of time ranging over the memory encoded directly by the method.
Autoencoders have recently been cast into a new light by considering them as generative models (Bengio et al., 2013). Though we do not take that approach here, we firmly believe that autoencoder training can improve the performance of generative models greatly. This has been shown for the temporal models considered here, and we expect this to lead to a significant improvement towards training temporal generative models.
2 Methods
We propose a new pretraining method for both the TRBM and the CRBM, based on a denoising autoencoder approach through time. To this end we shortly discuss the RBM, the denoising autoencoder and the temporal models used. Throughout the paper we will denote the activation of visible layers by and the activation of hidden layers by , where is the number of visible units and the number of hidden units. In the case of temporal models we will denote the present state of the visible and hidden layers by and , where is the number of delayed units considered, and the subsequential delayed units by and , where . The naming convention is shown in fig:all_models for delayed units.
2.1 Restricted Boltzmann Machines
Restricted Boltzmann Machines are generative models which assume alltoall symmetric connectivity between the visible and hidden variables (see fig:all_modelsa) and seek to model the structure of a given dataset. They are energybased models, parametrized by a
bydimensional weight matrix , a bias for the visible layer and a bias for the hidden layer . The energy of a given configuration of activations and is given byand the probability of a given configuration is given by
where is the partition function. One noted advantage of the RBM is that the visible units are independent of each other when conditioned on the hidden units and viceversa. This allows for efficient sampling, and for the exact calculation of a number of averages. Namely, we can evaluate exactly the conditional distributions
and
where
is the sigmoid function.
One can extend the RBM to continuousvalued visible variables by modifying the energy function, to obtain the Gaussianbinary RBM
This then leads to the conditional distributions
where
is the normal distribution with mean
and variance
andOften the variances are constrained to have the same value across dimensions, or simply taken to be constant. To learn them from the data, however, one must take extra care to deal with vanishingly small variances. Like most statistical models, RBMs can be trained by maximizing the log likelihood of the data. This, however proves to be intractable even for the case of the RBM, and we are left with maximizing surrogate functions. The derivative of the log likelihood of an observed visible state can be written as
where is any of the parameters of the model. Note that the first term is easy to compute, but the second one involves averages over the full distribution , which is intractable. RBMs are therefore usually trained through contrastive divergence, which approximately follows the gradient of the cost function
where is the data distribution, is the distribution of the visible layer after Markov chain Monte Carlo (MCMC) steps and
is the KullbackLeibler divergence
(CarreiraPerpinan and Hinton, 2005). The samples from the data distribution are simply taken from the data, whereas the samples from are taken by running a MCMC for steps. The function gives an approximation to maximumlikelihood (ML) estimation of the weight matrix . Further approximation is still needed, as the cost still involves intractable averages, but it is generally found that the approximate parameter update given byalready gives very good results. The weight updates then become
In general, is already sufficient for practical purposes (Hinton and Salakhutdinov, 2006).
2.2 Autoencoders
Autoencoders are deterministic models with two weight matrices and representing the flow of data from the visibletohidden and hiddentovisible layers respectively (see Figure 1b).^{1}^{1}1Often one only uses one matrix and propagates up throught and down through its transpose AEs are trained to perform optimal reconstruction of the visible layer, often by minimizing the meansquared error (MSE) in a reconstruction task. This is usually evaluated as follows: Given an activation pattern in the visible layer , we evaluate the activation of the hidden layer by . These activations are then propagated back to the visible layer through and the weights and are trained to minimize the distance measure between the original and reconstructed visible layers. Therefore, given a set of image samples we can define the cost function. For example, using the squared euclidean distance between the original data and the reconstructed data,
, we have the loss function
The weights can then be learned through stochastic gradient descent on the cost function. Autoencoders often yield better representations when trained on corrupted versions of the original data, performing gradient descent on the distance to the uncorrupted data. This approach is called a denoising autoencoder
(Vincent et al., 2010). Note that in the AE, the activations of all units are continuous and not binary, and usually take values between and .2.3 Temporal Restricted Boltzmann Machine
Temporal Restricted Boltzmann Machines (TRBM) are a temporal extension of the standard RBM whereby connections are included from previous time steps between hidden layers, from visible to hidden layers and from visible to visible layers. Learning is conducted in the same manner as a normal RBM using contrastive divergence and it has been shown that such a model can be used to learn nonlinear system evolutions such as the dynamics of a ball bouncing in a box (Sutskever and Hinton, 2007). A more restricted version of this model, discussed in (Sutskever et al., 2008) can be seen in fig:all_modelsc and only contains temporal connections between the hidden layers. We restrict ourselves to this model architecture throughout the paper.
The energy of the model for a given configuration of the visible layers and hidden layers is given by
(1) 
where we have used and , where are the static weights and are the delayed weights for the temporally delayed hidden layers (see fig:all_modelsc). Note that because the hidden layers are coupled, the expectations in the CD cost can not be simply evaluated as in the RBM, and must be estimated by MCMC sampling, making training and sampling in this model more difficult. More specifically note that the conditional distribution is already intractable. A simple way to deal with this is the socalled filtering approximation, where we sample from the past hidden layers ignoring the present hidden layer and then sample from the present hidden layer conditioned on the past.
2.4 Conditional Restricted Boltzmann Machines
One way to overcome the problems of the TRBM has been proposed in the Conditional Restricted Boltzmann Machines (Taylor et al., 2007). The CRBM has only one hidden layer, which receives input from all visible layers, past and present. Additionally, the present visible layer receives input from past visible layers. Unlike the TRBM, only the present hidden and visible layers are considered to be free, whereas the past visible states are conditioned on. The energy of the model can be written as
where are the visibletovisible weights. The model architecture can be seen in Figure 1d. Using this formulation, the hidden layer can still be easily marginalized over, allowing for more efficient training using contrastive divergence. The CRBM is possibly the most successful of the temporal RBM models to date and has been shown to both model and generate data from complex dynamical systems such as human motion capture data and video textures (Taylor, 2009).
2.5 Temporal Autoencoding Training
The usual CD training for the TRBM and CRBM seeks to maximize the likelihood of the data observed. This usually works quite well and has been shown to allow the trained models to reproduce complex temporal data such as video of a bouncing ball or human motion capture. However, there is one bit of essential information which these training methods overlook. They ignore that the current time frame has a causal dependence on the previous frames. If the data comes from a time series it is a natural assumption that the future states are given by some function of the past states, latent variables and possibly noise. We seek to explore this property, by explicitly learning a representation which represents these dynamics.
We do so by treating the hidden layers of the model as an information bottleneck, similar to what is done in the training of the denoising autoencoder (Vincent et al., 2010)
. We treat the past states of the time series up to a number of delays as a noisy representation of the present state, and propagate the activations through the model, considering it as a neural network with sigmoidal activation functions and perform gradient descent on the quadratic error of the reconstructed present state. In this way, we explicitly constrain the model to represent the dynamic structure of the data.
This essentially amounts to performing supervised learning for reconstruction using the architectures shown in fig:autoencoding. Though the idea behind the training procedure is the same for both models, the specifics are slightly different and as such we consider them separately below.
2.5.1 Temporal Autoencoding for the TRBM
Let us first consider the TRBM. The energy of the model is given by eq:energy_trbm and is essentially an th order autoregressive RBM which is usually trained by standard contrastive divergence. Here we propose to train it with a novel approach, highlighting the temporal structure of the stimulus. First, the individual RBM visibletohidden weights are initialized through contrastive divergence learning with a sparsity constraint on static samples of the dataset. After that, to ensure that the weights representing the hiddentohidden connections (
) encode the dynamic structure of the ensemble, we initialize them by pretraining in the fashion of a denoising Autoencoder. For that, we consider the model to be a deterministic MultiLayer Perceptron with continuous activation in the hidden layers. We then consider the
delayed visible layers as features and try to predict the current visible layer by projecting through the hidden layers. In essence, we are considering the model to be a feedforward network, where the delayed visible layers would form the input layer, the delayed hidden layers would constitute the first hidden layer, the current hidden layer would be the second hidden layer and the current visible layer would be the output as is pictured in fig:autoencoding. Given sample activations of the visible layers given by , we can then write the prediction of the network as , where the index runs over the data points. The exact format of this function is described in alg:pretraining. We therefore minimize the reconstruction error given bywhere the sum over goes over the entire dataset. After the Temporal Autoencoding is completed, the whole model (both visibletohidden and hiddentohidden weights) is trained together using contrastive divergence (CD) training. A summary of the training method is described in tab:training.
Step  Action 

1. Static RBM Training  Constrain the static weights using CD on single frame samples of the training data 
2. Temporal Autoencoding  Constrain the temporal weights to using a denoising autoencoder on multiframe samples of the data 
3. Model Finalisation  Train all model weights together using CD on multiframe samples of the data 
2.5.2 Temporal Autoencoding for the CRBM
The procedure is very similar for the CRBM. First the static weights are initialized with contrastive divergence training. After that, we reconstruct the present frame from its past observations by passing it through the hidden layer. The obtained reconstruction is then a function of the past observations and the matrices and the biases , we can write . We then perform stochastic gradient descent on the reconstrucion error
After this step is finished we proceed to train the CRBM with normal contrastive divergence to fine tune the weights for better generation. A summary for the training procedure is given in tab:training and a complete description of the temporal autoencoding step is given in alg:pretraining_crbm.
2.5.3 Implementation
Gradient descent on the cost functions explained above involves backpropagation through the hidden layers. This has been made relatively simple by automatic differentiation packages such as Theano
(Bergstra et al., 2010). We have implemented the temporal autoencoding training as a MLP and then proceeded to perform stochastic gradient descent on the loss using minibatches.3 Experiments
We have applied our pretraining method to the CRBM and TRBM using two datasets. The motioncapture data described in (Taylor et al., 2007) and the M3 competition dataset (Makridakis and Hibon, 2000). For both datasets we separated the data into a training and a validation set, then trained our models on the training set and evaluated them on a fillinginframes task on the validation set. For all experiments we used a Gaussianbinary RBM model with variance fixed to 1.
3.1 MotionCapture Data
We assessed the impact of our pretraining method by applying it to the 49 dimensional human motion capture data described in (Taylor et al., 2007) and using this as a benchmark, comparing the performance to the models without pretraining.^{2}^{2}2In this section we refer to the reduced TRBM model referenced in (Sutskever et al., 2008) with only hiddentohidden temporal connections All the models were implemented using Theano (Bergstra et al., 2010)
, have a temporal dependence of 6 frames and were trained using minibatches of 100 samples for 500 epochs.
^{3}^{3}3For the TRBM and CRBM, training epochs were broken up into 100 static pretraining and 400 epochs for all the temporal weights together. For the TA pretrained models, aTRBM and aCRBM, training epochs were broken up into 100 static pretraining, 50 Autoencoding epochs per delay and 100 epochs for all the temporal weights together, totalling to the same number of training epochs (500) The training time for the models was approximately equal. Training was performed on the first 2000 samples of the dataset after which the models were presented with 1000 snippets of the data not included in training set and required to generate the next frame in the sequence. The generation in the TRBM is done using the filtering approximation, that is, by taking a sample from the hidden layers at through and then Gibbs sampling from the RBM at time while keeping the others fixed as biases. The visible layer at time is initialized with noise and we sample for 100 Gibbs steps from the model. The results of a single trial prediction for 4 random dimensions of the dataset can be seen in Figure 3and the mean squared error and standard deviations of the model predictions over 100 repetitions of the task can be seen in Table
2.The models trained with Temporal Autoencoding significantly outperform their CDonly trained counterparts. The CRBM shows an improvement of approximately 56%, while the TRBM shows an improvement of almost 80% on this dataset. The performance can be further improved by taking the mean of the estimate by sampling from the hidden layer multiple times and taking the average prediction. This is akin to taking the Bayesian posterior mean estimator and leads to a further decrease in the MSE of 78% for the CRBM and 91% for the TRBM relative to straight CD training.
One could argue that the improved performance of the TA pretrained model simply shows that a deterministic neural network is more well suited to the task at hand. To make sure that the improvement in performance is due to the interplay of both training approaches, we have also trained a deterministic multilayer perceptron (MLP) with the architecture shown in fig:autoencoding. This is shown in the rightmost column in fig:awesome. As is shown, this simple deterministic approach outperforms the CDtrained model, but not the model trained with Temporal Autoencoding.
These improvements also hold for longer time scales if we keep feeding the models predictions back into the it and let it generate autonomously. The TA pretraining significantly lowers the prediction error. Even after 6 frames, when all the visible layer frames were generated by the model, the MSE is still approximately as low or lower than when filling in one frame from the data without pretraining. The prediction errors for our models are shown in fig:prediction.
Model  Architecture and Training  MSE ( SD) 

TRBM  100 hidden units, 6 frame delay  1.59 () 
TRBM (TA)  100 hidden units, 6 frame delay  0.32 () 
TRBM (TA), 50 sample mean  100 hidden units, 6 frame delay  0.14 () 
CRBM  100 hidden units, 6 frame delay  0.40 () 
CRBM (TA)  100 hidden units, 6 frame delay  0.17 () 
CRBM (TA), 50 sample mean  100 hidden units, 6 frame delay  0.08 () 
3.2 M3 Forecasting Competition Data
The motion capture experiments have shown great results for our proposed training method, but it reflects a lot of structure specific to the origin of the data. To assess how the method works on a more generalised dataset, we applied it to the datasets of the M3 forecasting competition. The M3 forecasting competition (Makridakis and Hibon, 2000) pitted forecasting algorithms against one another on 3003 different datasets, ranging from microeconomical to financial and industrial data. The data are univariate, but through state augmentation we can use our method to generate predictions for future data points. We have done so by taking chunks of 4 observations and used successive chunks as our multivariate data. With these we have trained the model to generate forecasts.
fig:m3_prediction shows the average performance of our algorithm on the four different kinds of data. They are separated into yearly, quarterly, monthly and other, the main categories of the competition. Here we measure the model performance using MAPE as was used in the competition. Although the datasets are generally small if compared to the usual unsupervised learning case, our training method still fares relatively well. Furthermore, TA pretraining continues to show a strong improvement over straight CD learning across the board. The robust performance of the TA pretraining on these datasets strongly suggests our method will generally yield improvements.
4 Discussion and Future Work
We have introduced a new training method for temporal RBMs we call Temporal Autoencoding and have shown that it can achieve a significant performance increase in a fillinginframes task across a number of datasets. The gain in performance from our pretraining holds for both the CRBM and the TRBM, allowing for more efficient training of generative models.
Our approach combines the supervised approach of backpropagating prediction errors through the network with the unsupervised approach of Contrastive Divergence learning. We have also shown that neither method by itself can achieve the performance we achieve by combining both.
The approach shows significant improvement in the performance of the generative models, for fillinginframes as well as for prediction tasks. This is shown to hold across a number of datasets. In the M3 contest dataset, specifically, the approach is shown to consistently improve the MAPE in a forecasting task, across a number of different types of data. On motion capture data, on the other hand, we were able to improve the MSE of the generative model by as much as 90% in some cases.
It is our opinion that the approach of autoencoding the temporal dependencies gives the model a more meaningful temporal representation than is achievable through contrastive divergence training alone. The TA training seeks to constrain the model to reproduce the dynamics observed in the data, as such it is not surprising that the improvement in generation also leads to an improvement in the prediction performance of the models considered. We believe the inclusion of Autoencoder training in temporal learning tasks will be beneficial in a number of contexts, as it enforces the causal structure of the data on the learned model.
References
 Bengio et al. (2013) Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized Denoising AutoEncoders as Generative Models. ArXiv eprints, May 2013.
 Bergstra et al. (2010) James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. URL http://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf. Oral Presentation.
 CarreiraPerpinan and Hinton (2005) M.A. CarreiraPerpinan and G.E. Hinton. On contrastive divergence learning. In Artificial Intelligence and Statistics, volume 2005, page 17, 2005.
 Hinton and Salakhutdinov (2006) G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 Makridakis and Hibon (2000) Spyros Makridakis and Michele Hibon. The m3competition: results, conclusions and implications. International journal of forecasting, 16(4):451–476, 2000.

Sutskever and Hinton (2007)
I. Sutskever and G.E. Hinton.
Learning multilevel distributed representations for highdimensional sequences.
In Proceeding of the Eleventh International Conference on Artificial Intelligence and Statistics, pages 544–551, 2007.  Sutskever et al. (2008) I. Sutskever, G. Hinton, and G. Taylor. The recurrent temporal restricted boltzmann machine. Advances in Neural Information Processing Systems, 21, 2008.
 Taylor (2009) G.W. Taylor. Composable, distributedstate models for highdimensional time series. PhD thesis, 2009.
 Taylor et al. (2007) G.W. Taylor, G.E. Hinton, and S.T. Roweis. Modeling human motion using binary latent variables. Advances in neural information processing systems, 19:1345, 2007.
 Vincent et al. (2010) P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11:3371–3408, 2010.
Comments
There are no comments yet.