I Introduction
Deep learning algorithms have been the object of much attention in the machine learning and applied statistics literature in recent years, due to the stateoftheart results achieved in various applications such as object recognition [1, 2], speech recognition [3, 4, 5]
[6], etc. With the exception of convolutional neural networks (CNN), the training of multilayered networks was in general unsuccessful until 2006, when breakthroughs were made by three seminal works in the field  Hinton
et al. [7], Bengio et al. [8] and Ranzato et al. [9]. They addressed the notion of greedy layerwise pretraining to initialize the weights of an entire network in an unsupervised manner, followed by a supervised backpropagation step. The inclusion of the unsupervised pretraining step appeared to be the missing ingredient which then lead to significant improvements over the conventional training schemes.The recent progress in deep learning is more towards supervised learning algorithms [1, 10, 11]
. While these algorithms obviate the need for the additional pretraining step in supervised settings, large amount of labeled data is still critical. Given the evergrowing volumes of unlabeled data and the cost of labeling, it remains a challenge to develop better unsupervised learning techniques to exploit the copious amounts of unlabeled data.
The unsupervised training of a network can be interpreted as learning the data distribution in a probabilistic generative model of the data. A typical method for accomplishing this is to decompose the generative model into a latent conditional generative model and a prior distribution over the hidden variables. When this interpretation is extended to a deep neural network [7, 12, 13], it implies that while the lower layers model the conditional distribution
, the higher layers model the distribution over the latent variables. It has been shown that locallevel learning accomplished via pretraining is important when training deep architectures (initializing every layer of an unsupervised deep Boltzmann machine (DBM) improves performance
[14]; similarly, initializing every layer of a supervised multilayer network also improves performance [8]). But this greedy layerwise scheme has a major disadvantage – the higher layers have significantly less knowledge about the original data distribution than the bottom layers. Hence, if the bottom layer does not capture the data representation sufficiently well, the higher layers may learn something that is not useful. Furthermore, this error/bias will propagate through layers, i.e. the higher the layer, the more the errors it will incur. To summarize, the greedy layerwise training focuses on the local constraints introduced by the learning algorithm (such as in autoencoders), but loses sight of the original data distribution when training higher layers. To compensate for this disadvantage, all the levels of the deep architecture should be trained simultaneously. But this type of joint training can be very challenging and if done naively, will fail to learn [14].Thus in this paper, we present an effective method for jointly training a multilayer autoencoder from endtoend in an unsupervised fashion, and analyze its performance against greedy layerwise training in various settings. This unsupervised joint training method consists of a global reconstruction objective, thereby learning a good data representation without losing sight of the original input data distribution. In addition, it can also easily cope with local regularization on the parameters of hidden layer in a similar way as in layerwise training, and therefore allow us to use more powerful regularizations proposed more recently. This method can also be viewed as a generalization of single to multilayer autoencoders. Because our approach achieves a global reconstruction in a deep network, the feature representations learned are better. This attribute also makes it a good feature extractor and we confirm this by our extensive analysis. The representations learned from joint training approach consistently outperform those obtained from greedy layerwise pretraining algorithms in unsupervised settings. In supervsied setting, this joint training scheme also demonstrate superior performance for deeper models.
Ia Motivation
Because supervised methods typically require large amounts of labeled data and the cost of acquiring these labels can be an expensive and time consuming task, it remains a challenge to continue to develop improved unsupervised learning techniques that can exploit large volumes of unlabeled data. Although the greedy layerwise pretraining procedure has till date been very successful, from an engineering perspective, it can be challenging to train, and monitoring the training process can be difficult. For the layerwise method, apart from the bottom layer where the unsupervised model is learning directly on the input, the training cost are measured with respect to the layer below. Hence, any changes in error values from one layer to the next has little meaning to the user. But by having one global objective, the joint training technique has training cost that are consistently measured with respect to the input layer. This way, one can readily monitor the changes in training errors and even with respect to posttraining tasks such as classification or prediction.
Also, as stated earlier, the unsupervised training of a network can be interpreted as learning the data distribution in a probabilistic generative model of the data. But in order for the unsupervised method to learn , a common strategy is to decompose into and [13, 12]. There are now two models to optimize, the conditional generating model and the prior model . Since covers a wide range of distributions and is hard to optimize in general, one tends to emphasize more on the optimization of and assume that the later learning of could compensate for the loss occurred due to imperfect modeling. Note that the prior can also be decomposed in exactly the same way as , resulting in additional hidden layers. Thus, one can recursively apply this trick and delay the learning of the prior. The motivation behind this recursion is to expect that as the learning progresses through layers, the prior gets simpler, and thus makes the learning easier. The greedy layerwise training employs this idea, but it may fail to learn the optimum data distribution for the following reason: in layerwise training, the parameters of the bottom layers are fixed after training and the prior model can only observe
through its fixed hidden representations. Hence, if the learning of
does not preserve all informatin regarding (which is very likely), then it is not possible to learn the prior that leads to the optimum model for the data distribution.For these reasons, therefore, we explore the possibility of jointly training deep autoencoders where the joint training scheme makes the adjustments of and (and consequently ) possible with respect to each other, thus alleviating the burden for both models. As a consequence, the prior model can now observe the input making it possible to fit the prior distribution better. Hence, joint training makes global optimization more possible.
Ii Background
In this section, we briefly review the concept of autoencoders and some of its variants, and expand on the notion of the deep autoencoder, the primary focus of this paper.
Iia Basic Autoencoders (AE)
A basic autoencoder is a onehiddenlayer neural network [15, 16], and its objective is to reconstruct the input using its hidden activations so that the reconstruction error is as small as possible. It takes the input and puts it through an encoding function to get the encoding of the input, and then it decodes the encodings through a decoding function to recover (an approximation of) the original input. More formally, let be the input,
where and are encoding and decoding functions respectively, and are the weights of the encoding and decoding layers, and and are the biases for the two layers. and
are elementwise nonlinear functions in general, and common choices are sigmoidal functions like
or logistic. For training, we want to find a set of parameters that minimize the reconstruction error :(1) 
where
is a loss function that measures the error between the reconstructed input
and the actual input , and denotes the training dataset. We can also tie the encoding and decoding weights by setting the weights . Common choices of includes sumofsquarederrors for real valued inputs, crossentropy for binary valued inputs etc.. However, this model has a significant drawback in that if the number of hidden units is greater than the dimensionality of the input data , the model can perform very well during training but fail at test time because it trivially copied the input layer to the hidden one and then copied it back. As a workaround, one can set to force the model to learn something meaningful, but the performance is still not very efficient.IiB Denoising Autoencoder (DAE)
Vicent et al. [17]
proposed a more efficient way to overcome the shortcomings of the basic autoencoder, namely the denoising autoencoder. The idea is to corrupt the input before passing it to the network, but still require the model to reconstruct the uncorrupted input. In this way, the model is forced to learn representations that are useful since trivially copying the input will not optimize this denoising objective (equation
2). Formally, let be the corrupted version of the input, where is some corruption process over the input , then the objective it tries to optimize is:(2) 
IiC Contractive Autoencoders (CAE)
More recently, Rifai et al. [18] proposed to add a special penalty term to the original autoencoder objective to achieve robustness to small local perturbations. The penalty term was the Frobenius norm of the Jacobian matrix of the hidden activations with respect to the input, and their modified objective is:
(3) 
This penalty term measured the change within the hidden activations with respect to the input. Thus, the penalty term alone would prefer hidden activations to stay constant when the input varies (i.e. the Frobenius norm would be zero). The loss term will be small only if the reconstruction is close to the original, which is not possible if the hidden activation does not change according to the input. Therefore, the hidden representation that tries to represent the data manifold would be preferred, since otherwise we would have high costs in both terms.
IiD Deep Autoencoders
A deep autoencoder is a multilayer neural network that tries to reconstruct its input (see Figure 1). In general, an layer deep autoencoder with parameters , where can be formulated as follows:
(4)  
(5)  
(6) 
The deep autoencoder architecture therefore contains multiple encoding and decoding stages made up of a sequence of encoding layers followed by a stack of decoding layers. Notice that the deep autoencoder therefore has a total of layers. This type of deep autoencoder has been investigated in several previous works (see [19, 20, 21] for examples). A deep autoencoder can also be viewed as an unwrapped stack of multiple autoencoders with the higher layer autoencoder taking the lower layer encoding as its input. Hence, when viewed as a stack of autoencoders, one could train the stack from bottom to top.
IiE General Autoencoder Objective
By observing the formulations of Equations (13), we can rewrite the single layer autoencoder objective in a more general form as:
(7) 
where , is some conditional distribution over the input and , and is an arbitrary regularization function over the parameters. The choice of a good regularization term seems to be the ingredient that has led to the recent successes of several autoencoder variants (the models proposed in paper [22, 23], for example).
It is straightforward to see that we can recover the previous autoencoder objectives from Equation 7. For example, the basic autoencoder can be obtained by setting to be a Dirac delta at the data point, and . Since this objective is more general, going forward, we will refer to this objective rather than the specific autoencoder ones presented earlier.
Iii Our Joint Training Method
As mentioned previously, a deep autoencoder can be optimized via layerwise training, i.e. bottom layer objective is optimized before (i.e. from bottom to top, see also Figure 1 left) until the desired depth is reached. If one would like to jointly train all layers, a natural question would be – why not combine these objectives into one and train all the layers jointly? Directly combing all the objectives (i.e. optimizing ) is not appropriate, since the end goal is to reconstruct the input as accurate as possible, and not the intermediate representations. The aggregated loss and regularization may hinder the training process since the objective has deviated from the goal. Furthermore, during training, the representation of the lower layers is varying continuously (since the parameters are changing), making the optimization very difficult^{1}^{1}1We have tried this scheme on several datasets but the results are very poor..
So, focusing on the goal of reconstructing the input, we explore the following joint training objective for an layered deep autoencoder:
(8) 
where , ; and with , and .
In other words, this method is optimizing the reconstruction error directly from a multilayer neural network, and at the same time enabling us to conveniently apply more powerful regularizers for the single autoencoders at each layer. For example, if we want to train a twolayered denoising autoencoder using this method, we will need to corrupt the input and feed it into the first layer, and then corrupt the hidden output from the first layer and feed it into the second layer; then we reconstruct from the second layer hidden activations followed by a reconstruction to the original input space using the first layer reconstruction weight; we then measure the loss between the reconstruction and the input and do gradient update to all the parameters (see Figure 1).
In this formulation, we not only train all layers jointly by optimizing one global objective (the first part of equation 8), so that all hidden activations will adjust according to the original input; but we also take into account the local constraints of the single layer autoencoders (the second part of equation 8). Locally, if we isolate each autoencoder out from the stacks and look at them individually, this training process is optimizing almost exactly the single layer objective as in layerwise case, but without the unnecessary constraint on reconstructing back the intermediate representations. Globally, because the loss is measured between the reconstruction and the input, all parameters must be tuned to minimize the global loss, and thus the resulting higher level hidden representations would be more representative of the input.
This approach addresses some of the drawbacks of layerwise training: since all the parameters are tuned simultaneously toward the reconstruction error of the original input, the optimization is no longer local for each layer. In addition, the reconstruction error in the input space makes the training process much easier to monitor and interpret. To sum up, this formulation provides an easier and more modular way to train deep autoencoders.
We now relate this joint objective to some of the existing, wellknown deep learning paradigms that implement deterministic objective functions. We also relate this work to the techniques in the literature that explain deep learning from a probabilistic perspective.
Iiia Relationship to Weight Decay in Multilayer Perceptron
It is easy to see that if we replace all the regularizers with or norms, and set to Dirac delta distribution at point in Equation (8
), we recover the standard multilayer perceptron (MLP) with
or weight decay with one exception – MLP is not commonly used for unsupervised learning. So if we replace the loss term with some supervised loss, it is then identical to the ordinary MLP with corresponding weight decay.IiiB Relationship to Greedy Layerwise Pretraining
It is straight forward to see that in the single layer case, the proposed training objective in Equation (8) is equivalent to greedy layerwise training. It would be interesting to investigate the relationship of this method and layerwise training in multilayer cases. Therefore, we construct a scenario that make these two method very similar by modify the joint training algorithm slightly by setting a training schedule with learning rates , and regularizer , where , and indicate the corresponding layer and training iteration. The objective in equation 8 and the gradient update is as follows:
(9)  
(10) 
Let and , where indicates the learning rate. Let us set the value of in the following way: for , for , for , where is the number of iterations used in the greedy layerwise training.
In this way, the joint training scheme is set up in a very similar way to the layerwise training scheme, i.e. only tune the parameter of a single layer at a time from bottom to top. However, since the loss is measured in the domain of instead of , the learning will still behave differently. This will be more apparent as we write down both joint loss () and layerwise loss () for training a given layer as follows:
(11)  
(12) 
where and represent the trained decoding and encoding function respectively. In other words, we do not optimize the parameters of these function during training. Note that, the loss and the resulting parameter update will be very different between equation 11 and 12 in general, since the gradient for the joint loss will be modified by the previous layers parameters but not in the layerwise case. It is somewhat surprising that even by constructing a very similar situation the two methods still behave very differently. Note that even in the special case where one uses linear activation (which is not very practical), these two losses are still not equivalent. Hence, the joint training will perform very differently from layerwise in general.
IiiC Relationship to Learning the Data Distribution
If we take a probabilistic view of the learning of a deep model as learning the data distribution , a common interpretation of the model is to decompose into:
(13) 
where is the empirical distribution over data. So the model is decomposed such that the bottom layer models the distribution and the higher layers models the prior . Notice that, if we apply layerwise training it is only possible to learn the prior through a fixed , and thus the prior will not be optimal with respect to if does not preserve all information regarding . On the other hand, in joint training, both the generative distribution and the prior
are tuned together, and therefore is more likely to obtain better estimations of the true
. In addition, while training layers greedily, we are not taking into account the fact that some more capacity may be added later to improve the prior for hidden units. This problem is also alleviated by joint training since the architecture is fixed at the beginning of training and all the parameters are tuned towards better representation of the data.IiiD Relationship to Generative Stochastic Networks
Bengio et al. [24] recently proposed a new alternative to maximum likelihood for training probabilistic models – the generative stochastic networks (GSNs). The idea is that learning the data distribution
directly is hard since it is highly multimodal in general. On the other hand, one can try to learn to approximate the Markov chain transition operator (
e.g. , ). The intuition is that the move in Markov chain are mostly local, and thus these distributions are likely to be less complex or even unimodal, make it an easier learning problem. For example, the denoising autoencoder learns where is a corrupted example. They show that if one can get a consistent estimator of , then following the implied Markov chain, the stationary distribution of this chain will converge to the true data distribution .Like the denoising autoencoder, a deep denoising autoencoder also defines the Markov chain:
(14)  
(15) 
Therefore, a deep denoising autoencoder also learns a data generating distribution within the GSN framework, and we can generate samples from it.
Iv Experiments and Results
In this section, we empirically analyse the unsupervised joint training method with the following questions: 1) does joint training lead to better data models? 2) does joint training result in better representations that would be helpful for other tasks? 3) what role does the more recent usage of regularizers for autoencoder play in joint training? 4) does joint training affect the performance of supervised finetuning?
We tested our approach on MNIST [25] – a digit classification dataset contains 60,000 training and 10,000 test examples, where we used the standard 50,000 and 10,000 training and validation split. In addition, we also used the MNIST variation datasets [26] each with 10,000 data points for training, 2,000 for validation and 50,000 for test. Additional shape datsets employed in [26]
are also employed, this set of datasets contains two shape classification tasks. One is to classify short and tall rectangles, the other is to classify convex and nonconvex shapes. All of these datasets have 50,000 testing examples. The rectangle dataset has 1,000 and 200 training and validation examples respectively, and the convex dataset has 6,000 and 2,000 training and validation respectively. The rectangle dataset also has a variation that uses image as the foreground rectangles, and it has 10,000 and 2,000 training and validation examples, respectively (see Figure
2 left for visual examples from these datasets).We tied the weights of the deep autoencoders (i.e. ) in this set of experiments, and set each layer with 1,000 hidden units using logistic activations, and crossentropy loss was applied as the training cost. We optimized the deep network using rmsprop [27]
with 0.9 decay factor for the rms estimate and 100 samples per minibatch. The hyperparameters were chosen on the validation set, and the model that obtained best validation result was used to obtain the test set result. The hyperparameters we considered were the learning rate (from the set {0.001,0.005,0.01,0.02}), noise level (from the set {0.1,0.3,0.5,0.7,0.9}) for deep denoising autoencoders (deepDAE), and contraction level (from the set {0.01,0.05,0.15,0.3,0.6}) for deep contractive autoencoders (deepCAE). Gaussian noise is applied for DAE.
Iva Does Joint Training Lead to Better Data Models?
As mentioned in previous sections, we hypothesize that the joint training will alleviate the burden on both the bottom layer distribution and top layer priors , and hence result a better data model
. In this experiment, we inspect the goodness on the modeling of data distribution through samples from the model. Since the deep denoising autoencoder follows the GSN framework, we can follow the implied Markov chain to generate samples. The models are trained for 300 epochs using both layerwise and joint training method
^{2}^{2}2Each layer is trained for 300 epochs in layerwise training, and the quality of samples are then estimated by measuring the loglikelihood of the test set under a Parzen window density estimator [28, 29]. This measuerment can be seen as a lower bound on the true loglikelihood, and will converge to the true likelihood as the number of samples increase and with an appropriate Parzen window parameter. 10,000 consecutive samples were generated for each of the datasets with models that were trained using the layerwise and joint training method, and we used a Gaussian Parzen window for the density estimation. The estimated loglikelihoods on the respective test sets are shown in Table I^{3}^{3}3We note that this estimate has a little high variance, but this is to our knowledge the best available method for estimating generative models that can generate samples but not estimate data likelihood directly.
. We use a suffix ’L’ and ’J’ to denote the layerwise and joint training method, respectively.Dataset/Method  DAE2L  DAE2J  DAE3L  DAE3J 

MNIST  2041.59  2661.32  1811.68  2701.26 
basic  2011.69  2051.59  1831.62  2171.51 
rot  1741.41  1721.49  1631.60  1871.36 
bgimg  1561.41  1541.45  1421.56  1551.48 
bgrand  2670.38  2520.36  2750.37  2490.36 
bgimgrot  1511.46  1521.50  1391.52  1491.31 
rect  1601.08  1611.12  1521.15  1541.12 
rectimg  2751.35  2731.30  2691.40  2721.37 
convex  9678.99  8538.85  101110.79  7048.50 
Training Samples  Layerwise Samples  Joint Training Samples 

For qualitative purposes the generated samples from each dataset are shown in Figure 2; the quality of samples are comparable in both cases, however, the models trained through joint training shows faster mixing with less spurious samples in general. It is also interesting to note that, in most of the cases (see Table I) the loglikelihood on the test set improved with deeper models in the joint training case, whereas in layerwise settings, the likelihood dropped with additional layers. This illustrates one advantage of using joint training scheme to model data since it accommodates the additional capacity of the hidden prior while training the whole model.
(a)  (b)  (c) 
(d)  (e)  (f) 
(g)  (h)  (i) 
Since the joint training objective in eq. 8 focuses on reconstruction, it is expected that the reconstruction errors from models trained by joint training should be less than that obaited from layerwise training. To confirm this we also record the training and testing errors as training progresses. As can be seen from Figure 3, it is clearly the case that joint training achieves better performance in all cases^{4}^{4}4We did not show the rest models since the trends are very similar.. It is also interesting to note that the models from joint training are less prone to overfitting as compare to layerwise case, this is true even in case of less training examples (see Figure 3g and i).
IvB Does Joint Training Result in Better Representations?
In the previous experiment we illustrated the advantage of joint training over greedy layerwise training when learning the data distribution . We also confirmed that joint training has better reconstruction error in both training and testing as compare to layerwise case, since it focuses on reconstruction. A natural followup question would be to find out if tuning all the parameters of a deep unsupervised model towards better representing the input, lead to better higher level representations. To answer this question the following experiment was conducted. We trained two and threelayered deep autoencoder^{5}^{5}5Note that the actual depth of the network has double the number of hidden layers because of the presence of intermediate reconstruction layers. using both the layerwise and our joint training methods for 300 epochs^{6}^{6}6Each layer is trained for 300 epochs in layerwise training.. We then fix all the weights and used the deep autoencoders as feature extractors on the data, i.e.
use the top encoding layer representation as the feature for the corresponding data. A support vector machine (SVM) with a linear kernel was further trained on those features and the test set classification performance was evaluated. The model that achieved the best validation performance was used to report the test set error, and the performance for the two different models is shown in Tables
II and III. We report the test set classification error (all in percentages) along with its 95% confidence interval for all classification experiments. Since the computation of contractive penalty of higher layers with respect to the input is expensive for the deep contractive autoencoder, we set
to save some computations. The result from Tables II and IIIsuggest that the representations learned from the joint training was generally better. It is also important to note that the model achieved low error rate without any finetuning by using the labeled data; in other words, the feature extraction was completely unsupervised. Interestingly, the threelayered models seemed to perform worse than twolayered models in most of the cases for all methods, which contradicts the generative performance. This is because the goal for a good unsupervised data model is to capture all information regarding
, whereas for supervised tasks the goal is to learn a good model over . contains useful information regarding but not all the information from might be helpful. Therefore, good generative performance does not necessarily translate to good discriminative performance.Dataset/Method  DAE2L  DAE2U  DAE2J  DAE2UJ  DAE3L  DAE3U  DAE3J  DAE3UJ 

MNIST  1.48 0.24  1.94 0.27  1.39 0.23  1.47 0.24  1.40 0.23  1.85 0.26  1.41 0.23  1.71 0.25 
basic  2.75 0.14  2.72 0.14  2.66 0.14  2.66 0.14  2.65 0.14  2.44 0.14  2.98 0.15  2.57 0.14 
rot  15.75 0.32  17.09 0.33  14.23 0.31  14.98 0.31  14.33 0.31  15.89 0.32  15.05 0.31  13.25 0.30 
bgimg  20.69 0.36  17.01 0.33  17.23 0.33  17.44 0.33  21.36 0.36  17.75 0.33  21.77 0.36  19.00 0.34 
bgrand  12.72 0.29  13.65 0.30  8.52 0.24  9.51 0.26  10.72 0.27  12.11 0.29  8.03 0.24  8.70 0.25 
bgimgrot  52.44 0.44  52.92 0.44  49.61 0.44  50.93 0.44  56.19 0.43  56.61 0.43  57.04 0.43  52.49 0.44 
rect  1.14 0.09  0.97 0.09  0.83 0.08  0.87 0.08  1.30 0.10  1.85 0.12  0.98 0.09  1.26 0.10 
rectimg  22.84 0.37  22.81 0.37  21.96 0.36  21.98 0.36  24.22 0.38  23.46 0.37  23.04 0.37  21.86 0.36 
convex  28.65 0.40  25.78 0.38  22.18 0.36  25.65 0.38  27.24 0.39  25.00 0.38  20.86 0.36  24.42 0.38 
Dataset/Method  CAE2L  CAE2U  CAE2J  CAE2UJ  CAE3L  CAE3U  CAE3J  CAE3UJ 

MNIST  1.55 0.24  1.91 0.27  1.33 0.22  1.65 0.25  1.47 0.24  1.98 0.27  1.55 0.24  1.74 0.26 
basic  2.96 0.15  2.80 0.14  2.38 0.13  2.70 0.14  2.90 0.15  2.92 0.15  2.80 0.14  2.95 0.15 
rot  15.10 0.31  15.29 0.32  13.17 0.30  15.38 0.32  15.23 0.31  16.23 0.32  15.14 0.31  15.59 0.32 
bgimg  19.80 0.35  17.48 0.33  19.05 0.34  19.63 0.35  21.31 0.36  19.65 0.35  21.68 0.36  19.49 0.35 
bgrand  15.00 0.31  12.50 0.29  14.00 0.30  13.18 0.30  14.30 0.31  12.05 0.29  11.53 0.28  12.68 0.29 
bgimgrot  52.57 0.44  53.92 0.44  52.00 0.44  52.29 0.44  53.66 0.44  55.61 0.44  54.32 0.44  53.57 0.44 
rect  1.62 0.11  1.77 0.12  1.07 0.09  1.85 0.12  1.36 0.10  2.40 0.13  0.93 0.08  1.99 0.12 
rectimg  22.62 0.37  22.40 0.37  22.23 0.36  22.76 0.37  22.37 0.37  22.56 0.37  23.06 0.37  22.53 0.37 
convex  27.45 0.39  25.90 0.38  24.91 0.38  25.31 0.38  28.49 0.40  26.66 0.39  24.66 0.38  25.26 0.38 
Apart from performing joint training directly with random initialization as in equation 8, it is also possible to first apply layerwise training and then jointly train with the pretrained weights. Therefore, to investigate whether joint training is beneficial in this situation, we performed another set of experiments by initializing the weights of deep autoencoders using the pretrained weights from denoising and contractive autoencoders, and further performed unsupervised joint training for 300 epochs, with and without their corresponding regularizations. The results are shown in Tables II and III. We use a suffix ‘U’ to indicate the results from this training procedure when the following joint training is preformed without any regularization, and ‘UJ’ is used to indicate the case where the further joint training is performed with the corresponding regularization. Results from ‘U’ scheme are similar to the layerwise case, whereas, the performance from ‘UJ’ is clearly better as compare to the layerwise case. The above result also suggest that the performance improvement from joint training is more significant while combined with corresponding regularizations, irrespective of the parameter initialization.
In summary, the representation learned through unsupervised joint training is better as compared to the layerwise case. In addition, it is important to apply appropriate regularizations during joint training in order to get the most benefit.
IvC How Important is the Regularization?
From the previous experiments, it is clear that the joint training scheme has several advantages over the layerwise training scheme, and it also suggest that the use of proper regularizations is important (see the performance between ‘U’ and ‘UJ’ in Table II and III for example). However, the role of the regularization for training deep autoencoder is still unclear  is it possible to achieve good performance by applying joint training without any regularization? To investigate this we trained two and threelayered deep autoencoders for 300 epochs with L2 constraints on the weights^{7}^{7}7Training without any regularization resulted in very poor performance., for fair comparison with the previous experiments. We again trained a linear SVM using the toplayer representation and reported the results in Table IV. Performance was significantly worse as compared to the case where more powerful regularizers from autoencoders were employed (see Tables II and III), especially in the case where noise was presented in the dataset. Hence, the results strongly suggest that for unsupervised feature learning, more powerful regularization is required to achieve superior performance. In addition, it is more beneficial to incorporate those powerful regularizations, such as denoising or contractive, during joint training (see Table II and III).
Dataset/Method  AE2  AE3 

MNIST  1.96 0.27  2.36 0.30 
basic  3.20 0.15  3.20 0.15 
rot  18.75 0.34  66.70 0.41 
bgimg  26.71 0.39  86.30 0.30 
bgrand  84.24 0.32  85.95 0.30 
bgimgrot  58.33 0.43  81.00 0.34 
rect  4.00 0.17  4.15 0.17 
rectimg  24.30 0.38  48.70 0.44 
convex  43.70 0.43  45.08 0.44 
Dataset/Method  DAE2L  DAE2U  DAE2J  DAE2UJ  DAE3L  DAE3U  DAE3J  DAE3UJ 

MNIST  1.05 0.20  1.17 0.21  1.17 0.21  1.21 0.21  1.20 0.21  1.15 0.21  1.10 0.21  1.26 0.22 
basic  2.55 0.14  2.42 0.13  2.44 0.14  2.68 0.14  3.14 0.15  2.50 0.14  2.65 0.14  2.70 0.14 
rot  9.51 0.26  9.77 0.26  8.40 0.24  9.57 0.26  9.65 0.26  9.32 0.25  7.87 0.24  8.96 0.25 
bgimg  15.21 0.31  17.08 0.33  17.11 0.33  17.50 0.33  24.01 0.37  15.47 0.32  18.65 0.34  17.02 0.33 
bgrand  12.16 0.29  11.91 0.28  8.98 0.25  10.88 0.27  18.11 0.34  11.64 0.28  8.04 0.24  8.48 0.24 
bgimgrot  46.35 0.44  47.90 0.44  46.94 0.44  47.71 0.44  56.69 0.43  45.16 0.44  46.95 0.44  46.51 0.44 
rect  1.60 0.11  1.45 0.10  0.98 0.09  0.98 0.09  1.39 0.10  1.89 0.12  0.92 0.08  0.82 0.08 
rectimg  21.87 0.36  21.09 0.36  21.87 0.36  22.53 0.37  24.79 0.38  22.17 0.36  22.49 0.37  22.48 0.37 
convex  19.33 0.35  19.24 0.35  18.60 0.34  19.88 0.35  23.17 0.37  18.03 0.34  18.33 0.34  19.20 0.35 
Dataset/Method  CAE2L  CAE2U  CAE2J  CAE2UJ  CAE3L  CAE3U  CAE3J  CAE3UJ 

MNIST  1.25 0.22  1.35 0.23  1.09 0.20  1.26 0.22  1.47 0.24  1.53 0.24  1.11 0.21  1.24 0.22 
basic  2.74 0.14  2.85 0.15  2.63 0.14  2.74 0.14  3.23 0.15  3.01 0.15  2.78 0.14  2.84 0.15 
rot  10.31 0.27  10.47 0.27  8.56 0.25  10.18 0.27  10.85 0.27  10.04 0.26  7.91 0.24  9.36 0.26 
bgimg  15.82 0.32  18.45 0.34  18.34 0.34  18.45 0.34  25.32 0.38  16.58 0.33  17.24 0.33  17.01 0.33 
bgrand  13.12 0.30  10.53 0.27  12.99 0.29  12.53 0.29  18.21 0.34  11.19 0.28  13.67 0.30  12.25 0.29 
bgimgrot  46.37 0.44  47.38 0.44  47.77 0.44  48.00 0.44  58.93 0.43  46.80 0.44  47.19 0.44  47.50 0.44 
rect  2.17 0.13  1.99 0.12  1.17 0.09  1.39 0.10  2.44 0.14  2.15 0.13  1.04 0.09  1.51 0.11 
rectimg  21.83 0.36  21.72 0.36  22.13 0.36  22.22 0.36  24.56 0.38  22.03 0.36  23.22 0.37  23.58 0.37 
convex  18.60 0.34  18.63 0.34  18.01 0.34  19.23 0.35  23.20 0.37  18.60 0.34  18.00 0.34  19.20 0.35 
IvD How does Joint Training Affect Finetuning?
So far, we have compared joint with layerwise training in unsupervised representation learning settings. Now we turn our attention to supervised setting and investigate how joint training affects finetuning. In this experiment, the unsupervised deep autoencoders were used to initialize the parameters of a multilayer perceptron for the supervised finetuning (the same way as one would use layerwise for supervised tasks). The finetuning was performed for all previously trained models for a maximum 1,000 epochs with early stopping on the validation set error. As expected, the performance of the standard deep autoencoder (see Table VII) was not very impressive except on MNIST which contained ‘cleaner’ samples and significantly more training examples. It is also reasonable to expect similar performance from layerwise and joint training since the supervised finetuning process adjusts all parameters to better fit . This is partially true as can be observed from the results in Table V and VI. The performance of 2layer models are close in almost all cases. However, in 3layer case the models trained with joint training appear to perform better. This is true for the models pretrained via joint training with or without regularization (i.e. scheme ‘U’ and ‘UJ’ respectively), which might suggests that joint training is more beneficial for deeper models. Hence, even in the case, where one would tuning all parameters of the model for supervised tasks, unsupervised joint training can still be beneficial, especially for deeper models. The results also suggest that as long as appropriate regularization is employed in joint pretraining, initialization does not influence the supervised performance significantly.
Dataset/Method  AE2  AE3 

MNIST  1.36 0.23  1.30 0.22 
basic  3.18 0.15  3.04 0.15 
rot  9.95 0.26  9.43 0.26 
bgimg  24.38 0.38  24.00 0.37 
bgrand  14.47 0.31  18.53 0.34 
bgimgrot  52.00 0.44  54.91 0.44 
rect  2.91 0.15  3.08 0.15 
rectimg  23.45 0.37  24.00 0.37 
convex  21.27 0.36  22.25 0.36 
V Conclusion
In this paper we presented an unsupervised method for jointly training all layers of deep autoencoder and analysed its performance against greedy layerwise training in various circumstances. We formulated a single objective for the deep autoencoder, which consists of a global reconstruction objective with local constraints on the hidden layers, so that all layers could be trained jointly. This could also be viewed as a generalization of training single to multilayer autoencoders, and provided a straightforward way to stack the different variants of autoencoders. Empirically, we showed that the joint training method not only learned better data models, but also learned more representative features for classification as compared to the layerwise method, which highlights its potential for unsupervised feature learning. In addition, the experiments also showed that the success of the joint training technique is dependent on the more powerful regularizations proposed in the more recent variants of autoencoders. In the supervised setting, joint training also shows superior performance when training deeper models. Going forward, this framework of jointly training deep autoencoders can provide a platform for investigating more efficient usage of different types of regularizers, especially in light of the growing volumes of available unlabeled data.
a Additional Qualitative Samples
References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
NIPS, 2012, pp. 1106–1114.  [2] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV, 2014, pp. 818–833.

[3]
A. Mohamed, G. E. Dahl, and G. E. Hinton, “Acoustic modeling using deep belief networks,”
IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 1, pp. 14–22, 2012. 
[4]
G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognition with the meancovariance restricted boltzmann machine,” in
NIPS, 2010, pp. 469–477.  [5] L. Deng, M. L. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. E. Hinton, “Binary coding of speech spectrograms using a deep autoencoder,” in INTERSPEECH, 2010, pp. 1692–1695.
 [6] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to humanlevel performance in face verification,” in CVPR, 2014.
 [7] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
 [8] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in NIPS, 2006, pp. 153–160.

[9]
M. Ranzato, C. Poultney, S. Chopra, and Y. Lecun, “Efficient learning of sparse representations with an energybased model,” in
NIPS. MIT Press, 2006.  [10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
 [11] L. Wan, M. D. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, “Regularization of neural networks using dropconnect,” in ICML, 2013, pp. 1058–1066.
 [12] N. L. Roux and Y. Bengio, “Representational power of restricted boltzmann machines and deep belief networks,” Neural Computation, vol. 20, no. 6, pp. 1631–1649, 2008.
 [13] L. Arnold and Y. Ollivier, “Layerwise learning of deep generative models,” CoRR, vol. abs/1212.1524, 2012.
 [14] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in AISTATS, 2009, pp. 448–455.
 [15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, pp. 533–536, 1986.

[16]
H. Bourlard and Y. Kamp, “Autoassociation by multilayer perceptrons and singular value decomposition,”
Biological Cybernetics, vol. 59, no. 45, pp. 291–294, 1988.  [17] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in ICML, 2008, pp. 1096–1103.
 [18] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio, “Contractive autoencoders: Explicit invariance during feature extraction,” in ICML, 2011, pp. 833–840.
 [19] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [20] J. Martens, “Deep learning via hessianfree optimization,” in ICML, 2010, pp. 735–742.
 [21] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of initialization and momentum in deep learning,” in ICML, 2013, pp. 1139–1147.
 [22] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” in NIPS, 2012, pp. 350–358.
 [23] M. Chen, K. Q. Weinberger, F. Sha, and Y. Bengio, “Marginalized denoising autoencoders for nonlinear representations,” in ICML, 2014, pp. 1476–1484.
 [24] Y. Bengio, E. ThibodeauLaufer, and J. Yosinski, “Deep generative stochastic networks trainable by backprop,” in ICML, 2014.

[25]
Y. Lecun and C. Cortes, “The MNIST database of handwritten digits.” [Online]. Available:
http://yann.lecun.com/exdb/mnist/  [26] H. Larochelle, D. Erhan, A. C. Courville, J. Bergstra, and Y. Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in ICML, 2007, pp. 473–480.

[27]
T. Tieleman and G. E. Hinton, “Lecture 6.5  rmsprop,”
COURSERA: Neural Networks for Machine Learning, 2012.  [28] G. Desjardins, A. C. Courville, and Y. Bengio, “Adaptive parallel tempering for stochastic maximum likelihood learning of rbms,” CoRR, vol. abs/1012.3476, 2010.
 [29] O. Breuleux, Y. Bengio, and P. Vincent, “Unlearning for better mixing,” Université de Montréal/DIRO, Tech. Rep. 1349, 2010.
Comments
There are no comments yet.