I Introduction
Learning algorithms for deep multilayer neural networks have been known for a long time
[3], though they usually could not outperform simpler, shallow networks. In this way, deep multilayer networks were not widely used to solve large scale realworld problems until the last decade [4]. In 2006, Deep Belief Networks (DBNs)
[5] came out as a real breakthrough in this field, since the learning algorithms proposed ended up being a feasible and practical method to train deep networks, with spectacular results [6, 7, 8, 9]. DBNs have Restricted Boltzmann Machines (RBMs) [10] as their building blocks.RBMs are topologically constrained Boltzmann Machines (BMs) with two layers, one of hidden and another of visible neurons, and no intralayer connections. This property makes working with RBMs simpler than with regular BMs, and in particular the stochastic computation of the loglikelihood gradient may be performed more efficiently by means of Gibbs sampling
[11, 4].In 2002, the Contrastive Divergence (CD) learning algorithm was proposed as an efficient training method for productofexpert models, from which RBMs are a special case [12]
. It was observed that using CD to train RBMs worked quite well in practice. This fact was important for deep learning since some authors suggested that a multilayer deep neural network is better trained when each layer is pretrained separately as if it were a single RBM
[6, 13, 7]. Thus, training RBMs with CD and stacking up them seems to be a good way to go when designing deep learning architectures.However, the picture is not as nice as it looks, since CD is not a flawless training algorithm. Despite CD being an approximation of the true loglikelihood gradient [14], it is biased and it may not converge in some cases [15, 16, 17]. Moreover, it has been observed that CD, and variants such as Persistent CD [18] or Fast Persistent CD [19] can lead to a steady decrease of the loglikelihood during learning [2, 20]. Therefore, the risk of learning divergence imposes the requirement of a stopping criterion. There are two main methods used to decide when to stop the learning process. One is based on the monitorization of the reconstruction error [21]. The other is based on the estimation of the loglikelihood with Annealed Importance Sampling (AIS) [22, 23]. The reconstruction error is easy to compute and it has been often used in practice, though its adequacy remains unclear because of monotonicity [2]. AIS seems to work better than the reconstruction error in most cases, though it is considerably more expensive to compute, and may also fail [1].
In this work we approach this problem from a completely different perspective. Based on the fact that the energy is a continuous and smooth function of its variables, the close neighborhood of the highprobability states is expected to acquire also a significant amount of probability. In this sense, we argue that the information contained in the neighborhood of the training data is valuable, and that it can be incorporated in the learning process of RBMs. In particular, we propose to use it in the monitorization of the loglikelihood of the model by means of a new quantity that depends on the information contained in the training set and its neighbors. Furthermore, and in order to make it computationally tractable, we build it in such a way that it becomes independent of the partition function of the model. In this way, we propose a neighborhoodbased stopping criterion for CD and show its performance in several data sets.
Ii Learning in Restricted Boltzmann Machines
Iia Energybased Probabilistic Models
Energybased probabilistic models define a probability distribution from an energy function, as follows:
(1) 
where and stand for (typically binary) visible and hidden variables, respectively. The normalization factor is called partition function and reads
(2) 
Since only is observed, one is interested in the marginal distribution
(3) 
but the evaluation of the partition function is computationally prohibitive since it involves an exponentially large number of terms. In this way, one can not measure directly .
The energy function depends on several parameters
, that are adjusted at the learning stage. This is done by maximizing the likelihood of the data. In energybased models, the derivative of the loglikelihood can be expressed as
(4)  
where the first term is called the positive phase and the second term the negative phase. Similar to (3), the exact computation of the derivative of the loglikelihood is usually unfeasible because of the negative phase in (4), which comes from the derivative of the partition function.
IiB Restricted Boltzmann Machines
Restricted Boltzmann Machines are energybased probabilistic models whose energy function is:
(5) 
RBMs are at the core of DBNs [5] and other deep architectures that use RBMs for unsupervised pretraining previous to the supervised step [6, 13, 7].
The consequence of the particular form of the energy function is that in RBMs both and factorize. In this way it is possible to compute and in one step, making it possible to perform Gibbs sampling efficiently, in contrast to more general models like Boltzmann Machines [24].
IiC Contrastive Divergence
The most common learning algorithm for RBMs uses an algorithm to estimate the derivative of the loglikelihood of a Product of Experts model. This algorithm is called Contrastive Divergence [12].
Contrastive Divergence CD estimates the derivative of the loglikelihood for a given point as
(6)  
where is the last sample from the Gibbs chain starting from obtained after steps:



…


.
Usually, can be easily computed.
IiD Monitoring the Learning Process in RBMs
Learning in RBMs is a delicate procedure involving a lot of data processing that one seeks to perform at a reasonable speed in order to be able to handle large spaces with a huge amount of states. In doing so, drastic approximations that can only be understood in a statistically averaged sense are performed [25].
One of the most relevant points to consider at the learning stage is to find a good way to determine whether a good solution has been found or not, and so to decide when the learning process should stop. One of the most widely used criteria for stopping is based on the monitorization of the reconstruction error, which is a measure of the capability of the network to produce an output that is consistent with the data at input. Since RBMs are probabilistic models, the reconstruction error of a data point is computed as the probability of given the expected value of for :
(7) 
which is a probabilistic extension of the sumofsquares reconstruction error for deterministic networks
(8) 
Some authors have shown that, in some cases, learning induces an undesirable decrease in likelihood that goes undetected by the reconstruction error [1, 2]. It has been shown [2] that the reconstruction error defined in (7) usually decreases monotonically. Since no increase in the reconstruction error takes place during training there is no apparent way to detect the change of behavior of the loglikelihood for CD.
Iii Proposed Stopping Criterion
The proposed stopping criterion is based on the monitorization of the ratio of two quantities: the geometric average of the probabilities of the training set, and the sum of probabilities of points in a given neighbourhood of the training set. More formally, what we monitor is
(9) 
where is a subset of points at a Hamming distance from the training set less or equal than . The idea behind the definition is that the evolution of at the learning stage is expected to closely resemble that of the loglikelihood for certain values of and . For that reason we propose as the stopping criterion to find the maximum of , which will be close to the one shown by the loglikelihood of the data, as shown by the experiments in the next sections.
The reason for that is twofold. On one hand the numerator and denominator monitor different things. The numerator, which is essentially the likelihood of the data, is sensitive to the accumulation of most of the probability mass by a reduced subset of the training data, a typical feature of CD. For continuity reasons, the denominator is strongly correlated with the sum of probabilities of the training data. Once the problem has been learnt, the probabilities in a close neighborhood of the training set will be high. The value of results from a delicate equilibrium between these two quantities (see section IV), which we propose to use as a stopping criterion for learning. On the other hand, due to the structure of , the partition functions involved in both the numerator and denominator cancels out, which is a necessary condition in the design of the quantity being monitorized. In other words, the computation of can be equivalently defined as
(10) 
The particular topology of RBMs allows to compute efficiently. This fact dramatically decreases the computational cost involved in the calculation, which would otherwise become unfeasible in most realworld problems where RBMs could been successfully applied.
While the numerator in is directly evaluated from the data in the training set, the problem of finding suitable values for still remains. Indeed, the set of points at a given Hamming distance from the training set is independent of the weights and bias of the network. In this way, it can be built once at the very beginning of the process and be used as required during learning. Therefore, two issues have to be sorted out before the criterion can be applied. The first one is to decide a suitable value of . Experiments with different problems show that this is indeed problem dependent, as is illustrated in the experimental section below. The second one is the choice of the subset , which strongly depends on the size of the space being explored. For small spaces one can safely use the complete set of points at a distance less than or equal to , but that can be forbiddingly large in real world problems. For this reason we explore two possibilities: one including all points and another including only a random subset of the same size as the training set, which is only as expensive as dealing with the training set.
Iv Experiments
We performed several experiments to explore the aforementioned criterion defined in section III and study the behavior of in comparison with the loglikelihood and the reconstruction error of the data in several problems. We have explored problems of a size such that the loglikelihood can be exactly evaluated and compared with the proposed parameter.
The first problem, denoted Bars and Stripes (BS), tries to identify vertical and horizontal lines in 44 pixel images. The training set consists in the whole set of images containing all possible horizontal or vertical lines (but not both), ranging from no lines (blank image) to completely filled images (black image), thus producing different images (avoiding the repetition of fully back and fully white images) out of the space of possible images with black or white pixels. The second problem, named Labeled Shifter Ensemble (LSE), consists in learning 19bit states formed as follows: given an initial 8bit pattern, generate three new states concatenating to it the bit sequences 001, 010 or 100. The final 8bit pattern of the state is the original one shifting one bit to the left if the intermediate code is 001, copying it unchanged if the code is 010, or shifting it one bit to the right if the code is 100. One thus generates the training set using all possible states that can be created in this form, while the system space consists of all possible different states one can build with 19 bits. These two problems have already been explored in [2] and are adequate in the current context since, while still large, the dimensionality of space allows for a direct monitorization of the partition function and the loglikelihood during learning. For the sake of completeness, we have also tested the proposed criterion on randomly generated problems with different space dimensions, where the number of states to be learnt is significantly smaller than the size of the space. In particular, we have generated four different data sets (RAN10, RAN12, RAN14 and RAN16) consisting of binary input units and examples to be learnt, as suggested in [26].
In the following we discuss the learning processes of these problems with binary RBMs, employing the Contrastive Divergence algorithm CD with and as described in section IIC. In the BS case the RBM had 16 visible and 8 hidden units, while in the LSE problem these numbers were 19 and 10, respectively. For the random data sets we have used 10 hidden units in each case.
Every simulation was carried out for a total of 50000 epochs, with measures being taken every 50 epochs. Moreover, every point in the subsequent plots was the average of ten different simulations starting from different random values of the weights and bias. Other parameters affecting the results that were changed along the analysis are the learning rates involved in the weight and bias update rules. No weight decay was used, and momentum was set to 0.8. The learning rates were chosen in order to make sure that the loglikelihood degenerates, in such a way that it presents a clear maximum that should be detected by
.In the following we perform two series of experiments that are reported in the next two subsections. In the first one (section IVA) we analyze the case where all states in are included. In the second one (section IVB) we relax the computational cost of the evaluation of by selecting only a small subset of all the states in .
Iva Complete Neighborhoods
We present the results for the problems at hand, showing for each analyzed instance different plots corresponding to the actual loglikelihood of the problem and for different values of , among other things. In order to identify the contributions to from the different neighborhoods of the training set, we define two different sets: containing all states at a distance less than or equal to , and accounting for those states at a distance exactly equal to . We have computed for and in all our experiments that are commented in the following.
Figure 1 shows our results for the RAN10 data set. The upper left panel shows the loglikelihood of the data during training. As it can be seen, there is a clear maximum that should be identified as the stopping point. The panels below show the reconstruction errors (7) and (8) which clearly fail to identify the desired extremum. The rest of the columns show results for distances and . The first row depicts for , where all states at the required distances are taken into account. As it can be seen, starting at the criterion is robust and consistently detects the maximum of the loglikelihood at the right place, thus reinforcing the idea that the neighborhood of the data contains valuable information. The second row shows the denominator of corresponding to the first row, that is, the sum of probabilities of the states included in each case. Notice that for this sum equals one and is exactly equal to the likelihood of the data. More interestingly, even when the sum is still far away from one, as it happens for , consistently finds the desired point. This behavior is also observed in the rest of the data sets analyzed. Finally the third row shows for , thus showing the behavior of the criterion applied to different shells. For and the criterion detects reasonably well the maximum of the loglikelihood and can be used to identify the desired stopping point. Notice, though, that the data alone, entirely contained at , is not capable to reproduce this behavior. Moreover, for larger than the criterion also fails, as it is expected that starting at a certain distance the information regarding the model is lost. Please notice that the initial transitory behavior of some of the plots above is meaningless and can be omitted so it has been cut.
Equivalent results for the RAN14 case are shown in figure 2. The loglikelihood and the probabilistic reconstruction error in (7) are depicted in the upper and lower panels in the first column, respectively. The other panels show for and , with (top and bottom rows, second to fifth columns). As in the previous case, the reconstruction error fails to detect the maximum of the likelihood, thus not being very useful in the present context. On the contrary, a stopping point obtained from selects a nearoptimal model. Notice that the criterion is robust along all distances explored, as desired. Similar results are found for the RAN12 and RAN16 cases. As it can be inferred from these results, the optimal value of can not be fixed beforehand and is problemdependent.
The same plots for the BS and LSE problems are found in figures 3 and 4. Once again, the reconstruction error decreases monotonously and is therefore useless in the present context, while for larger than 1 successfully does the task for , while for the criterion does not work in the BS problem.
Data Set  Hamming Distance  

1  2  3  4  5  6  7  8  9  10  
Bars and Stripes  480  3216  11360  20744  19296  8688  1632  90     
Labeled Shifter Ensemble  8434  41160  110326  165088  132976  54160  10368  966  40  2 
IvB Uncomplete Neighborhoods
Despite the success of the criterion built as proposed, it is clear that for large spaces it can be unpractical if the number of states in the neighborhood of the training set is very large. For that reason, we have tested the criterion on randomly selected subsets of the same size as the training set, which is always computationally tractable. In this sense, we denote by the evaluation of on . Figure 5 shows compared with from the previous figures for the BS (first row) and LSE (second row) problems. More precisely, the first column shows the loglikelihood of the data along the training process, while the rest of the columns plot both and for and . Notice that the absolute scales of and may vary mainly due to the value of the sum of probabilities in the denominators. However, since the precise value of these quantities is irrelevant, we have decided to scale them properly for the sake of comparison. Although is built from a much smaller set than , it captures all the significant features of and can therefore be used instead of it. In this sense, provides a good stopping criterion for CD, although it is not as robust as due to the strong reduction of states contributing to as compared with those entering in . This reduction is illustrated in table I, where we show the number of neighboring states to the data set at different distances for the BS and LSE problems. By increasing the number of states included in , convergence to is expected at the expense of an increase in computational cost. However, the present results indicate that, at least for the problems at hand, a number of examples similar to that of the training set in the evaluation of is enough to detect the maximum of the loglikelihood of the data.
All the results presented up to this point show the goodness of the proposed stopping criterion for learning in CD. However, the underlying idea can be applied to different learning algorithms that try to maximize the loglikelihood of the data. In this way we have repeated all the previous experiments for CD with very similar results to the ones above. As an example, figure 6 shows the loglikelihood, and with and CD for the LSE data set, which is the largest one analyzed in this work. As it is clearly seen, the quality of the results is very similar to the CD case, thus stressing the robustness of the criterion.
As a final remark, we note that for the BS problem the trained RBM stopped using the proposed criterion is able to qualitatively generate samples similar to those in the training set. We show in figure 7 the complete training set (two upper rows) and the same number of generated samples (two lower rows) obtained from the RBM trained with CD and stopped after 5000 epochs, around the maximum shown by , which approximately coincides with the optimal value of the loglikelihood. It is important to realize that, ultimately, the quality of the model is a direct measure of the quality of CD learning, and that the model used to generate the plots is the one with largest , which is quite close to the one with largest likelihood.
V Conclusions
In this work we have introduced the contribution of neighboring points to the training set to build a stopping criterion for learning in CD. We have shown that not only the training set but also the neighboring states contain valuable information that can be used to follow the evolution of the network along training.
Based on the fact that learning tries to increase the contribution of the relevant states while decreasing the contribution of the rest, continuity and smoothness of the energy function assigns more probability to states close to the training data. This is the key idea behind the proposed stopping criterion. In fact, two different but related estimators (depending on the number of states used to compute them) have been proposed and tested experimentally. The first one includes all states close to the training set, while the second one takes only a fraction of these states as small as the size of the training set. The first estimator is robust but may require from the use of a forbiddingly large amount of states, while the second one is always tractable and captures most of the features of the first one, thus providing a suitable stopping learning criterion. This second estimator could be used in larger data set problems, where an exact computation of the loglikelihood is not possible. Additionally, the main idea of proximity to the training set will be explored in other aspects related to learning in future work.
Acknowledgments
ER: This research is partially funded by Spanish research project TIN201231377.
FM: This work has been supported by grant No. FIS201456257C21P from DGI (Spain).
JD: This work was partially supported by SGR2014890 (MACDA) of the Generalitat de Catalunya, MICINN project BASMATI (TIN201127479C0403) and MINECO project APCOM (TIN201457226P)
References

[1]
H. Schulz, A. Müller, and S. Behnke, “Investigating Convergence of Restricted Boltzmann Machine Learning,” in
NIPS 2010 Workshop on Deep Learning and Unsupervised Feature Learning, 2010.  [2] A. Fischer and C. Igel, “Empirical Analysis of the Divergence of Gibbs Sampling Based Learning Algorithms for Restricted Boltzmann Machines,” in International Conference on Artificial Neural Networks (ICANN), vol. 3, 2010, pp. 208–217.
 [3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Internal Representations by Error Propagation,” in Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundations, D. E. Rumelhart, J. L. McClelland, and the PDP research group., Eds. MIT Press, 1986.
 [4] Y. Bengio, “Learning deep architectures for AI,” Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
 [5] G. E. Hinton, S. Osindero, and Y. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
 [6] G. E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
 [7] H. Larochelle, Y. Bengio, J. Lourador, and P. Lamblin, “Exploring Strategies for Training Deep Neural Networks,” Journal of Machine Learning Research, vol. 10, pp. 1–40, 2009.
 [8] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations,” in International Conference on Machine Learning, 2009, pp. 609–616.
 [9] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, and A. Y. Ng, “Building Highlevel Features Using Large Scale Unsupervised Learning,” in 29th International Conference on Machine Learning, 2012.
 [10] P. Smolensky, “Information Processing in Dynamical Systems: Foundations of Harmony Theory,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition (vol. 1), D. E. Rumelhart and J. L. McClelland, Eds. MIT Press, 1986, pp. 194–281.
 [11] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 6, pp. 721–741, 1984.
 [12] G. E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence,” Neural Computation, vol. 14, pp. 1771–1800, 2002.
 [13] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy Layerwise Training of Deep Networks,” in Advances in Neural Information Processing (NIPS’06), vol. 19. MIT Press, 2007, pp. 153–160.
 [14] Y. Bengio and O. Delalleau, “Justifying and Generalizing Contrastive Divergence,” Neural Computation, vol. 21, no. 6, pp. 1601–1621, 2009.

[15]
M. A. CarreiraPerpiñán and G. E. Hinton, “On Contrastive Divergence
Learning,” in
International Workshop on Artificial Intelligence and Statistics
, 2005, pp. 33–40.  [16] A. Yuille, “The Convergence of Contrastive Divergence,” in Advances in Neural Information Processing Systems (NIPS’04), vol. 17. MIT Press, 2005, pp. 1593–1600.
 [17] D. J. C. MacKay, “Failures of the onestep learning algorithm,” 2001, unpublished Technical Report.
 [18] T. Tieleman, “Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient,” in 25th International Conference on Machine Learning, 2008, pp. 1064–1071.
 [19] T. Tieleman and G. E. Hinton, “Using Fast Weights to Improve Persistent Contrastive Divergence,” in 26th International Conference on Machine Learning, 2009, pp. 1033–1040.
 [20] G. Desjardins, A. Courville, Y. Bengio, P. Vincent, and O. Delalleau, “Parallel Tempering for Training of Restricted Boltzmann Machines,” in 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 145–152.
 [21] G. E. Hinton, “A Practical Guide to Training Restricted Boltzmann Machines,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 599–619.
 [22] R. M. Neal, “Annealed Importance Sampling,” 1998, technical Report 9805, Dept. Statistics, University of Toronto.
 [23] R. Salakhutdinov and I. Murray, “On the Quantitative Analysis of Deep Belief Networks,” in International Conference on Machine Learning, 2008, pp. 872–879.

[24]
E. Aarts and J. Korst,
Simulated Annealing and Boltzmann Machines. A Stochastic Approach to Combinatorial Optimization and Neural Computing
. John Wiley, 1990.  [25] A. Fischer and C. Igel, “Training Restricted Boltzmann Machines: An Introduction,” Pattern Recognition, vol. 47, pp. 25–39, 2014.

[26]
P. Bühlmann and S. Van De Geer,
Statistics for Highdimensional Data: Methods, Theory and Applications
. Springer Science & Business Media, 2011.
Comments
There are no comments yet.