Techniques for Learning Binary Stochastic Feedforward Neural Networks

06/11/2014 ∙ by Tapani Raiko, et al. ∙ aalto Université de Montréal 0

Stochastic binary hidden units in a multi-layer perceptron (MLP) network give at least three potential benefits when compared to deterministic MLP networks. (1) They allow to learn one-to-many type of mappings. (2) They can be used in structured prediction problems, where modeling the internal structure of the output is important. (3) Stochasticity has been shown to be an excellent regularizer, which makes generalization performance potentially better in general. However, training stochastic networks is considerably more difficult. We study training using M samples of hidden activations per input. We show that the case M=1 leads to a fundamentally different behavior where the network tries to avoid stochasticity. We propose two new estimators for the training gradient and propose benchmark tests for comparing training algorithms. Our experiments confirm that training stochastic networks is difficult and show that the proposed two estimators perform favorably among all the five known estimators.



There are no comments yet.


page 8

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Feedforward neural networks, or multi-layer perceptron (MLP) networks, model mappings from inputs to outputs through hidden units

. Typically the network output defines a simple (unimodal) distribution such as an isotropic Gaussian or a fully factorial Bernoulli distribution. In case the hidden units are deterministic (using a function

as opposed to a distribution ), the conditionals belong to the same family of simple distributions.

Stochastic feedforward neural networks (SFNN) (Neal, 1990, 1992) have the advantage when the conditionals are more complicated. While each configuration of hidden units

produces a simple output, the mixture over them can approximate any distribution, including multimodal distributions required for one-to-many type of mappings. In the extreme case of using empty vectors as the input

, they can be used for unsupervised learning of the outputs


Another potential advantage of stochastic networks is in generalization performance. Adding noise or stochasticity to the inputs of a deterministic neural network has been found useful as a regularization method (Sietsma & Dow, 1991). Introducing multiplicative binary noise to the hidden units (dropout, Hinton et al., 2012) regularizes even better.

Binary units have additional advantages in certain settings. For instance conditional computations require hard decisions (Bengio et al., 2013). In addition, some harwdare solutions are restricted to binary outputs

(e.g. the IBM SyNAPSE, Esser et al.,


The early work on SFNNs approached the inference of using Gibbs sampling (Neal, 1990, 1992) or mean field (Saul et al., 1996), which both have their downsides. Gibbs sampling can mix poorly and the mean-field approximation can both be inefficient and optimize a lower bound on the likelihood that may be too loose. More recent work proposes simply drawing samples from during the feedforward phase (Hinton et al., 2012; Bengio et al., 2013; Tang & Salakhutdinov, 2013)

. This guarantees independent samples and an unbiased estimate of


We can use standard back-propagation when using stochastic continuous-valued units (e.g. with additive noise or dropout), but back-propagation is no longer possible with discrete units. There are several ways of estimating the gradient in that case. Bengio et al. (2013)

proposes two such estimators: an unbiased estimator with a large variance, and a biased version that approximates back-propagation.

Tang & Salakhutdinov (2013) propose an unbiased estimator of a lower bound that works reasonably well in a hybrid network containing both deterministic and stochastic units. Their approach relies on using more than one sample from for each training example, and in this paper we provide theory to show that using more than one sample is an important requirement. They also demonstrate interesting applications such as mapping the face of a person into varying expressions, or mapping a silhouette of an object into a color image of the object.

Tang & Salakhutdinov (2013) argue for the choice of a hybrid network structure based on the finite (and thus limited) number of hidden configurations in a fully discrete . However, we offer an alternate hypothesis: It is much easier to learn a deterministic network around a small number of stochastic units, so that it might not even be important to train the stochastic units properly. In an extreme case, the stochastic units are not trained at all, and the deterministic units do all the work.

In this work, we take a step back and study more rigorously the training problem with fully stochastic networks. We compare different methods for estimating the gradient and propose two new estimators. One is an approximate back-propagation with less bias than the one by Bengio et al. (2013), and the other is a modification of the estimator by Tang & Salakhutdinov (2013) with less variance. We propose a benchmark test setting based on the well-known MNIST data and the Toronto Face Database.

2 Stochastic feedforward neural networks

We study a model that maps inputs to outputs through stochastic binary hidden units . The equations are given for just one hidden layer, but the extension to multiple layers is easy111Apply separately for each layer such that denotes the layer below instead of the original input.

. The activation probability is computed just like the activation function in deterministic multilayer perceptron (MLP) networks:


where denotes the th row vector of matrix and

is the sigmoid function. For classification problems, we use softmax for the output probability


For predicting binary vectors , we use a product of Bernoulli distributions


The probabilistic training criterion for deterministic MLP networks is . Its gradient with respect to model parameters

can be computed using the back-propagation algorithm, which is based on the chain rule of derivatives. Stochasticity brings difficulties in both estimating the training criterion and in estimating the gradient. The training criterion of the stochastic network


requires summation over an exponential number of configurations of . Also, derivatives with respect to discrete variables cannot be directly defined. We will review and propose solutions to both problems below.

2.1 Proposed estimator of the training criterion

We propose to estimate the training criterion in Equation (4) by


This can be interpreted as the performance of a finite mixture model over samples drawn from .

One could hope that using just sample just like in many other stochastic networks (e.g. Hinton et al., 2012) would work well enough. However, here we show in that case the network always prefers to minimize the stochasticity, for instance by increasing the input weights to a stochastic sigmoid unit such that it behaves as a deterministic step-function nonlinearity.

Theorem 1.

When maximizing the expectation of in Equation (6) using , a hidden unit never prefers a stochastic output over a deterministic one. However, when maximizing the expectation of in Equation (4), the hidden unit may prefer a stochastic output over any of the deterministic ones.


The expected over the data distribution can be upper-bounded as


where denotes the data distribution. The value in the last inequality is achievable by selecting the distribution of to be a Dirac delta around the value which maximizes the deterministic function . This can be done for every

under the expectation in an independent way. This is analogous to a idea from game theory: since the performance achieved with

is a linear combination of the performances of the deterministic choices , any mixed strategy cannot be better than the best deterministic choice.

Let us now look at the situation for the expectation and see how it differs from the case of that we had with one particle. We can see that the original training criterion can be written as the expectation of a KL-divergence.


The fact that this expression features a negative KL-divergence means that the maximum is achieved when the conditionals match exactly. That is, it it maximized when we have that for each value of .

We give a simple example in which each take values in . We define the following conditions on and , and we show how any deterministic is doing a bad job at maximizing (12).

A deterministic is one in which take values in .

Criterion (12) is maximized by , regardless of the distribution . For the purposes of comparing solutions, we can simply take . In that case, we get that the expected takes the value . On the other hand, all the deterministic solutions yield a lower value .

2.2 Gradient for training

We will be exploring five different estimators for the gradient of a training criterion wrt. parameters . However, all of them will share the following gradient for training .

Let be the incoming signal to the activation function in the final output layer. For training , we compute the gradient of the training criterion in Equation (6)


where are unnormalized weights. In other words, we get the gradient in the mixture by computing the gradient of the individual contribution and multiplying it with normalized weights . The normalized weights can be interpreted as responsibilities in a mixture model (see e.g. Bishop et al., 2006, Section 2.3.9).

The gradients , , and are computed from using the chain rule of derivatives just like in standard back-propagation.

2.3 First estimators of the gradient for training

Bengio et al. (2013) proposed two estimators of the gradient for . The first one is unbiased but has high variance. It is defined as


where we plug in as the training criterion. We estimate the numerator and the denominator of with an exponential moving average.

The second estimator is biased but has lower variance. It is based on back-propagation where we set resulting in


2.4 Proposed biased estimator of the gradient for training

We propose a new way of propagating the gradient of the training criterion through discrete hidden units . Let us consider continuous random variables with additive noise


Note that has the same distribution as in Equation (1), that is, it only gets values 0 and 1. With this formulation, we propose to back-propagate derivatives through by


This gives us a biased estimate of the gradient since we ignore the fact that the structure of the noise depends on the input signal . One should note, however, that the noise is zero-mean with any input , which should help keep the bias relatively small.

2.5 Variational training

Tang & Salakhutdinov (2013) use a variational lower bound on the training criterion as


The above inequality holds for any distribution , but we get more usefulness out of it by choosing so that it serves as a good approximation of .

We start by noting that we can use importance sampling to express in terms of a proposal distribution from which we can draw samples.


Let be the Dirac delta function centered at . We construct based on this expansion:


where and are called the unnormalized and normalized important weights.

It would be an interesting line of research to train an auxiliary model for the proposal distribution following ideas from Kingma & Welling (2013); Rezende et al. (2014); Mnih & Gregor (2014) that call the equivalent of the recognition model or the inference network. However, we do not pursue that line further in this paper and follow Tang & Salakhutdinov (2013) who chose , in which case the importance weights simplify to .

Tang & Salakhutdinov (2013) use a generalized EM algorithm, where they compute the gradient for the lower bound given that is fixed


Thus, we train using

as target outputs.

It turns out that the resulting gradient for is exactly the same as in Section 2.2, despite the rather different way of obtaining it. The importance weights have the same role as responsibilities in the mixture model, so we can use the same notation for them.

Proposed Unbiased Estimator of the Gradient We propose a new gradient estimator by applying a variance reduction technique (Weaver & Tao, 2001; Mnih & Gregor, 2014) to the estimator by Tang & Salakhutdinov (2013). First we note that


That is, when training with samples drawn from the model distribution, the gradient is on average zero. Therefore we can change the estimator of by subtracting any constant from the weights without introducing any bias. We choose which is empirically shown to be sufficiently close to the optimum (see Figure 1 (left)). Finally, the proposed estimator becomes

Figure 1: Left: The norm of the gradient for the weights of the first hidden layer as a function of where the proposed in Equation (32) corresponds to

. The norm is averaged over a mini-batch, after {1,7,50} epochs of training (curves from top to bottom) with

in the MNIST classification experiment (see Appendix A). Varying only changes the variance of the estimator, so the minimum norm corresponds to the minimum variance. Right: as a function of the number of particles used during test time for the MNIST structured prediction task for the two proposed models trained with .

3 Experiments

We propose two experiments as benchmarks for stochastic feedforward networks based on the MNIST handwritten digit dataset (LeCun et al., 1998) and the Toronto Face Database (Susskind et al., 2010). In both experiments, the output distribution is likely to be complex and multimodal.

In the first experiment, we predicted the lower half of the MNIST digits using the upper half as inputs. The MNIST dataset used in the experiments was binarized as a preprocessing step by sampling each pixel independently using the grey-scale value as its expectation. In the second experiment, we followed

Tang & Salakhutdinov (2013) and predicted different facial expressions in the Toronto Face Database (Susskind et al., 2010). As data, we used all individuals with at least 10 different facial expression pictures, which we do not binarize. We set the input to the mean of these images per subject, and as output predicted the distribution of the different expressions of the same subject222We hence discarded the expression labels. We randomly chose 100 subjects for the training data (1372 images) and the remaining 31 subjects for the test data (427 images). As the data in the second problem is continuous, we assumed unit variance Gaussian noise and thus trained the network using the sum of squares error. We used a network structure of 392-200-200-392 and 2304-200-200-2304 in the first and second problem, respectively.

Before running the experiments, we did a simple viability check of the gradient estimators by training a network to do MNIST classification. Based on the results, we kept to that performed significantly better than and . The results of the viability experiment can be found in Appendix A.

For comparison, we also trained four additional networks (labeled A-D) in addition to the stochastic feedforward networks. Network A is a deterministic network (corresponding to with in Equation (21)). In network B, we used the weights trained to produce deterministic values for the hidden units, but instead of using these deterministic values at test time we use their stochastic equivalent. We therefore trained the network in the same way as network A, but ran the tests as the network would be a stochastic network. Network C is a hybrid network inspired by Tang & Salakhutdinov (2013)

, where each hidden layer consists of 40 binary stochastic neurons and 160 deterministic neurons. However, the stochastic neurons have incoming connections from the deterministic input from the previous layer, and outgoing connections to the deterministic neurons in the same layer. As in the original paper, the network was trained using the gradient estimator

. Network D is the same as the hybrid network C with one difference: the stochastic neurons have a constant activation probability of 0.5, and do hence not have any incoming weights or biases to learn.

In all of the experiments, we used stochastic gradient descent with a mini-batch size of 100 and momentum of 0.9. We used a learning rate schedule where the learning rate increases linearly from zero to maximum during the first five epochs and back to zero during the remaining epochs. The maximum learning rate was chosen among

and the best test error for each method is reported.333

In the MNIST experiments we used a separate validation set to select the learning rate. However, as we chose just one hyperparameter with a fairly sparse grid, we only report the best test error in the TFD experiments without a separate validation set.

The models were trained with , and during test time we always used .

As can be seen in Table 1, excluding the comparison methods, the proposed biased estimator performs the best in both tasks. It is notable that the performance of increased significantly when using more than particles, as could be predicted from Theorem 1. In Figure 1 (right) we plot the objective at test time based on a number of particles . In theory, a larger number of particles is always better (if given infinite computational resources), but here Figure 1 (right) shows how the objective is estimated very accurately with only or .

Of all the networks tested, the best performing network in both tasks was however comparison network D, i.e. the deterministic network with added binary stochastic neurons that have a constant activation probability of 0.5. It is especially interesting to note that this network also outperformed the hybrid network C where the output probabilities of the stochastic neurons are learned. Network D seems to gain from being able to model stochasticity without the need to propagate errors through the binary stochastic variables. The results give some support to the hypothesis that a hybrid network outperforms a stochastic network because it is easier to learn a deterministic network around a small number of stochastic units than learning a full stochastic network, although the stochastic units are not trained properly.

The results could possibly be improved by making the networks larger and continuing training longer if given enough computational capacity. This might be the case especially in the experiments with the Toronto Face Dataset, where the deterministic network A outperforms some of the stochastic networks. However, the critical difference between the stochastic networks and the deterministic network can be be observed in Figure 2, where the stochastic networks are able to generate reconstructions that correspond to different digits for an ambiguous input. Clearly, the deterministic network cannot model such a distribution.

Figure 2: Samples drawn from the prediction of the lower half of the MNIST test data digits based on the upper half with models trained using (left), (middle), and for the deterministic network (right). The leftmost column is the original MNIST digit, followed by the masked out image and ten samples. The figures illustrate how the stochastic networks are able to model different digits in the case of ambiguous inputs.

MNIST Neg. test
log-likelihood ()
TFD test Sum of
Squared Errors

(proposed biased)
(Tang et al., 2013) na na
(proposed unbiased) na na
deterministic (A) na na
deterministic as stochastic (B) na na
hybrid (C) na na
deterministic, binary noise (D)

Table 1: Results obtained on MNIST and TFD structured prediction using various number of samples during training and various estimators of the gradient . Error margins are

two standard deviations from 10 runs.

4 Discussion

In the proposed estimator of the gradient for in Equation (32), there are both positive and negative weights for various particles . Positive weights can be interpreted as pulling probability mass towards the particle, and negative weights as pushing probability mass away from the particle. Although we showed that the variance of the gradient estimate is smaller when using both positive and negative weights ( vs. ), the difference in the final performance of the two estimators was not substantial

One challenge with structured outputs is to find samples that give a reasonably large probability with a reasonably small sample size . Training a separate as a proposal distribution looks like a promising direction for addressing that issue. It might still be useful to use a mix of particles from and , and subtract a constant from the weights of the latter ones. This approach would yield both particles that explain well, and particles that have negative weights.

5 Conclusion

Using stochastic neurons in a feedforward network is more than just a computational trick to train deterministic models. The model itself can be defined in terms of stochastic particles in the hidden layers, and we have shown many valid alternatives to the usual gradient formulation.

These proposals for the gradient involve particles in the hidden layers with normalized weights that represent how well the particles explain the output targets. We showed both theoretically and experimentally how involving more than one particle significantly enhances the modeling capacity.

We demonstrated the validity of these techniques in three sets of experiments: we trained a classifier on MNIST that achieved a reasonable performance, a network that could fill in the missing information when we deleted the bottom part of the MNIST digits, and a network that could output individual expressions of face images based on the mean expression.

We hope that we have provided some insight into the properties of stochastic feedforward neural networks, and that the theory can be applied to other contexts such as the study of Dropout or other important techniques that give a stochastic flavor to deterministic models.


The authors would like to acknowledge NSERC, Nokia Labs and the Academy of Finland as sources of funding, in addition to the developers of Theano

(Bastien et al., 2012; Bergstra et al., 2010)


Appendix A Classification experiment

MNIST classification is a well studied problem where performances of a huge variety of approaches are known. Since the output is just a class label, the advantage of being able to model complex output distributions is not applicable. Still, the benchmark is useful for comparing training algorithms against each other, and was used in this paper to test the viability of the gradient estimators.

We used a network structure with dimensionalities 784-200-200-10. The input data was first scaled to the range of , and the mean of each pixel was then subtracted. As a regularization method, Gaussian noise with standard deviation 0.4 was added to each pixel separately in each epoch (Raiko et al., 2012). The models were trained for 50 epochs.

Table 2 gives the test set error rate for each method. As can be seen from the table, deterministic networks give the best results. Excluding the comparison networks, the best result is obtained with the proposed biased gradient followed by the proposed unbiased gradient . Based on the results, gradient estimators and were left out from the structured prediction experiments.

Test error (%)
(Bengio et al., 2013, unbiased) 7.85 11.30
(Bengio et al., 2013, biased) 7.97 7.86
(proposed biased) 1.82 1.63
(Tang et al., 2013) na 3.99
(proposed unbiased) na 2.72
deterministic (A) 1.51 na
deterministic as stochastic (B) 1.80 na
hybrid (C) na 2.19
deterministic, binary noise (D) 1.80 1.92
Table 2: Results obtained on MNIST classification using various number of samples during training and various estimators of the gradient .