Unsupervised Pretraining Encourages Moderate-Sparseness

12/20/2013 ∙ by Jun Li, et al. ∙ 0

It is well known that direct training of deep neural networks will generally lead to poor results. A major progress in recent years is the invention of various pretraining methods to initialize network parameters and it was shown that such methods lead to good prediction performance. However, the reason for the success of pretraining has not been fully understood, although it was argued that regularization and better optimization play certain roles. This paper provides another explanation for the effectiveness of pretraining, where we show pretraining leads to a sparseness of hidden unit activation in the resulting neural networks. The main reason is that the pretraining models can be interpreted as an adaptive sparse coding. Compared to deep neural network with sigmoid function, our experimental results on MNIST and Birdsong further support this sparseness observation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have found many successful applications in recent years. However, it is well-known that if one trains such networks with the standard back-propagation algorithm from randomly initialized parameters, one typically ends up with models that have poor prediction performance. A major progress in DNNs research is the invention of pretraining techniques for deep learning

(Hinton et al., 2006; Hinton and Salakhutdinov, 2006; Bengio et al., 2006; Bengio, 2009; Bengio et al., 2012)

. The main strategy is to employ layer-wise unsupervised learning procedures to initialize the DNN model parameters. A number of such unsupervised training techniques have been proposed, such as restricted Boltzmann machines (RBMs) in

(Hinton et al., 2006)

, and denoising autoencoders (DAEs) in

(Vincent et al., 2008). Although these methods show strong empirical performance, the reason of their success has not been fully understood.

Two reasons were offered in the literature to explain the advantages of unsupervised learning procedure (Erhan et al., 2010; Larochelle et al., 2009): the regularization effect and the optimization effect. The regularization effect says that pretraining provides regularization which initialize the parameters in the basin of attraction to a “good” local minimum. The optimization effect says that the pretraining leads to better optimization so that the initial value is close to a local minimum with a lower objective value than that can be achieved with random initialization. Based on experimental evidences, some researchers confirm that the pretraining can learn invariant representations and selective units (Goodfellow et al., 2009).

Our Contributions: We study why the pretraining encourages moderate-sparseness. The main reason is that the pretraining models can be interpreted as an adaptive sparse coding. This coding is approximated by a sparse encoder, which is implemented by adaptively filtering out a lot of features that are not present in the input and suppressing the responses of some features that are not significant in the input. We further conduct experiments to demonstrate that it is a sparse regularization (the hidden units become more sparsely activated).

2 Previous Works

In this part we review some advantages of the pretraining methods.

Distributed representations and deep architectures play an important role in deep learning methods. A distributed representation (an old idea) can capture a very great number of possible input configurations (Bengio, 2009). Deep architectures can promote the re-use of features and lead to abstract more invariant features for most local changes of the inputs (Bengio et al., 2012). However, it is hard to use the Back-Propagation

to train DNNs with two traditional activation functions (the sigmoid function

and the hyperbolic tangent ). Luckily, (Hinton et al., 2006) proposes a unsupervised pretraining method to initialize the DNNs model parameters and learn good representations. The regularization effect and the optimization effect are used to explain the main advantages of the pretraining method (Erhan et al., 2010; Larochelle et al., 2009).

To better understand what the pretraining models learn in deep architectures, (Goodfellow et al., 2009) find that the pretraining methods can learn invariant representations and selective units. Some researchers use the linear combination of previous units (Lee et al., 2009) and the maximizing activation (Erhan et al., 2010) to visualize the feature detectors (or invariance manifolds or filters) in an arbitrary layer. Fig. 1 of (Erhan et al., 2010) and Fig. 3 of (Lee et al., 2009) show that the first, second and third layer can learn edge detectors, object parts, and objects respectively. Based the distributed and invariant representations, (Larochelle et al., 2009; Bengio et al., 2012, 2013) further confirm that the pretraining methods tend to do a better job at disentangling the underlying factors of variation, such as objects or object parts.

Compared to DNNs with sigmoid function, we confirm that the pretraining methods encourage moderate-sparseness as the detectors filter out a lot of features that are not present in the input. In general, there is an illusion that unsupervised pretraining methods tend to learn non-sparse representations because it does not meet the conventional sparse methods (Zhang et al., 2011; Yang et al., 2012a, 2013). The conventional methods consider the idea of introducing a form of sparsity regularization. Most ways have been proposed by directly penalizing the outputs of hidden units, such as penalty, penalty and Student-t penalty. But, the pretraining methods implement sparseness by filtering out a lot of irrelevant features.

3 Pretraining Model

There is a classic pretraining models: RBMs. An RBMs is an energy-based generative model defined over a visible layer and a hidden layer. The visible layer is fully connected to the hidden layer via symmetric weights , while there have no connections between units of the same layer. The number of visible units and hidden units are denoted by and , respectively. Additionally, visible units and hidden units receive input from bias - and respectively. The energy function is denoted by :


The probability that the network assigns to visible units



where is the partition function or normalizing constant. Because there are no direct connections between hidden (or visible) units, it is very easy to sample from the conditional functions taking the form:


where , and is a logistic sigmoid functions: .

The training is to use algorithm (Hinton., 2002) to minimize the likelihood of the data: .

4 Unsupervised Pretraining Encourages Moderate-Sparseness

In this section, we denote that the sparse regularization with more overlapping groups in low layer or less in high layer is called the Moderate-Sparseness. We mainly consider the multi-class problem to explain why the unsupervised pretraining encourages moderate-sparseness since DNNs with the pretraining has been used to achieve state-of-the-art results on classification tasks. There are two reasons. First, we show a new viewing that the pretraining model is an adaptive sparse coding. Second, because the pretraining can train the ”good” feature detectors, we discuss that how the feature detectors can lead to moderate-sparseness. Finally, we measure the moderate-sparseness.

To start off the discussion, there are two natural assumptions to -class training set. Assumption 1: Every class has a balanced number of samples and there are a lot of common raw features (pixels or MFCC features) among samples of the same class (Zhang et al., 2012). Assumption 2: There are some similar raw features among samples of different classes since they share some common ones (Amit et al., 2007).

4.1 A New Viewing of Pretraining Model

Pretraining Model (such as RBMs) is an adaptive sparse coding. The explanation is as follow. By the results of (Bengio and Delalleau., 2009), the pretraining model (RBMs training is to minimize ) is also approximated by minimizing a reconstruction error criterion:


where is the mean-field output of the hidden units given the observed input and is the mean-field output of the visible units given the representation sampled from . The and can be regard as a decoder and an encoder, respectively.

From the second parts of (3) and (4), every hidden unit can be further interpreted as a feature detector (or invariance manifolds or filters) because the hidden unit is active (or non-active), that means, the detector should respond strongly (or weakly) when the corresponding feature is present (or absent) in the input (Goodfellow et al., 2009).

Amazedly the pretraining can train edge feature detectors in low layer and objects (or object parts) in high layer (Lee et al., 2009; Larochelle et al., 2009; Bengio et al., 2012). Given an input, the feature detectors naturally filter out a lot of features that are not present in the input and suppress the responses of some features that are not significant in the input. Clearly, those detectors result in sparseness.

Relationship with sparse coding: Sparse coding is to find the dictionary and the sparse representation to minimize the most popular form:


where also is a hyper-parameter. Obviously, the first part of RBMs (LABEL:eq:rbms) is similar to the first part of sparse coding (5) as they are decoders. The sparse coding is directly to penalize the

norm of the hidden representation

. But in RBMs (LABEL:eq:rbms) the is approximated by the sparse encoders (feature detectors), which filter out a lot of irrelevant features. In next subsection we shall discuss that how the feature detectors can lead to moderate-sparseness.

4.2 Lead To Moderate-Sparseness

Low-layer: Based on the assumptions the pretraining models averagely distribute all edge feature detectors to the hidden units in low layer as every class has a same number of samples. Assumption 1 shows that every class has a same number of edge feature detectors and there are a lot of common edge ones in the same class. Clearly, the edge feature detectors find out the edge features belonged to self-class, suppress the responses of some nonsignificant edge features, and filter out a lot of edge features related to the other classes. Suppose that there are hidden units (edge feature detectors) and a -class dataset, every class ideally has ones. Given input samples of a class, thus, the hidden units belonged to the class are activated or weakly responded and the remaining units are not activated (corresponding sparseness that is measured by (6)). Moreover, there are the common activation units (corresponding group).

Simultaneously, assumption 2 shows that there are some similar edge feature detectors among different classes. Different classes share some edge feature detectors corresponded to hidden units, which are also activated. The activated units results in more overlapping activation units in low layer. The activation overlapping degree is measured by (10). Combined with regularization effect (Erhan et al., 2010), therefore, we obtain the first result (A1) that the unsupervised pretraining is a sparse regularization with more-overlapping groups in low layer.

High-layer: In high layer the pretraining goes on to train object or object part features detectors from the edge features. Similarly to the analysis in low layer, the hidden units are more sparsely activated or weakly responded in high layer. Moreover, the activation overlapping degree is lower than one in low layer because the pretraining can potentially lead to do a better job at disentangling the objects or object parts (Larochelle et al., 2007; Bengio et al., 2013). Thus, we obtain the second result (A2) that the unsupervised pretraining is a sparse regularization with less (or no)-overlapping groups in high layer.

In DNNs without the pretraining the most hidden units of every layer are always activated and correspond to terrible feature detectors, which are the important causes of difficult classification problems. For classification tasks, it is always desirable to extract features that are most effective for preserving class separability (Wong and Sun, 2011) and collaboratively representing objects (Zhang et al., 2011; Yang et al., 2012b). The pretraining firmly grasps the those benefits. The more activation overlapping units can capture the collaborative features in low layer and the less or no activation overlapping units can capture the separability in high layer.

4.3 Sparseness Measure

For better understanding the pretraining, we tried to find sparseness, more-overlapping and no-overlapping characteristics of DNNs with or without the pretraining. So, Hoyer’s sparseness measure and activation overlapping degree are defined as followings.

The Hoyer’s sparseness measure (HSPM) (Hoyer, 2004) is based on the relationship between the norm and the norm. The HSPM of a

dimensional vector

is defined as follows:


This measure has good properties, which is in the interval and on a normalized scale. It’s value more close to means that there are more zero components in the vector . We denote absolute value of a real number and give the following definitions about AOD.

Definition 1: A hidden unit is said to be active if the absolute value of its activation is above a threshold , that is . And a hidden unit is called un-active if .

Definition 2: A vector is said to be an activation binary-vector of a dimensional representation if some representation units are active when the corresponding features are present in , and otherwise are not active when they are absent. Formally, the activation binary-vector is defined as:


where is a threshold.

To indicate the present feature in the input, we select a threshold that does not change the reconstruction data, that is 111We select because it is small enough., where and the vector of a sample is defined as:


Definition 3: An activation binary-vector of a sample set is an activation binary-vector of the mean value among all samples . It is defined as:


where is defined in (7) and is the number of sample in the set . The activation overlapping degree (AOD) simply calculates the percentage of activation unites that are simultaneously selected by different classes . AOD among a set is defined as:


where , is a binary-vector that is a logical conjunction on all activation binary-vectors and is defined in (9).

AOD, which is in the interval , is used to measure the percentage of activation overlapping units in different classes. It’s value more close to means that there are few activation overlapping units and it is easier to separate the different classes.

5 Experiments

In this section, we use deep neural networks to do experiments. A standard architecture for DNNs consists of multiple layers of units in a directed graph, with each layer fully connected to the next one. The nodes of the inter-layers are called hidden units. Each hidden unit is passed through a standard sigmoid functions. The objective of learning is to find the optimal network parameters so that the network output matches the target closely. The output can be compared to a target vector through a squared loss function or an negative log-likelihood loss function. We employ the standard

back-propagation algorithm to train the model parameters (the connected weights) (Bishop, 2006).

We denote that Dsigm: DNNs with standard sigmoid functions, DpRBMs: DNNs only pretrained with RBMs and DBNs: deep belief networks pretrained with RBMs and finely tuned.

Datasets: We present experimental results on standard benchmark datasets: MNIST222http://yann.lecun.com/exdb/mnist/ and Birdsong333http://sabiod.univ-tln.fr/icml2013/BIRD-SAMPLES/ The pixel intensities of all datasets are normalized to . MNIST dataset has 60,000 training samples and 10,000 test samples with pixel greyscale images of handwritten digits 0-9. Birdsong444TRAIN SET has 30 sec 35 bird recordings and TEST SET has 150sec 3 mics 90 recordings. There are not labels in TEST SET. So we divide the TRAIN SET to a new train set and a new test set(We randomly select 3,000 train samples with 16 MFCC features, the rest are test samples in every recording.). dataset has 70,000 training samples and 200,690 test samples with 16 MFCC features.

To speed-up training, we subdivide training sets into mini-batches, each containing 100 cases, and the model parameter is updated after each minibatch using the averages. Weights are initialized with small random values sampled from a normal distribution with zero mean and standard deviation of

. Biases are initialized with zeros. For simplicity, we use a constant learning rate chosen from . Momentum is also used to speed up learning. The momentum starts at a value of and linearly increases to

over half epochs, and stays at

thereafter. The regularization parameter for the weights is fixed at .

dataset 1st 2nd 3rd error
DBNs 0.63 0.39 0.53 0.63 1.17
DpRBMs 0.63 0.39 0.58 0.67
Dsigm 0.63 0.17 0.18 0.06 2.01
Table 1: Hoyer’s sparseness measures (HSPM) of DNNs (500-500-2000) on MNIST.

5.1 Sparseness Comparison

Figure 1: (a-c) Hoyer’s sparseness measures (HSPM) of RBMs only pretrained on MNIST. (a) HSPM of three layers RBMs as the pretraining epoch increases (the momentum is 0.5 in the first 25 epochs and 0.9 in the rest 25 epochs). From down to top: RBMs from the 1st, 2nd and 3rd layers, respectively. (b) HSPM of RBMs with 500-5000 hidden units after 1000 training epochs. (c) HSPM of five layers RBMs with 500 and 1000 hidden units after 1000 training epochs.

Figure 2: Activation overlapping degree (AOD) on MNIST. The left, middle and right respectively plot the average AOD among classes ( changes from to ) in first, second and third layer.

Before presenting the comparison of activation overlapping units, we first show the sparseness of pretraining compared to the more traditional sigmoid activation function. The sparseness metric HSPM is the averaged value over the definition of (6).

We perform comparisons on MNIST, and results after fine-tuning training for 200 epochs are reported in Table 1. The results show that compared to Dsigm the pretraining leads to models with higher sparseness, and smaller test errors. Table 1 compares the network HSPM of DBNs and DpRBMs to that of Dsigm. From Table 1, we observe that the average sparseness of three layer DpRBMs is about ; the resulting DBNs has similar sparseness. In Fig 1, (a) also shows that the feature of every layer RBMs is more sparse as the train epoch increases. In contract, the HSPM of Dsigm is on average below .

When the pretraining are trained longer enough and the number of hidden unites increases, HSPM of the pretraining models will become more sparse and also has an upper bound. In Fig 1, (b) shows that when the number of hidden units changes from 500 to 5000, an upper bounds of RBMs is 0.68 after 1000 training epochs.

As the number of layers increases, HSPM of the pretraining models also has an upper bound. From Fig 1, (c) shows that upper bounds of five hidden (500 and 1000) layers RBMs are and , respectively. We observe that the HSPM of the third layer pretraining is lower than one of the second layer. We empirically obtain the high HSPM by increasing the number of the hidden units of high layer, for example, DBNs (500-500-2000). This observation maybe can explain why the top layer should be big.

dataset 1st 2nd 3rd error
DBNs 0.29 0.12 0.12 0.35 9.6
DpRBMs 0.29 0.62 0.68 0.59
Dsigm 0.29 0.08 0.11 0.43 13.7
Table 2: Hoyer’s sparseness measures (HSPM) of DNNs (50-100-100) on Birdsong.

We perform comparisons on Birdsong, results after fine-tuning training for 100 epochs are reported in Table 2. The results also show that the pretraining leads to models with higher sparseness, and smaller test errors. From Table 2, we observe that the sparseness of three layer DpRBMs is higher than that of Dsigm and database. Although after tuning the sparseness is close to Dsigm, the pretraining learn ”good” initial values to initialize the DNN model parameters. This illustrates that the pretraining also is an optimization effect.

The HSPM in 3th layer are lower than 2nd layer. When training 2 layers networks, the resulting has similar test errors. So, there is a inspiration that the HSPM can be used to guide the number of layers and the number of hidden units.

5.2 Comparison of Selective-Overlapping Units

We perform comparisons on the test set of MNIST. For convenience, the test set is denoted by , where represents a set of all digits . -combinations of the set is denoted by a set , where is the number of -combinations and is a subset of distinct elements of . The average AOD among classes is an average of all AOD among a subset .

We compare the average AOD of DpRBMs to that of Dsigm. We found that the pretraining can capture the characteristics (A1) that there are many overlapping units in low layer and (A2) that there are few (or no) overlapping units in high layer. From Fig 2, the results show that average AOD among classes ( changes from to ) is high in low layer and is low (or zero) in high layer. The average AOD gets closer to as the number of layer increases in DpRBMs. Particularly, the average AOD gets closer to

than data-self in the third layer. This reveals that it is easier to classify. But it is very approximate to

in every layer of Dsigm.

6 Conclusion

Since the pretraining is known to perform well on MNIST, this paper mainly discusses why the unsupervised pretraining encourages moderate-sparseness. Our observations make us suspect that sparseness and activation overlapping degree play more important roles in deep neural networks. From Table 1, Table 2 and Fig 2, the pretraining can capture the sparse hidden units, the more activation overlapping units in low layer and the less (or no) activation overlapping units in high layers.


  • Y. Amit, M. Fink, N. Srebro, and S. Ullman (2007) Uncovering shared structures in multiclass classification. In ICML, Cited by: §4.
  • Y. Bengio, A. Courville, and P. Vincent (2012) Unsupervised feature learning and deep learning: a review and new perspectives. arXiv preprint arXiv:1206.5538v1. Cited by: §1, §2, §2, §4.1.
  • Y. Bengio and O. Delalleau. (2009)

    Justifying and generalizing contrastive divergence

    Neural Computation 21, pp. 1601–1621. Cited by: §4.1.
  • Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. (2006) Greedy layer-wise training of deep networks. In NIPS, pp. 153–160. Cited by: §1.
  • Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai (2013) Better mixing via deep representations. In ICML, Cited by: §2, §4.2.
  • Y. Bengio (2009) Learning deep architectures for ai. Foundations and Trends in Machine Learning 2, pp. 1–127. Cited by: §1, §2.
  • C. M. Bishop (2006) In Pattern Recognition and Machine Learning, Cited by: §5.
  • D. Erhan, Y. Bengio, P. Manzagol, A. Courville, and P. Vincent (2010) Why does unsupervised pre-training help deep learning. Journal of Machine Learning Research 11, pp. 625–660. Cited by: §1, §2, §2, §4.2.
  • I.J. Goodfellow, Q.V. Le, A.M. Saxe, H. Lee, and A.Y. Ng (2009) Measuring invariances in deep networks. In NIPS, Cited by: §1, §2, §4.1.
  • G. Hinton, S. Osindero, and Y. W. Teh. (2006) A fast learning algorithm for deep belief nets. Neural Computation 18, pp. 1527–1554. Cited by: §1, §2.
  • G. Hinton and R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. Science 313, pp. 504–507. Cited by: §1.
  • G. Hinton. (2002) Training products of experts by minimizing contrastive divergence. Neural Computation 14, pp. 1771–1800. Cited by: §3.
  • P.O. Hoyer (2004) Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research 5, pp. 1457–1469. Cited by: §4.3.
  • H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin (2009) Exploring strategies for training deep neural networks. Journal of Machine Learning Research 10, pp. 1–40. Cited by: §1, §2, §2, §4.1.
  • H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio (2007) An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pp. 473–480. Cited by: §4.2.
  • H. Lee, R. Grosse, R. Ranganath, and A. Ng (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, pp. 609–616. Cited by: §2, §4.1.
  • P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In ICML, pp. 1096–1103. Cited by: §1.
  • W.K. Wong and M.M. Sun (2011) Deep learning regularized fisher mappings. IEEE Trans. on Neural Networks 22, pp. 1668–1675. Cited by: §4.2.
  • J. Yang, D.L. Chu, L. Zhang, Y. Xu, and J.Y. Yang (2013)

    Sparse representation classifier steered discriminative projection with applications to face recognition

    IEEE Trans. on Neural Networks and Learning Systems 24, pp. 1023–1035. Cited by: §2.
  • J. Yang, L. Zhang, Y. Xu, and J.Y. Yang (2012a) Beyond sparsity: the role of l1-optimizer in pattern classification. Pattern Recognition 45, pp. 1104–1118. Cited by: §2.
  • M. Yang, L. Zhang, D. Zhang, and S. Wang (2012b) Relaxed collaborative representation for pattern classification. In CVPR, Cited by: §4.2.
  • C.J. Zhang, J. Liu, Q. Tian, C.S. Xu, H.Q. Lu, and S.D. Ma (2012) Image classification by non-negative sparse coding, low-rank and sparse decomposition. In CVPR, Cited by: §4.
  • L. Zhang, M. Yang, and X. Feng (2011) Sparse representation or collaborative representation: which helps face recognition?. In ICCV, Cited by: §2, §4.2.