1 Introduction
Based on an underlying premise that DNNs establish a complex probabilistic model [10, 28, 29, 39], numerous theories, such as representation learning [3, 11, 27], information bottleneck [1, 22, 32, 33], have been proposed to explore the working mechanism of deep learning. Though the proposed theories reveal some important properties of deep learning, such as hierarchy [3, 11] and sufficiency [1, 33], a fundamental problem is that the underlying premise is still not explicitly formulated.
In the context of probabilistic modeling for deep learning, most previous works focus on finding a probabilistic model to explain a single hidden layer of DNNs. It is known that every hidden layer of the Deep Boltzmann Machine (DBM) is equivalent to the restricted Boltzmann distribution
[20, 31]. Some works demonstrate that a convolutional layer can be explained as an undirected probabilistic graphical model, namely the Markov Random Fields (MRFs) [16, 40]. In addition, the softmax layer is proved to be a discrete Gibbs distribution
[6]. However, there are still some hidden layers, such as fully connected layer, without clearly probabilistic explanation. Although it is known that DNNs stack hidden layers in a hierarchical way [3, 20, 27, 36], establishing an explicitly probabilistic explanation for the whole architecture of DNNs has never been attempted successfully. In summary, we still don’t know what is the exactly probabilistic model corresponding to DNNs.The obscurity of the premise impedes clearly formulating some important properties of deep learning. First, we are still unclear what is the exact principle of assembling various hidden layers into a hierarchical neural network for a specific application [3, 11]. Second, though DNNs achieve great generalization performance, we cannot convincingly formulate the generalization property of DNNs based on traditional complexity measures, e.g., the VC dimension [2] and the uniform stability [5]
. Recent works claim that DNNs perform implicit regularization by the Stochastic Gradient Descent (SGD)
[7, 23], but they cannot clarify some empirical phenomena of DNNs presented in [24, 38].To establish an explicitly probabilistic premise for deep learning, we introduce a novel probabilistic representation of deep learning based on the Markov chain
[8, 33]and the Energy Based Model (EBM)
[9, 17]. More specifically, we provide an explicitly probabilistic explanation for DNNs in three aspects: (i) neurons define the energy of a Gibbs distribution; (ii) hidden layers formulate Gibbs distributions; and (iii) the whole architecture of DNNs can be interpreted as a Bayesian Hierarchical Model (BHM). To the best of our knowledge, this is the first probabilistic representation that can comprehensively interpret every component and the whole architecture of DNNs.Based on the proposed probabilistic representation, we provide novel insights into two properties of DNNs: hierarchy and generalization. Above all, we explicitly formulate the hierarchy property of deep learning from the Bayesian perspective, namely that the hidden layers close to the training dataset model a prior distribution and the remaining layers model a likelihood distribution for the training labels . Second, unlike previous work claiming that DNNs perform implicit regularization by SGD [23, 38], we demonstrate that DNNs have an explicit regularization by learning based on the Bayesian regularization theory [35] and prove that SGD is a reason for decreasing the generalization ability of DNNs from the perspective of the variational inference [7, 14].
Moreover, we clarify two empirical phenomena of DNNs that are inconsistent with traditional theories of generalization [23]. First, increasing the number of hidden units can decrease the generalization error but not result in overfitting even in an overparametrized DNN [24]. That is because more hidden units enable DNNs to formulate a better prior distribution to regularize the likelihood distribution , thereby guaranteeing the generalization performance. Second, DNNs can achieve zero training error but high generalization error for random labels [38]. We demonstrate the DNN still have good generalization performance in terms of learning an accurate prior distribution
. The high generalization error is due to the fact that it is impossible for arbitrary DNNs to classify random labels because it can only model two dependent random variables.
2 Related work
2.1 Energy based model
The Energy Based Model (EBM) describes the dependencies within the input by associating an energy to each configuration of [17]. We commonly formulate EBM as a Gibbs distribution
(1) 
where is the energy function, is the partition function, and denote all parameters. A classical example of EBM in deep learning is the Boltzmann machine [31]. In particular, the Gibbs distribution belongs to the exponential family [26] and can be expressed as
(2) 
where is the sufficient statistics for , is called the lognormalizer, and denotes the inner product [8]. We can derive that is a sufficient statistics for as well because . In addition, conjugacy is an important property of Gibbs distribution, which indicates that the posterior distribution would be a Gibbs distribution if the prior and the likelihood distributions are both Gibbs distributions [12, 21].
2.2 Stochastic variational inference
As a dominant paradigm for posterior inference , the variational inference converts the inference problem into an optimization problem [4], where the prior distribution
is the probability of arbitrary hypothesis
with respect to the observation and the likelihood distribution is the probability of given . More specifically, variational inference posits a family of approximate distributions and aims to find a distribution that minimizes the KullbackLeibler (KL) divergence between and .(3) 
A typical method to solve the above optimization problem is the stochastic variational inference [14], which iteratively optimizes each random variable in based on the samples of while holding other random variables fixed until achieving a local minimum of . In particular, previous works prove that SGD performs variational inference during training DNNs [7, 15].
3 The probabilistic representation of deep learning
We assume that
is an unknown joint distribution between
and , where describes the prior knowledge of , describes the statistical connection between and , and denote the parameters of . In addition, a training dataset is composed of i.i.d. samples generated from . A neural network with hidden layers is denoted as and trained by . is the random variable for the hidden layer , andis the estimation of the distribution
.Proposition 1: The whole architecture of DNNs can be explained as a Bayesian hierarchical model.
Since the input of the hidden layer in the is the output of its previous layer , we can derive that the forms a Markov chain as
(4) 
As a result, the distribution of the DNN can be formulated as
(5) 
where if the output layer is defined as softmax. Notably, Proposition 2 shows that one or more hidden layers could be used to formulate a single conditional distribution in some cases. Since we can still derive a joint distribution in these cases, is still used to indicate the joint distribution of DNNs for subsequent discussion. An example is shown in Figure 1.
The joint distribution demonstrates that the DNN can be explained as a BHM with levels, in which the hidden layer formulates a conditional distribution to process the features in and serves as a prior distribution for the higher level . After establishing the probabilistic representation for the whole architecture of DNNs, we demonstrate that the hidden layers of DNNs can be explained as Gibbs distributions.
Proposition 2: The hidden layers of a neural network formulate Gibbs distributions through defining the corresponding energy functions.
The intuition of Proposition 2 can be demonstrated by the shallow neural network shown in Figure 1, in which the hidden layer has neurons and the output layer is softmax with nodes. Therefore, we can formulate each output node as , where
is an input vector,
formulates the th neuron, and denotes the weight of the edge between and . The partition function is .Previous works prove that is equivalent to a discrete Gibbs distribution [6]. Specifically, assumes that there are configurations of , and the energy of each configuration is expressed as . Since is a linear combination of all neurons , we can reformulate as the Product of Expert (PoE) model [13]
(6) 
where and .
It is noteworthy that all experts are Gibbs distributions expressed as
(7) 
where the energy function is equivalent to the negative of the th neuron, i.e., . In other words, the energy function of is entirely dependent on all the neurons in , namely the functionality of the hidden layer . Since an energy function is a sufficient statistics of a Gibbs distribution [8], we can conclude that arbitrary hidden layers can be formulated as Gibbs distributions by defining the corresponding energy functions based on the functionality of the hidden layers.
A straightforward example is DBM [31], in which each hidden layer defines an energy function as
, thereby formulating a special Gibbs distribution, namely Restricted Boltzmann Machine (RBM),
(8) 
where and are vectors of weights for the hidden nodes and the input vector , respectively. is the matrix of connection weights. The partition function is .
In some cases, we should use multiple hidden layers to formulate a single Gibbs distribution. For example, a convolutional layer with nonlinear layers have been proved to formulate the MRF model in the Convolutional Neural Networks (CNNs)
[16, 40]. Since MRF is a special Gibbs distribution [9, 19], we can conclude that a convolutional layer with nonlinear layers formulate the energy function as and define a Gibbs distribution expressed as(9) 
where is a highdimensional input, is a convolutional filter, and denote nonlinear layer(s), such as ReLU. The partition function is .
It needs to be emphasized that hidden layers only formulate the corresponding energy functions, rather than directly formulating Gibbs distributions. We can conclude that hidden layers formulate Gibbs distributions because an energy function is a sufficient statistics of a Gibbs distribution [8].
Overall, the above two propositions provide an explicitly probabilistic representation of DNNs in three aspects: (i) neurons define the energy of a Gibbs distribution; (ii) the hidden layers of DNNs formulate Gibbs distributions; and (iii) the whole architecture of DNNs can be interpreted as a BHM. Based on the probabilistic representation, we provide insights into two fundamental properties of deep learning, i.e., hierarchy and generalization, in the next section.
4 Insights into deep learning
4.1 Hierarchy
Based on Proposition 1, we can explicitly formulate the hierarchy property of deep learning. More specifically, the describes a BHM as to simulate the joint distribution given , which can be expressed as
(10) 
This equation indicates that the DNN uses some hidden layers (i.e., ) to learn a prior distribution and the other layers (i.e., ) to learn a likelihood distribution . For simplicity, DNNs formulates a joint distribution to model .
Compared to traditional Bayesian models, there are two characteristics of DNNs. First, there is no clear boundary to separate DNNs into two parts, i.e., and , because the architecture of DNNs is much more complex than an ordinary BHM [18]
. Second, unlike the naive Bayes classifier independently inferring the parameters of
and [25] from, the learning algorithm of DNNs, e.g., backpropagation
[30], infers the parameters of based on that of . These characteristics lead to both pros and cons of DNNs. On the one hand, they enable DNNs to freely learn various features from to formulate . On the other hand, they result in some inherent problems of DNNs, such as overfitting, which is discussed in the next section.4.2 Generalization
Based on the hierarchy property of DNNs, we can demonstrate DNNs having an explicit regularization because the Bayesian theory indicates that a prior distribution corresponds to the regularization [35]. This novel insight explains why an overparametrized DNN still can achieve great generalization performance [24]. More specifically, though an overparametrized DNN indicates that it has a much complex , it simultaneously implies that the DNN can use many hidden units to formulate a powerful to regularize , thereby guaranteeing the generalization performance.
Moreover, we demonstrate that the learning algorithm, e.g., the backpropagation [30], is the reason for decreasing the generalization ability of DNNs. Given a , we can formulate a BHM as . From the perspective of variational inference [7, 14], the backpropagation aims to find an optimal distribution that minimizes the KL divergence to the truly posterior distribution .
(11) 
Ideally, this optimization problem is expected to be solved by iteratively optimizing each random variable while holding other random variables fixed.
(12) 
However, we cannot derive in practice because is intractable.
To design a feasible learning algorithm for DNNs, the loss function is alternatively relaxed to
(13) 
because is known to us. Nevertheless, the cost for this relaxation is that we cannot precisely infer . For simplicity, the truly posterior distribution can be expressed as , thus the loss function for DNNs should be formulated as . However, the relaxed loss function merely corresponds to , which implies that it cannot guarantee the learned DNNs satisfying the generalization property of DNNs.
It is noteworthy that the conjugacy property, namely that both and are Gibbs distributions derived from the same DNN, and the backpropagation inferring parameters in the backward direction enable us to infer via based on the relaxed loss function. However, the primary goal of the learned is to derive the that is close to but not to precisely model the truly prior distribution .
5 Experiments
In this section, we first demonstrate the proposed probabilistic representation and the hierarchy property based on a simple but comprehensive CNN on a synthetic dataset. Subsequently, we validate the proposed insights into the generalization property and clarify two notable empirical empirical phenomena of deep learning that cannot be explained by the traditional theories of generalization.
5.1 The proposed probabilistic representation
Since the distributions of most benchmark datasets are unknown, it is impossible to use them to demonstrate the proposed probabilistic representation. Alternatively, we generate a synthetic dataset obeying the Gaussian distribution based on the NIST dataset of handwritten digits ^{1}^{1}1https://www.nist.gov/srd/nistspecialdatabase19. The synthetic dataset consists of 20,000 grayscale images in 10 classes (digits from 0 to 9). All grayscale images are sampled from the Gaussian distribution . Each class has 1,000 training images and 1,000 testing images. Figure 2 shows five synthetic images and their perspective histograms. The method for generating the synthetic dataset is reported in the supplement A.
We choose CNN1 from Table 1 to classify the synthetic dataset. Based on the proposed probabilistic representation, we can identify the functionality of each hidden layer as follows. Above all, should model the truly prior distribution i.e., , because the max pooling layer compresses too much information of for dimension reduction. The subsequent hidden layers formulate and , thereby modeling the likelihood distribution . In summary, the whole architecture of CNN1 formulates a BHM as .
R.V.  Layer  Description  CNN1  CNN2 
Input  
Conv ()  
Maxpool + ReLU  
Conv ()  
Maxpool + ReLU  
Fully connected  
Output(softmax) 

R.V. is the random variable of the hidden layer(s), and the only difference between CNN1 and CNN2 is the number of convolutional filters in .
Since and are convolutional layers, we can formulate and as
(14) 
where is a convolutional filter in , is a convolutional filter in , and indicate the max pooling and ReLU operators in and . In addition, , , , and . Since the output layer is defined as softmax, the output nodes can be expressed as
(15) 
where is the linear filter in , and is the weight of the edge between and .
Though we obtain the formulas of and , it is hard to calculate and because and are intractable for the high dimensional datasets and . Alternatively, we use the histograms of and to estimate and , respectively, because an energy function is a sufficient statistics of a Gibbs distribution [8, 34, 37].
After CNN1 is well trained (i.e., the training error becomes zero), we randomly choose a testing image as the input of CNN1 for deriving , , and . Since is a sample generated from , i.e., , we can exam the proposed probabilistic representation through calculating the distance between and , i..e, , to check if precisely models . All distributions are shown in Figure 3. We see that is very close to () and outputs correct classification probability.
This experiment validates the proposed probabilistic representation in two aspects. First, since we can theoretically prove CNN1 formulating a joint distribution and empirically show modeling the prior distribution , we can conclude that formulates a BHM as , thereby explaining the hierarchy property of DNNs. Second, the hidden layers of CNN1, e.g., , formulate Gibbs distributions by defining the corresponding energy function. Moreover, it shows that DNNs have an explicit regularization by learning a prior distribution and preliminarily validates the novel insight into the generalization property of DNNs.
5.2 Generalization
5.2.1 Analyzing the generalization property of overparametrized DNNs
We use CNN1 and CNN2 to further validate the insights into the generalization property by comparing their performances on the synthetic dataset. Notably, both CNN1 and CNN2 are overparametrized because a synthetic image has 1024 pixels but they have 1680 and 1330 parameters, respectively.
Based on the proposed probabilistic representation, Table 1 indicates that CNN1 and CNN2 have the same convolutional layer to formulate their respective prior distributions, i.e., and . Meanwhile, CNN1 formulates a much more complex likelihood distribution than CNN2 because the former has much more convolutional filters than the later in . Intuitively, given the same complexity of the prior distributions, a more complex likelihood distribution is more prone to be overfitting, but Figure 4 shows that CNN1 has the better generalization performance than CNN2. Also note that and . It means that CNN1 learns the better prior distribution, thus it can regularize the likelihood distribution of CNN1 better and guarantee its superiority over CNN2 even though they are overparametrized networks.
Moreover, this experiment shows that the learning algorithm limits the generalization ability of DNNs. Equation (14) indicates that and have the same formula. Intuitively, given the same training dataset , the learned should be as good as . However, Figure 4 shows that is worse than though both CNN1 and CNN2 achieve zero training error. It implies that the relaxed loss function cannot guarantee the backpropagation accurately inferring the prior distribution. Specifically, since the backpropagation infers the parameters of DNNs in the backward direction [30], it has to infer the prior distributions via the likelihood distributions rather than directly from . As a result, the hidden layers corresponding to the likelihood distributions, especially , have great effect on inferring and . In particular, Table 1 shows that CNN1 has 60 convolutional filters in but CNN2 only has 36, thus the backpropagation cannot infer as accurate as , thereby limiting the generalization ability of CNN2.
5.2.2 Analyzing the generalization property of DNNs on random labels
Similar to the experiment presented in [38], we use CNN1 to classify the synthetic dataset with random labels. Figure 6 shows that CNN1 achieves zero training error but very high testing error. We also visualize the distribution of each hidden layer in CNN1 given a testing image with a random label, and Figure 5(C) shows that CNN1 still can learn an accurate prior distribution . In this sense, CNN1 still achieves good generalization performance for this experiment.
We use a simple example to demonstrate that it is impossible for a DNN to classify random labels because it can only model two dependent random variables, i.e., , but random label implies . In this example, we assume that humans can only distinguish the shape of the object (triangle or square), but the color feature (blue or green) and whether the object is filled (full or empty) are hidden features for humans. Since we can only detect the shape feature, there are only two label values (1 and 2). Labels are randomly assigned to 12 objects (8 for training and 4 for testing) in Figure 6. It is known that DNNs can detect many features that are imperceptible for humans, thus we assume that DNNs can detect all the features. Given the training objects with random labels, DNNs extract all three features as prior knowledge, i.e., , and find that objects can be classified based on if it is full, e.g., . It can be understood that the training labels categorize the objects into two groups and DNNs can extract the hidden feature to generate for precisely formulating the categorization, though it is imperceptible or indecipherable for humans. However, the categorization indicated by the training labels is obviously not consistent with the testing labels, because labels are random, thereby the testing error becomes . That explains why DNNs has a high testing error on random labels. Since DNNs still can learn an accurate prior distribution , we conclude that this experiment does not contradict the generalization property of deep learning.
6 Conclusion
In this work, we present a novel probabilistic representation for explaining DNNs and investigate two fundamental properties of deep learning: hierarchy and generalization. First, we explicitly formulate the hierarchy property from the Bayesian perspective. Second, we demonstrate that DNNs have an explicit regularization by learning a prior distribution and clarify some empirical phenomena of DNNs that cannot be explained by traditional theories of generalization. Simulation results validate the proposed probabilistic representation and the insights based on a synthetic dataset.
References
 [1] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. arXiv preprint arXiv:1706.01350, 2017.
 [2] Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearlytight vcdimension and pseudodimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930, 2017.
 [3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.

[4]
David Blei, Alp Kucukelbir, and Jon MaAuliffe.
Variational inference: A review for statisticians.
Journal of Machine Learning Research
, 112:859–877, 2017.  [5] Olivier Bousquet and Andre Elisseeff. Stability and generalization. Journal of Machine Learning Research, pages 499–526, 2002.
 [6] John Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In NeurIPS, 1990.
 [7] Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. In ITA, 2018.
 [8] Thomas Cover and Joy Thomas. Elements of Information Theory. WileyInterscience, Hoboken, New Jersy, 2006.
 [9] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions. on Pattern Analysis and Machine Intelligence, pages 721–741, June 1984.
 [10] Herbert Gish. A probabilistic approach to the understanding and training of neural network classifiers. In IEEE ICASSP, pages 1361–1364, 1990.
 [11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 [12] James Hensman, Magnus Rattray, and Neil D. Lawrence. Fast variational inference in the conjugate exponential family. In NeurIPS, 2012.

[13]
Geoffrey E. Hinton.
Training products of experts by minimizing contrastive divergence.
Neural Computation, 14:1771–1800, 2002.  [14] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14:1303–1347, 2013.
 [15] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183–233, 1999.
 [16] Xinjie Lan and Kenneth E. Barner. From mrfs to cnns: A novel image restoration method. In 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1–5, 2018.
 [17] Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang. A tutorial on energybased learning. MIT Press, 2006.
 [18] Feifei Li and Pietro Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR, 2005.
 [19] Stan Z. Li. Markov Random Field Modeling in Image Analysis 2nd ed. Springer, New York, 2001.
 [20] Pankaj Mehta and David J. Schwab. An exact mapping between the variational renormalization group and deep learning. arXiv preprint arXiv:1410.3831, 2014.
 [21] Kevin P. Murphy. Conjugate bayesian analysis of the gaussian distribution. Technical report, University of British Columbia, 2007.
 [22] Noga Zaslavsky Naftali Tishby. Deep learning and the information bottleneck principle. arXiv preprint arXiv:1503.02406, 2015.
 [23] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. In NeurIPS, 2017.
 [24] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR, 2015.

[25]
Andrew Y. Ng and Michael I. Jordan.
On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.
In NeurIPS, pages 841–848, 2002.  [26] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arxiv preprint arXiv:0911.4863, 2011.
 [27] Ankit Patel, Minh Nguyen, and Richard Baraniuk. A probabilistic framework for deep learning. In NeurIPS, 2016.
 [28] Judea Pearl. Theoretical impediments to machine learning with seven sparks from the causal revolution. arXiv preprint arXiv:1801.04016, 2018.
 [29] M.D. Richard and R.P. Lippmann. Neural network classifiers estimate bayesian a posteriori probabilities. Neural Computation, pages 461–483, 1991.
 [30] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. Nature, 323:533–536, October 1986.
 [31] Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In AISTATS 2009, pages 448–455, 2009.
 [32] Andrew Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Tracey, and David Cox. On the information bottleneck theory of deep learning. In ICLR, 2018.
 [33] Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
 [34] E. P. Simoncelli. Statistical models for images: Compression, restoration and synthesis. In Proc 31st Asilomar Conf on Signals, Systems and Computers, pages 673–678, November 1997.
 [35] Harald Steck and Tommi S. Jaakkola. On the dirichlet prior and bayesian regularization. In NeurIPS, 2003.
 [36] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey Hinton. Deep mixtures of factor analysers. arXiv preprint arXiv:1206.4635, 2015.
 [37] Martin. J. Wainwright and Eero. P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NeurIPS, pages 855–861, 2000.
 [38] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2016.
 [39] G. Zhang. Neural networks for classification: a survey. IEEE Transactions on Systems, Man, and Cybernetics, 30:451–462, 2000.

[40]
Shuai Zheng, Sadeep Jayasumana, Bernardino RomeraParedes, Vibhav Vineet,
Zhizhong Su, Dalong Du, Chang Huang, and Philip Torr.
Conditional random fields as recurrent neural networks.
InInternational Conference on Computer Vision (ICCV)
, pages 1529–1537, 2015.