Deep Neural Networks (DNN) have achieved tremendous successes in AI fields such as computer vision, natural language processing and reinforcement learning. One crucial factor for the successes of DNN is that it possesses highly complex and nonlinear model architecture, which allows it to approximate almost any complicated function(Cybenko, 1989; Rolnick and Tegmark, 2018; Mhasker et al., 2017). However, large and deep fully connected networks are slow to implement and memory demanding (Srivastava et al., 2014), which sheds the light in the use of sparse neural nets. Additionally, sparse neural nets have been shown to have accurate approximation and strong generalization power (Glorot et al., 2011; Goodfellow et al., 2016). For example, the popular Dropout regularization (Srivastava et al., 2014) could be interpreted as averaging over regularized sparse neural nets. From a theoretical perspective, Schmidt-Hieber (2017)
showed that sparse DNN with a ReLU activation function could achieve nearly minimax rate in the nonparametric regression setup. This theoretically justifies the use of sparse neural nets.
Bayesian neural nets are perceived to perform well against overfitting due to its regularized nature by enforcing a prior distribution. The study of Bayesian neural nets could date back to MacKay (1992), Neal (1993). In particular, a spike-and-slab prior (George and McCulloch, 1993)
would switch a certain neuron off, and thus in nature imposesregularization and encourages network sparsity. Polson and Rockova (2018)
introduced the Spike-and-Slab Deep Learning as a fully Bayesian alternative to Dropout for improving generalizability of DNN with ReLU activation, where the posterior distribution is proven to concentrate at the nearly minimax rate.
However, a well-known obstacle for Bayesian inference is its high computational cost for drawing samples from posterior distribution via Markov chain Monte Carlo (MCMC). A popular alternative - Variational Inference (VI) or Variational Bayes (VB) (Jordan et al., 1999), approximates the true posterior distribution by a simpler family of distributions through an optimization over the Evidence Lower Bound (ELBO). Some recent works have explored VI for neural networks (Graves, 2011; Kingma and Welling, 2014; Rezende et al., 2014; Blundell et al., 2015). However, statistical properties of VI haven’t been carefully studied only until recently (Pati et al., 2018; Alquier and Ridgway, 2017; Wang and Blei, 2019)
, and the generalization property for variational BNN remains undiscovered. Specifically, it would be interesting to examine whether the variational inference leads to the same rates of convergence compared to the Bayesian posterior distribution and frequentist estimators.Cherief-Abdellatif (2019) attempts to provide theoretical justifications for variational inference on BNN but only for the inflated tempered posterior (Bhattacharya et al., 2019) rather than the true posterior.
In this paper, we directly investigate the theoretical behavior of variational posterior for Bayesian DNN under spike-and-slab modeling. Our specific goals are to understand how fast the variational posterior converges to the truth and how accurate the prediction carried out by variational inferences is. It turns out that the choice of the network structure, i.e., network depth, width and sparsity level, plays a crucial role for the success of variational inference. There exists a trade off phenomenon for the choice of network architecture. An overly complex structure leads to large variational approximation error, while an overly simplified network may not be able to capture the nonlinear feature of true underlying regression function.
When the structure of network is optimally tuned according to the smoothness level of the true regression function, we show that the variational posterior has minimax optimal contraction rate (up to a logarithm factor), and variational inferences possess optimal generalization error. In addition, when such smoothness level is unknown, rate (near-)optimal variational inference is still achievable. By imposing a well designed prior on the network structure, we develop an automatic variational architecture selection procedure which is based on the penalized ELBO criterion. Such selection procedure is adaptive to the unknown smoothness, and thus leads to the same theoretical guarantee. As a by-product, our work also characterizes the performance of variational inference when the truth is instead a ReLU network.
2 Nonparametric Regression via Deep Learning
Consider a nonparametric regression model with random covariates and
denotes the uniform distribution, the noiseindependently, and is the underlying true function. In this paper, we assume the unknown belongs to the class of -Hölder smooth functions , which is defined as
2.1 Deep neural networks
An -hidden-layer ReLU neural network is used to model the data. The number of neurons in each layer is denoted by for . The weight matrix and bias parameters in each layer are denoted by and for . Let be the ReLU activation function, and for any and any , we define as
Therefore, given parameters and , the output of this DNN model can be written as
In what follows, with slight abuse of notation, is also viewed as a vector that contains all the coefficients in ’s and ’s, and its length is denoted by , i.e., .
2.2 Regularization via spike-and-slab prior
Instead of using a fully connected neural net, i.e., is a dense vector, we consider a sparse NN with bounded coefficients, that is ,
where upper bounds the weights, controls the sparsity level of NN connectivity and . The set of under the constraint is denoted as .
Given this specified sparse network structure, we impose a fully Bayesian modeling with a spike-and-slab prior on . Denoting as the Dirac at and as a binary vector indicating the inclusion of each edge in the network. The prior distribution thus has the following hierarchical structure:
for . In other words, we assign uniform prior over all possible -sparse network structures, and uniform distribution for the coefficient of the selected edge. As will be discussed later, the uniform slab distribution (i.e., ) in (3
) can be replaced by a centered Gaussian distribution, without altering our theoretical conclusion.
We denote and as the observations. Let denote the underlying probability measure of data, and denote the corresponding density function, i.e., where is normal cdf. Similarly, let and be the distribution and density functions induced by the parametric NN model (2). Thus, the posterior distribution is written as
3 Variational Inference
In the framework of variational inference, one seeks to find a good approximation of the posterior via optimization rather than to simulate the posterior distribution by long-run Markov chain Monte Carlo. Given a variational family of distributions, denoted by , the goal is to minimize the KL divergence between distributions in and true posterior distribution:
and the variational posterior is subsequently used for approximated inference.
which is usually conducted via gradient descent type algorithm.
An inspiring representation of is
where the first term in (6) can be viewed as the reconstruction error (Kingma and Welling, 2014) and the second term serves as regularization. Hence the variational inference procedure tends to be minimizing the reconstruction error while being penalized against prior distribution in the sense of KL divergence.
Technically, the variational family could be chosen freely. But for the sake of efficient implementation and easy optimization, it is often selected as some simple distribution family. In our case, is chosen as the spike-and-slab distribution to resemble the prior distribution, i.e. for ,
where and .
One major contribution is to show that under a proper choice of network structure, the variational Bayes procedure achieves minimax rate of convergence (up to a logarithm factor) for the variational generalization error (8) defined below. To study generalization ability, we are interested in evaluating how well the proposed VB procedure can predict a new observation, which can be measured by
where is some VB estimator such as VB posterior mean. Alternatively, the generalizability can be measured by
where the random estimator follows the VB posterior distribution. Note that by Jensen’s inequality,
Hence, our objective is to investigate the convergence rate of .
4.1 VB posterior asymptotics
In this section, we establish the distributional convergence of the variational Bayes posterior , towards the true regression function . As an intermediate step, we first study the convergence under the Hellinger distance (rather than under norm, i.e., (8)).
Denote the log-likelihood ratio between and as
then the negative ELBO can be expressed as
where is a constant with respect to . In the framework of variational inferences, one pursuits greatest ELBO since it measures the similarity between true Bayesian posterior and variational posterior distributions. Our first lemma provides an upper bound for the negative ELBO for sparse DNN model, under our prior specification (3) and variational family .
Under any choice of network family with equal width , we have that, with dominating probability,
holds, where and is any diverging sequence.
The RHS of (9) consists of two terms: the first term is an error caused by the variational Bayes approximation; the second term is an approximation error of ReLU DNN model. Specifically, the term depends on the complexity of network structure, with an order
which nearly linearly depends on the sparsity and depth of the network structure. On the other hand, the approximation error decreases as one increases the complexity of networks configuration (i.e., the choice of , and
). Therefore, it reveals a trade off phenomenon on the choice of network structure. Note that such trade-off echoes with those observed in the literature of nonparametric statistics: as one increases the domain of parameter space (e.g., increases the number of basis functions in spline regression modeling), it usually leads to smaller bias but larger variance.
To quantify , certain knowledge of approximation theory is required. There is rich literature on the approximation properties of neural networks. For instance, Cheang and Barron (2000) and Cheang (2010) provided tight approximation error bound for simple indicator functions; Ismailov (2017) studied approximation efficiency of shallow neural network. Some recent works characterize the approximation accuracy of sparsely connected deep nets (Bölcskei et al., 2019; Schmidt-Hieber, 2017; Bauler and Kohler, 2019). The following lemma is due to Schmidt-Hieber (2017, Theorem 3).
Assume for some , then there exists a neural net with
and for some positive constant , such that
Lemma 4.2 summarizes the expressibility of sparse ReLU DNN in terms of its depth, width and sparsity, which will serve as an important building block for our subsequent analysis. In particular, when network depth, width and sparsity are chosen as in Lemma 4.2, the two terms discussed in the Lemma 4.1 (i.e., and ) strike a balance which later yields an optimal rate.
Therefore, in the rest of section, our network architecture follows the choice of , , and , and we study the generalization ability of variational Bayes inferences with prior support being . The obtained results in this section will be extended to the case that the smoothness parameter is unknown (so that and become unavailable) in Section 5.
Our next lemma links the contraction rate of variational Bayes posterior with the negative ELBO discussed in Lemma 4.1.
Under the network structure specified in Lemma 4.2, let for any and some large constant , then with probability at least ,
where is some positive constant.
It is worth mentioning that Lemma 4.3 holds regardless of the choice of prior specification and variational family .
The LHS of (11) is the variational Bayes posterior mean of the squared Hellinger distance. On the RHS, the first term comes from the construction of certain testing function (refer to Lemma 7.2) which is completely determined by the nature of our sparse DNN modeling; the second term, as discussed in above, is the negative ELBO (up to a constant), which is completely determined by the choices of and .
Assume for some , where and . Choose , and as in Lemma 4.2. Let with some , then, for any diverging
with dominating probability.
Theorem 4.1 establishes the rate minimaxity (up to a logarithmic factor) of variational sparse DNN inference, under the squared Hellinger distance measure. The established rate matches the contraction rate of the true Bayesian posterior (Polson and Rockova (2018)) and therefore implies that there is no sacrifice in statistical rate with variational inference.
Result (12) also implies that in probability for any , hence almost all of the VB posterior mass contracts towards a small Hellinger ball with radius centered at .
4.2 VB generalization error
So far the results established in Section 4.1 concern only the Hellinger distance, which is
By our assumption, for some constant , therefore,
This, combined with Theorem 4.1, implies that w.h.p, the VB generalization error is rate (near-)minimax, as
for any .
4.3 Data generated from neural network
Apart from our assumption that the truth is some unknown -smooth function, it is also common to assume that is exactly an unknown ReLU sparse network, that is for some , and . Importantly, our variational Bayes modeling with spike and slab prior can still be used to estimate , as long as certain information about , and are available.
Let the prior support satisfy , , for some such that . In other words, we allow the networks to be wider and denser than the true sparse network, but with the same depth . We can show that if
our variational inference leads to the following contraction result:
with dominating probability, where
for some positive . The above convergence result follows the same proof of Theorem 4.1. In particular, Lemma 4.1 still holds with the approximation error term vanishing to 0, and Lemma 4.3 still holds when the satisfies the inequality (16).
We want to point out that the above analysis holds as long as (13) is satisfied, even when is not necessarily a sparse neural network. Note that it is meaningless to compare against as the two parameter spaces are not directly comparable. However, the minimax lower bound result is still valuable in this setup, and will be developed as a future direction.
5 Adapt to Unknown Smoothness
In Section 4.1, we specify the network structure using the choices of , and . However, both and depend on the smoothness parameter , which in general is not available for real data analysis. In this section, we will develop an adaptive variational Bayes inference procedure, under which the variational posterior achieves the same optimal convergence rate as if were known.
Note that doesn’t depend on . Therefore, we continue restricting the network structure to be -layers. As for network sparsity and width, we expand the prior support to , where and is the total possible number of edges in the -hidden-layer network with layer width .
The prior specification on the network structure follows Polson and Rockova (2018), that is
for some given and . Conditional on and , the priors for the weight matrix and bias parameters still follow (3). As proved by Polson and Rockova (2018), the posterior distribution induced by the above Bayesian modeling possesses an optimal posterior contraction rate.
To implement variation inference, we consider the variational family that restricts the VB marginal posterior of and to be a degenerate measure, i.e.,
where and . This choice of variational family means that the VB posterior will adaptively select one particular network structure by minimizing .
for some constant . Let
be the maximized ELBO given the network structure determined by parameters and . Then , in other words, the above VB modeling leads to a variational network structure selection based on a penalized ELBO criterion, where the penalty term is the logarithm of the prior of and .
In Bayesian analysis, model selection depends on the (log-)posterior: . Thus, the proposed variational structure selection procedure is an approximation to maximum a posteriori (MAP) estimator, by replacing the model evidence term with the ELBO .
Our next theorem shows that the proposed variational model is rate optimal without knowing .
Under the variational Bayes modeling described above, we still achieve that
holds with dominating probability, where is any sequence satisfying .
The proof of Theorem 5.1 shows that our adaptive VB inference procedure will avoid over-complicated network structure, in the sense that the selected and will not be overwhelmingly larger than the optimal choice and . Therefore, the optimal rate of convergence still holds.
In this work, we investigate theoretical aspects of variational Bayes inference for sparse DNN models in the nonparametric regression setup. With certain spike-and-slab prior and variational family, we are able to achieve (near-)optimal variational posterior contraction rate under Hellinger distance, as well as (near-)optimal VB generalization error.
Although theoretically sound, the spike and slab modeling with Dirac spike is difficult to implement in practice. One will have to incorporate certain trans-structure step in the minimization of ELBO, or said differently, the algorithm may potentially need to exhaustively search all possible sparse network structures. Thus, developing an efficient computational algorithm for spike-and-slab prior is very challenging. An alternative choice could be the Bayesian shrinkage modeling. For example, Ghosh and Doshi-Velez (2017) demonstrated promising empirical results for variational inference on ReLU DNN via the horseshoe prior. We foresee that a carefully designed variational inferences with shrinkage prior can inherit the theoretical optimality of spike-and-slab modeling, while enjoy fast computation that is crucially important for deep learning problems.
We include the detailed proofs for our main theorems in this section.
For any probability measure and any measurable function with ,
The next lemma proves the existence of a testing function which can exponentially separate and . The existence of such testing function is crucial for Lemma 4.3.
Let for any and some large constant . Then there exists some testing function and , , such that
for all satisfying that .
Let denote the covering number of set , i.e., there exists Hellinger-balls with radius , that completely cover . For any (W.O.L.G, we assume belongs to the th Hellinger ball centered at ), if , then we must have that and there exists a testing function , such that
Now we define . Thus we must have
where the first inequality is due to the inequality
and , the second inequality is due to Lemma 10 of Schmidt-Hieber (2017) and the third inequality is due to the choices of , and , and the fact that is sufficiently large. Therefore,
for some . On the other hand, for any , such that and belongs to the th Hellinger ball, we have
where . Hence we conclude the proof. ∎
Proof of Lemma 4.1
It suffices to construct some , such that for any , w.h.p,
Let and we choose the same which has been used in the proof of Theorem 2 of Cherief-Abdellatif (2019), i.e.,
for all , where and with .
Then, according to the proof of Theorem 2 in Cherief-Abdellatif (2019),
Noting that , then by Fubini’s theorem,
which thereafter by Markov inequality, implies that w.h.p. for any , which proves the this lemma together with (18). ∎
Proof of Lemma 4.3
The proof is adapted from the proof of Theorem 3.1 in Pati et al. (2018).
We claim that with high probability (w.h.p),
for some , where . Thus by Lemma 7.1, w.h.p.,
holds for any distribution . The last inequality holds since that is the negative ELBO function up to a constant, which is minimized at . This concludes Lemma 4.3.
To prove (20), we define