Bayesian Learning is a framework to combine the internal data information with your background knowledge on your specific learning task, namely the external information in a logically consistent way by Bayes Theoremdraper2006bayesian; draper2013bayesian
. In the Bayesian framework, the parameters of the model are not fixed constants but random variables whose distributions should be introduced according to the background information. From a information perspective, the background information (the knowledge) external to the data set is ’injected’ into the learning model optimally by Bayes Theoremzellner1988optimal.
In this framework, parameters are not estimated by point estimator such as the Maximum Likelihood Estimator (M.L.E.) instead the posterior distribution of the parameters are given. In this step, the technical difficulty is the computation of the normalization constant. In some application cases, people use the Dirac distribution of the optimal mode to approximate the posterior distribution (M.A.P.) which is equivalent to use the Maximum A Posterior Estimator to do point estimation of the parameters in the model nowlan1992simplifying
. To get better approximation of the posterior distribution, higher order term should be remained and till second order terms are preserved and Normal distribution is obtained by Laplace approximation trick.
In our work, we proposed a robust initialization method to infer the parameters of the Bayesian Tensor Network model more quickly. Since the special structure of the Tensor Network which is a sum of tensor chain products, the output of this model is easily blowing out or decaying to zero which will lead to the instability of the inference process. Using our initialization strategy, the parameters and their gradients will stay in the stable region where gradient descent optimization will behave healthily. To predict the new observations, we get the predictive marginal posterior distribution of Bayesian Tensor Network model by Laplace Approximation trick, namely second order approximation. We observe that the hessian matrix of the Tensor Network model has a nice analytical expression which will be computed more efficiently than the Bayesian Neural Network. Hopefully, the Bayesian Tensor Network can be easily scaled to much larger enterprise level applications.
In practical application, labels will be assigned to the new observations in classification problem. To determine the decision boundary, the utility matrix (negative loss matrix) will be written down and then the decision threshold will be determined by minimize the expected loss.
2 Related Work
Deep Neural Networks have made great achievements in recent years lecun2015deep; hinton2006fast; lecun1995convolutional; krizhevsky2012imagenet; hochreiter1997long; srivastava2014dropout. Actually several work were proposed to develop the Bayesian Framework of Neural Network. In the Bayesian Framework of Neural Network, the key part and also difficulty part is to approximate the posterior distribution and also the predictive marginal distribution.
In the framework of hierarchical modeling for model uncertainty which is published by another author of this paper, Prof. David Draper proposed a general framework to consider the model uncertainty and also discussed the computation techniques in inference and prediction steps in the Bayesian Modeling considering model uncertainty draper1995assessment
. At the same time, some other early pioneers of Bayesian Neural Networks also contributes great to this area. David MacKay focus on the Bayesian Neural Networks as an adaptive learning model which means he used the Bayesian Framework to do Neural Networks model comparing and the hyperparameters and model structures selection by the evidence frameworkmackay1992interpolation. Radford Neal focus on constructing appropriate priors for the Neural Networks based on the properties of the Neural Networks. Neal suggested to use the infinite hidden units network and the priors converge to the stochastic process. Neal also proposed to use the Hybrid Monte Carlo to do the posterior predictive distribution (integral) simulation neal2012bayesian. Instead of approximate the posterior integral, variational method idea, namely searching in the parameterized functional space by maximizing the Evidence Lower Bound is widely explored blei2006variational; kingma2013auto. Recently, there are amounts of developments following and developing above ideas buntine1991bayesian; tran2019bayesian; lee2017deep; arora2019exact; arora2019harnessing; blundell2015weight; xiong2011bayesian; salakhutdinov2008bayesian; balan2015bayesian; hernandez2015probabilistic.
Tensor Networks are original proposed to describe the quantum many-body states orus2019tensor; chabuda2020tensor; glasser2020probabilistic. In the last several years, several work were proposed such as using the Matrix Product States (MPS) to construct new Learning Models novikov2015tensorizing; stoudenmire2016supervised; han2018unsupervised.
3 Initialization of Tensor Network
3.1 Background and Set Up
Although Tensor Network has powerful representation ability because of the bond dimension between every pair of tensor nodes neighboring each other, it is difficult to train long Matrix Product States chains. Since every tensor node in the chain will contribute one multiplication factor, the output will roughly increase or decrease with a exponential rate with respect to the number of nodes.
The Set up of the Tensor Network is as follows,
We can get a rough estimation of the amplitude of as
With the assumption that , and , where represents the bond dimension, represents the dimension of the kernel space and represents the number of the nodes.
The networks will not work well or even not work if the tensor nodes are not well initialized since little perturbation of the tensor weights will be amplified because of the exponential rate respect to the number of nodes.
To solve the respond or gradients exploding or vanishing problem, we focused on initialize the parameters with the appropriate distribution. To train deep neural networks, Glorot and Bengio glorot2010understandinghe2015delving
3.2 Variance Analysis
We use the Gaussian distribution to initialize Bayesian Tensor Networks model and set the mean to be zero since we want the respond to stay around zero. For the variance, we need to analyze the variance of the respond as the function of the variance of the tensor weights.
In above derivation, we assume that all the components of the tensor weights are independent and identically distributed (I.I.D.) random variables. For the kernel part, we assume that all the components of input kernel are also I.I.D..
Considering the particular kernel function , we have
From the gradients perspective, we can write down the gradients of tensor node as follows.
We get the variance of the gradients as
In a good initialization method, we can set the variance of the tensor nodes as the same order of the output variance which will avoid increasing or decreasing the response exponentially, so we get
From the gradient perspective, the variance of the gradients is a same order number as the variance of the tensor weights , then we obtain
We analyze the asymptotic behavior of our initialization formula. From the tensor weights perspective,
From the gradient perspective,
In the large limit, we get the same initialization variance of the tensor weights from weights perspective and gradient perspective which means our initialization method is logically consistent.
Different from Xavier glorot2010understanding, and He initialization he2015delving, our initialization formula does not depend on the number of the nodes heavily as we see the number of nodes just get into the index of the variance of each feature of the data as a fractional factor and the factor will converge to quickly. However, the bond dimension and the physical dimension, namely the dimension of the kernel has a huge effects on the initialization.
We analyse the mean of the parameters following the above idea,
In most practical case, the training data set is usually pre-processed and the mean of the training data set is usually transferred to . So we have
then we get
Similarly, we can get
This means that the mean of the responses and the gradients will always stay in healthy region, namely no matter what the initialization distribution is. So we do not need to care about the mean in the initialization method design.
3.3 Numerical Results
In this section. we study the performance of our formula on the MNIST lecun1998mnist, Phishing Website and Breast Cancer data set Dua:2019 and also we compare our initialization method with Xavier and He initialization methods.
In (a), (c) and (e), we compare the accuracy curve with different initialization method on MNIST, Phishing and Breast Cancer Data set. In (b), (d) and (f), we test the convergence of the Bayesian Tensor Network model by perturbing the standard deviation given by our formula.
In Fig. 1, we compare the accuracy on the MNIST Data set in the first epochs by our initialization method, Xavier initialization method and He initialization method. Our initialization method converges much more quickly than the other two initialization method on the MNIST data set.
In Fig. 0(a), Fig. 0(c) and Fig. 0(e), we compare the convergence of the Bayesian Tensor Network with our initialization method, Xavier initialization and He initialization method on the MNIST data set, Phishing data set and Breast Cancer data set. In these three data set, our initialization method works much better than the other two methods. Since the number of features in the Breast Cancer data set is only , so the training of the Bayesian Tensor Network is not as the same heavily sensitive to the initialization method as the Phishing data set or the MNIST data set.
In Fig. 0(b), Fig. 0(d), and Fig. 0(f), we show the accuracy curves with the standard deviation obtained by our formula, and slightly scaled std deviation. In the data set whose data has more number of features, the training process is more sensitive to the initialization and small deviation from the std given by our formula will lead to the bad convergence.
4 Bayesian Framework For Tensor Networks
4.1 The General Framework
We write down our Bayesian Tensor Networks model as follows,
In the inference step, we write down the posterior distribution of the parameters , namely as
If we use the Maximum A Posterior Estimator (M.A.P.) to estimate the parameters in the optimal mode of the model which is the same as the normal Tensor Network model,
In our work, we focus on the Full Bayesian Analysis. We do prediction and make decision based on the predictive posterior marginal distribution.
However, since the analytical intractability of the predictive posterior marginal distribution, we approximate the marginal predictive distribution around the M.A.P. mode of the posterior distribution till second order. We note that if we approximate the posterior distribution roughly by the Dirac distribution at the M.A.P. mode,
Obviously the information contained in the data set is lost in the point approximation.
According to the predictive posterior marginal distribution obtained in the prediction step, we can assign every new observation in the test set with one label which is a decision problem. For the goal of minimizing the misclassification rate in the classification problem, the decision boundary is determined as
For some practical case where utility matrix (negative loss matrix) is specially designed, we can also maximize the expected utility function to determine the action. For the regression problem, we focus on the mean square loss and by minimize the expected loss, it is proved that the decision boundary is the conditional expectation of the label given the features, namely , which is just the value of the regression function .
In our convention, the data set is notated as
in the classification problem. In our set up,
is encoded as the one-hot vector which represents which category the databelongs to.
For the binary classification problem, the response of the Bayesian Tensor Network model is the logits, namely
If we treat every component of the encoded vector independently and model each component with above logit formula, then it can be easily extended to multi-classification case.
In the multi-classification case, we use the Softmax activation function and we get
For the binary case, we can write down the cost function as
For the multi-classification case, we have
In the Bayesian Tensor Network, we do not need to do inference (training) if we can solve the intractable posterior marginal distribution as long as the the prior distribution is wisely introduced according to the background knowledge. However, we need to find the M.A.P mode of the posterior distribution to expand the posterior distribution around the M.A.P mode. We write down the posterior distribution and use the stochastic gradient optimization method to get to the M.A.P. mode. In practice, our objective function is the negative log posterior distribution. Around the optimal mode, we get the normal distribution as
The Hessian matrix contains the geometric information (curvature) of the posterior distribution, so more information is extracted by the Hessian matrix which is a better approximation than the distribution.
The co-variance matrix of the Normal distribution is
where is the second derivative matrix of the log likelihood function . The time complexity of computing the inverse of the Hessian matrix is which is time consuming, so we use the Out-Product approximation to decrease the time complexity to .
Since the prior distribution and the approximated posterior distribution are all Normal distribution, then we can get the predictive marginal posterior distribution analytically as
By plugging in the approximated posterior distribution , we get
4.3 Hessian Matrix
The approximation of the Hessian matrix has been widely studied. In the Out-Product approximation method, the idea is in the trained networks, the label and the output are close to each other then the second derivative matrix term is very small which is ignored. We get the out-product approximation of the Hessian matrix of the Bayesian Tensor Network model as
Our result above just contains the first derivative which means the time complexity is almost . Here means the component of the output of the Bayesian Tensor Networks. Different from the Neural Networks, the first derivative of the logits can be calculated analytically and then we can get an analytical result of the Hessian matrix of the Bayesian Tensor Networks.
4.4 Numerical Results
We study the performance of the Bayesian Tensor Networks on several data set.
From small data set to big data set, we used the following data sets
Synthetic Data Set: Two dimensional Gaussian Blobs with two classes.
Breast Cancer Wisconsin Data Set: A toy binary classification data set.
Phishing Website Data Set: A small binary classification data set.
MNIST Data Set: A standard multi-classification data set in computer vision community.
To study the Bayesian effects, we visualize the parameters in the Bayesian Tensor Network and the decision boundary in two dimensional synthetic data set.
We study the performance of the Bayesian Tensor Network with different standard deviation and different bond dimension on the Breast Cancer Wisconsin, Phishing Website and MNIST data set.
We train the Bayesian Tensor Networks and Bayesian Neural Networks on the blobs synthetic data set which contains samples in two classes. We used relatively bigger tensor nets and neural nets model to overfit the data set to study the Bayes shrinkage effect with different prior distribution in Bayesian Tensor Network and Bayesian Neural Network in Fig. 2. As we use greater standard deviation in the prior Normal distribution, the decision boundary becomes smoother. From our numerical experiments, we find that neural network is slightly more sensitive to the prior distribution than the tensor network.
In Fig. 3, we show the histogram of the parameters in the trained Bayesian Tensor Network and Bayesian Neural Network. We observe the Bayesian shrinkage in both the histograms of the parameters in the Bayesian Tensor Network and the Bayesian Neural Network. In the Bayesian Tensor Network, we find that the distribution of the parameters is not heavily affected instead the stand deviation is decreased as the prior std deviation is decreased. For the Bayesian Neural Network, we find that the Bayesian shrinkage effect is heavier and the parameters distribution gets to be heavy tail which means the parameters in the model become sparse.
Bond dimension is a key hyperparameter in the MPS which controls the ’description’ ability of the model collura2019descriptive. In Fig. 4, we show the test accuracy of the Bayesian Tensor Network model with different bond dimension in different data set. We observe that as the bond dimension gets increased, the generalization ability of the model becomes better, namely the Bayesian Tensor Network model gets better prediction accuracy.
We study the Bayesian framework of the Tensor Network and propose a robust initialization method. We use the toy, small and standard data set: Breast Cancer, Phishing website and MNIST data set to evaluate our initialization method and study the performance of the Bayesian Tensor Network model. We observe the Bayesian shrinkage in the parameters histogram plot and study the decision boundary of the Bayesian Tensor Network. We also explore the bond dimension in the Bayesian Tensor Network model. In practical application, we expect our model to take its own advantage in the small data set where overfitting problem can be solved by prior information introducing.
The authors wish to thank David Helmbold, Hongyun Wang, Qi Gong, Torsten Ehrhardt and Francois Monard for their helpful discussions.