The Bayesian Method of Tensor Networks

01/01/2021
by   Erdong Guo, et al.
University of California Santa Cruz
0

Bayesian learning is a powerful learning framework which combines the external information of the data (background information) with the internal information (training data) in a logically consistent way in inference and prediction. By Bayes rule, the external information (prior distribution) and the internal information (training data likelihood) are combined coherently, and the posterior distribution and the posterior predictive (marginal) distribution obtained by Bayes rule summarize the total information needed in the inference and prediction, respectively. In this paper, we study the Bayesian framework of the Tensor Network from two perspective. First, we introduce the prior distribution to the weights in the Tensor Network and predict the labels of the new observations by the posterior predictive (marginal) distribution. Since the intractability of the parameter integral in the normalization constant computation, we approximate the posterior predictive distribution by Laplace approximation and obtain the out-product approximation of the hessian matrix of the posterior distribution of the Tensor Network model. Second, to estimate the parameters of the stationary mode, we propose a stable initialization trick to accelerate the inference process by which the Tensor Network can converge to the stationary path more efficiently and stably with gradient descent method. We verify our work on the MNIST, Phishing Website and Breast Cancer data set. We study the Bayesian properties of the Bayesian Tensor Network by visualizing the parameters of the model and the decision boundaries in the two dimensional synthetic data set. For a application purpose, our work can reduce the overfitting and improve the performance of normal Tensor Network model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

03/02/2022

Fast and accurate approximation to informed Bayes factors for focal parameters

We outline an approximation to informed Bayes factors for a focal parame...
03/29/2021

Martingale Posterior Distributions

The prior distribution on parameters of a likelihood is the usual starti...
11/19/2019

A Normal Approximation Method for Statistics in Knockouts

The authors give an approximation method for Bayesian inference in arena...
07/08/2018

BALSON: Bayesian Least Squares Optimization with Nonnegative L1-Norm Constraint

A Bayesian approach termed BAyesian Least Squares Optimization with Nonn...
11/08/2018

Practical Bayesian Learning of Neural Networks via Adaptive Subgradient Methods

We introduce a novel framework for the estimation of the posterior distr...
06/25/2018

Stochastic natural gradient descent draws posterior samples in function space

Natural gradient descent (NGD) minimises the cost function on a Riemanni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 introduction

Bayesian Learning is a framework to combine the internal data information with your background knowledge on your specific learning task, namely the external information in a logically consistent way by Bayes Theorem

draper2006bayesian; draper2013bayesian

. In the Bayesian framework, the parameters of the model are not fixed constants but random variables whose distributions should be introduced according to the background information. From a information perspective, the background information (the knowledge) external to the data set is ’injected’ into the learning model optimally by Bayes Theorem

zellner1988optimal.

In this framework, parameters are not estimated by point estimator such as the Maximum Likelihood Estimator (M.L.E.) instead the posterior distribution of the parameters are given. In this step, the technical difficulty is the computation of the normalization constant. In some application cases, people use the Dirac distribution of the optimal mode to approximate the posterior distribution (M.A.P.) which is equivalent to use the Maximum A Posterior Estimator to do point estimation of the parameters in the model nowlan1992simplifying

. To get better approximation of the posterior distribution, higher order term should be remained and till second order terms are preserved and Normal distribution is obtained by Laplace approximation trick.

In our work, we proposed a robust initialization method to infer the parameters of the Bayesian Tensor Network model more quickly. Since the special structure of the Tensor Network which is a sum of tensor chain products, the output of this model is easily blowing out or decaying to zero which will lead to the instability of the inference process. Using our initialization strategy, the parameters and their gradients will stay in the stable region where gradient descent optimization will behave healthily. To predict the new observations, we get the predictive marginal posterior distribution of Bayesian Tensor Network model by Laplace Approximation trick, namely second order approximation. We observe that the hessian matrix of the Tensor Network model has a nice analytical expression which will be computed more efficiently than the Bayesian Neural Network. Hopefully, the Bayesian Tensor Network can be easily scaled to much larger enterprise level applications.

In practical application, labels will be assigned to the new observations in classification problem. To determine the decision boundary, the utility matrix (negative loss matrix) will be written down and then the decision threshold will be determined by minimize the expected loss.

2 Related Work

Deep Neural Networks have made great achievements in recent years lecun2015deep; hinton2006fast; lecun1995convolutional; krizhevsky2012imagenet; hochreiter1997long; srivastava2014dropout. Actually several work were proposed to develop the Bayesian Framework of Neural Network. In the Bayesian Framework of Neural Network, the key part and also difficulty part is to approximate the posterior distribution and also the predictive marginal distribution.

In the framework of hierarchical modeling for model uncertainty which is published by another author of this paper, Prof. David Draper proposed a general framework to consider the model uncertainty and also discussed the computation techniques in inference and prediction steps in the Bayesian Modeling considering model uncertainty draper1995assessment

. At the same time, some other early pioneers of Bayesian Neural Networks also contributes great to this area. David MacKay focus on the Bayesian Neural Networks as an adaptive learning model which means he used the Bayesian Framework to do Neural Networks model comparing and the hyperparameters and model structures selection by the evidence framework

mackay1992interpolation. Radford Neal focus on constructing appropriate priors for the Neural Networks based on the properties of the Neural Networks. Neal suggested to use the infinite hidden units network and the priors converge to the stochastic process. Neal also proposed to use the Hybrid Monte Carlo to do the posterior predictive distribution (integral) simulation neal2012bayesian. Instead of approximate the posterior integral, variational method idea, namely searching in the parameterized functional space by maximizing the Evidence Lower Bound is widely explored blei2006variational; kingma2013auto. Recently, there are amounts of developments following and developing above ideas buntine1991bayesian; tran2019bayesian; lee2017deep; arora2019exact; arora2019harnessing; blundell2015weight; xiong2011bayesian; salakhutdinov2008bayesian; balan2015bayesian; hernandez2015probabilistic.

Tensor Networks are original proposed to describe the quantum many-body states orus2019tensor; chabuda2020tensor; glasser2020probabilistic. In the last several years, several work were proposed such as using the Matrix Product States (MPS) to construct new Learning Models novikov2015tensorizing; stoudenmire2016supervised; han2018unsupervised.

3 Initialization of Tensor Network

3.1 Background and Set Up

Although Tensor Network has powerful representation ability because of the bond dimension between every pair of tensor nodes neighboring each other, it is difficult to train long Matrix Product States chains. Since every tensor node in the chain will contribute one multiplication factor, the output will roughly increase or decrease with a exponential rate with respect to the number of nodes.

The Set up of the Tensor Network is as follows,

(1)

where

We can get a rough estimation of the amplitude of as

(2)

where

With the assumption that , and , where represents the bond dimension, represents the dimension of the kernel space and represents the number of the nodes.

The networks will not work well or even not work if the tensor nodes are not well initialized since little perturbation of the tensor weights will be amplified because of the exponential rate respect to the number of nodes.

To solve the respond or gradients exploding or vanishing problem, we focused on initialize the parameters with the appropriate distribution. To train deep neural networks, Glorot and Bengio glorot2010understanding

proposed to initialize the parameters with the scaled uniform distribution in the Sigmoid neuron case. For the Relu neuron, He and his collaborators

he2015delving

proposed to use the Gaussian distribution with neural number determined variance. In the following section, we will derive the variance of the Gaussian distribution used in our initialization method.

3.2 Variance Analysis

We use the Gaussian distribution to initialize Bayesian Tensor Networks model and set the mean to be zero since we want the respond to stay around zero. For the variance, we need to analyze the variance of the respond as the function of the variance of the tensor weights.

(3)
(4)

In above derivation, we assume that all the components of the tensor weights are independent and identically distributed (I.I.D.) random variables. For the kernel part, we assume that all the components of input kernel are also I.I.D..

Considering the particular kernel function , we have

(5)

From the gradients perspective, we can write down the gradients of tensor node as follows.

(6)

We get the variance of the gradients as

(7)

In a good initialization method, we can set the variance of the tensor nodes as the same order of the output variance which will avoid increasing or decreasing the response exponentially, so we get

(8)

From the gradient perspective, the variance of the gradients is a same order number as the variance of the tensor weights , then we obtain

(9)

We analyze the asymptotic behavior of our initialization formula. From the tensor weights perspective,

(10)

From the gradient perspective,

(11)

In the large limit, we get the same initialization variance of the tensor weights from weights perspective and gradient perspective which means our initialization method is logically consistent.

Different from Xavier glorot2010understanding, and He initialization he2015delving, our initialization formula does not depend on the number of the nodes heavily as we see the number of nodes just get into the index of the variance of each feature of the data as a fractional factor and the factor will converge to quickly. However, the bond dimension and the physical dimension, namely the dimension of the kernel has a huge effects on the initialization.

We analyse the mean of the parameters following the above idea,

(12)
(13)

In most practical case, the training data set is usually pre-processed and the mean of the training data set is usually transferred to . So we have

(14)

then we get

(15)

Similarly, we can get

(16)

This means that the mean of the responses and the gradients will always stay in healthy region, namely no matter what the initialization distribution is. So we do not need to care about the mean in the initialization method design.

3.3 Numerical Results

In this section. we study the performance of our formula on the MNIST lecun1998mnist, Phishing Website and Breast Cancer data set Dua:2019 and also we compare our initialization method with Xavier and He initialization methods.

(a) Initialization method comparing on MNIST Data Set
(b) Initialization with , or
(c) Initialization method comparing on Phishing Data Set
(d) Initialization with , or
(e) Initialization method comparing on Breast Cancer Data Set
(f) Initialization with , or
Figure 1:

In (a), (c) and (e), we compare the accuracy curve with different initialization method on MNIST, Phishing and Breast Cancer Data set. In (b), (d) and (f), we test the convergence of the Bayesian Tensor Network model by perturbing the standard deviation given by our formula.

In Fig. 1, we compare the accuracy on the MNIST Data set in the first epochs by our initialization method, Xavier initialization method and He initialization method. Our initialization method converges much more quickly than the other two initialization method on the MNIST data set.

In Fig. 0(a), Fig. 0(c) and Fig. 0(e), we compare the convergence of the Bayesian Tensor Network with our initialization method, Xavier initialization and He initialization method on the MNIST data set, Phishing data set and Breast Cancer data set. In these three data set, our initialization method works much better than the other two methods. Since the number of features in the Breast Cancer data set is only , so the training of the Bayesian Tensor Network is not as the same heavily sensitive to the initialization method as the Phishing data set or the MNIST data set.

In Fig. 0(b), Fig. 0(d), and Fig. 0(f), we show the accuracy curves with the standard deviation obtained by our formula, and slightly scaled std deviation. In the data set whose data has more number of features, the training process is more sensitive to the initialization and small deviation from the std given by our formula will lead to the bad convergence.

4 Bayesian Framework For Tensor Networks

4.1 The General Framework

We write down our Bayesian Tensor Networks model as follows,

(17)

In the inference step, we write down the posterior distribution of the parameters , namely as

(18)

If we use the Maximum A Posterior Estimator (M.A.P.) to estimate the parameters in the optimal mode of the model which is the same as the normal Tensor Network model,

(19)

In our work, we focus on the Full Bayesian Analysis. We do prediction and make decision based on the predictive posterior marginal distribution.

(20)

However, since the analytical intractability of the predictive posterior marginal distribution, we approximate the marginal predictive distribution around the M.A.P. mode of the posterior distribution till second order. We note that if we approximate the posterior distribution roughly by the Dirac distribution at the M.A.P. mode,

we get

(21)

Obviously the information contained in the data set is lost in the point approximation.

According to the predictive posterior marginal distribution obtained in the prediction step, we can assign every new observation in the test set with one label which is a decision problem. For the goal of minimizing the misclassification rate in the classification problem, the decision boundary is determined as

(22)

For some practical case where utility matrix (negative loss matrix) is specially designed, we can also maximize the expected utility function to determine the action. For the regression problem, we focus on the mean square loss and by minimize the expected loss, it is proved that the decision boundary is the conditional expectation of the label given the features, namely , which is just the value of the regression function .

4.2 Classification

In our convention, the data set is notated as

(23)

in the classification problem. In our set up,

is encoded as the one-hot vector which represents which category the data

belongs to.

For the binary classification problem, the response of the Bayesian Tensor Network model is the logits, namely

(24)

If we treat every component of the encoded vector independently and model each component with above logit formula, then it can be easily extended to multi-classification case.

In the multi-classification case, we use the Softmax activation function and we get

(25)

For the binary case, we can write down the cost function as

(26)

For the multi-classification case, we have

(27)

In the Bayesian Tensor Network, we do not need to do inference (training) if we can solve the intractable posterior marginal distribution as long as the the prior distribution is wisely introduced according to the background knowledge. However, we need to find the M.A.P mode of the posterior distribution to expand the posterior distribution around the M.A.P mode. We write down the posterior distribution and use the stochastic gradient optimization method to get to the M.A.P. mode. In practice, our objective function is the negative log posterior distribution. Around the optimal mode, we get the normal distribution as

(28)

where

The Hessian matrix contains the geometric information (curvature) of the posterior distribution, so more information is extracted by the Hessian matrix which is a better approximation than the distribution.

The co-variance matrix of the Normal distribution is

(29)

where is the second derivative matrix of the log likelihood function . The time complexity of computing the inverse of the Hessian matrix is which is time consuming, so we use the Out-Product approximation to decrease the time complexity to .

Since the prior distribution and the approximated posterior distribution are all Normal distribution, then we can get the predictive marginal posterior distribution analytically as

(30)
(31)

By plugging in the approximated posterior distribution , we get

(32)

where

4.3 Hessian Matrix

The approximation of the Hessian matrix has been widely studied. In the Out-Product approximation method, the idea is in the trained networks, the label and the output are close to each other then the second derivative matrix term is very small which is ignored. We get the out-product approximation of the Hessian matrix of the Bayesian Tensor Network model as

(33)

Our result above just contains the first derivative which means the time complexity is almost . Here means the component of the output of the Bayesian Tensor Networks. Different from the Neural Networks, the first derivative of the logits can be calculated analytically and then we can get an analytical result of the Hessian matrix of the Bayesian Tensor Networks.

4.4 Numerical Results

We study the performance of the Bayesian Tensor Networks on several data set.

From small data set to big data set, we used the following data sets

  • Synthetic Data Set: Two dimensional Gaussian Blobs with two classes.

  • Breast Cancer Wisconsin Data Set: A toy binary classification data set.

  • Phishing Website Data Set: A small binary classification data set.

  • MNIST Data Set: A standard multi-classification data set in computer vision community.

To study the Bayesian effects, we visualize the parameters in the Bayesian Tensor Network and the decision boundary in two dimensional synthetic data set.

We study the performance of the Bayesian Tensor Network with different standard deviation and different bond dimension on the Breast Cancer Wisconsin, Phishing Website and MNIST data set.

(a) Normal MPS model
(b) Bayesian Tensor Network with reg
(c) Bayesian Tensor Network with reg
(d) Normal Neural Network
(e) Bayesian Neural Network with reg
(f) Bayesian Neural Network with reg
Figure 2: In (a), (b), (c), we show the decision boundary of the Bayesian Tensor Network. In (d), (e), (f), we show the decision boundary of the Bayesian Neural Network.

We train the Bayesian Tensor Networks and Bayesian Neural Networks on the blobs synthetic data set which contains samples in two classes. We used relatively bigger tensor nets and neural nets model to overfit the data set to study the Bayes shrinkage effect with different prior distribution in Bayesian Tensor Network and Bayesian Neural Network in Fig. 2. As we use greater standard deviation in the prior Normal distribution, the decision boundary becomes smoother. From our numerical experiments, we find that neural network is slightly more sensitive to the prior distribution than the tensor network.

(a) Normal Tensor Network
(b) Bayesian Tensor Network with reg
(c) Bayesian Tensor Network with reg
(d) Normal Neural Network
(e) Bayesian Neural Network with reg
(f) Bayesian Neural Network with reg
Figure 3: In (a), (b) and (c), we show the histogram of the parameters in Bayesian Tensor Network. In (c), (d), (e), we show the histogram of the parameters in Bayesian Neural Network.

In Fig. 3, we show the histogram of the parameters in the trained Bayesian Tensor Network and Bayesian Neural Network. We observe the Bayesian shrinkage in both the histograms of the parameters in the Bayesian Tensor Network and the Bayesian Neural Network. In the Bayesian Tensor Network, we find that the distribution of the parameters is not heavily affected instead the stand deviation is decreased as the prior std deviation is decreased. For the Bayesian Neural Network, we find that the Bayesian shrinkage effect is heavier and the parameters distribution gets to be heavy tail which means the parameters in the model become sparse.

(a) MNIST data set
(b) Phishing data set
(c) Breast Cancer Wisconsin data set
Figure 4: In (a), (b) and (c), we show the test accuracy of the Bayesian Tensor Network with different bond dimension on the MNIST, Phishing and Breast Cancer data set.

Bond dimension is a key hyperparameter in the MPS which controls the ’description’ ability of the model collura2019descriptive. In Fig. 4, we show the test accuracy of the Bayesian Tensor Network model with different bond dimension in different data set. We observe that as the bond dimension gets increased, the generalization ability of the model becomes better, namely the Bayesian Tensor Network model gets better prediction accuracy.

5 Conclusion

We study the Bayesian framework of the Tensor Network and propose a robust initialization method. We use the toy, small and standard data set: Breast Cancer, Phishing website and MNIST data set to evaluate our initialization method and study the performance of the Bayesian Tensor Network model. We observe the Bayesian shrinkage in the parameters histogram plot and study the decision boundary of the Bayesian Tensor Network. We also explore the bond dimension in the Bayesian Tensor Network model. In practical application, we expect our model to take its own advantage in the small data set where overfitting problem can be solved by prior information introducing.

Acknowledgements

The authors wish to thank David Helmbold, Hongyun Wang, Qi Gong, Torsten Ehrhardt and Francois Monard for their helpful discussions.

References