1 Introduction
1.1 Algorithms for Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) [16, 4] has been applied to text mining [21], signal processing [14], bioinformatics [11], consumer analysis [12], and recommender systems [3]. NMF experiments discover the knowledge and predict the future unknown structures in the real world, however, the method suffers from many local minima and seldom reaches the global minimum. In addition, the results of numerical experiments strongly depend on the initial values; a rigorous method has yet to be established.
In order to resolve this difficulty, Bayesian inference for NMF has been established
[4]. It uses, for numerical calculation of the Bayesian posterior distribution, Gibbs sampling method which is a kind of Markov chain Monte Carlo method (MCMC). Bayesian NMF is more robust than usual recursive methods of NMF since it numerically realizes the posterior distribution; the parameters are subject to a probability distribution and that makes it possible to grasp the degree of fluctuation of the learning/inference result. As is described later, in general, Bayesian method has higher estimation accuracy than maximum likelihood estimation and maximum posterior estimation if the model has hierarchical structures or hidden variables, like NMF.
On the other hand, the variational Bayesian algorithm (VB) for NMF has also been established [4], with being inspired the mean field approximation. The variational Bayesian NMF algorithm (VBNMF) also results more numerically stable than usual recursive algorithms as VB approximates the Bayesian posterior distribution. Moreover, VBNMF computes faster than usual Bayesian inference such a MCMC. However, its free energy (called the variational free energy) is larger than the Bayesian free energy, since VB ascends the evidence lower bound (ELBO) but it is not the true model evidence. Note that the marginal likelihood is also called the model evidence and the negative log ELBO is equal to the variational free energy. From the above, it is important to clarify the approximation accuracy of the variational free energy for not only theoretical reasons but also practical points of view.
1.2 Learning Theory of Bayesian and Variational Inference
A statistical model is called regular if a function from a parameter set to a probability density function set is one-to-one and if the likelihood function can be approximated by a Gaussian function. It is proved that, if a statistical model is regular and if a true distribution is realizable by a statistical model, then the generalization error is asymptotically equal to
, where ,, and the generalization error are the dimension of the parameter, the sample size, and the expected Kullback-Leibler divergence of the true distribution and the estimated learning machine, respectively. Moreover, the negative log marginal likelihood or the free energy asymptotically behaves
. However, the statistical model used in NMF is not regular because the map from a parameter to a probability density function is not injective. As a result, its generalization error and the free energy is still unknown.There are many non-regular statistical models in machine learning. For example, reduced rank regressions, normal mixture models, neural networks, hidden Markov models, and Boltzmann machines are such examples. From the theoretical point of view, the free energy and the generalization error of a non-regular learning machine in Bayesian learning was proved to be asymptotically equal to
and , where is a real log canonical threshold (RLCT), respectively [18, 19]. Moreover, in non-regular cases, holds and is also much less than the learning coefficients of maximum likelihood/posterior methods [20]. The RLCTs for several learning machines, have been clarified. In fact, reduced rank regressions [1], normal mixture models [22], and hidden Markov models [23], they are clarified by using resolution of singularities. A statistical model selection method sBIC using RLCTs has also been proposed [6].On the other hand, for several statistical models, the variational free energy was proved that it asymptotically equals ,where is a learning coefficient and it depends on the model. Normal mixture models [17], hidden Markov models [10], and NMF [13] are such examples. In general, the learning coefficient of VB may not be equal to but becomes an upper bound of the RLCT: , since the variational free energy is larger than the usual free energy even if the sample size diverges infinity. Unfortunately, the variational generalization error is not equal to , asymptotically.
VBNMF has been devised [4], and the exact learning coefficient of VBNMF has been derived [13]. Nevertheless, the variational approximation accuracy has not been clarified since the RLCT of NMF has been unknown. If the prior distribution is strictly and entirely positive and bounded analytic function on the domain, then an upper bound of the RLCT of NMF has been proved [9, 8]. If the non-negative restriction is not assumed for matrix factorization, then the exact value of the RLCT has been clarified as reduced rank regression models [1]
. However, the RLCT has been unknown in the case of that the prior is gamma distributions, which may be zero.
In this paper, we theoretically derive the lower bound of the RLCT difference in NMF, by which we can derive the lower bound of the approximation accuracy of VBNMF for Bayesian NMF.
The rest of this paper is organized as follows. In the second section, we briefly explain Bayesian inference and Variational Bayesian algorithm. In the third section, we present the main theorems and their proofs. In the fourth section, we conclude this paper.
2 Inference Frameworks
In this section, we explain the framework of Bayesian inference, VB algorithm.
2.1 Framework of Bayesian Inference
First, we explain the general theory of Bayesian inference.
Let and be probability density functions on a finite dimensional real Euclidean space, where is a parameter. In learning theory, and represent a true distribution and a learning machine with respectively. A probability density function on a set of parameters is called a prior. Usually, the prior has parameters that is called hyperparameter. Let
be a set of random variables that are independently subject to
, where and are referred to as the sample size and training data. The posterior of is defined bywhere is the normalizing constant that is determined by the condition . is called the marginal likelihood, evidence, or partition function, and it is also a probability density function of the training data: . The Bayesian predictive distribution is also defined by
The negative log marginal likelihood (or free energy) is defined by
The generalization error is also defined by the Kullback-Leibler divergence from the true distribution and the predictive one :
Note that and are functions of hence it is also a random variable. These expected value overall training data and are called the expected free energy and generalization error, respectively, where this expectation is subject to the true distribution of the training data
and defined by
Assume there exists a parameter that satisfies . By the singular learning theory [18, 19, 20], it was proven that
(1) | ||||
(2) |
hold when tends to infinity, even if the posterior distribution can not
be approximated by any normal distribution, where
is the empirical entropy. The constant is the real log canonical threshold (RLCT) which is an important birational invariant in algebraic geometry. The constant is called the multiplicity and also a birational invariant. From the mathematical point of view, RLCT is characterized by the following property. We defined a zeta function bywhere
Then this is holomorphic in which can be analytically continued to a unique meromorphic function on the entire complex plane [2]. The poles of this extended function are all negative rational numbers. Let be the nearest pole to the origin; is then equal to the RLCT. The multiplicity is denoted by the order of the nearest pole. If is regular then ; however, it is not usually general.
2.2 Variational Bayesian Algorithm
Variational Bayesian algorithm (VB) or variational approximation is an approximation method for Bayesian inference. VB is based on the mean field approximation, and possible to make the numerical calculation cost less than usual Bayesian inference.
Let the training data be and the posterior be . In general, the posterior distribution cannot be found analytically thus we assume that the parameters are independent:
This assumption is just an approximation but useful if are more simple form than the original posterior.
This approximation is meaning as Kullback-Leibler divergence; put
and minimize
The problem of Bayesian inference is numerical realization of the posterior and VB solves as the above optimization. In the practical cases, the parameters are often decomposed by several, especially just two, parts and they are assumed to be independent:
where , ,
VB optimizes the above Kullback-Leibler divergence by searching , however, the objective function may be not calculated analytically. This is because the marginal likelihood is contained:
where .
and are designed by the human thus the first term only contains known variables. The minimization problem returns to minimization the following functional of the mean field approximation :
If there exists such that , then the Kullback-Leibler divergence is equal to 0 i.e. the functional is equal to the free energy: . Nevertheless, in general, there may be no such that . The variational free energy is defined by the minimum of :
The inequality is immediately derived by the definition.
For example, if the approximation
is applied, then
Besides, and satisfy the following self-consistency condition:
where are the normalizing constants.
In this way, VB is an approximation of Bayesian and its accuracy can be expressed by , the less the difference is, the more valid the approximation is. From the theoretical point of view, has an asymptotic behavior which is similar to the one of . For example, in VBNMF, the following theorem is proved [13].
Theorem 2.1 (Kohjima).
Let the elements of the data matrices be independently generated from
the Poisson distribution whose mean is equal to the
Let the likelihood model and the prior be the following Poisson and gamma distributions, respectively:
where , the size of and are and , and is the entry of , respectively.
Then, the variational free energy satisfies the following asymptotic equality:
where
In this paper, we mathematically show an upper bound of the RLCT of the NMF in the case that is same as VBNMF; the model is Poisson ditributions and the prior is gamma distributions; the prior may have zero or divergence points in . By using the upper bound, we also derive a lower bound of the approximation accuracy of VBNMF.
3 Main Theorems and Proof
In this section, we introduce our main results and prove them.
3.1 Main Theorems
Third, we introduce the main result of this paper. In the followings, is a parameter and is an observed random variable.
Let be a set of matrices whose elements are in , where is a subset of . Let be a compact subset of and let be a compact of subset . We denote that , and , are the NMF result of such that is the non-negative rank [5] of , i.e. they give the minimal , where and .
Definition 3.1 (RLCT of NMF).
Assume that the largest pole of the function of one complex variable ,
is equal to . Then is said to be the RLCT of NMF.
We have already derived an upper bound of the RLCT of NMF in the case that the prior is strictly positive and bounded [9, 8]. However, to compare with VBNMF, we set the prior as gamma distributions:
where
and .
In this paper, we prove the following theorems.
Theorem 3.1.
If the prior is the above gamma distributions, then the RLCT of NMF satisfies the following inequality:
The equality holds when .
We prove this theorem in the next section. As two applications of this theorem, we obtain an upper bounds of the free energy and Bayesian generalization error of NMF in this case. The following theorem shows a statistical bound of Bayesian estimation of NMF.
Theorem 3.2.
Let the probability density functions of be and , which represent a true distribution and a learning machine respectively defined by
where
Also let . Then, the free energy and the expected generalization error satisfies the following inequality:
In Theorem 3.2, we study a case when a set of random matrices are observed and the true decomposition and are statistically estimated. Actually sometimes NMF has studied in the case when only one target matrix is decomposed, however, in general, decomposition of a set of independent matrices should be studied because target matrices are often obtained daily, monthly, or different places [12]. In such cases, decomposition of a set of matrices results in statistical inference. We consider this situation and it is common to [13] and Theorem 2.1. A statistical model which has parameters are employed for estimation. Then the free energy and generalization error of Bayesian estimation is given by this theorem.
Theorem 3.3.
Let the variational free energy of VBNMF be . Then, the following inequality is attained:
where
Theorem 3.3 gives a lower bound of the difference of the free energy between the variational approximation and the true.
3.2 Proof of Main Theorems
In orfer to prove Theorem 3.1, we use the following three lemmas which are proved in Appendix A.
Lemma 3.1.
Let be the absolute of the maximum pole of
Then,
holds; this is the equality of Theorem 3.1 in the case .
Lemma 3.2.
If , the equal sign of the Theorem3.1 holds:
Lemma 3.3.
If , the Theorem3.1 is attained:
Let the entries of the matrices be
and the ones of be
respectively.
Now, we prove Theorem3.1 using the above lemmas.
Proof of Theorem3.1.
Let and be non-negative analytic functions on a finite dimensional Euclid space. A binomial relation is defined by
We have
Put the first and second terms of above
and
respectively. Because of the prior , all we have to do is find the RLCT of
to derive an upper bound.
Let be an RLCT of the right-most side, be an RLCT of first term in the right-most side, and be an RLCT of the second one. If then thus the first prior term can be ignored to calculate the RLCT; we have
Because of that variables are independent and RLCTs are order isomorphic,
(3) |
Since ‘ corresponds to the proof of Lemma 3.3 in the case of ,
In contrast, corresponds to Lemma 3.1 in the case of . That causes
Using the above inequality and equality for inequality(3),
∎
Remark 3.1.
Proof.
Owing to ,
Thus,
Using the above relation,
∎