Matrix Factorization (MF) is an old known machine learning method. MF decomposes the data matrix into two matrices product and discover hidden structure or pattern, hence its experiments have been applied to knowledge discovery in many fields. However, as is well-known, usual MF has no guarantee to reach the unique factorization and it is sensitive to the initial value of the numerical calculation. This non-uniqueness interferes with data-driven inference and interpretation of the results. In addition, the sensitivity to the initial value causes the low reliability of the factorization result. From the viewpoint of data-based prediction, it can be said that this instability may lead to incorrect prediction results. In order to improve the interpretability, non-negative matrix factorization (NMF)(Paatero and Tapper, 1994; Lee and Seung, 1999) has been devised and it is known as a restricted MF that the elements of the matrix are non-negative. Thanks to the non-negativity constraint, extracted factors are readily interpretable. NMF is frequently used for extracting latent structure and patterns, for instance, image recognition (Lee and Seung, 1999), audio signal processing (Schmidt and Olsson, 2006) and consumer analysis (Kohjima et al., 2015). However, uniqueness and initial value sensitivity have not yet been settled.
Stochastic matrix factorization (SMF) has been devised by Adams (Adams, 2016b) and it can be understood as a restriction to NMF in which at least one matrix factor is “stochastic”: the elements of the matrix factors are non-negative and the columns of one matrix factor sum to 1. We call a matrix whose column is “stochastic” at least one “stochastic” matrix. By making two further assumptions, Adams proved the uniqueness of the results that the proposed method reached (Adams, 2016a, b). For statement of the two condition, we consider the data matrix whose size is and the factor matrices and which are “stochastic” and their size are and , respectively. might be the rank of but the “stochastic” condition makes it non-trivial. SMF can be understood as the method that finds the factor matrices pair such that , from given and , in other words. The non-uniqueness problem has been paraphrased as the existence of regular matrix such that
i.e. the elements of and are non-negative, and satisfies
and claimed that their assumptions are “natural” (Adams, 2016b). From the practical point of view, SMF has been proposed that it can be applied to picture reduction problems and topic models for analysis unstructured data (Adams, 2016b). It is well-known that topic models are useful, for instance, storing and finding pictures.
Until now, we have introduced MF methods including Adam’s SMF, however, they have been deterministic. As described later, for hierarchical learning machines such as MF, it is proved that Bayesian inference has higher prediction accuracy than deterministic methods or maximum likelihood estimation. This is also true for the accuracy of the discovered knowledge. Moreover, probabilistic view gives wider applications. Indeed, Bayesian NMF(Virtanen et al., 2008; Cemgil, 2009) has been applied to image recognition (Cemgil, 2009), audio signal processing (Virtanen et al., 2008), overlap community detection (Psorakis et al., 2011), and recommender systems (Bobadilla et al., 2018)
. From the statistical point of view, the data matrices are random variables subject to the true distribution. Sometimes, MF is considered when only one target stochastic matrix is decomposed, however, in general, factorization of a set of independent matrices should be studied because the target matrices are often obtained daily, monthly, or in different places(Kohjima et al., 2015). Besides, SMF can be applied to Bayesian networks via Markov chain, as explained later. Bayesian networks are often estimated using the Bayesian method. In such cases, decomposition of a set of matrices results in statistical inference.
We show some potential application of statistical or probabilistic SMF. First, it can be applied to NMF for binary data (Larsen and Clemmensen, 2015)
, because binary matrices can be observed as random variables subject to a Bernoulli distribution. Second, SMF for topic models can be also viewed as a statistical learning machine. Third, if the transition stochastic matrixin a Markov chain or Bayesian network can be represented by a matrix with a lower rank, it should be to perform a reduced rank regression
on this transition matrix. Hence, it is important to study theoretical prediction accuracy of SMF for not only statistical learning theory but also applications to real data.
A regular learning machine or statistical model is defined by that a map from a parameter set to a probability density function is injection and its likelihood function can be approximated by a Gaussian function. It has been proved that, if a statistical model is regular and if a true distribution is realizable by a statistical model, then the expected generalization erroris asymptotically equal to , where ,
, and the generalization error are the dimension of the parameter, the sample size or the number of data, and the Kullback-Leibler divergence,
from the true distribution to the predicted one , respectively(Watanabe, 2000). However, the learning machine used in SMF is not regular because the function from a parameter to a probability density function is not one-to-one. Such a model is called a singular learning machine. As a result, its theoretical generalization error is still unknown, and we cannot confirm the correctness of the results of numerical experiments.
We would like to stress that we consider the case (1) in which all matrix factors are stochastic rather than the case (2) in which at least one matrix factor is stochastic. Adams proved that SMF reaches a unique factorization under some assumptions in case (2)(Adams, 2016a, b), as mentioned above. However, in general, stochastic matrices do not satisfy the assumptions. The term “stochastic matrix” usually means case (1). In addition, stochastic matrices is a point in the Cartesian product space of simplices thus it is not clear that the Adam’s assumption (2) and (3) are mathematically “natural”. Here, we suppose a more general application such as NMF for binary data and reduced rank regression applied to a Markov chain. For simplicity, we will call the model an SMF even in case (1).
There are many singular learning machines that are practical, for example, Gaussian mixture models, reduced rank regression, neural networks, hidden Markov models, and Boltzmann machines. NMF and SMF are also statistically singular. The expected generalization error of a singular learning machine in Bayesian learning has an asymptotic expansion
where is the real log canonical threshold (RLCT) which is a birational invariant in algebraic geometry (Watanabe, 2000, 2010). The RLCT is also called the learning coefficient (Drton and Plummer, 2017; Aoyagi, 2010), as it is the coefficient of the main term in the above expansion Eq. (4). In addition, the negative log Bayesian marginal likelihood is asymptotically expanded by
where is the empirical entropy. Note that RLCTs are different from the usual log canonical thresholds (Hironaka, 1964) since real field is not algebraic closed and the usual log canonical threshold is defined on an algebraic closed field such as complex field. Thus we cannot directly apply the research results for over an algebraic closed field to the problem of SMF. The RLCTs for several learning machines have been clarified. For example, they have been found for mixture models (Yamazaki and Watanabe, 2003a), reduced rank regression (Aoyagi and Watanabe, 2005), three-layered neural networks (Watanabe, 2001), naive Bayesian networks (Rusakov and Geiger, 2005), Bayesian networks (Yamazaki and Watanabe, 2003b), Boltzmann machines (Yamazaki and Watanabe, 2005b; Aoyagi, 2010, 2013), Markov models (Zwiernik, 2011), hidden Markov models (Yamazaki and Watanabe, 2005a), Gaussian latent tree and forest models (Drton et al., 2017), and NMFs (Hayashi and Watanabe, 2017a, b), by using resolution of singularities (Hironaka, 1964; Atiyah, 1970). Finding the RLCTs means deriving the theoretical value of the generalization errors. In addition, a statistical model selection method called singular Bayesian information criterion (sBIC) that uses RLCTs to approximate the negative log Bayesian marginal likelihood has also been proposed (Drton and Plummer, 2017). Thus, clarification of the RLCTs for concrete learning machines is important not only for algebraic geometrical reasons but also statistical and practical reasons.
In this paper, we consider SMF as a restriction of NMF and theoretically derive an upper bound of the RLCT of SMF, by which we can derive an upper bound of the expected Bayesian generalization error of SMF. We would like to emphasize that the bound cannot be immediately proved in the same way as with NMF and other learning machine. There does not exist a standard method to find the RLCT to given family of functions thus researchers study RLCTs by considering different methods for each learning machine or collection of functions. The detail of this difference is discussed in Section 5 with novelty of our proof. Prior researcher’s method cannot be directly applied to the SMF problem.
This paper consists of five parts. In the second section, we describe the upper bound of the RLCT in SMF (Main Theorem). In the third section, we mathematically prepare for the proof of Main Theorem. In the fourth section, we show the sketch of proof of Main Theorem. Finally, in the fifth section, we describe a theoretical application of Main Theorem to Bayesian learning. In the appendices, we rigorously prove Main Theorem and lemmas that are used to derive Main Theorem.
2 Framework and Main Result
In this section, we explain the framework of Bayesian learning and analyzing RLCTs of learning machines, and introduce the main result of this paper.
2.1 Framework of Bayesian Learning
First, we explain the general theory of Bayesian learning.
Let and be probability density functions on a finite-dimensional real Euclidean space, where is a parameter. In learning theory, and represent a true distribution and a learning machine given respectively. A probability density function whose domain is a set of parameters is called a prior. Let be a set of random variables that are independently subject to , where and are the sample size and training data respectively. A probability density function of defined by
is called a posterior, where is a normalizing constant that is determined by the condition :
This is called a marginal likelihood or a partition function. The Bayesian predictive distribution is defined by
Bayesian inference/learning means inferring that the predictive distribution is the true distribution.
Bayesian inference is statistical, hence, its estimation accuracy should be verified. There are mainly two criteria for the verification. The first is the negative log marginal likelihood:
This is also called the free energy or the stochastic complexity(Watanabe, 2009). The second is the generalization error . It is defined by the Kullback-Leibler divergence from the true distribution and the predictive one :
Note that and are functions of hence they are also random variables. The expected value of for the overall training data is called the expected generalization error. Assume there exists at least one parameter that satisfies and the parameter set is compact. Using singular learning theory (Watanabe, 2000, 2009), it has been proven that
tends to infinity even if the posterior distribution can not be approximated by any normal distribution, whereis the empirical entropy:
The constant is the RLCT which is an important birational invariant in algebraic geometry. From a mathematical point of view, the RLCT is characterized by the following property. We define the zeta function of learning theory by
if and only if for almost everywhere . Let be the nearest pole of to the origin; is then equal to the RLCT. If is regular, then . However, this is not true in general. The detail of the general case is explained in the next section.
2.2 Relationship between Algebraic Geometry and Learning Theory
Second, we outline the relationship between algebraic geometry and statistical learning theory.
First of this subsection, we introduce the motivation to apply algebraic geometry to learning theory. As described above, statistical learning encounters a situation that the true distribution is not known although a plurality of data or sample can be obtained, where the number of data or the sample size is . Researchers and practitioners design learning machines or statistical models to estimate by making the predictive distribution . There is a problem, “how different are our model and the true distribution?” This issue can be characterized as the model selection problem, “which model is suitable?” This “suitableness” criteria are the negative log marginal likelihood and the generalization error , as mentioned above. However, calculating is very high cost for computers and cannot be computed since is unknown. We should estimate them from the data. If the likelihood function and the posterior distribution can be approximated by a Gaussian function of , we can estimate and , by using Bayesian information criterion (BIC) (Schwarz, 1978) and Akaike information criterion (AIC) (Akaike, 1980), respectively. AIC and BIC are respectively defined by
where is the maximum likelihood estimator or the maximum posterior estimator and is the parameter dimension. AIC and BIC are derived by not using algebraic geometry, however they are asymptotically equal to and only if and can be approx a normal distribution. In general, we cannot estimate and by using AIC and BIC, thus we need algebraic geometry.
Second, we describe the framework of analyzing and using algebraic geometry. We consider in Eq. (6) and its zero points : this is an algebraic variety. We use the following form by (Atiyah, 1970) of the singularities resolution theorem (Hironaka, 1964). This form is originally derived by Atiyah for the analysis of distributions (hyperfunctions), however Watanabe proved that it is useful for constructing singular learning theory (Watanabe, 2000).
Theorem 1 (Singularities Resolution Theorem)
Let be a non-negative analytic function on the open set and assume that there exists such that . Then, there are -dimensional manifold and an analytic map such that for each local chart of ,
where is the Jacobian of and is strictly positive analytic: .
Let be an analytic function of a variable . is denoted by a -function with compact support . Then
is a holomorphic function in . Moreover, can be analytically continued to a unique meromorphic function on the entire complex plane . The poles of the extended function are all negative rational numbers.
The Kullback-Leibler divergence is non-negative thus we can apply Theorem 1 to on , we get
Assuming the domain of prior is and , we can also apply Theorem 2 to and obtain Eq. (5). In this equation, is called the zeta function of learning theory and it can be analytically extended on as a unique meromorphic function. The RLCT of is defined by the maximum pole of (Watanabe, 2009). Furthermore, it is proved that the RLCT is not depend on if on (Watanabe, 2009).
Let , , and be the true distribution, the learning machine, and the prior distribution, where is a point of and is an element of the compact subset of . Put as same as Eq. (6) and is denoted by the RLCT of . If there exists at least one such that , then the asymptotic behavior of the generalization error and the free energy is as belows:
is depend on and thus Theorem 3 can be understood as we can clarify and if the RLCT which is determined by is clarified. As introduced above, there are several researches to find the RLCT of a statistical model by analyzing the maximum pole of the zeta function. Their studies are based on Theorem 3 and the zeta function derived by Theorem 2. The researchers have found the singularity resolution map for the exact value or an upper bound of , and have obtained the one of the RLCT since the RLCT is order isomorphic: if , then , where and are the maximum pole of and , respectively (Watanabe, 2009).
As discussed in Section 5, there is no standard method to find the RLCTs for a learning machine (family of functions). Here, we show a fundamental method to find the RLCT for a non-negative analytic function: this is called blowing-up (Hironaka, 1964). We explain blowing-up which is used to study learning machines, based on a concrete example and (Watanabe, 2009). If a reader needs the rigorous definition of blowing-up, then see (Hironaka, 1964). Let and , are independent variables. Especially, we treat the case . Blowing-up of is a transformation of the coordinate that is defined
Using this blowing-up,
and the absolute of Jacobian of this transformation is
From the applied mathematical point of view, is strictly positive thus the RLCT can be calculated. The zeta function is
and it is immediately proved that does not effect the RLCT . Then all we have to consider is the function
which are analytically connected to as unique meromorphic function, respectively. Therefore, we get
where and is positive constants. As the same way, in general , the RLCT is equal to
2.3 Main Theorem
Third, we introduce the main result of this paper. In the followings, is a parameter and is an observed random variable matrix.
A stochastic matrix is defined by a matrix wherein the sum of the elements in a column is equal to 1 and that each entry is non-negative. For example,
is a stochastic matrix. It is known that a product of stochastic matrices is also a stochastic matrix. Let be a set of stochastic matrices whose elements are in , where is a subset of , and . In addition, we set and . Let be a compact subset of and let be a compact of subset of . We define and , and assume that and are SMFs such that they give the minimal factorization of . We also assume that .
Definition 4 (RLCT of SMF)
Set . Then the holomorphic function of one complex variable
can be analytically continued to a unique meromorphic function on the entire complex plane and all of its poles are rational and negative. If the largest pole is , then is said to be the RLCT of SMF.
In this paper, we prove the following theorem.
Theorem 5 (Main Theorem)
If , , and , then the RLCT of SMF satisfies the following inequality:
In particular, if or , then equality holds:
Also if and , then
We prove Main Theorem in the third and fourth sections. As an application of it, we obtain an upper bound of the Bayesian generalization error in SMF.
Assume that , , and . Let the probability density functions of be and , which represent a true distribution and a learning machine respectively,
Also let be a probability density function such that it is positive on a compact subset of including i.e. . Then, has the same RLCT as and the minus log marginal likelihood and the expected generalization error satisfy the following inequalities:
where is the upper bound of the RLCT of SMF showed in Theorem 5.
Regarding this theorem, we will study the case when a set of random matrices are observed and the true decomposition and are statistically estimated. A statistical model , which has parameters , is used for estimation. Then, the theorem gives the Bayesian generalization error. Once Theorem 5 is proved, Theorem 6 immediately follows. Therefore, we will prove Main Theorem 5 subsequently. Other models will be considered in Discussion.
Let and be
and and be
are stochastic matrices thus
We need the following four lemmas and two propositions in order to prove Main Theorem. Before the proof of the lemmas and Main Theorem, we will explain the notation used in the paper. We often transform the coordinates by using a linear transformation and a blowing-up process, hence, for the sake of simplicity, we sometimes use the same symbolsrather than . For example,
Let and be non-negative analytic functions from a subset of Euclidian space to . The RLCT of is defined by where is the largest pole of the following function
which is analytically connected to the entire complex plane as a unique meromorphic function. When the RLCT of is equal to the one of , it is denoted by . Regarding the binomial relation , the following propositions are known.
Suppose and let be real polynomials. Furthermore let
be the generated ideal of and respectively. Also we put
Then, if and only if .
It is immediately proved by using Cauchy-Schwarz’s inequality.
The above gives the following corollary.
Assume that . Then
We can easily prove this owing to and Proposition 7.
Put and , where and are compact sets that do not include 0. Let be and be . Then,
(Sketch of Proof) It is sufficient to prove that
for (constant). Using mathematical induction,
holds for (constant). This means
since Corollary 7 causes that one if and only if the other. We had rigorously proved Proposition 9 in our previous research (see Hayashi and Watanabe, 2017a, Lemma 3. 4). In addition, it is easily verified that the RLCT of equals by using blowing-up (Eq. (7)) and Proposition 9.
On account of above propositions, we show the following four lemmas. Their lemmas are proved in the Appendix C.
If ( is constant),
Let be the RLCT of . If ,, , and ,
If , , and , then the equality of Main Theorem holds:
Suppose and . In the case of , Main Theorem holds:
4 Proof of Main Theorem
First, we develop and have
Thanks to Corollary 8, we obtain
Second, we calculate the RLCTs of the terms of the bound. Using linear transformations and the triangle inequality,
for and . Therefore, considering blowing-ups of respective variables and and applying Lemma 13, we get
The equal sign of the inequality of Main Theorem holds if or . If and , the bound is not equal to the exact value.
Under the same assumption of Main Theorem, suppose
Then, in the proof of Main Theorem is equal to
and more tightly satisfies the same inequality in Main Theorem.
Owing to , .
Then we have and
Using the above relation,