Matrices and tensors are natural candidates for the representation of relations between two or more entities (Cichocki et al., 2009; Kolda and Bader, 2009). In matrix factorization, the goal is computing an exact or approximate decomposition of a matrix of form , possibly subject to various constraints on factor matrices ,
or both. The decomposition, also called a factorization due to its algebraic structure, provides a latent representation of data that can be used for prediction, completion, anomaly detection, denoising or separation tasks to name a few.
The difficulty of the matrix decomposition problem may depend on the nature of the constraints on factor matrices
or the nature of the error function used for measuring the approximation quality. For many matrix decomposition problems, exact algorithms are not known. An important exception is the ubiquitous singular value decomposition where efficient and numerically stable algorithms are known for the computation of a decomposition of formwith orthonormal and and nonnegative diagonal for any matrix (Golub and Van Loan, 2013). Instead, when both factors and are required to be nonnegative, we obtain the nonnegative matrix factorization (NMF) (Cohen and Rothblum, 1993; Paatero and Tapper, 1994; Lee and Seung, 2001) and the problem becomes intractable in general (Vavasis, 1999; Gillis, 2017). However, under further conditions on the matrices and , such as separability (Donoho and Stodden, 2004) or orthogonality (Asteris et al., 2015a), it becomes possible to obtain provable algorithms with certain optimality guarantees (Gillis, 2012; Arora et al., 2016; Gillis, 2014; Asteris et al., 2015b). Despite the lack of practical general exact algorithms, NMF and its various variants have been extensively investigated in the last decades (Gillis, 2017).
In domains where it is natural to represent data as a multi-way array (e.g., a three-way array is denoted as , where ) involving more then two entities, tensor decomposition methods are more natural (Acar and Yener, 2009; Kolda and Bader, 2009; Cichocki et al., 2009), along with several structured extensions such as coupled/collective factorizations (Paatero, 1999; Yılmaz et al., 2011), tensor trains and tensor networks (Cichocki et al., 2016, 2017). Unfortunately, in contrast to matrix decompositions, computing exact or approximate tensor factorizations is a much harder task that is known to be an NP-hard problem in general (Håstad, 1990; Hillar and Lim, 2013). Algorithms used in practice mostly rely on non-convex optimization, and depend on iterative minimization of a discrepancy measure between the target data and the desired decomposition. Despite their lack of optimality guarantees, tensor decomposition algorithms have found many modern applications as surveyed in Sidiropoulos et al. (2017); Papalexakis et al. (2016), psychometrics (Tucker, 1966; Harshman, 1970), knowledge bases (Nickel et al., 2016), graph analysis, social networks, (Papalexakis et al., 2016), music, audio, source separation (Virtanen et al., 2015; Simsekli et al., 2015)
, communications, channel estimation(Kofidis and Regalia, 2001; Sidiropoulos et al., 2017), bioinformatics (Mørup et al., 2008; Acar et al., 2011), link prediction (Ermiş et al., 2015), and computational social science (Schein et al., 2016).
Another closely related set of methods for relational data are the probabilistic topic models (Blei, 2012)
. Topic models have emerged even back in the early 1990’s primarily from the need of understanding vector space models for the analysis and organizing large text document corpora(Deerwester et al., 1990; Papadimitriou et al., 2000; Arora et al., 2015), but quickly applied to other data modalities such as images, video, audio (Signoretto et al., 2011; Liu et al., 2013; Abdallah et al., 2007; Virtanen et al., 2008). Research in this direction is mostly influenced by the seminal works Blei et al. (2003); Minka and Lafferty (2002) and shortly thereafter many related models have been proposed (Canny, 2004; Li and McCallum, 2006; Airoldi et al., 2008; Schein et al., 2015). The empirical success of topic models has also triggered further research in several alternative inference strategies (Griffiths and Steyvers, 2004; Teh et al., 2007).
It has been a folkloric knowledge that factorizing nonnegative matrices and fitting topic models to document corpora are closely related problems where both problems can be viewed as inference and learning in graphical models (Gopalan et al., 2013)
. In this paper, we formalize this observation and illustrate that nonnegative tensor factorization (NTF) models and probabilistic topic models have the same algebraic form as an undirected graphical model, which ‘factorizes’ a joint distribution as a product of local compatibility functions. Based on this observation, we develop a general modeling framework that allows us computing tensor factorizations as well as approximating the marginal likelihood of a tensor factorization model for Bayesian model selection.
In our construction, the central object is a dynamic model that we coin as the Bayesian allocation model (BAM). The model first defines a homogeneous Poisson process on the real line (Kingman, 1993; Daley and Vere-Jones, 2007) and we imagine the increments of this Poisson process as ‘tokens’, drawing from the topic modeling literature. Each token is then marked in an independently random fashion using mark probabilities . The mark probabilities
are assumed to obey a certain factorization that respects a Bayesian network of discrete random variables, with the directed graph111For developing a dynamical process view, we find working with directed graphical models more convenient as compared to undirected graphs. A directed graph-based factorization of a joint distribution, also known as a Bayesian network, is a standard tool in modeling domains containing several interacting variables (Lauritzen, 1996). (Lauritzen, 1996; Barber, 2012) where the probability tables corresponding to
are also random with a Dirichlet prior. Thanks to the conjugacy properties of the Poisson process and the Dirichlet-Gamma distributions, we can integrate out the random probability tables to arrive at a marginal model that turns out to be a Pólya urn process. The Markov properties of the directed graphallows us to understand the nature of this Pólya urn model.
For general graphs, exact inference, i.e., the computation of certain marginal probabilities conditioned on the observations, can be accomplished using the junction tree algorithm (Lauritzen and Spiegelhalter, 1988; Lauritzen, 1992; Spiegelhalter et al., 1993). For graphical models with discrete probability tables, under suitable assumptions, it is also possible to integrate out the probability tables to calculate the marginal likelihood. In this paper, we will show that computing the marginal likelihood of BAM is equivalent to computing the probability that the aforementioned Pólya urn process hits a certain set in its state space. This dynamic view leads to a class of novel sequential algorithms for the calculation of the marginal likelihood.
Note that hierarchical probability models, including topic models and tensor factorizations can be always expressed as Bayesian networks that include both the involved random variables as well as the probability tables. Indeed in the topic modeling literature, it is a common practice to express a joint distribution induced by a topic model using a plate notation. We argue that this detailed description may sometime blur the direct connection of topic models to Bayesian networks. We will illustrate that many known topic models, including Latent Dirichlet allocation, mixed membership stochastic blockmodels, or various NTF models such as nonnegative Tucker or canonical Polyadic decompositions can be viewed in a much simpler way, by just focusing on the discrete random variables and analytically integrating out the probability tables.
The key contribution of the present paper is a surprisingly simple generic sequential Monte Carlo (SMC) algorithm that exploits and respects the sparsity of observed tensors. The computational complexity of the algorithm scales with the total sum of the elements in the observed tensor to be decomposed and the treewidth of the graph
that defines the mark distribution. Unlike other approaches that are based on standard techniques such as non-convex optimization, variational Bayes, or Markov chain Monte Carlo(Acar and Yener, 2009; Şimşekli and Cemgil, 2012; Ermis et al., 2014; Nguyen et al., 2018), the proposed algorithm does not depend on the size of the observed tensor. This is a direct consequence of the junction tree factorization that implies also a factorized representation of the Pólya urn. Moreover, since the tokens are allocated one by one to the urn (due to the Poisson process formulation), the representation is typically sparse and this allows the use of efficient data structures for implementation.
The paper is organized as follows. In Section 2, we give a review of NTF models and setup our notation to highlight the connections to topic models and Bayesian networks. Then, Section 3 introduces BAM as a generative model for tensors and discusses the properties of the resulting model. In the subsequent Section 4
, we show how the generic model can be viewed as a Pólya urn model and derive an SMC algorithm for unbiased estimation of the marginal likelihood. Finally, Section5 contains a simulation study to illustrate the performance and practical utility of the derived algorithms when compared with variational method. We illustrate with examples where our algorithm is practical in the sparse data regime. To make the paper self contained, we also provide a generic derivation of VB (Beal et al., 2006) and related optimization techniques in the same section. We conclude with the Section 6.
In this section, we will give a short review of matrix and tensor factorizations, topic models and probabilistic graphical models. All of these subjects are quite mature where excellent and extensive reviews are available, so our purpose will be setting up a common notation and building motivation for the rest of the paper.
2.1 Nonnegative Matrix and Nonnegative Tensor Factorizations
First, we provide a review of nonnegative matrix and tensor factorizations, the latter being a generalization of the former.
2.1.1 Nonnegative matrix factorization
Nonnegative matrix factorization (NMF) is a statistical model that has been a widely adopted method for data analysis (Paatero and Tapper, 1994; Cohen and Rothblum, 1993; Lee and Seung, 2001; Gillis, 2017). In its original formulation, the NMF problem is typically cast as a minimization problem:
where is an observed matrix and and are factor matrices of sizes and respectively with nonnegative entries such that a divergence is minimized. One geometric interpretation is that the model seeks nonnegative basis vectors for represent each data vector by a conic (nonnegative) combination of . Here, replacing an index with a denotes a vector as in . As nonnegativity prevents cancellation, typical solutions of the matrices and
are empirically known to have qualitatively different behaviour than other matrix decomposition models such as principal component analysis, that can be obtained by singular value decomposition.
One possible divergence choice for is the information divergence,
We will refer to the corresponding model as KL-NMF, following the established terminology in the literature. We note that the abbreviation KL refers to the Kullback-Leibler divergence that is actually the information divergence between suitably normalized measures and where . In the sequel we will assume that the matrix has nonnegative integer entries, i.e., it is a count matrix. For practical considerations, this is not a major restriction as one can always scale finite precision rational numbers to integers by appropriate scaling.
When is a count matrix, the model is sometimes referred as the Poisson factorization model: the negative log-likelihood of the Poisson intensity is equal, up to constant terms, to the information divergence (Cemgil, 2009; Gopalan et al., 2013). More precisely, if
the log-likelihood is
Here, the statistical interpretation of NMF is estimating a low rank intensity matrix that maximizes the likelihood of observed data.
2.1.2 Nonnegative tensor factorization
Nonnegative tensor factorization (NTF) models can be viewed as structured extensions of NMF, in a sense we will now describe. Note that the matrix product is a particular matrix valued bilinear function , the element-wise products and followed by a contraction on index as . To express general models, we will introduce a multi-index notation where for a set of integers we let and for any we let denote . In the sequel, we will use NMF as a running example.
In the NMF model, we have factor matrices and indices. We redefine the factor matrices , and rename the indices as and define the sets , and . We let denote (that is ), and denote (which is ). The contraction index set is so the bilinear function can be expressed as (which is just ).
In tensor factorization, the goal is finding an approximate decomposition of a target tensor as a product of factor tensors , on a total of indices. The target tensor has indices where , where each element is denoted as and the order of this target tensor is the cardinality of , denoted by . Each of the factor tensors have indices where and each element is denoted as . As such, a tensor factorization model is
and we use the bar notation to denote the complement of an index set as . Note that this is simply a particular marginal sum over the indices that are not members of the target tensor. For any set we will refer to the corresponding marginal using the same notation .
Using the notation above, we can define some known models easily. For example, a canonical Polyadic decomposition (also known as a PARAFAC) model is defined by
and can be encoded as , , , , , where we decompose an order tensor as a product of factor matrices. Another common model is the Tucker model
that corresponds to , , , , , , where we decompose an order tensor as a product of tensors, three factor matrices and a so-called core-tensor. Many other tensor models, such as tensor trains (Oseledets, 2011), tensor networks (Orus, 2013; Cichocki et al., 2016, 2017) can also be expressed in this framework.
There exist a plethora of divergences that can be used for measuring the quality of an approximate tensor factorization (Fevotte and Idier, 2011; Finesso and Spreij, 2006; Cichocki et al., 2006). In this paper, we focus on Poisson tensor factorization, that has a particularly convenient form due to the properties of the information divergence and is a natural choice in fitting probability models. The derivative of the scalar information divergence with respect to a scalar parameter is . Thanks to the multilinear structure of a tensor model, the partial derivatives of the divergence can be written explicitly as the difference of two positive terms
After writing the Lagrangian of the objective function (2.3) for all factors for , necessary optimality criteria can be derived for each from the Karusch-Kuhn-Tucker (KKT) conditions (see, e.g., Kazemipour et al. (2017)) as
for all , . For solving the KKT conditions, one can derive a multiplicative algorithm for each factor with the ratio of the negative and positive parts of the gradient
It can be verified that this algorithm is a local search algorithm that chooses a descent direction as increases when and decreases when
. In the KL case, the same algorithm can also be derived as an expectation-maximization (EM) algorithm by data augmentation(Cemgil, 2009; Yılmaz et al., 2011). For the matrix (NMF) case, the convergence properties of the multiplicative updates are established in Finesso and Spreij (2006); Lin (2007a).
An alternative interpretation of the KKT optimality conditions is a balance condition where the marginals on must match on both sides for all
where we let
and use the letter to highlight the fact that this quantity can be interpreted as a conditional probability measure.
For a general tensor model, calculating these marginals, hence the gradient required for learning can be easily intractable. The theory of probabilistic graphical models (Lauritzen and Spiegelhalter, 1988; Maathuis et al., 2018) provides a precise characterization of the difficulty of this computation. In fact, the NTF model in (2.4) is equivalent to an undirected graphical model with hidden random variables. Here, the undirected graph is defined with a node for each index, and for each pair of nodes there is an undirected edge in the edge set if two indices appear together in an index set for . Exact inference turns out to be exponential in the treewidth of this undirected graph. The relations between tensor networks and graphical models have also been noted (Robeva and Seigal, 2018). Due to this close connection, we will review graphical models and closely related topic models, that have a well understood probabilistic interpretation and highlight their connections to NTF models. It will turn out, that the concept of conditional independence translates directly to a ‘low rank’ structure often employed in construction of tensor models.
2.2 Probabilistic Graphical Models
Probabilistic graphical models are a graph based formalism for specifying joint distributions of several random variables and for developing computation algorithms (Maathuis et al., 2018). For discrete random variables, when each for takes values in a finite set with elements, one natural parametrization of the distribution is via an -way table , where each element for specifies the probability of the event , expressed as . More formally, we have
where denotes the expectation with respect to the joint distribution and is the indicator of the condition, that is if the condition inside the bracket is true and otherwise.
For large , explicitly storing the table is not feasible due to the fact that its size is growing exponentially with , so alternative representations are of great practical interest. One such representation is the undirected graphical models briefly mentioned in the previous section. Another alternative representation is a Bayesian network (Pearl, 1988). In this representation, the probability model is specified with a reference to a graph with the set of vertices and directed edges . This graph is a directed acyclic graph (DAG) meaning that the set of directed edges are chosen such that there are no directed cycles in . Each vertex of the graph corresponds to a random variable and missing edges represent conditional independence relations.
Given a directed acyclic graph , for each vertex , we define the set of parents as the set of vertices from which there is a directed edge incident to node , defined by . We will also define the family of the node as . With these definitions, the Bayesian network encodes nothing but a factored representation of the original probability table as:
The factors are typically specified during model construction or need to be estimated from data once a structure is fixed. In theory, to obtain such a factorized representation, we can associate each factor with a conditional distribution of form defined by the ratio of two marginals
where the notation refers to a collection of random variables, indexed by as for any
. Formally, we define the (moment) parameters of a marginal distributionas
as such a marginal is, in the tensor terminology introduced in the previous section, a contraction on where we have . We can define a conditional probability table as:
We use the notation just to highlight the fact that , for all .
One key problem in probabilistic graphical models is the inference problem, that is the computation of marginal probabilities conditioned on a subset of random variables observed to be in a specific state, where is the set of visible indices. Exact inference is feasible, but this depends critically on the structure of the graph as well as the particular set of visible indices . To highlight the relation to tensor factorization, we will describe the conditioning using tensor operations and define an observation as a tensor with a single nonzero entry
The inferential goal is, given (so that ) and , computing posterior marginals of form
A general method for exact inference in graphical models is the junction tree algorithm (see, e.g., Lauritzen and Spiegelhalter (1988)) that proceeds by combining groups of nodes into cliques by moralization and triangulation steps to arrive at a representation as
where and are collection of sets named cliques and separators satisfying the running intersection property. This factorization allows a propagation algorithm on a tree for efficiently computing desired marginals. In a sense, the junction tree can be viewed as a compact representation of the joint distribution from which desired posterior marginals can still be computed efficiently. In general, each clique will contain at least one of the families of so the junction tree algorithm provides a practical method that facilitates the computation of for all , including cases where some variables are observed, i.e., when is not empty.
We note that these posterior marginals are in fact closely related to the required gradients when solving the KKT conditions in (2.7) for the NTF model. In other words, the gradients required when fitting a tensor model to data can be computed in principle by the junction tree algorithm. To see this, we assume that the observed tensor
is in fact a contingency table ofobservations, (with corresponding tensors ), where each cell of gives the total count of observing the index
In this case, given (so that ) for , the total gradient, also known as the expected sufficient statistics, is given by
2.3 Topic Models
Topic models are a class of hierarchical probabilistic generative models for relational data with the latent Dirichlet allocation (LDA) as the prototypical example (Blei et al., 2003). The LDA is introduced as a generative model for document collections and can be viewed as a full Bayesian treatment of a closely related model, probabilistic latent semantic indexing (PLSI) (Hofmann, 1999). Suppose we are given a corpus of documents with a total of words from a dictionary of size . To avoid confusion, we will refer to the individual instances of words as tokens. Formally, for each token where , we are given a data set of document labels and word labels from a fixed dictionary of size . This information is encoded as a product of two indicators and : if and only if token comes from document and it is assigned to word .
LDA is typically presented as a mixture model where one assumes that, conditioned on the document indicator , the token first chooses a topic among different topics, with document specific topic probability , then chooses the word conditioned on the topic with probability . This model can be summarized by the following hierarchical model for all
The inferential goal of LDA is, given and estimating the posterior of as well as the tables . The corresponding simplified graphical model is illustrated in Figure 1 Following the seminal work of Blei et al. (2003), many more variations have been proposed (Li and McCallum, 2006; Airoldi et al., 2008). One can view LDA as a hierarchical Bayesian model for the conditional distribution . As such, the approach can be viewed naturally as a Bayesian treatment of the graphical model with observed and and latent ; see Figure 1. In this paper, we will view topic models and structured tensor decompositions from the lens of discrete graphical models.
In topic modeling, it is common to select independent priors on factor parameters, for example in (2.12) the hyper-parameters could be specified freely. While this choice seems to be intuitive, it turns out having a quite dramatic effect on posterior inference and may even lead to possibly misleading conclusions. The choice of a Dirichlet may also seem to be arbitrary and merely due to convenience, but Geiger and Heckerman (1997) prove that under certain plausible assumptions the Dirichlet choice is inevitable. If the parameters of a Bayesian network are assumed to be globally and locally independent, and there are no extra assumptions about the conditional independence structure of the model other than the ones directly encoded by the graph, i.e., any Markov equivalent graph structures could not be discriminated, then the only possible choice of priors happens to be a collection of Dirichlet distributions satisfying equivalent sample size principle. In plain terms, this requires that all Dirichlet hyper parameters should be chosen consistently as marginal pseudo-counts from a fixed, common imaginary data set (Heckerman et al., 1995). These results, initially stated only for Bayesian structure learning problems in discrete graphical models with observable nodes directly apply to the much more general setting of topic models and tensor factorizations. Hence, in the following section, we will establish the link of topic models with tensor decomposition models and Bayesian networks. Our strategy will be introducing a dynamic model, that we coin as BAM and showing that this model can be used for constructing all the related models.
3 Allocation Model
We first define an allocation process as a collection of homogeneous Poisson processes where each index for has the cardinality of , i.e., , so we have processes. All processes are obtained by marking an homogeneous Poisson process with constant intensity that we name as the base process. The marking is done by randomly assigning each event (token) generated by the base process independently to the processes with probability . Let the event times of the base process be and define for . Note that is random also. We will also define the increments
with the convention . When , we let and refer to this object as the allocation tensor. It is useful to view the allocation tensor as a collection of cells where cell contains tokens at time and the increments as indicators of the allocation of the ’th token to cell . Later, it will be also useful to think of as a color, where each index encodes a color component.
As the assignments of tokens to cells are random, we can define the sequence of random variables , where is the index of the Poisson process that is incremented at time , i.e.,
In the general case, we can imagine the joint distribution of as an -way array (an order tensor) with entries, and specifies the probability that a token is placed into the cell with index ,
For modeling purposes, we will further assume that respects a factorization implied by a Bayesian network (see Section 2.2). As such, we have a factorization of that has the form
We will generally use , , , and for realizations of the random variables , , , , and preserve the relation between the random variables for their realizations as well. (For example, for some and we will use directly, with reference to the relation between the random variables they correspond to). Finally, we will resort to the bold-faced notation when the distinction between random variables and realizations is essential in the notation; otherwise we will stick to the lighter notation , , , and , etc. in the flow of our discussion.
3.1 Bayesian Allocation Model
To complete the probabilistic description, we let . For , we start with an order- tensor with nonnegative entries for all . In practice, we do not need to explicitly store or construct ; we will use it to consistently define the contractions for all
We model each factor in (3.2) as an independent222Though we introduced the probability tables as independent, it is straightforward to generalize the generative framework to the cases where tables are dependent as well; see subsection 5.3 for an example. random conditional probability table, and is used to assign the prior distribution on those tables.
Specifically, given the hyperparameters, the hierarchical generative model for variables , and is constructed as:
Here, is an vector obtained by fixing and varying only, and is the Dirichlet distirbution with parameter vector . Given the hierarchical model, the joint distribution for the Bayesian allocation model (BAM) is expressed as
where the distributions (densities) , and are induced by (3.4). To make our reference to the assumed probability measure clear, we will use the notation denote distributions regarding the model in (3.4).
Our construction of defining the prior parameters from a common seems restrictive, however this is required for consistency: the factorization in (3.2) is not unique; there are alternative factorizations of that are all Markov equivalent (Murphy, 2012). If we require that the distribution of should be identical for all equivalent models then defining each measure as in (3.3) is necessary and sufficient to ensure consistency by the equivalent sample size principle (Geiger and Heckerman, 1997). We discuss this further in Section 3.6, where we provide a numerical demonstration in Example 3.2.
3.1.1 Observed data as contractions
In most applications, the elements of the allocation tensor are not directly observed. In this paper, we will focus on specific types of constraints where we assume that we observe particular contractions of , of form
Here is the set of ‘visible’ indices, and are the latent indices; see Figure 2 for an illustration. Many hidden variable models such as topic models and tensor factorization models have observations of this form.