Probabilistic graphical models (PGMs) are a framework for encoding conditional independence information about multivariate distributions as graph-based representations, whose generality and interpretability have made them an indispensable tool for probabilistic modeling. Undirected graphical models (UGMs), also known as Markov random fields, form a general class of PGMs with a diverse range of applications in fields such as computer visionwang2013markov wang2019bert , and biology mora2011biological . More recently, the graphical structure of discrete UGMs has been shown to be closely related to that of tensor networks (TNs) duality2018 , a state-of-the-art modeling framework first developed for quantum many-body physics verstraete2008matrix ; Orus2014
, whose use in machine learning—for example in model compressionnovikov2015tensorizing ; cichocki2016tensor , proving separations in expressivity between deep and shallow learning methods cohen2016expressive ; levine2018deep , and as standalone learning models milesdavid ; novikov2017exponential —has been a subject of growing interest.
In this work, we explore the correspondence between UGMs and TNs in the setting of probabilistic modeling. Whereas UGMs are specifically designed to represent probability distributions, general TNs represent high-dimensional tensors whose values can be positive, negative, or even complex. While restricting TN parameters to take on non-negative values results in an exact equivalence with UGMs duality2018 , it also limits their expressivity. More general probabilistic models built from TNs, as exemplified by the Born machine (BM) bornmachine2018 model family, employ complex latent states that permit them to utilize novel forms of interference phenomena in structuring their learned distributions. While this provides a new resource for probabilistic modeling, it also limits the applicability of foundational PGM concepts such as conditional independence.
We make use of the physics-inspired concept of decoherence zurek2003decoherence to develop a hybrid framework for probabilistic modeling, which allows for the coexistence of tools and concepts from UGMs alongside quantum-like interference behavior. We use this framework to define the family of decohered Born machines
(DBMs), which we prove is sufficiently expressive to reproduce any probability distribution expressible by discrete UGMs or BMs, along with more general families of TN-based models. We further show that DBMs satisfy a conditional independence property relative to its decohered regions, with the operation of decoherence permitting the values of latent random variables to be conditioned on in an identical manner as UGMs. Finally, we verify the empirical benefits of such models on a sequential modeling task.
Our work builds on the duality results of duality2018 , which establish a graphical correspondence between discrete UGMs and TNs, by further accounting for the distinct probabilistic behavior of both model classes. Much work across physics, machine learning, stochastic modeling, and automata theory has introduced and explored novel properties of quantum-inspired probabilistic models zhao2010norm ; bailly2011quadratic ; FV2012 ; gao2017efficient ; pestun2017tensor ; pestun2017language ; bornmachine2018 ; stoudenmire2018learning ; stokes_terilla ; benedetti2019generative ; BST_2020 ; miller2021tensor ; gao2021enhancing , almost all of which explicitly or implicitly employ tensor networks. The relative expressivity of these models was explored in glasser2019 ; adhikary2021quantum , where quantum-inspired models were proven to be inequivalent to graphical models. Fully-quantum generalizations of various graphical models were investigated in leifer2008quantum .
We work with real and complex finite-dimensional vector spaces, where denotes one of or when the distinction is not needed. We take an th order tensor, or -tensor, over to be a scalar-valued map from an -fold Cartesian products of index sets, where and where the vector space of all -tensors is denoted by . Matrices, vectors, and scalars over respectively correspond to -tensors, -tensors, and -tensors, whereas higher-order tensors refers to any -tensor for . The elements of are individual values of on input tuples, and written as , while the th mode of refers to the th argument of . The contraction of a vector with the th mode of an tensor is the -tensor whose elements satisfy . Although dense representations of -tensors require parameters to specify, where , we will see later how tensor networks bypass this exponential scaling for many families of higher-order tensors. As one simple example, the tensor product of any -tensor and -tensor is the -tensor whose elements are given by . We use to indicate the non-negative real numbers, and take the 2-norm of a tensor to be the scalar . Finally, we use to indicate the conjugate transpose of a complex vector or matrix .
We focus exclusively on undirected graphs , whose vertex and edge sets are denoted by and . In anticipating the needs of tensor networks, we allow the existence of edges which are incident to only one node, which we refer to as visible (i.e. dangling) edges. We use to indicate the set of all visible edges, and to indicate the set of all hidden edges, which are edges adjacent to two nodes. Graphs without dangling edges will be called proper graphs. For any node , we denote the set of edges incident to by . A clique of is a maximal subset such that every pair of nodes are connected by an edge, and we use to denote the set of all cliques of . We define a cut set of to be any set of edges such that the removal of all edges in from partitions the graph into two disjoint non-empty sub-graphs.
We work with random variables (RVs), indicated by uppercase letters such as , and their possible outcomes, indicated by lowercase equivalents such as . RVs and their outcomes are frequently indexed with values chosen from an index set, for example , in which case the notation indicates the joint RV . A similar notation is used for multivariate functions , as well as for tensor elements , and a related notation is used for spaces of tensors. Given three disjoint sets of random variables , we use to indicate the conditional independence of and given , and to indicate the (unconditional) independence of and .
2.1 Undirected Graphical Models
Probabilistic graphical models (PGMs) represent multivariate probability distributions using a proper graph whose nodes each correspond to distinct RVs . We focus on undirected graphical models (UGMs), whose probability distributions are determined by a collection of clique potentials , non-negative valued functions from the RVs associated with nodes in , where ranges over all cliques of . Given a UGM with clique potentials defined on a graph with nodes, the probability distribution represented by the UGM is
For brevity, we will often omit normalization factors such as in the following, with the understanding that such terms must ultimately be added to ensure a valid probability distribution. UGMs satisfy an intuitive conditional independence property involving disjoint subsets of nodes for which the removal of leaves the nodes of and in separate disconnected subgraphs of . In this case, the RVs associated with these nodes satisfy .
While the definition above is the standard presentation of UGMs, to permit an easier comparison with tensor networks we will more frequently view them using a dual graphical formulation. In this dual picture, nodes represent clique potentials and edges represent RVs .
3 Tensor Networks
Tensor networks (TNs) provide a general means to efficiently represent higher-order tensors in terms of smaller tensor cores, in much the same way as UGMs efficiently represent multivariate probability distributions in terms of smaller clique potentials. Tensor contraction is crucial in the structure of TNs, and generally involves the multiplication of an -tensor and an -tensor along modes of equal dimension , to yield a single output -tensor whose elements are given by
Although appearing complex in its general form, it is worth verifying from Equation (2) that tensor contraction generalizes matrix-vector and matrix-matrix multiplication, along with vector inner product and scalar multiplication. Tensor contraction is associative, in the sense that multiple contractions between multiple tensors yield the same output regardless of the order of contraction, with different contraction orderings often having vastly different memory and compute requirements.
Tensor network diagrams penrose71 provide an intuitive formalism for reasoning about computations involving tensor contraction using undirected graphs. In a TN diagram, each -tensor is represented as a node of degree , and each mode of is represented as an edge incident to . Tensor contraction between two tensors along a pair of modes is depicted by connecting the corresponding edges of the nodes, with the actual operation of tensor contraction depicted by merging the nodes representing both input tensors into a single node which shares the visible edges of both input nodes. In this manner, a TN diagram with visible edges and any number of hidden edges specifies a sequence of tensor contractions whose output will always be an -tensor. For example, the TN diagram expresses a tensor contraction used in the SVD to express a matrix as the product of three smaller matrices. As a degenerate case, the tensor product of two tensors is depicted by drawing them adjacent to each other, with no connected edges.
Tensor networks use a fixed TN diagram to efficiently parameterize a family of higher-order tensors in terms of a family of smaller dense tensor cores, as stated in the following:
A tensor network consists of a graph , along with a positive integer valued map assigning each edge to a vector space of dimension , and a map assigning each node to a tensor core whose shape is determined by the dimensions assigned to edges incident to . The tensor represented by a tensor network with nodes is that resulting from a contraction of all tensor cores according to the tensor network diagram defined by .
Dimensions assigned to hidden edges are referred to as bond dimensions, and for a fixed graph
they represent the primary hyperparameters setting the tradeoff between a TN’s compute/memory efficiency and its expressivity. A simple but illustrative example of a TN is a low-rank matrix factorization, whose graphis the line graph on two nodes , and whose single hidden edge is associated with a bond dimension equal to the rank of the parameterized matrices.
A simple family of tensors plays a key role in understanding the relationship between UGMs and TNs. Given an orthonormal basis for a vector space , for each we define the th order copy tensor associated with to be . When is contracted with any of the basis vectors , the result is a tensor product of independent copies of (Figure 1f). This convenient property only holds for vectors chosen from the basis defining the copy tensor, leading to a one-to-one correspondence between copy tensor families and orthonormal bases coecke_pavlovic_vicary_2013 . The copy tensors and respectively correspond to the
-dimensional all-ones vector and identity matrix, with the former allowing the expression of sums over tensor elements.
A general copy tensor is depicted graphically as a single black dot with edges (Figure 1e), while is depicted as an undecorated edge. These satisfy a useful closure property under tensor contraction, whereby any connected network of copy tensors is identical to a single copy tensor with the same number of visible edges (fong2019invitation, , Theorem 6.45). This property allows connected networks of copy tensors to be rearranged in any manner, so long as the number of visible edges remains unchanged (Figure 1g). While general tensor network diagrams remain unaffected by changes of basis in the hidden edges, the use of copy tensors permits the graphical depiction of a larger family of basis-dependent linear algebraic operations (Figure 1h–j).
3.1 Undirected Graphical Models as Non-Negative Tensor Networks
Discrete multivariate probability distributions are an important example of higher-order tensors, with the individual probabilities of a distribution over discrete RVs forming the elements of an -tensor. More generally any non-negative tensor , whose elements all satisfy , can be converted into a probability distribution by normalizing as , where .
The connection between multivariate probability distributions and the structure of higher-order tensors extends further, with the independence relation between two disjoint sets of RVs being equivalent to the factorization of their joint probability distribution as the tensor product . Methods used to efficiently represent and learn higher-order tensors, such as tensor networks, can be applied to probabilistic modeling, provided that some means of parameterizing only non-negative tensors is employed. We discuss two important approaches for achieving this probabilistic parameterization, one equivalent to undirected graphical models and the other to Born machines.
It was shown in (duality2018, , Theorem 2.1) that the data defining a UGM is equivalent to that defining a TN, but with dual graphical notations that interchange the roles of nodes and edges. Converting from a UGM to an equivalent TN involves expressing each clique potential on a clique of size as a th-order tensor core , depicted as a degree- node of the TN diagram. Meanwhile, each UGM node representing a discrete RV is replaced by a copy tensor111Copy tensors were used implicitly in duality2018 , in the form of hyperedges within a defining hypergraph. of degree equal to the number of clique potentials the RV occurs in, plus one additional visible edge permitting the values of the RV to appear in the probability distribution described by the TN (Figure 2a). Since every tensor core consists of non-negative elements, the resultant TN is guaranteed to describe a non-negative higher-order tensor. We refer to this family of TN models as non-negative tensor networks.
In the dual graphical notation of TNs, marginalization of and conditioning on RVs in UGMs is achieved by contracting each visible edge of the associated TN with either a first-order copy tensor (marginalization) or an outcome-dependent basis vector (conditioning) respectively. Computing the resulting distribution over the remaining RVs is then a straightforward application of tensor contraction duality2018 , where any nodes of the TN with no remaining visible edges are merged together (Figure 2b). Furthermore, since variables are associated to copy tensors, the conditional independence property of UGMs can be proven using the copying property of copy tensors (Figure 2c). The appropriate formulation of conditional independence is slightly different in the dual TN notation, owing to the association of RVs to edges rather than nodes. In this graphical framework, conditional independence arises when a conditioning set of RVs form a cut set of the underlying TN graph, in which case the RVs and associated with the two partitions of the graph induced by this cut set will satisfy .
The reverse direction of converting non-negative TNs into UGMs is also straightforward, though care is needed with the treatment of hidden edges that aren’t connected to copy tensors. In such cases, we can replace any hidden edge with a third-order copy tensor , yielding a new visible edge which encodes a latent RV in an enlarged distribution. This enlargement process is reversible, in the sense that marginalizing over all latent RVs associated with hidden edges yields the original distribution, allowing hidden edges of a non-negative TN to be treated as visible edges without any loss of generality (Figure 2d). We will see shortly that this property is not shared by more general probabilistic TN models.
4 Born Machines
While UGMs represent one means of parameterizing non-negative tensors for probabilistic modeling, an alternate approach is suggested by quantum physics. Discrete quantum systems are fully described by complex-valued wavefunctions, higher-order tensors which yield probabilities under the Born rule of quantum mechanics. The efficacy of TNs in learning quantum wavefunctions inspired the Born machine (BM) model bornmachine2018 .
A Born machine consists of a tensor network over a graph containing visible edges, whose associated tensor is converted into a probability distribution via the Born rule , where denotes the 2-norm of .
The Born rule permits the (unnormalized) probability distribution associated with a BM to be expressed as a single composite TN, consisting of two copies of the TN parameterizing , one with all core tensor values complex-conjugated, and where all pairs of visible edges have been merged via third-order copy tensors (Figure 3a). Expressing the BM distribution as a single composite TN allows efficient marginal and conditional inference procedures to be applied in a manner analogous to UGMs, namely by contracting the visible edges of the composite TN with vectors (marginalization) or (conditioning), and then contracting regions of the TN with no remaining visible edges. The “doubled up” nature of the composite TN means that intermediate states occurring during inference are described by density matrices, which are positive semidefinite matrices employed in quantum mechanics whose non-negative diagonal entries correspond to (unnormalized) probabilities, and whose off-diagonal elements are referred to as “coherences.”
The existence of non-zero coherences gives BMs the ability to utilize quantum-like interference phenomena in modeling probability distributions, but also makes it difficult to interpret the operation of a BM by assigning latent RVs to its edges, as was possible with non-negative TNs. While we can force a new RV into existence by extracting the diagonal elements of intermediate density matrices using copy tensors (Figure 3b), this causes the elimination of all coherences in density matrices passing through the edge, with the result that the distribution after marginalizing the new latent variable differs from the original BM distribution (Figure 3c). This fact, which can be seen as an analogue of the measurement-induced “observer effect” in quantum mechanics, represents a tradeoff between expressivity and interpretability in BMs that isn’t available in PGMs.
5 A Hybrid Framework
While the graphical structure of Born machines is useful for defining area laws, which characterize the attainable mutual information between subsets of RVs eisert2010colloquium ; lu2021 , they do not permit the formulation of conditional independence results, something which is a major benefit of PGMs. A primary reason for this is the inability to freely assign latent RVs to the hidden edges of BMs without disturbing the original distribution. However, by accounting for this disturbance in a principled manner, it is possible to combine the representational advantages available to BMs with the conditional independence guarantees available to PGMs.
A crucial tool is the concept of decoherence, whereby all off-diagonal coherences of a hidden density matrix are set to zero, leaving only a probability distribution on the diagonals of the operator. This can be carried out by the action of a decoherence operator, a fourth-order copy tensor acting on density matrices which we write as (Figure 4a). The operator is the natural result of converting a hidden edge of a BM into a latent RV and then marginalizing. We can use this idea to decohere certain edges of a BM in advance, leading to the notion of decohered Born machine models (Figure 4b).
A decohered Born machine (DBM) consists of a Born machine over a graph along with a subset of hidden edges , referred to as the model’s decohered edges. The probability distribution represented by a DBM is given by the composite TN associated to the original BM, but with each pair of hidden edges in replaced by a decoherence tensor . A DBM for which is referred to as a fully-decohered Born machine.
Having Definition 3 in hand, we would like to first characterize the expressivity of DBMs. It is clear that standard BMs are a special case of DBMs, where the decohered edge set is taken to be empty. On the other hand, we show in the following two results that fully-decohered BMs are equivalent in expressive power to discrete UGMs.
fdbm_as_ugmThe probability distribution expressed by a fully-decohered Born machine with tensor cores , one for each node , is identical to that of a discrete undirected graphical model with clique potentials of the same shape, and whose values are given by , where contains the RVs from all edges adjacent to .
The proof of Theorem 1 is given in the supplemental material, with the basic idea illustrated in Figure 4c. Each decoherence operator can be written as the product of two third-order copy tensors, each of which can be assigned to one pair of TN cores adjacent to the decohered edge. In the case that all edges of a DBM are decohered, these copy tensors allow each pair of cores and to be replaced by their element-wise product, giving an effective clique potential with non-negative values. The UGM formed by these clique potentials has the same graphical structure as the TN describing the original BM (up to graphical duality). Conversely, the correspondence given in Theorem 1 suggests a simple method for representing any discrete UGM as a fully-decohered BM.
ugm_as_dbmThe probability distribution of any discrete undirected graphical model with clique potentials is identical to that of any fully-decohered Born machine with tensor cores of the same shape, and whose elements are given by , where can be any real-valued function, and where indicates the TN node dual to the clique .
Although standard BMs and UGMs operate very differently—and in the case of line graphs have been proven to have inequivalent expressive power glasser2019 —we see that DBMs offer a unified means of representing both families of models with an identical parameterization. We further prove in the supplemental material that DBMs are equivalent in expressivity to locally purified states, a model family generalizing both BMs and UGMs.
Another motivation for the use of DBMs is in enabling conditional independence guarantees within the setting of quantum-inspired TN models. The ability to replace any decoherence operator by a fifth-order copy tensor with a visible edge lets us assign RVs to all decohered edges, such that marginalizing over these new RVs yields the original DBM distribution (Figure 4a). These new RVs behave identically to those of a UGM, letting us demonstrate a conditional independence property.
cond_ind_dbmConsider a DBM with underlying graph and decohered edges , along with a subset which forms a cut set for . If we denote by the set of RVs associated to , and denote by and the sets of RVs associated to the two partitions of induced by the cut set , then the DBM distribution satisfies the conditional independence property .
While the complete proof of Theorem 2 is given in the supplemental material, the idea is simple (Figure 4d). The insertion of decoherence operators, which are examples of copy tensors, into the composite TN giving the DBM distribution allows any basis vector used for conditioning to be copied to all edges incident to the copy tensor. This in turn removes any direct correlations between the nodes on either side of the decohered edge, so that conditioning on a collection of RVs associated with a cut set of decohered edges results in a factorization of the conditional composite TN into a tensor product of two independent pieces.
Having discussed the representational
advantages of DBMs for structured discrete probability distributions, we now investigate the empirical advantages over UGMs with a similar graphical structure. We define both DBM and UGM models on an undirected version of the graph defining a hidden Markov model (HMM), and evaluate their relative performance on a sequential (flattened) version of the Bars and Stripes datasetmackay2003information . Bars and Stripes consists of two kinds of binary images: ones with only horizontal bars, and ones with only vertical stripes. Intuitively, we expect that the interference behavior available in DBMs can capture correlations more efficiently when the distribution being modeled exhibits regular periodic behavior. The 1D periodic patterns seen in the vertical stripe images give us an ideal setting for testing this out.
Figure 5 shows the results of optimizing a UGM and DBM to maximize marginal likelihood over observations of flattened images. Our models were implemented and trained with JAX jax2018github . The results are favorable to the DBM, which achieves both a lower average NLL on held-out data than the UGM, as well as a progressively smaller training times as the hidden dimensions of the models are increased. We conjecture that DBM models will more broadly have advantages in performance in modelling real-world data processes that exhibit long-term periodic behavior, such as geophysical processes, mechanical systems like human gait, and more general stochastic processes with some cyclical dynamics.
We use the physically-motivated notion of decoherence to define decohered Born machines (DBMs), a new family of probabilistic models that serve as a bridge between PGMs and TNs. As shown in Theorem 1 and Corollary 1, fully decohering a BM gives rise to a UGM, and conversely any subgraph of a UGM can be viewed as the decohered version of some BM. Crucial to this back-and-forth passage is the use of copy tensors, which further allows conditional independence guarantees in the context of TN modeling and provides an additional correspondence between the two modeling frameworks. An immediate limitation of our results surrounding DBMs is the focus on UGMs only. An extension to directed graphical models is left for future work, as is a deeper investigation into what kinds of problems could most benefit from utilizing quantum interference effects in the manner proposed. It is possible that DBMs would improve the performance of existing graphical model inference and learning algorithms by replacing sub-regions of the model with quantum-style ingredients, although a more systematic exploration of this question is needed. The integration of “classical” and “quantum” ingredients represented by a DBM further makes it a natural candidate for quantum machine learning, as decoherence represents a natural form of noise present in quantum hardware in the noisy intermediate-scale quantum (NISQ) era Preskill2018quantumcomputing .
The authors thank Guillaume Verdon, Antonio Martinez, and Stefan Leichenauer for helpful discussions, and Jae Hyeon Yoo for engineering support. Geoffrey Roeder is supported in part by the National Sciences and Engineering Research Council of Canada (grant no. PGSD3-518716-2018).
Sandesh Adhikary, Siddarth Srinivasan, Jacob Miller, Guillaume Rabusseau, and
Quantum tensor networks, stochastic processes, and weighted automata.
International Conference on Artificial Intelligence and Statistics, pages 2080–2088. PMLR, 2021.
-  Raphael Bailly. Quadratic weighted automata: Spectral algorithm and likelihood maximization. In Asian Conference on Machine Learning, pages 147–163. PMLR, 2011.
-  Marcello Benedetti, Delfina Garcia-Pintos, Oscar Perdomo, Vicente Leyton-Ortega, Yunseong Nam, and Alejandro Perdomo-Ortiz. A generative modeling approach for benchmarking and training shallow quantum circuits. npj Quantum Information, 5(1):1–9, 2019.
-  James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018.
-  Tai-Danae Bradley, E M Stoudenmire, and John Terilla. Modeling sequences with quantum states: a look under the hood. Machine Learning: Science and Technology, 1(3):035008, 2020.
-  Andrzej Cichocki, Namgil Lee, Ivan Oseledets, Anh-Huy Phan, Qibin Zhao, and Danilo P Mandic. Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends in Machine Learning, 9(4-5):249–429, 2016.
-  Bob Coecke, Dusko Pavlovic, and Jamie Vicary. A new description of orthogonal bases. Mathematical Structures in Computer Science, 23(3):555–567, 2013.
Nadav Cohen, Or Sharir, and Amnon Shashua.
On the expressive power of deep learning: A tensor analysis.In Conference on Learning Theory, pages 698–728. PMLR, 2016.
-  Ralph DeMarr. Nonnegative matrices with nonnegative inverses. Proceedings of the American Mathematical Society, 35(1):307–308, 1972.
-  Jens Eisert, Marcus Cramer, and Martin B Plenio. Colloquium: Area laws for the entanglement entropy. Reviews of Modern Physics, 82(1):277, 2010.
-  Andrew J. Ferris and Guifre Vidal. Perfect sampling with unitary tensor networks. Phys. Rev. B, 85:165146, 2012.
-  Brendan Fong and David I. Spivak. An Invitation to Applied Category Theory: Seven Sketches in Compositionality. Cambridge University Press, 2019.
-  Xun Gao, Eric R Anschuetz, Sheng-Tao Wang, J Ignacio Cirac, and Mikhail D Lukin. Enhancing generative models via quantum correlations. arXiv preprint arXiv:2101.08354, 2021.
-  Xun Gao, Zhengyu Zhang, and Luming Duan. An efficient quantum algorithm for generative machine learning. arXiv preprint arXiv:1711.02038, 2017.
-  Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, and J. Ignacio Cirac. Expressive power of tensor-network factorizations for probabilistic modeling. In Advances in Neural Information Processing Systems 32, pages 1496–1508, 2019.
-  Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. Unsupervised generative modeling using matrix product states. Phys. Rev. X, 8:031012, 2018.
-  Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with NumPy. Nature, 585(7825):357–362, 2020.
-  J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Matthew S Leifer and David Poulin. Quantum graphical models and belief propagation. Annals of Physics, 323(8):1899–1946, 2008.
-  Yoav Levine, David Yakira, Nadav Cohen, and Amnon Shashua. Deep learning and quantum entanglement: Fundamental connections with implications to network design. In International Conference on Learning Representations, 2018.
-  Sirui Lu, Márton Kanász-Nagy, Ivan Kukuljan, and J Ignacio Cirac. Tensor networks and efficient descriptions of classical data. arXiv preprint arXiv:2103.06872, 2021.
-  David JC MacKay. Information theory, inference and learning algorithms. Cambridge University Press, 2003.
-  Jacob Miller, Guillaume Rabusseau, and John Terilla. Tensor networks for probabilistic sequence modeling. In International Conference on Artificial Intelligence and Statistics, 2021.
-  Thierry Mora and William Bialek. Are biological systems poised at criticality? Journal of Statistical Physics, 144(2):268–302, 2011.
Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov.
Tensorizing neural networks.In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 442–450, 2015.
-  Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets. Exponential machines. In International Conference on Learning Representations, 2017.
-  Román Oruś. A Practical Introduction to Tensor Networks: Matrix Product States and Projected Entangled Pair States. Annals Phys., 349:117–158, 2014.
-  Roger Penrose. Applications of negative dimensional tensors. Combinatorial Mathematics and its Applications, 1:221–244, 1971.
-  Vasily Pestun, John Terilla, and Yiannis Vlassopoulos. Language as a matrix product state. arXiv preprint arXiv:1711.01416, 2017.
-  Vasily Pestun and Yiannis Vlassopoulos. Tensor network language model. arXiv preprint arXiv:1710.10248, 2017.
-  John Preskill. Quantum Computing in the NISQ era and beyond. Quantum, 2:79, 2018.
-  Elina Robeva and Anna Seigal. Duality of Graphical Models and Tensor Networks. Information and Inference: A Journal of the IMA, 8(2):273–288, 06 2018.
-  James Stokes and John Terilla. Probabilistic modeling with matrix product states. Entropy, 21(12), 2019.
-  E Miles Stoudenmire. Learning relevant features of data with multi-scale tensor networks. Quantum Science and Technology, 3(3):034003, 2018.
-  EM Stoudenmire and David J Schwab. Supervised learning with tensor networks. Advances in Neural Information Processing Systems, pages 4806–4814, 2016.
-  Frank Verstraete, Valentin Murg, and J Ignacio Cirac. Matrix product states, projected entangled pair states, and variational renormalization group methods for quantum spin systems. Advances in Physics, 57(2):143–224, 2008.
-  Alex Wang and Kyunghyun Cho. Bert has a mouth, and it must speak: Bert as a markov random field language model. NAACL HLT 2019, page 30, 2019.
-  Chaohui Wang, Nikos Komodakis, and Nikos Paragios. Markov random field modeling, inference & learning in computer vision & image understanding: A survey. Computer Vision and Image Understanding, 117(11):1610–1627, 2013.
-  Ming-Jie Zhao and Herbert Jaeger. Norm-observable operator models. Neural Computation, 22(7):1927–1959, 2010.
-  Wojciech Hubert Zurek. Decoherence, einselection, and the quantum origins of the classical. Reviews of Modern Physics, 75(3):715, 2003.
Appendix A Decohered Born Machines Compute Unnormalized Probability Distributions
Here we show that every decohered Born machine (DBM) defines a valid (unnormalized) probability distribution, that is, that the tensor elements of a DBM are non-negative. This can be seen from the fact that the probability distribution represented by a DBM is obtainable as a marginalization of the distribution represented by a larger Born machine (BM) model. By definition, a DBM is a BM with the property that a subset of the set of the hidden edges of are decohered. Recall that decoherence here involves the contraction of third-order copy tensors , where is the size of the set , and observe that such contraction can be achieved by marginalizing over new latent variables introduced in the corresponding hidden edges. The tensor elements of the DBM will then take the form , where is associated to a larger BM containing all copy tensors associated with decohered edges. These tensor elements are clearly non-negative, proving that DBMs always describe non-negative tensors.
Appendix B Expressivity of Decohered Born Machines
We prove several results which give a general characterization of the expressivity of decohered Born machines (DBMs), showing them to be capable of reproducing a range of classical and quantum-inspired probabilistic models. In Section B.1, we prove that fully-decohered Born machines are equivalent in expressivity to undirected graphical models (UGMs), with the equivalence in question preserving the number of parameters of the two model classes. This result, which parallels the tautological equivalence of non-decohered DBMs and standard BMs, is followed in Section B.3 by a result showing the equivalence of DBMs and locally purified states (LPS), an expressive model class introduced in .
We first review the definition of a DBM and some terminology for its graphical structure. Tensor networks (TNs) are defined in terms of a graph whose edges are allowed to be incident to either two or one nodes in , and we will refer to the respective disjoint sets of edges as hidden edges and visible edges . Visible edges are associated with the modes of the tensor described by the TN, with the number of visible edges in equal to the order of the tensor. Nodes which aren’t incident to any visible edges are referred to as hidden nodes of the TN. Every node is associated with a tensor core of the TN, with the order of being equal to the degree of within .
Recall that every BM is completely determined by a TN description of a higher-order tensor , whose values are then converted into probabilities via the Born rule. We call the TN describing the underlying TN, and the Born rule implies that the probability distribution can be described as a single composite TN formed from two copies of the underlying TN, with all pairs of visible edges joined by copy tensors . We sometimes use the phrase composite edge to refer to any pair of “doubled up” edges in the composite TN, in which case the composite TN can be seen as occupying the same graph as the underlying TN, but where each hidden edge corresponds to a composite edge. The probability distribution defined by a DBM is given by replacing certain composite edges in the composite TN by decoherence operators , according to whether those edges belong to a set of decohered edges (Figure 6).
b.1 Proof of Theorem 1
We first restate Theorem 1, before providing a complete proof.
We show that the composite TN defining the probability distribution of a fully-decohered BM can be rewritten as a TN on the same graph as the underlying TN, with identical bond dimensions but where all cores take non-negative values. By virtue of the equivalence of non-negative TNs and UGMs [33, Theorem 2.1], this suffices to prove Theorem 1.
Fully-decohered BMs are defined as DBMs for which , so that every composite edge within the composite TN has been replaced by a decoherence operator . Since , we can use the equality of different connected networks of copy tensors (Figure 1g) to express as a contraction of two third-order copy tensors along a single edge. Decohered edges are hidden edges and are therefore incident to two distinct (pairs of) nodes in the composite TN. This allows us to move each copy of onto a separate pair of nodes incident to the composite edge (Figure 7). We group together each pair of nodes and , along with all copy tensors incident to it, and contract each of these groups into a single tensor, which we call .
It is clear that each tensor consists of a pair of cores and with all pairs of edges joined together by separate copies of . Since this arrangement of copy tensors corresponds to the element-wise product of and , this implies that the elements of satisfy , with denoting the collection of indices associated with the edges incident to (these correspond to for some clique in the dual graph). Since each core has the same shape as , has non-negative values, and is arranged in a TN with the same graph as the underlying TN, this proves Theorem 1.
b.2 Proof of Corollary 1
The statement of Corollary 1 gives an explicit formula for constructing BM cores from clique potentials , using an arbitrary real-valued tensor . It can be immediately verified that the conversion from BM cores to effective clique potentials under full decoherence (Theorem 1) recovers the same clique potentials we had started with, proving Corollary 1. Note that the values of the complex phases have no impact on the decohered cores.
b.3 Decohered Born Machines are Equivalent to Locally Purified States
Although the definition of locally purified states (LPS) in  assumes a one-dimensional line graph for the TN, we give here a natural generalization to LPS defined on more general graphs.
A locally purified state (LPS) consists of a tensor network over a graph containing visible edges, where all cores contain exactly two visible edges, one of which is designated as a purification edge, and the set of purification edges is denoted by . The -variable probability distribution defined by an LPS is given by constructing the composite TN for a BM from these cores, with order , then marginalizing over all purification edges.
An illustration of this model family is given in Figure 8. By choosing all purification edges to have dimension 1, LPS reproduce standard BMs, whereas [15, Lemma 3] gives a construction allowing LPS to reproduce probability distributions defined by general UGM. Owing to this expressiveness, and to corresponding results for uniform variants of LPS , we can think of LPS as representing the most general family of quantum-inspired probabilistic models. We now prove that DBMs are equivalent in expressivity to LPS, by first showing that LPS can be expressed as DBMs (Theorem 3), and then showing that DBMs can be expressed as LPS (Theorem 4).
Consider an LPS whose underlying TN uses a graph with nodes, visible edges, and hidden edges. The probability distribution represented by this LPS can be reproduced by a DBM over a graph with nodes, visible edges, and hidden edges, where the decohered edge set is in one-to-one correspondence with the purification edges of the LPS.
Starting with a given LPS, we construct a TN matching the description in the Theorem statement, whose interpretation as a DBM will recover the desired distribution. We begin with the underlying TN for the LPS, whose nodes each have one purification edge. We connect each purification edge to a new hidden node, whose associated tensor is the first-order copy tensor with dimension equal to that of the purification edge. This converts all of the purification edges into hidden edges, which form the decohered edges of the DBM.
Given this new TN and choice of decoherence edges, the equivalence of the DBM distribution and the original LPS distribution arises from inserting decoherence operators in the composite edges connected to the new hidden nodes, and then using copy tensor rewriting rules to express the composite TN of the DBM as that of the LPS (Figure 9a). Given that the new hidden nodes are associated with constant tensors with no free parameters, and given that all of the cores defining the LPS are kept unchanged in the DBM, the overall parameter count is unchanged. This completes the proof of Theorem 3.
Consider a DBM defined on a graph with nodes and a set of decohered edges . Given any function assigning decohered edges to nodes of incident to those edges, we can construct an LPS with nodes which represents the same probability distribution as the DBM. This LPS is defined by a TN with an identical graphical structure to the TN underlying the original DBM, but with the addition of a purification edge at each node of dimension , where is the bond dimension of edge and is the set of decohered edges mapped to node .
Despite the somewhat complicated formulation of Theorem 4, the idea is simple. In contrast to standard BMs, DBMs and LPS both permit direct vertical edges within the composite TN defining the model’s probability distribution, and the proof consists of shifting these vertical edges from decohered edges to the nodes themselves. In the case where multiple vertical edges are moved to a single node, all of these can be merged into one single purification edge by taking the tensor product of the associated vector spaces. This gives the purification dimension appearing in the Theorem statement, with the overall procedure illustrated in Figure 9b. For nodes which are not assigned any decohered edges, a trivial purification edge of dimension is used. This completes the proof of Theorem 4.
Appendix C Proof of Theorem 2
From the definition of a cut set, the removal of from the graph for the underlying TN partitions into two disjoint pieces, and the same property holds true for the composite TN giving the DBM probability distribution. Figure 10 illustrates how conditioning on a decohered edge of a DBM results in the splitting of the associated decoherence operator into a tensor product of two rank-1 matrices, which propagate the value of the conditioning value to both pairs of incident nodes. Consequently, each composite edge whose value is conditioned on will be removed from the composite TN, and if the set of conditioning edges form a cut set for , this will result in the separation of the post-conditional composite TN into two disconnected pieces. This implies the independence of the composite random variables and in the conditional distribution, which completes our proof.
Appendix D Experimental Details
The code used to generate Figure 5 is given in the included notebook, which will install necessary libraries, retrain the models, and regenerate the figure if run in sequence. Instructions are in a README file. The saved parameters are also included. We use NumPy  (BSD-compatible licence) and JAX  (Apache 2.0 licence) for scientific computation, and Matplotlib  (GPL-compatible licence) for visualization. In this section, we discuss the details needed for independent implementation and verification.
Figure 5 answers the model selection problem: which model family yields better performance on held-out data? We implement a Hidden Markov Model (HMM) as both a UGM and as a DBM, and learn parameters to maximize marginal likelihood on the Bars and Stripes dataset .
d.1 Model Complexity: Ensuring Equal UGM and DBM Parameter Counts
To make a fair comparison between models, we must ensure that they have identical model complexity, measured here as total parameter count. On the same underlying graph, a DBM will have double the number of parameters as a UGM, because each parameter in a DBM is complex-valued and has both a real and an imaginary component. To match parameter counts, we implement a mixture of two UGMs with independent parameters and , where each consists of a collection of clique potential values. The probability of an observation , , is given by the convex combination
where denotes the probability assigned by a UGM with clique potentials functions contained in . Note that the UGM is now represented as a mixture distribution where the mixture weight
is found as a point estimate. To ensure the models are representing the same family of distributions, we also make the DBM a mixture. We do so by splitting the weights of a DBM into their magnitude and phase components, with the former specified by clique potentials in the parameter setand the latter by phase functions in a disjoint parameter set . The magnitude components are used to compute an HMM as in the UGM case, yielding a component . The complex phase components are then used to assign complex phases to the magnitudes in , leading to a Born machine probability distribution denoted by The probability of an observation is given by
where we again make a point estimate of the mixture weight . By sharing between the two components of the DBM, we match parameter counts between the two models, with both models containing parameters in total.
d.2 Dataset, Training, and Hyperparameters
For the experiments, we generated a Bars and Stripes dataset of images, and varied the HMM hidden dimension between . Code for generating the dataset is included in the UGM and DBM training notebook, as well as a standalone notebook. To represent the images as 1D sequences, we use a horizontal raster scan from upper left to bottom right. Figure 5 reports results for with 2 and 32 having given similar results to 4 and 16. The observation sequence length was fixed to .
Training for each was replicated 15 times and trained on a different 70% train, 30% test split. The sequence of splits was fixed among different , controlled by initializing the random seed to 0 for each experiment.
Parameters for each model are: hidden-to-hidden transition probabilities , observation probabilities , and distribution over the initial hidden state , and convex mixture . For the UGM, we have two independent sets of . For the DBM, we have an additional set of complex phases . Both probabilities and phases are given an unconstrained parameterization in log space. The mixture component was parameterized as the inverse of a sigmoid, e.g., an unconstrained real number. Probabilities and mixture weights were initialized as independent standard normal random variables. The complex phase components were independently initialized as , where and .
We conducted full batch gradient descent on the training split for 30 epochs for eachand each replication, using an adaptive gradient optimizer (). Experiments were run on a MacBookPro15,1 model with a single 2.6 GHz 6-Core Intel i7 and 16GB of memory. Each curve in Figure 5
is the mean over 15 replications, and the shaded areas are 1 standard deviation.
Appendix E Gauge Freedom in Probabilistic Tensor Networks
Tensor networks matching the description given in Definition 1 exhibit a form of symmetry in their parameters commonly referred to as gauge freedom. This symmetry is generated by edge-dependent gauge transformations, wherein an invertible matrix and its inverse are inserted in a hidden bond of a TN, and then applied to the two tensor cores on nodes incident to that hidden edge. This results in a change in the parameters of the two incident core tensors, which nonetheless leaves the global tensor parameterized by the TN unchanged. The phenomenon of gauge freedom ultimately arises from the close connection between TNs and multilinear algebra, where gauge transformations on a given hidden edge correspond to changes in basis in the vector space associated with the hidden edge.
The use of copy tensors in a TN leads to a preferred choice of basis, and thereby breaks the full gauge freedom of any edge incident to a copy tensor node. It is therefore surprising that for non-negative TNs, i.e. those with all core tensors taking non-negative values, hidden edges were shown to be expressable as latent RVs without loss of generality, via the insertion of copy tensors in the hidden state space (Section 3.1). The generality of this operation means that any non-negative TN can be converted into a UGM by associating hidden edges with latent RVs, where the original distribution over only visible edges is recovered by marginalizing over hidden edges. This fact is a key ingredient in the exact duality between non-negative TNs and UGMs, and differs from quantum-style probabilistic TN models. For example, attempting to observe the latent states associated to hidden edges in a BM will generally lead to a change in the distribution over visible edges, even after marginalizing out these new latent RVs.
We observe here that the generality in associating hidden edges of a non-negative TN to latent RVs is a consequence of the fact that non-negative TNs already have significantly diminished gauge freedom. More precisely, in order for a gauge transformation on a hidden edge to maintain the non-negativity of both tensor cores incident to that edge, we must generally have the change of basis matrix , as well as its inverse , possess only non-negative entries. This is a strong limitation, and is equivalent to the gauge transformation factorizing as a product , where is a permutation matrix and is a diagonal matrix with strictly positive entries . We illustrate in Figure 11 how this restricted gauge freedom maintains the overall structure of the copy tensor inserted into a hidden edge, with the result being an irrelevant permutation of the discrete values of the hidden latent variable associated with that edge.
The situation is quite different for BMs and DBMs, and we remark that the use of decoherence operators in a DBM means that the gauge freedom of such models is different than for BMs. In particular, two BMs whose underlying TNs are related by gauge transformations will necessarily define identical distributions, whereas the corresponding DBMs resulting from decohering some gauge-transformed hidden edges may define different distributions. In this sense, the appropriate notion of gauge freedom for a DBM lies in between that of a BM and a UGM defined on the same graph, in a manner set by the pattern of decohered edges.
The choice of basis in which decoherence is performed can be treated as an additional parameter of the model, and we view the interaction between this basis-dependence of decoherence and basis-fixing procedures related to TN canonical forms an interesting subject for future investigation.