DeepAI

# Tensor network language model

We propose a new statistical model suitable for machine learning of systems with long distance correlations such as natural languages. The model is based on directed acyclic graph decorated by multi-linear tensor maps in the vertices and vector spaces in the edges, called tensor network. Such tensor networks have been previously employed for effective numerical computation of the renormalization group flow on the space of effective quantum field theories and lattice models of statistical mechanics. We provide explicit algebro-geometric analysis of the parameter moduli space for tree graphs, discuss model properties and applications such as statistical translation.

• 4 publications
• 4 publications
07/06/2022

### Tensor networks in machine learning

A tensor network is a type of decomposition used to express and approxim...
01/31/2019

### A Generalized Language Model in Tensor Space

In the literature, tensors have been effectively used for capturing the ...
06/15/2018

### Supervised learning with generalized tensor networks

Tensor networks have found a wide use in a variety of applications in ph...
01/28/2020

### The Complexity of Contracting Planar Tensor Network

Tensor networks have been an important concept and technique in many res...
12/30/2019

### Bayesian Tensor Network with Polynomial Complexity for Probabilistic Machine Learning

It is known that describing or calculating the conditional probabilities...
06/08/2018

### VTrails: Inferring Vessels with Geodesic Connectivity Trees

The analysis of vessel morphology and connectivity has an impact on a nu...
03/22/2021

### Fixes That Fail: Self-Defeating Improvements in Machine-Learning Systems

Machine-learning systems such as self-driving cars or virtual assistants...

## 1. Introduction

It must be recognized that the notion ”probability of a sentence” is an entirely useless one, under any known interpretation of this term.

Noam Chomsky, 1969

In natural language processing, unsupervised statistical learning of a language aims to construct an efficient approximation to the probability measure on the set of expressions in the language learned from a sampling data set.

Currently, neural network models have proved to be the most efficient. A particular success is attributed to models which construct distributed word representations

[1, 2, 3], that is a function from the set of words in a language to a real vector space of a certain dimension depending on a particular model.

Impressive results for constructing such a function , also called continuous vector representation, have been achieved in [4, 5] (see also [6] for earlier construction of neural network language model and [7] for continuous space word representations). A model in [5] constructs vector representation of a word by training a predictor of words within a certain range of in the training sample of language. Curiously, the function constructed in [4, 5] was found to satisfy interesting semantic and syntactic linear relations in English language such as

 v(apples)−v(apple)≃v(cars)−v(car)≃v(families)−v(family)

encoding the syntactic concept of grammatical number,

encoding the semantic concept of capital of a country,

 v(king)−v(queen)≃v(man)−v(woman)≃v(uncle)−v(aunt)

encoding the semantic concept of gender.

The difficulties in the statistical modeling of language are due to the long range correlation. In [8] mutual information111defined in terms of relative entropy between two characters in the text was measured as a function of the distance between positions of the characters in two samples of English Literature (“Moby Dick” by Melville and Grimm’s Tales), and it was found that in the range is well approximated by the power law

 I(l)=c1l−α+c2 (1.1)

with critical exponent . Further measurements in [9] showed that the long distance correlation are due to the structures beyond the sentence level. In [10] it was proposed to explain the long range correlations in the text by a hierarchy of levels in language which reminds the hierachical structure/renormalization group flow in the physical theories. The analysis in [11] confirmed long range correlations in the sequence of integers constructed from the sequence of words in the text, where each word is mapped to a positive integer equal to the rank of this word in the sorted list of individual word frequencies. Criticality of language is not surprising with abundance of critical phenomena in biology [12].

Moreover, in [13] it was shown that a language described by a formal probabilistic regular grammar necessarily has mutual entropy function with exponentially fast decay with respect to the distance

 I(l)≃cexp(−ml) (1.2)

where

is the inverse correlation length, or the mass gap in the physics terminology. A formal probabilistic regular grammar is almost equivalent to a probabilistic finite automaton, hidden Markov model chain or matrix product state called also tensor train decomposition in machine learning (see

[14, 15] for more precise statements). These models can be thought as tensor networks based on a graph with topology of 1-dimensional chain constructed from three-valent vertices whose output edges correspond to the observables. In the context of language a matrix product state model is considered in [16]. The absence of mass gap (power law correlation function, criticality, conformal structure, scale invariance) observed in the natural language (1.1) implies that 1-dimensional tensor chain models (e.g. hidden Markov model or similar) fail to reproduce correctly the basic statistics of natural language: these models are mass-gapped with exponential decay (1.2) while a natural language is at criticality with the power law decay (1.1). This observation of [13] explains why hidden Markov models or similar do not describe very well natural languages.

On the other hand, [13] also showed that a language described by a formal probabilistic context free grammar (probabilistic regular tree grammar) has mutual entropy function function of a power law type. A probabilistic regular tree grammar is modelled by a tensor network based on a graph with topology of a tree in which leafs correspond to an observable, see also [17]. The hierarchy tree structure, for example in the case of a binary tree, means that from the state vectors of two words we construct the state vector of the sentence of two words they form. From state vectors of sentences of two words we construct the state vector of the sentence of four words they form and so on.

In physics this corresponds to the idea of iteratively coarse graining a system of many locally interacting variables and the resulting renormalization group flow on the space of theories developed by Kadanoff [18], Wilson [19, 20, 21], Fischer [22] and many others (see e.g. review [23]). Further in [24, 25] density matrix renormalization algorithm was suggested which turned out to be quite efficient for numerical solutions of quantum 1-D lattice systems. Algorithms implementing Kadanoff-Wilson-White real space renormalization group on tensor tree networks have been developed in [26, 27]. See [28] for a recent survey of tensor network models.

However, for critical systems the performance of bare tree tensor networks models was limited due to effects of remaining long range entanglement. To handle this entanglement the dimensions of Hilbert spaces in the layers have to grow substantially when moving up to the higher layers in which nodes represent compound objects.

In [29] Vidal modified bare tensor tree network by introducing disentangling operators between neighbor blocks before applying each step of Kadanoff-Wilson-White [21, 25] density matrix renormalization group projection operation, see Fig. 1. Such tensor tree interlaced with disentaglers is called Multiscale Entanglement Renormalization Ansatz, and it has been successfully applied to study numerically many critical systems with impressive precision, see e.g. review in [30]. Intuitively, -valent tree tensor networks could be thought as discrete models of AdS/CFT correspondence (holography) [31, 32, 33], between the AdS side which is a discrete gravity theory in discrete hyperbolic geometry represented by -valent trees, and the CFT side which is a critical (conformal) theory that lives on the boundary of the tree, see e.g.[34].

Motivated by

1. criticality of language [9, 8, 10, 11, 12]

2. emergence of effective vector structure on the space of linguistic syntactic and semantic concepts [2, 1, 35, 4, 5]

3. real space matrix density renormalization group or discretized holographic AdS/CFT correspondence realized by tensor trees [18, 19, 25, 26, 29]

in this paper we propose to use quantum MERA-like tensor networks for a statistical model of language or other data sets with observed critical phenomena and long range power law type correlation functions.

### 1.1. Previous work

Recurrent neural networks have been shown to have long range correlation [36, 13]

. The connection between deep learning architectures and renormalization group has been pointed out in several recent works

[37, 38, 39]. Moreover, in [40, 41] it was shown that an arithmetic circuit deep convolutional network is a particular case of a tree tensor network. Linear matrix product states and density matrix renormalization group [42] in the context of image recognition have been analyzed in [43]. On the other hand, deep neural networks in [44] have been used to compute correlation functions in the Ising model, and [45] deep neural networks have been used to learn a wave-function of a quantum system. In [46] it was suggest to represent topological states with long-range quantum entanglement by a neural network, and in [47] it was suggested how to accelerat Monte Carlo statistical simulations with deep neural network. In [48]

some analysis has been put for equivalence of certain restricted Boltzmann machines and tensor network states.

While the present manuscript was in preparation we noticed works [49, 50, 17] which contain partial overlap with presented constructions.

### 1.2. Acknowledgements

We would like to thank Maxim Kontsevich and John Terilla for useful discussions. The research of V.P. on this project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (QUASIFT grant agreement 677368), and research of Y.V. received funding from Simons Foundation Award 385577.

## 2. Quantum statistical models

### 2.1. Quantum states

Let be the set of symbols from which a language is constructed. For example, could be a set of words, a set of syllables, a set of ASCII characters, a set in the binary representation, a set of musical characters, a set of tokens in computer programming language, a set of DNA pairs and so on. By we denote the number of symbols in the set .

Let be the set of sequences of symbols in , including the empty sequence

 W∗=⨿n∈Z≥0W×n (2.1)

where . An element of the set is a sequence of length consisting of symbols in .

Let be a vector space over the field of complex numbers generated by .

Elements of are formal linear combinations of symbols in with complex coefficients. Using Dirac bra-ket notations one can denote an element as

 ∑w∈Wψw|w⟩ (2.2)

where is a basis element in labelled by a symbol and is a complex number.222We use the standard conventions to label coefficients of controvariant vectors by upper indices and coefficients of covariant vectors by lower indices An element in the vector space is called state.

For example, if the set of symbols is a set of English words, a state could be equal to

 ψ=1+√3i4|mountain⟩+√32|hill⟩ (2.3)

We equip with Hermitian metric , that is a positive definite sesquilinear form, also called inner product

 ⟨,⟩:¯¯¯¯¯¯W⊗W→C (2.4)

where denotes a vector space complex conjugate to in such a way that is the standard basis

 ⟨w|w′⟩=δww′,w,w′∈W (2.5)

where

 δww′={1,w=w′0,w≠w′

is the Kronecker symbol. In other words, is a finite-dimensional Hilbert space with orthonormal basis . For every state there is an adjoint state induced by the Hermitian metric .

For example, the norm of a state is

 ⟨ψ|ψ⟩=∑w∈W¯ψwψw (2.6)

where denotes complex conjugation of a complex number .

To a length sequence of symbols we associate a basis element in the vector space

 W⊗n=W⊗W⋯⊗Wn (2.7)

A generic state is a linear combination of basis elements with complex coefficients

 Ψ=∑s∈WnΨs|s⟩ (2.8)

An operator defined as

 os=|s⟩⟨s| (2.9)

is the projection operator on the basis element . In particular,

 ⟨ΨosΨ⟩=⟨Ψ|s⟩⟨s|Ψ⟩=|⟨s|Ψ⟩|2=¯¯¯¯ΨsΨs (2.10)

is the absolute value square of the coefficient , thus it is a real non-negative number. A state is called normalized if it has unit norm

 ⟨Ψ|Ψ⟩=1 (2.11)

For a normalized state it holds that

 ∑s∈W×n⟨ΨosΨ⟩=1 (2.12)

### 2.2. Pure state statistical model

A statistical model on the set

is a family of probability distributions

on fibered over the base space of parameters . That is, for each parameter there is a positive real valued function

 μu:W×n→R≥0,u∈U (2.13)

such that

 ∑s∈W×nμu(s)=1,u∈U (2.14)

The base space of parameters can be thought as moduli space of distributions in a given statistical model.

A quantum (pure state) statistical model on with the space of parameters is a family of normalized states fibered over the base space . That is for each point in the space of parameters there is a normalized state :

 ⟨Ψu|Ψu⟩=1. (2.15)

A quantum statistical model induces classical statistical model by the Born rule

 μ(s):=⟨ΨosΨ⟩ (2.16)

Indeed, is a real non-negative number, and the normalization (2.14) follows from (2.15).

We remark that in quantum physics with a Hilbert space of states , a -family of normed states

is often called a wave-function ansatz. A typical problem posed in quantum physics is to find a ground state of a quantum system, that is an eigenstate with the lowest eigenvalue of a positive definite Hermitian operator

(Hamiltonian) acting on a Hilbert space . Often the exact solution of this problem is not possible, and one resolves to approximate methods. A wave-function ansatz is such an approximate method. While the exact ground state problem is equivalent to the minimization problem

 min|Ψ⟩∈H,⟨ΨΨ⟩=1⟨Ψ|H|Ψ⟩ (2.17)

over the space of all states in with unitary norm, an approximate solution by an ansatz searches the minimum over, usually, much smaller subset of states

 minu∈U⟨Ψu|H|Ψu⟩ (2.18)

Similarly, using the Born rule induction (2.16) of a statistical model from a pure state model , we can think about a family of quantum states as a particular case of classical statistical model.

### 2.3. Learning the model

Given a statistical model on a discrete set , a statistical learning of a model is to find an optimal value of parameters in , which means that approximates best an observed distribution in a sample of data points in . Formally, a sample of data is a multiset

 S=(S,m:S→Z≥0)

based on the set , that is a set of pairs where is an observed data point, and non-negative integer is the multiplicity of point in the sample.

An observed distribution associated to a sample is a normalized frequency function defined by

 ˚μS(s)=m(s)|S| (2.19)

where is the cardinality of the sample . We use conventions where yields with multiplicity .

A distance between between two probability distributions and can be defined as Kullback-Leibler (KL) divergence

 D(˚μ||μu)=∑s∈S˚μ(s)log˚μ(s)μu(s) (2.20)

This definition of distance between two probability distributions satisfies certain natural axioms in the information theory that also lead to the standard definition of the entropy of a distribution.

Hence, a standard method of learning a statistical model from a sample of data points is minimization of the KL divergence between probablity distribution and the observed distribution , so that the optimal value of the parameters is

 u∗=argminu∈UD(˚μS||μu) (2.21)

Minimization of KL divergence bewteen and is equivalent to maximization of log-likelihood , that is

 u∗=argmaxu∈U∑s∈Slogμu(s) (2.22)

using the multiset yield notation .

In particular, for a quantum pure state statistical model and data sample , the optimal value of is

 u∗=argmaxu∈U∑s∈S(logΨu(s)+log¯¯¯¯Ψu(s)) (2.23)

## 3. Isometric tensor network model

Now we consider a particular pure state statistical model on the Hilbert space called isometric (or unitary) tensor network model.

### 3.1. Isometric maps

Let be any Hermitian vector spaces, then a map

 u:V→W (3.1)

is called isometric embedding (isometry for short) if the Hermitian metric on is equal to the pullback by of Hermtian metric on . In other words, for any and it holds that

 ⟨w|w′⟩W=⟨v|v′⟩V (3.2)

Since Hermitian metric is non-degenerate

 ranku=dimV (3.3)

and isometric embedding exists only if .

An equivalent definition of isometric embedding of Hermitian spaces is that

 u∗u=1V (3.4)

where is the adjoint map and is the identity map on . With respect to a basis on and a basis on , the isometric property of a map means

 g¯vv′=¯u¯w¯vuw′v′g¯ww′ (3.5)

where and are components of Hermitian metric on and .

The definition (3.4) also implies that the operator is a projection, since

 (uu∗)2=uu∗uu∗=uu∗ (3.6)

We can think about morphism as a projection on the image of in .

A unitary transformation can be defined as an isometry . The set of unitary transformations of forms a group called unitary group . In this sense an isometry is a generalization of the notion of unitary transformation.

In general, the space of isometries is not a group, but a homogeneous space isomorphic to the quotient

 UV,W=U(w)U(w−v),dimRUV,W=2wv−v2 (3.7)

For example, a normalized state can be canonically identified with an isometry by taking the image of

 ψ=~ψ(1) (3.8)

and, indeed, the space of isometric embeddings of to is a unit sphere isomorphic to the sphere of normalized states

 UC,W=U(w)U(w−1)=S2w−1 (3.9)

where

 S2w−1={ψ∈W|⟨ψψ⟩=1} (3.10)

Also we define the reduced space of isometries from that is obtained from by reduction by the unitary group of automorphisms of the Hermitian space .

For example, if , then

 (3.11)

which is a familiar statement from quantum physics that the space of normalized states in  up to equivalence by phase rotation forms the complex projective space . In generic case

 ~UV,W=U(w)U(v)U(w−v)=Grv,w,dimC~UV,W=vw (3.12)

where denotes a Grassmanian of complex -planes in -dimensional complex vector space.

The reduced space is an example of moduli spaces of isometric tree tensor networks which we discuss in more generality in section 4.

### 3.2. Multi-linear isometric maps and tensor vertices

For vector spaces and vector spaces , a linear map of type is morphism

 u:V⊗q→W⊗p (3.13)

where , and .

Let and be the bases of vector spaces and for and . Then tensor of type is the table of components of represented as a -dimensional array table with upper indices and lower indices

 (uwv)=(uw1…wpv1…vq|w∈W1×⋯×Wp,v∈V1×⋯×Vq) (3.14)

A Hermitian metric on and on induces Hermitian metric on and , and then a tensor is called an isometry if the map (3.13) is an isometry from to with respect to the induced Hermitian metric. In the orthonormal basis this means

 ∑w∈W1×⋯×Wp¯u~vwuwv=δ~vv,v,~v∈V1×⋯×Vq (3.15)

Graphically we display tensor vertex (3.13) as in figure 2.

### 3.3. Directed multigraphs

Let be a directed multigraph in which edges have their identity (sometimes called quiver). More formally, a directed multigraph is where is a set of vertices, is a set of edges, is a source map which associates to each edge its source vertex, and is a target map which associates to each edge its target vertex.

A tensor network (without inputs and outputs) is a decoration of a quiver by a vector space at each edge , and linear map at each vertex .

Notice, that unlike the theory of quiver representations, in which vertices are decorated by vector spaces, and the edges by maps between vector spaces, in a tensor network edges of a quiver are decorated by vector spaces, and the vertices are decorated by multi-linear maps from the tensor product of vector spaces of incoming edges to the tensor product of vector space of outgoing edges. A vertex decorated by a linear map is called tensor vertex.

To define an open tensor networ, or tensor network with a boundary we add a set of external incoming edges and a set of external outgoing edges. Formally, a quiver with a boundary is where is a set of vertices, is a set of internal edges, is a set of incoming edges, is a set of outgoing edges, is a source map which associates to each edge its source vertex, and is a target map which associates to each edge its target vertex.

An open tensor network is a quiver , possibly with a boundary, in which each edge is decorated by a vector space , and each vertex is decorated by a multi-linear map

 u(i):⊗e∈t−1(i)Ve→⊗e∈s−1(i)Ve (3.16)

An open tensor network defines a multi-linear map called evaluation from the tensor product of vector spaces in the edges to the tensor product of vector space in the edges by composition of maps

 uγ:⨂e∈InVe→⨂e∈OutVe (3.17)

If the boundary is empty, i.e. tensor network is closed, then is a number.

In components, evaluation map is obtained by summing over all pairs of upper and lower indices in the product of tensor vertices

 (3.18)

We remark that a tensor network could be recognized as a Feynman diagram of the directed graph for a 0-dimensional field theory with a pair of fields with and for each edge with kinetic term and interaction tensor vertices (3.16)

 ⟨∏e∈t−1(i)~ϕ,ui∏e∈s−1(i)ϕe⟩ (3.19)

Also, graphical representation of contraction of tensor indices corresponding to the composition of multi-linear maps in vertices is known as Penrose graphical notation.

### 3.4. Directed acyclic graphs and props

If we assume that directed multigraph is acyclic (see Figure 3), i.e. does not contain any directed cycles, then mathematical structure that associates to a tensor network is called a colored prop [51, 52, 53, 54, 55, 56, 57, 58, 59]. A prop is generalization of the notion of operad. While operad takes several inputs and returns a single output, a prop takes an element of tensor product of several inputs and returns an element in the tensor product of several outputs. A tensor network for acyclic directed graph is an object in the endomorphism prop of a set of objects in some symmetric monoidal category . In the above definition (3.16) (3.18) the category pis a category of finite-dimensional vector spaces over complex numbers with linear morphisms.

We remark that if directed multigraph contains directed cycles, the corresponding tensor network involves trace operation. Formally this structure is encoded in the notion of colored wheeled prop. In this situation could be any symmetric monoidal compact closed category so that there is the trace operation in . Directed cycles in correspond to the trace map and are called ‘wheels’ in the context of ‘wheeled prop’. A category of finite-dimensional vector spaces with linear morphisms has trace map, and thus it is suitable to build tensor network (3.16)(3.18) on arbitrary directed graph.

In the context of this paper we are interested in a particular case of tensor networks called isometric tensor network.

An isometric tensor network is a tensor network built on directed acyclic multigraph in which edges are decorated by Hermitian vector spaces and all multi-linear maps in vertices are isometric embeddings. Hermitian vector spaces with isometric linear maps form a category, since composition of isometries and is an isometry , and this category is symmetric monoidal with the standard tensor product of vector spaces. Abstractly, an isometric tensor network can be thought as an object in the endomorphism prop of a set of objects from the category of Hermitian vector spaces with isometric morphisms. Concretely, this means that if is a directed acyclic multigraph and all tensor vertices are isometries, the evaluation map (3.18) is also an isometry.

### 3.5. Isometric pure state tensor network model

Given a Hilbert space , whose basis is a set of length sequences of symbols in , a pure state isometric tensor network model for a state

 Ψ∈W1⊗W2⋯⊗Wn

is an isometric tensor network with a single input vector space and output vector space built on output edges, see

The state equals to the tensor network morphism evaluated on

 Ψ=uγ1 (3.20)

Notice, that in general, a directed acyclic graph underlying a tensor network is not a tree, i.e. there could be multiple directed paths from a node to a node (and ‘acyclic’ here means that directed cycles are not allowed). In particular, MERA-like graph [60, 30] (see figure 1) is directed acyclic but not a tree.

However, in a particular case of isometric tree tensor network , see Figure 5, it is computationally easier to evaluate an amplitude of a sequence

 ⟨s|uγ1⟩=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯⟨1u∗γs⟩ (3.21)

since pulling back the state from the lower layer to the top vertex always keeps it in the form of tensor product of states on the intermediate edges. However, because of the observed criticality of language the dimension of vector spaces at the top layers needs to grow for a sensible model.

A MERA-like graph with disentagling intermediate vertices is not a tree, and therefore algorithms of the amplitude evaluation are computationally more costly. However, it is feasible that the dimensions of the vector spaces in the edges at the higher layers of MERA graph do not have to grow as fast [28], and MERA-like tensor model will turn out to be computationally more effective.

### 3.6. Slicing and layers

In any case, a directed acyclic graph can be always sliced into layers (see Figures 4 and 5) such that each layer contains vertices which are evaluated in parallel by tensor product. Denoting by a set of vertices in layer we write for a morphism in the layer , and then the evaluation map of an isometric tensor network is a sequential composition of maps starting at the source (or higher) layer ‘L’ with input and finishing at the target (or lower) layer with output .

 uγ,[1L]=u[12]∘⋯∘u[L−1,L] (3.22)

The adjoint map is evaluated in the reverse order

 u∗γ,[L1]=u∗[L,L−1]…u∗[21] (3.23)

where now the morphisms are not isometries in general, but projections if we identify the source of with its image in the target space.

In particular, the projection operators do not preserve inner product. Consequently, by the analogy with renormalization group flow, we expect that orthogonal commuting operators of projections on basis states at the bottom layer (UV scale in renormalization group/holography terminology), such as projection operators

 o[1]=o|large⟩⊗|hill⟩,o′[1]=o|small⟩⊗|mountain⟩

are projected by the linear map to the similar operators operating at the middle layer

 o[l]=u∗[lL]o[1]u[Ll],o′[l]=u∗[lL]o′[1]u[Ll] (3.24)

In the opposite direction, from intermediate higher layer to the bottom layer , under the map

 no′′[1]=u[1l]o[l]u∗[l1]≃u[1l]o′[l]u∗[l1], (3.25)

and in the context of the example we expect to see an operator of the form

 o′′[1]=oc1|large⟩⊗|hill⟩+c2|small⟩⊗|mountain⟩ (3.26)

where are context-free probabilities of expressing similar concept with different words.

In other words, the context free choice between the UV layer expressions vs is irrelevant at a higher level which operates within a Hilbert space of higher level concepts.

Notice that the renormalization group flow preserves expectation values of relevant operators, in other words, if we define a state at the intermediate level as

 |ψ⟩l=ulL|1⟩ (3.27)

where

 uγ,[lL]=u[l,l+1]∘⋯∘u[L−1,L] (3.28)

then

 ⟨ψolψ⟩l=⟨ψo1ψ⟩1 (3.29)

where is an image under renormalization group flow of the operator operating at the base layer ‘1’.

Therefore, the expectation value of is approximately equal to the expectation value of or .

## 4. Geometry of the moduli space

Let be an isometric tensor network built on a directed acyclic graph . We define automorphism group (gauge group)

 Aut(γ,Uγ)=∏e∈Edge⊔InU(Ve) (4.1)

to be the group of unitary transformations which act on all incoming and internal edges preserving their Hermitian metric, so that tensor vertex transform as

 u(i)↦⎛⎜⎝∏e∈s−1(i)ue⎞⎟⎠u(i)⎛⎜⎝∏e∈t−1(i)u−1e⎞⎟⎠ (4.2)

under the action of automorphism group element .

The moduli space of an isometric tensor network is defined as a quotient

 Uγ=⊕i∈VertIsom(⊗e∈t−1(i)Ve,⊗e∈s−1(i)Ve)/∏e∈Edge⊔InU(Ve) (4.3)

where denotes the space of isometric maps from Hermitian space to Hermitian space .

### 4.1. Tree flag variety

If is a directed tree graph, the moduli space has a particular simple algebro-geometric description. First notice, that the constraint that a map is an isometry between Hermitian spaces and

 u∗u=1V (4.4)

is a symplectic moment map for

group action on the Kahler space of all maps . By geometric invariant theory, the quotient of the level subset of the constraint by the action of compact group is isomorphic to the quotient of the (stable locus) of the whole set by the complexified group

 Uu:V→W={u∈Hom(V,W)|u∗u=1w}/U(V)≃Hom(V,W)stab/GL(V) (4.5)

In the present case the stable locus is the space of injective maps, and we obtain

 Uu:V→W≃Grv(W) (4.6)

where denotes Grassmanian of -dimensional planes in the vector space , with , c.f. (3.12).

Now, if is a directed tree with a single input, then each vertex has a single incoming input edge. Therefore, for a tree tensor network, the set of isometric constraints on the maps in each vertex (4.4) is a level set of symplectic moment map of the action of the full automorphism group (4.1). Consequently,

 (4.7)

In the simplest example, when is chain quiver in which each vertex has a single input and a single output the moduli space is explicitly generalized flag variety, i.e. the moduli space of flags

 Vin≃VL⊂VL−1⊂⋯⊂V2⊂V1≃Vout (4.8)

which can be thought as fibered over fibered over …. fibered over .

If is a generic directed tree, the moduli space can be thought as a generalization of flag variety for linear quiver, and could be called tree flag variety.

For the illustration, consider a tree isometric tensor network displayed on Figure 6

In this case the moduli space is a fibration over the base with the fibers where , denote tautological vector bundles over and .

For general tree isometric tensor network , the moduli space is projective algebraic manifold with has explicit description of a tower of fibrations, where the fiber at level is a product of Grassmanians, and then the structure of fibration of products of Grassmanians at level uses external tensor product of the tautological vector bundles of Grassmanians at level according to the combinatorics of tree vertices.

Hence, we deduce that for an isometric tree tensor network the parameter moduli space is a smooth algebraic projective Kahler variety of complex dimension

 dimCUγ=∑i∈Vert⎛⎜⎝vt−1(i)∏e∈s−1(i)ve−v2t−1(i)⎞⎟⎠ (4.9)

One can expect that the explicit algebro-geometric structure of the moduli space might be useful for optimization algorithms that minimize the target function (2.23).

## 5. Learning the network and sampling

### 5.1. Learning

For an isometric tensor network , where is the Hilbert space based on length sequences, and a training sample of strings , the effective free energy function that needs to be minimized over the moduli space of parameters , is equivalent to the KL divergence between observed probability distribution and the model probability distribution (2.23) and is given by

 F(u|S)=−∑s∈S2Relog⟨s|uγ|1⟩ (5.1)

where is the standard Hermitian metric on , and is a basis element in labelled by a sequence . The summation over takes each element from the training multiset with its multiplicity.

Since the objective function is additive over the training sample , a particular effective approximate algorithm to minimize with a large sample

is a term-wise local gradient descent (called sometimes stochastic gradient descent). This algorithm constructs a flow on the moduli space

, where each step of the flow for a limited time (called learning rate) follows the gradient flow associated to a single term in the objective function, so that evolution of for a step is

 ∂tu=−∇F(u|s),t∈[0,η] (5.2)

where is

 F(u|s)=−2Relog⟨s|uγ|1⟩ (5.3)

Other learning methods based on the recursive singular value decomposition of effecive density matrices developed in tensor networks applied to many body quantum systems

[30, 60, 29, 61] might turn out also to be effective.

### 5.2. Sampling

To sample from a probability distribution of learned isometric tensor network a standard recursive procedure can be used.

Namely, a sequence is sampled recursively according to the following algorithm.

The -th element of is recursively sampled from the probability distribution of the -th element conditioned on the previously sampled elements of , that is

 prob(sk|s1…sk−1)=⟨Ψos1…skΨ⟩⟨Ψos1…sk−1Ψ⟩ (5.4)

where denotes the projection operator on a length sequence with first elements fixed to be , i.e.

 os1…sk=os1⊗⋯⊗osk⊗1k+1⊗⋯⊗1n (5.5)

For tree tensor networks, the state can be evaluated particularly effectively by recursive composition over a leaf of the tree and deleting that leaf. The local term gradient (5.1) is also efficiently evaluated because of the product structure of the evaluation morphism (3.18). Namely, the gradient components for the moduli of parameters in vertex is computed as pulling the tangent bundle for local variation in (3.18) along the composition map in (3.18

) of all remaining vertices (the pullback or ‘chain rule’ for differential of composition of functions is sometimes referred as ‘back propagation’ in the context of neural network optimizations).

## 6. Discussion

### 6.1. Supervised model and classification tasks

A ‘supervised’ version of algorithm is also possible, where a ‘supervised’ label is a relevant operator which survives at the higher (IR) levels of the network, such as the general topic of the input text or other general feature relevant operators. We simply add another leaf input to the network at higher level decorated by a vector space whose basis is the set of higher level labels.

### 6.2. Translation of natural languages

It is expected that various human natural languages are in the same critical universality class at the sufficiently high level of the network (i.e. at the sufficiently IR scale of the renormalization group flow).

Then a translation engine from language to language can be constructed by connecting by unitary transformation at sufficiently high layers and of two isometric tensor networks describing language and language

 (6.1)

A sequence in language is translated to a state in the Hilbert space of language equal to

 |ψ⟩2=u2S21u∗1|s⟩1 (6.2)

The state , in general, is a not basis element corresponding to a single sequence in , but rather a linear combination

 |ψ⟩2=∑s∈W×n2ψs|s⟩2 (6.3)

with complex coefficients . If the state is sampled, a basis sequence from language is generated with probability . As expected, the translation (6.1) is not isomorphism between languages in the base layer. However, one can conjecture approximate isomorphism, or a single universality class of all human languages at some deeper scale of renormalization group flow that could be called the scale of ‘meaning’ or ‘thought’. The renormalization group flow from the base layer to the ‘meaning’ layer is many to one, projecting different phrases in with equivalent meaning to the same state in . The map from the scale of meaning to the base layer of language is an ordinary isometry map between vector spaces, however, in general, the expansion of the image state over the basis in contains many phrases weighted with probability amplitudes. Each of these phrase is a possible translation with corresponding probability .

### 6.3. Network architecture

In this note, for simplicity we have assumed a certain fixed topology of an isometric tensor network. However, we expect that there is a natural generalization of the construction in which the topology of the underlying graph of the model is not fixed, but arbitrary, so that the amplitudes are computed in the spirit of Feynman diagrams where different graph topologies appear.

### 6.4. Testing the model

The presented construction is theoretical. It would be very interesting to implement the suggested models and study its performance on various types of languages that display critical properties (human natural languages, DNA sequences, musical scores, etc).

## References

• [1]

G. E. Hinton, J. L. Mcclelland, and D. E. Rumelhart, “Distributed representations, parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations,”.

• [2] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et. al., “Learning representations by back-propagating errors,” Cognitive modeling 5 (1988), no. 3 1.
• [3] J. L. Elman, “Distributed representations, simple recurrent networks, and grammatical structure,” Machine learning 7 (1991), no. 2-3 195–225.
• [4] W. Y. T. Mikolov and G. Zweig, “Linguistic regularities in continuous space word representation,” Microsoft Research.
• [5]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector Space,”

ArXiv e-prints (Jan., 2013) 1301.3781.
• [6] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” Journal of machine learning research 3 (2003), no. Feb 1137–1155.
• [7] H. Schwenk, “Continuous space language models,” Computer Speech & Language 21 (2007), no. 3 492–518.
• [8] W. Ebeling and T. Pöschel, “Entropy and long-range correlations in literary English,” EPL (Europhysics Letters) 26 (May, 1994) 241–246, cond-mat/0204108.
• [9] W. Ebeling and A. Neiman, “Long-range correlations between letters and sentences in texts,” Physica A: Statistical Mechanics and its Applications 215 (1995), no. 3 233–241.
• [10] E. G. Altmann, G. Cristadoro, and M. D. Esposti, “On the origin of long-range correlations in texts,” Proceedings of the National Academy of Sciences 109 (2012), no. 29 11582–11587, http://www.pnas.org/content/109/29/11582.full.pdf.
• [11] M. A. Montemurro and P. A. Pury, “Long-range fractal correlations in literary corpora,” eprint arXiv:cond-mat/0201139 (Jan., 2002) cond-mat/0201139.
• [12] T. Mora and W. Bialek, “Are biological systems poised at criticality?,” Journal of Statistical Physics 144 (July, 2011) 268–302, 1012.2242.
• [13] H. Lin and M. Tegmark, “Criticality in formal languages and statistical physics,” Entropy 19 (June, 2017) 299, 1606.06737.
• [14] E. Vidal, F. Thollard, C. De La Higuera, F. Casacuberta, and R. C. Carrasco, “Probabilistic finite-state machines-part I,” IEEE transactions on pattern analysis and machine intelligence 27 (2005), no. 7 1013–1025.
• [15] E. Vidal, F. Thollard, C. De La Higuera, F. Casacuberta, and R. C. Carrasco, “Probabilistic finite-state machines-part II,” IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005), no. 7 1026–1039.
• [16] V. Pestun, J. Terilla, and Y. Vlassopoulos, “Language as a matrix product state,” in preparation (2017).
• [17] A. J. Gallego and R. Orus, “The physical structure of grammatical correlations: equivalences, formalizations and consequences,” ArXiv e-prints (Aug., 2017) 1708.01525.
• [18] L. P. Kadanoff, “Scaling laws for Ising models near T(c),” Physics 2 (1966) 263–272.
• [19] K. G. Wilson, “Renormalization group and critical phenomena. I. Renormalization group and the Kadanoff scaling picture,” Phys. Rev. B 4 (Nov, 1971) 3174–3183.
• [20] K. G. Wilson, “Renormalization group and critical phenomena. II. Phase-space cell analysis of critical behavior,” Phys. Rev. B 4 (Nov, 1971) 3184–3205.
• [21] K. G. Wilson, “The renormalization group: critical phenomena and the Kondo problem,” Rev. Mod. Phys. 47 (Oct, 1975) 773–840.
• [22] M. E. Fisher, Scaling, universality and renormalization group theory, pp. 1–139. Springer Berlin Heidelberg, Berlin, Heidelberg, 1983.
• [23] M. E. Fisher, “Renormalization group theory: its basis and formulation in statistical physics,” Rev. Mod. Phys. 70 (Apr, 1998) 653–681.
• [24] S. R. White and R. M. Noack, “Real-space quantum renormalization groups,” Phys. Rev. Lett. 68 (Jun, 1992) 3487–3490.
• [25] S. R. White, “Density-matrix algorithms for quantum renormalization groups,” Phys. Rev. B 48 (Oct, 1993) 10345–10356.
• [26] Y.-Y. Shi, L.-M. Duan, and G. Vidal, “Classical simulation of quantum many-body systems with a tree tensor network,” Phys. Rev. A 74 (Aug, 2006) 022320.
• [27] L. Tagliacozzo, G. Evenbly, and G. Vidal, “Simulation of two-dimensional quantum systems using a tree tensor network that exploits the entropic area law,” 0903.5017.
• [28] P. Silvi, F. Tschirsich, M. Gerster, J. Jünemann, D. Jaschke, M. Rizzi, and S. Montangero, “The Tensor Networks Anthology: Simulation techniques for many-body quantum lattice systems,” 1710.03733.
• [29] G. Vidal, “Class of Quantum Many-Body States That Can Be Efficiently Simulated,” Physical Review Letters 101 (Sept., 2008) 110501, quant-ph/0610099.
• [30] G. Vidal, “Entanglement Renormalization: an introduction,” ArXiv e-prints (Dec., 2009) 0912.1651.
• [31] J. M. Maldacena, “The large N limit of superconformal field theories and supergravity,” Adv. Theor. Math. Phys. 2 (1998) 231–252, hep-th/9711200.
• [32] S. S. Gubser, I. R. Klebanov, and A. M. Polyakov, “Gauge theory correlators from non-critical string theory,” Phys. Lett. B428 (1998) 105–114, hep-th/9802109.
• [33] E. Witten, “Anti-de Sitter space and holography,” Adv. Theor. Math. Phys. 2 (1998) 253–291, hep-th/9802150.
• [34] S. S. Gubser, J. Knaute, S. Parikh, A. Samberg, and P. Witaszczyk, “-adic AdS/CFT,” Commun. Math. Phys. 352 (2017), no. 3 1019–1059, 1605.01061.
• [35] J. L. Elman, “Finding structure in time,” Cognitive science 14 (1990), no. 2 179–211.
• [36] A. Graves, “Generating Sequences With Recurrent Neural Networks,” ArXiv e-prints (Aug., 2013) 1308.0850.
• [37] C. Bény, “Deep learning and the renormalization group,” ArXiv e-prints (Jan., 2013) 1301.3124.
• [38] P. Mehta and D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” ArXiv e-prints (Oct., 2014) 1410.3831.
• [39] H. W. Lin, M. Tegmark, and D. Rolnick, “Why Does Deep and Cheap Learning Work So Well?,” Journal of Statistical Physics 168 (Sept., 2017) 1223–1247, 1608.08225.
• [40] N. Cohen, O. Sharir, and A. Shashua, “On the Expressive Power of Deep Learning: A Tensor Analysis,” ArXiv e-prints (Sept., 2015) 1509.05009.
• [41] Y. Levine, D. Yakira, N. Cohen, and A. Shashua, “Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design,” ArXiv e-prints (Apr., 2017) 1704.01552.
• [42] S. R. White, “Density matrix formulation for quantum renormalization groups,” Phys. Rev. Lett. 69 (Nov, 1992) 2863–2866.
• [43]

E. Miles Stoudenmire and D. J. Schwab, “Supervised Learning with Quantum-Inspired Tensor Networks,”

ArXiv e-prints (May, 2016) 1605.05775.
• [44] G. Torlai and R. G. Melko, “Learning thermodynamics with Boltzmann machines,” Physical Review B 94 (2016), no. 16 165134.
• [45] G. Carleo and M. Troyer, “Solving the quantum many-body problem with artificial neural networks,” Science 355 (Feb., 2017) 602–606, 1606.02318.
• [46] D.-L. Deng, X. Li, and S. Das Sarma, “Exact Machine Learning Topological States,” ArXiv e-prints (Sept., 2016) 1609.09060.
• [47] L. Huang and L. Wang, “Accelerated Monte Carlo simulations with restricted Boltzmann machines,” Physical Review B 95 (2017), no. 3 035105.
• [48] J. Chen, S. Cheng, H. Xie, L. Wang, and T. Xiang, “On the Equivalence of Restricted Boltzmann Machines and Tensor Network States,” ArXiv e-prints (Jan., 2017) 1701.04831.
• [49] Z.-Y. Han, J. Wang, H. Fan, L. Wang, and P. Zhang, “Unsupervised Generative Modeling Using Matrix Product States,” ArXiv e-prints (Sept., 2017) 1709.01662.
• [50] D. Liu, S.-J. Ran, P. Wittek, C. Peng, R. Blázquez García, G. Su, and M. Lewenstein, “Machine Learning by Two-Dimensional Hierarchical Tensor Networks: A Quantum Information Theoretic Perspective on Deep Architectures,” ArXiv e-prints (Oct., 2017) 1710.04833.
• [51] F. W. Lawvere, “Functorial semantics of algebraic theories,” Proceedings of the National Academy of Sciences of the United States of America 50 (1963), no. 5 869–872.
• [52] S. Mac Lane, “Categorical algebra,” Bull. Amer. Math. Soc. 71 (1965) 40–106.
• [53] A. Joyal and R. Street, “The geometry of tensor calculus. I,” Adv. Math. 88 (1991), no. 1 55–112.
• [54] D. Yau, “Higher dimensional algebras via colored PROPs,” ArXiv e-prints (Sept., 2008) 0809.2161.
• [55] M. Markl, S. Merkulov, and S. Shadrin, “Wheeled PROPs, graph complexes and the master equation,” J. Pure Appl. Algebra 213 (2009), no. 4 496–535.
• [56] P. Hackney and M. Robertson, “On the category of props,” Appl. Categ. Structures 23 (2015), no. 4 543–573.
• [57] D. Yau and M. W. Johnson, A foundation for PROPs, algebras, and modules, vol. 203 of Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI, 2015.
• [58] J. C. Baez, B. Coya, and F. Rebro, “Props in Network Theory,” ArXiv e-prints (July, 2017) 1707.08321.
• [59]

S. Yalin, “Function spaces and classifying spaces of algebras over a prop,”

Algebr. Geom. Topol. 16 (2016), no. 5 2715–2749.
• [60] G. Evenbly and G. Vidal, “Tensor Network States and Geometry,” Journal of Statistical Physics 145 (Nov., 2011) 891–918, 1106.1082.
• [61] J. Y. Lee and O. Landon-Cardinal, “Practical variational tomography for critical one-dimensional systems,” Phys. Rev. A 91 (Jun, 2015) 062128.
• [62] A. J. Ferris and G. Vidal, “Perfect sampling with unitary tensor networks,” Physical Reviews B 85 (Apr., 2012) 165146, 1201.3974.