Tensor Networks for Language Modeling

03/02/2020 ∙ by Jacob Miller, et al. ∙ 0

The tensor network formalism has enjoyed over two decades of success in modeling the behavior of complex quantum-mechanical systems, but has only recently and sporadically been leveraged in machine learning. Here we introduce a uniform matrix product state (u-MPS) model for probabilistic modeling of sequence data. We identify several distinctive features of this recurrent generative model, notably the ability to condition or marginalize sampling on characters at arbitrary locations within a sequence, with no need for approximate sampling methods. Despite the sequential architecture of u-MPS, we show that a recursive evaluation algorithm can be used to parallelize its inference and training, with a string of length n only requiring parallel time O(log n) to evaluate. Experiments on a context-free language demonstrate a strong capacity to learn grammatical structure from limited data, pointing towards the potential of tensor networks for language modeling applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning and reproducing the statistics of natural language is a problem of significant interest in machine learning, which has seen major advances in the last few years coming from the rapid improvement of deep neural network models 

(LeCun et al., 2015)

. While much of this progress has arisen from models based on recurrent neural networks (RNNs) 

(Sutskever et al., 2014; Merity et al., 2018), the inherently sequential nature of RNNs has proved a significant obstacle for training models on large datasets. This was a key motivation for the development of the transformer model (Vaswani et al., 2017), whose ability to parallelize evaluation of input sequences allows for a significant speedup on GPU hardware. Furthermore, the autoregressive nature of standard language models restricts the contextual information they can use in sampling, limiting their usefulness for general sentence completion tasks. For example, sampling a sequence conditioned on a given length from an RNN language model requires approximate sampling methods (e.g., naively using rejection sampling) or specifically training a conditioned RNN model for the task (Ficler and Goldberg, 2017).

In this work, we propose a uniform matrix product state (u-MPS) model specifically suited for probabilistic sequence modeling that offers an efficient parallel evaluation procedure and a flexible sampling algorithm. This model belongs to the family of tensor networks, which have enabled major practical and theoretical advances in quantum many-body physics and have recently been successfully applied in machine learning (Novikov et al., 2015; Cohen et al., 2016; Stoudenmire and Schwab, 2016; Novikov et al., 2017; Han et al., 2018; Stoudenmire, 2018; Cheng et al., 2019). In contrast to previous models built on the fixed-size MPS tensor network, where a model is parameterized by a fixed number of core tensors (Stoudenmire and Schwab, 2016; Han et al., 2018), u-MPS are parameterized by a single core, making them recurrent in nature and able to evaluate and sample strings of arbitrary length.

One of the main appeals of using tensor network models in machine learning resides in the large collection of efficient and tractable optimization and evaluation algorithms developed in the quantum physics and numerical analysis community (see e.g., (Oseledets, 2011; Orús, 2014)

). In particular for u-MPS, we leverage such algorithms to design a parallel training algorithm and a flexible sampling algorithm allowing one to condition sequence generation on arbitrary surrounding context. Sampling is carried out exactly, without any recourse to approximate sampling techniques, using an algorithm akin to the forward-backward algorithm for hidden Markov models which permits arbitrary characters in a string to be conditioned on or marginalized over. Because the uniform MPS’s outputs correspond to tensor network diagrams satisfying a general notion of associativity, its evaluation can be parallelized at the level of individual sequences, so that inputs of length

can be evaluated in only computational depth.

To highlight the usefulness and versatility of u-MPS language modeling, we conduct several experiments with synthetic text data. Using strings chosen from context free grammars, we demonstrate the ability of u-MPS to generalize across various length scales and to accurately generate sequences from the grammar conditioned on unseen sequence lengths and arbitrary contexts (e.g., generating well balanced sequences of parenthesis of a given length where the sub-sequence “()((” appears at a given location in the middle of the string). We show that u-MPS significantly outperforms unidirectional and bidirectional LSTM baselines on a subset of tasks for which RNN language models can be used.111Note that the sampling task just described on balanced parenthesis cannot be straightforwardly performed by RNNs without additional techniques, such as those of (Berglund et al., 2015).

1.1 Summary of Contributions

We introduce a uniform MPS model for sequence modeling (Section 3), and show that its convenient mathematical layout permits a highly parallel evaluation method (Section 3.1) and a flexible conditional sampling algorithm (Section 3.2). We explore the performance of the u-MPS at different language modeling tasks using experiments on synthetic text data (Section 4), and show that it is able to make significant generalizations of formal grammatical structure, a first step towards applying the model more broadly for language modeling.

1.2 Related Work

Tensor networks have a long history in quantum many-body physics (Fannes et al., 1992; Orús, 2019), but have only recently been applied to machine learning tasks. These applications include parameter-efficient representations of large neural network weight matrices (Novikov et al., 2015), provable separations in expressivity between deep and shallow networks (Cohen et al., 2016), as well as standalone models for supervised (Stoudenmire and Schwab, 2016; Novikov et al., 2017) and unsupervised (Han et al., 2018; Stoudenmire, 2018; Cheng et al., 2019) learning. Of particular relevance to us are the works of (Han et al., 2018) and (Stokes and Terilla, 2019), where fixed-length MPS are used to learn distributions of fixed-size image and text data. Our sampling algorithm is similar to that described in (Han et al., 2018) (which in turn is based on the perfect sampling algorithm developed in (Ferris and Vidal, 2012)), but by virtue of the recurrent layout of u-MPS, is able to sample sequence data of arbitrary length. The use of positive matrices and completely positive maps in our sampling algorithm is similar to their use in the evaluation of hidden quantum Markov models (Monras et al., 2010; Srinivasan et al., 2018), and likewise admits a convenient interpretation in terms of concepts from quantum mechanics.

u-MPS are closely related to weighted finite automata (WFA), sequential models with close ties to formal language theory which encompass hidden Markov models. In particular, the quadratic weighted automata model proposed in (Bailly, 2011) is equivalent to u-MPS. The potential for flexible sampling and efficient training and parallel evaluation were not investigated in (Bailly, 2011)

, where a variant of the spectral learning algorithm is used in contrast with the gradient based training algorithm we propose. Our focus is on more practical aspects of u-MPS, with the aim of exploring how the unique properties of tensor network models can complement the landscape of existing deep learning methods for sequence modelling.

u-MPS can also be seen as a particular instance of linear RNNs: u-MPS are second-order RNNs with a linear recurrent activation function and a quadratic output activation function. Linear RNNs have been studied previously for parallelization and interpretability 

(Martin and Cundy, 2018; Foerster et al., 2017). Connections between second-order RNNs and weighted automata have also been recently studied in (Rabusseau et al., 2019). A key difference with our work is that the flexibility of such linear RNN models for complex sampling tasks was not investigated. To the best of our knowledge, we are the first to show the potential of u-MPS and linear recurrent networks as generative models.

Finally, there have been a number of works from diverse perspectives presenting theoretical applications of tensor networks for language modeling and understanding. This includes (Pestun and Vlassopoulos, 2017)(Pestun et al., 2017)(Coecke et al., 2010)(Gallego and Orús, 2017), and (DeGiuli, 2019) among others. Collectively, these works draw many interesting parallels between the mathematical structure of tensor networks and the grammatical structure of natural language, with (Pestun et al., 2017) in particular proposing an MPS model similar to the u-MPS, although with a different evaluation rule than (7). Our work demonstrate that these fundamental connections can be leveraged to design efficient and flexible new models for language modeling, and we believe that the algebraic nature of tensor networks will prove to be a powerful tool to encode complex syntactic and semantic information in language modeling tasks.

2 Background

We consider sequences over a finite alphabet , and use to denote the empty string. refers to the set of all length- strings, and to the set of all strings of finite length. In the context of sampling from a distribution of strings, we will call the locations where characters can be placed as “sites”.

In the following, will denote a field which can be chosen as either or , with properties which require a particular choice stated explicitly.

denotes the 2-norm of a vector (or higher-order tensor)

, while is its Hermitian adjoint (conjugate transpose, where for ). We say a matrix is Hermitian when , a condition which implies that can be expressed in some basis as a diagonal matrix of real values. Finally, we use to represent the trace of a square matrix .

2.1 Tensors

A tensor of shape over a field can be seen as an indexed collection of elements , where each index ranges over the set . Tensors with indices are said to be th order, and the set of th order tensors naturally form a vector space of dimension .

The most well-known families of tensors are vectors (first-order tensors) and matrices (second-order tensors), with the basic operations available on these objects (such as matrix multiplication and inner products) having rich generalizations in the tensor setting. Tensor contraction is one such operation, acting on tensors and of shapes and , provided for some . and can then be contracted along their and indices to give a product tensor , with elements

(1)

Although appearing complicated in algebraic form, (1) is in fact a straightforward generalization of matrix multiplication, and can be reasoned about with a simple graphical notation (see Figure 1). In this tensor network notation, individual tensors correspond to nodes, and tensor indices correspond to edges. The contraction of two tensors is indicated by joining their corresponding edges, and (1) gives a computational evaluation procedure yielding a new node containing all the uncontracted edges of and  (see Figure 1d).

Of great importance is the fact that tensor contraction is associative, meaning that the order of pairwise tensor contractions used to evaluate a tensor network has no impact on its final reduced value. In practice, this property often allows seemingly intractable quantities to be computed efficiently and exactly, by rearranging the order of tensor contractions used in computation.

Figure 1: Examples of tensor contraction. (a) Inner product between vectors and gives the scalar , a zeroth-order tensor. (b) Matrix-vector product between and gives a vector , a first-order tensor. (c) Matrix-matrix multiplication between and gives a matrix , a second-order tensor. (d) As a more general example, tensor contraction between and of order 3 and 4 gives a new tensor of order 5. (e) The fixed-size MPS decomposition, where an th-order tensor is represented in terms of auxiliary tensors . The mapping from the individual to the global is given by contracting together all the adjacent “hidden” indices. (f) The corresponding tensor in the uniform MPS decomposition, where all the equal a fixed core tensor , and vectors are used to terminate the initial and final hidden indices.

Although one might think that tensors of high order rarely appear in real-world settings, an important class of examples are joint distributions over

discrete variables, which can naturally be interpreted as

th order tensors. In particular, the probability distribution

of a language model conditioned on a fixed length is an th order tensor whose indices are chosen from a vocabulary of size . One can thus think of a language model as an infinite family of tensors .

Although language models can be associated with an infinite family of high-order tensors, storing these tensors directly (as in -gram models) is infeasible, owing to the exponentially growing number of independent tensor elements. Nonetheless, the tensor network formalism permits an efficient way to parameterize high-order tensors in terms of simpler low-order factors, with the matrix product state being a good example of this method which has proven useful in quantum physics and machine learning (see e.g., (Novikov et al., 2015)).

2.2 Matrix Product States

We first discuss the (fixed-size) matrix product state (MPS) (Perez-García et al., 2007), also known as the tensor train format (Oseledets, 2011), before introducing its recurrent counterpart, the uniform MPS. A fixed-size MPS representation parameterizes an th order tensor of shape in terms of a collection of tensor “cores” , where has shape , has shape , and all other have shape . The dimension is referred to as the bond dimension of the model222In general, the bond dimension is allowed to vary across sites but it is sufficient to consider equal bond dimensions at all sites in this paper. and controls the compression rate: while has entries (where ) the MPS parameterization only requires storing parameters. The dimensional contractions between the core tensors allow for nontrivial correlations to be generated between different indices of the global tensor . Given a collection of such , we obtain by simply contracting together all the core tensors in order, as shown graphically in Figure 1e.

Fixed-size MPS models were applied for classification (Stoudenmire and Schwab, 2016) and regression (Novikov et al., 2017) tasks, by embedding regular data in a high-dimensional space of

th order tensors and using MPS to parameterize the weight vectors of linear models in this space. The capabilities of this model family for supervised learning were further studied in 

(Glasser et al., 2018), where several enhancements were proposed.

A closely related model is the uniform MPS (u-MPS), a recurrent-style factorization parameterized by a single core tensor of shape , along with a pair of -dimensional boundary vectors and . Given these three objects, one can generate th order tensors for any using a similar procedure as for fixed-size MPS. In particular, letting denote the transition matrix obtained by fixing the middle index of to , then the elements of are given by

(2)

We will use the notation to indicate the transition matrix appearing in (2), which we can think of the u-MPS’s learned representation of an arbitrary string . In the setting of natural language models, these can be used for downstream learning tasks, where their definition makes them a compositional representation of arbitrary text data.

u-MPS are particularly suited to model functions over sequences, since the entries of the tensor can be thought of as the outputs of the model on any sequence of symbols from an alphabet . Thus, given a fixed number of parameters , a u-MPS defines a function over all sequences of symbols from . For language modeling, it would therefore seem desirable to interpret the entries of as probabilities of generating the corresponding sequences.

This direct interpretation of as a probability distribution is not without issue though, as determining if a given u-MPS generates with non-negative elements for all is in fact undecidable (Denis and Esposito, 2008). One way to circumvent this issue is to restrict all entries of , , and to be non-negative real numbers, an approach which produces models equivalent to hidden Markov models. We instead take a different approach inspired by quantum mechanics, which permits deeper ties with tensor network theory.

2.3 Born Machines

An appealing means of employing tensor networks as generative models was given in (Han et al., 2018), under the name of the Born machine architecture. This construction involves mapping an th order tensor to a probability distribution over length- sequences via the rule:

(3)

where the normalization factor is the 2-norm of , ensuring that sums to 1 over all sequences of length . The name “Born machine” comes from the fact that (3) is equivalent to the Born rule of quantum mechanics (Born, 1926), giving the tensor a formal interpretation as the state of an -body quantum system. Although computing the normalization factor is intractable for arbitrary tensors , we can exactly and efficiently compute it whenever is parameterized by an MPS. This well-known procedure is described in Section 3, in the context of uniform MPS.

2.4 Positive Matrices and Completely Positive Maps

We say that a matrix is positive if it is (a) Hermitian, and (b) satisfies for every . For a positive , we call positive definite if only when , and positive semidefinite otherwise. Given a positive matrix , the diagonal elements of will necessarily be non-negative. For any vector , the rank-1 matrix is necessarily a positive matrix, and all positive matrices can be expressed as the weighted sum of (at most) such rank-1 matrices. This can be used to show that for any positive matrices and , .

It is common in quantum mechanics to regard positive matrices as probabilistic states (Nielsen and Chuang, 2002), a viewpoint we adopt here. To this end, the family of completely positive (CP) maps is the natural class of operations used to manipulate positive matrices. A map sending positive to is said to be CP if there exist matrices such that takes the form:

(4)

The condition (4) implies in particular that is necessarily positive. The matrices in (4) are called the Kraus operators of the map, with referred to as its rank. By taking the adjoint of all the Kraus operators appearing in (4), we obtain a new CP map , which is the Hermitian adjoint of .

For each uniform MPS, there is an associated CP map called the transfer operator, which is defined using the Kraus operators making up the core tensor :

(5)

Each character produces a rank-1 CP map , defined by the single Kraus operator , called the conditional transfer operator of  (relative to ):

(6)

Given that each is associated with exactly one Kraus operator of , it is immediate to check that , an important fact for the proof of the correctness of the sampling procedure.

Figure 2: Examples of different CP maps used for a u-MPS with core tensor . (a) The rank-1 map , which conditions a context on a character occurring at that site. acts on positive matrices as . (b) The rank- map , which marginalizes over a given site. Its Kraus operators are the matrices . (c,d) Right context matrices are acted on by the CP maps and , moving them leftward, while left context matrices are acted on by the adjoints of these maps, moving them rightward. (e) The normalization factor is given by the sum over all unnormalized probabilities , which can be expressed as a tensor network diagram. By changing the order of contraction of the diagram, this expression can be evaluated in the form of (8), using CP maps.

3 Uniform MPS Born Machines

Our model is obtained as an immediate generalization of the Born machine model from (Han et al., 2018), in which a uniform MPS is used in place of a finite-size MPS. Thus, we modify (3) to give the probability of a sentence with the recurrent tensor given in (2), yielding:

(7)

The normalization factor can be computed exactly, by expressing the sum as a ladder-shaped tensor network and rearranging the order of contraction as shown in Figure 2e. By first contracting the vertical indices (i.e. summing over outcomes ) before contracting the horizontal indices, this expression can be evaluated in terms of the th power of the CP map associated with the MPS core, as:

(8)

Applying the sequence of CP maps in (8) takes time, and produces the exact normalization factor needed to properly normalize the length- sampling distribution. In the limit of large , the CP map can be approximated arbitrarily well in terms of spectral properties of , via the use of so-called “infinite” methods (e.g., (McCulloch, 2008)). Although we do not utilize them here, we mention that infinite methods can be used within generative modeling for (a) sampling (bi-)infinite streams of text with a u-MPS (described in Appendix B), and (b) circumventing the sequential bottleneck associated with computing for large values of .

3.1 Evaluation

Figure 3: Illustration of our parallel evaluation procedure on input of length , which produces the tensor element used to obtain the probability in (7). After obtaining the transition matrices from , we multiply together all pairs of neighboring matrices, continue this process until the full product is reduced to a single matrix after steps. Although imposing additional computational overhead from the use of matrix-matrix multiplication, the ability to carry out these operations in parallel permits large speedups in the presence of GPU resources.

To obtain the unnormalized probability of a sentence within the distribution defined by the u-MPS, we must evaluate the length- product of matrices and vectors in the right hand side of (7) using the characters in . This can be carried out in one of several different ways.

One choice for evaluating this product is with iterated vector-matrix multiplications, producing a sequence of row vectors , where and . This evaluation method is similar to that used in vanilla RNNs, and requires time to carry out. However, the dependence of each hidden state on the previous makes this method inherently sequential, meaning that even with unlimited parallel computing resources, the evaluation time is still lower-bounded by .

By contrast, an alternate means of evaluation uses hierarchical matrix-matrix multiplications to reduce all the into a single transition matrix , at which point the unnormalized probability is given by . This proceeds in a recursive fashion on the , with all pairs of neighboring matrices multiplied together to yield product matrices, followed by all pairs of those products, and so forth. After repetitions of this iterated multiplication, the resultant product matrix is obtained (see Figure 3).

This recursive evaluation strategy has a total computational cost of , a factor of increase compared to sequential evaluation. In return, the ability to parallelize the matrix-matrix multiplications means that the parallel evaluation time of this method is only . The ideal choice of evaluation method thus depends on the size of the model and length of the strings being trained on, with parallel evaluation becoming increasingly attractive with models of smaller size acting on longer strings.

The fact that probabilities computed by u-MPS can be expressed as a tensor network and contracted in arbitrary order is the primary factor enabling our parallel evaluation method. It is worth mentioning that if we had used nonlinear activation functions within the hidden state of u-MPS (as is typical in RNNs), this convenient parallelization property would be lost. For a discussion of these issues for first-order RNNs, see (Martin and Cundy, 2018).

3.2 Sampling

Sampling with the u-MPS is straightforward, and follows a flexible procedure permitting arbitrary sites in the random output string to be either sampled (producing a random symbol ), marginalized (left unsampled), or conditioned on taking a given value . In general, the cost of this sampling procedure when generating a string of sites using a u-MPS with vocabulary size and bond dimension is .

For simplicity, we will describe the sampling procedure for a distribution of strings with fixed length , in a manner which broadly parallels that of (Ferris and Vidal, 2012; Han et al., 2018). However, slight modifications of this procedure make it possible to sample strings of random (finite) length, as well as (bi-)infinite character streams from a given u-MPS. These variations of the fixed-length sampling algorithm are discussed in Appendix B.

The sampling procedure relies on an atomic operation for drawing a single character from a u-MPS, given a specification of two positive “context” matrices and . The probability assigned to a character by this atomic operation is given by

(9)

where is a normalization factor.

We can think of the context matrices and as hidden states, which contain the model’s entire collection of knowledge about sites to the left and right of the sampled site. The following theorem (whose proof is given in Appendix A) shows how the context matrices are constructed from the parameters of a u-MPS to account for any combinations of marginalized and conditioned sites to the left and right of the sampling site.

Theorem 1.

Let be a set of conditioned positions, let be the corresponding set of conditioning symbols and let be a sampling position not in .

Then, given a u-MPS , the probability of the th symbol being conditioned on the symbols appearing in the corresponding positions in a sequence of length is given by

(10)

where and the left and right context matrices are defined by

with

for all , where and are the transfer operator and conditional transfer operator of defined in Eq. (5) and (6).

  Input: MPS state (); disjoint subsets with ; conditioning chars
  // Right-to-left sweep to generate right context matrices
  Initialize
  for  to  do
     
     
  end for
  // Left-to-right sweep to sample characters
  Initialize
  for  to  do
     if  then
        Sample
        
     else
        
     end if
     
  end for
  Output: Sampled characters
Algorithm 1 Uniform MPS sampling algorithm

By generating a collection of left and right context matrices for the sites we wish to sample, we can use the single-character sampling operation defined by (10) (referred to as ), to obtain all the samples needed.

The left and right sampling contexts are generated in a two-sweep procedure, with a right-to-left sweep first producing all the right contexts , and a left-to-right sweep then sampling characters while conditionally evolving the left context to .

In the first sweep, a sequence of CP maps is applied to the leftward-propagating context matrix , starting with . Different CP maps are used for different sites, with the choice of map depending on whether the corresponding site is being conditioned on or marginalized over. When conditioning on the occurrence of a character , the CP map is applied to , otherwise is applied. In both cases, the resultant sequence of contexts is saved for the second sweep.

In the second sweep, we similarly proceed one site at a time, but now with random characters being drawn at each sampling site. Starting with , we move rightward via the map when site is marginalized over, or when it is conditioned on. When a sampling site is reached, the function  (relying on Eq. (10)) is used with the context matrices to draw a random sample , after which is conditionally evolved using . Finishing the second left-to-right sweep completes the sampling process.

The overall sampling procedure is summarized in Algorithm 1. The proof that this process indeed samples from the appropriately conditioned and marginalized distribution (relative to the induced distribution given in Eq. (7)) is a simple corollary of Theorem 1 and the following factorization of the joint distribution over sampled sites conditioned on conditioning sites :

(11)

4 Experiments

To assess the capacity of u-MPS for learning distributions of structured text, we carry out experiments on synthetic strings from a context free language. These experiments involve training on sequences of a fixed length, then assessing the performance at (a) predicting single missing characters in unseen sequences, and (b) generating new sequences with no prior context. In both cases, we vary the length of sequences used for assessment, and measure the fraction of sampled outputs which produce valid strings in the language. We use a bidirectional LSTM as a baseline for (a) and a unidirectional LSTM for (b)333Note that while u-MPS are able to perform significantly more complex sampling tasks (i.e., conditioning on an any number of arbitrary locations), these two simple tasks were chosen because they allow for direct comparison with RNN models., where the number of hidden units equals the bond dimension of the u-MPS.

While MPS are traditionally trained using the density matrix renormalization group (DMRG) procedure, we train the MPS model using stochastic gradient descent (SGD) relative to a negative log likelihood (NLL) loss on the probability

from (7), with gradients computed via automatic differentiation. In our experiments we take the bond dimension to be , and use early stopping relative to the NLL on a separate validation set to determine the end of training.

The context-free language we use is defined over a three symbol vocabulary , and consists of all strings in whose parentheses are properly balanced, with no constraints made on the placement of the * characters. For example, “*(()*)*()” and “***” are contained in the language, while “(*()*)*)” and “*)*(*” are not. We will call strings in this language “Motzkin strings”, owing to their correspondence with Motzkin walks (e.g., see (Alexander et al., 2018)).


u-MPS % BiLSTM %
10 10k 99.9 98.3
15 10k 99.6 99.8
30 10k 93.0 49.9
50 10k 84.0 39.5
10 1k 95.8 89.1
15 1k 87.4 92.7
30 1k 75.5 55.0
50 1k 63.0 42.1
Table 1: Character completion experiment with synthetic Motzkin strings. After training on strings of length , we use Algorithm 1 to conditionally sample at each site of a set of Motzkin strings of length , given knowledge of all other characters in the string. We use a single-layer bidirectional LSTM as a baseline, with the number of hidden states in its forward and backwards LSTMs each equal to the bond dimension of the u-MPS, . We conduct the experiments with and Motzkin strings in the training data.

For the character completion task, we train on Motzkin strings of length , before feeding the MPS strings of varying lengths to use for conditional sampling. For each string, Algorithm 1 is used to sample a character conditioned on knowledge of all other characters . This sampling process is carried out for all values of , with the overall percentage of correctly444Given knowledge of all other characters in a Motzkin string, each site has exactly one character which can make it a valid Motzkin string. sampled characters given in Table 1.

While both models manage to easily predict the missing character in strings of equal length to those used for training, we find the u-MPS model better able to correctly extend its predictions to strings of different lengths. This is especially true in the presence of additional training data, which causes the u-MPS’s accuracy to increase at all tested string lengths, whereas the BiLSTM ends up generalizing worse in this setting.

For the unconditional sampling task, we again train a u-MPS on fixed-length Motzkin strings, but now use the trained model for unconditional sequence generation at different lengths. Given the difficulty of sampling streams of text from a bidirectional model, we use a single-layer unidirectional LSTM as a baseline, where strings are produced in an autoregressive fashion. The results are reported in Table 2. We find the MPS capable of correctly generalizing the rules of the context free grammar given only fixed-length training data. Even at lengths over 3 times that used for training, we find that the model produces valid strings the majority of the time. Equally surprising is the model’s ability to infer the correct distribution of Motzkin strings of length , in which the only allowed string is the character *.

In contrast, the sequences output by the LSTM quickly degenerates into a sequence of ) and * characters, as seen in the randomly chosen sample *(****(*)())****)*)*))*))**))) of length . While the initial substring of length  (in bold) does indeed belong to the language, the sampled characters after this point appear to be an incorrect generalization based on characters frequently appearing at the end of training sequences. This is in contrast to output from the MPS, such as the randomly chosen sample ***((()*)(*()(()**)())*)*, which do not exhibit such differences.


u-MPS % LSTM %
1 10k 87.2 38.5
10 10k 99.6 13.1
15 10k 99.1 80.9
50 10k 67.6 0.0
1 1k 56.8 33.0
10 1k 86.6 10.1
15 1k 81.0 45.9
50 1k 28.2 0.0
Table 2: Unconditional sampling experiment with synthetic Motzkin strings. After training on strings of length , we use Algorithm 1 to unconditionally sample new strings of length . For each sample length, we report the percentage of output samples which are valid Motzkin strings, and compare the performance of the u-MPS to a baseline unidirectional LSTM model. While the MPS generalizes well to different lengths, samples from the LSTM tend to consist of Motzkin strings of length , followed by a stream of ungrammatical text.

5 Discussion

We introduced a u-MPS model for probabilistic modeling of sequences of arbitrary length. We show that this model permits a flexible sampling procedure, in which arbitrary regions of a sequence can be sampled in the presence of arbitrary conditional context in other regions of the sequence. Furthermore, we present a parallel evaluation procedure which enables the model to parallelize its computation within individual inputs, allowing it to overcome the sequential evaluation bottleneck of standard RNNs when training on long sequences. We carry out experiments on synthetic text data, and find that u-MPS achieve remarkable accuracy in generalizing to text of significantly different length than was present in the training data.

While we used standard results from the tensor network formalism to prove our results, there is still much that remains to explore. Of particular interest is the use of tensor networks to “tensorize” high-dimensional linear algebraic operations, which was shown in (Novikov et al., 2015) to yield significant improvements in the parameter efficiency and runtime of existing neural network models. When applying this tensorization process to a u-MPS model (which is itself a tensor network), these techniques are even more powerful, and would enable the efficient manipulation of hidden states of exponentially large dimension555In tensor network language, this amounts to using a projected entangled pair state (PEPS) (Verstraete et al., 2006) in place of a uniform MPS. We expect this technique to be crucial in scaling up the model’s capacity and performance, giving additional promise to the use of these models for competitive language modeling applications.

Acknowledgements

Jacob would like to thank Shawn Tan for helpful discussions regarding language modeling. This research is supported by the Canadian Institute for Advanced Research (CIFAR AI chair program) and the MITACS Accelerate International program.

References

  • R. N. Alexander, G. Evenbly, and I. Klich (2018) Exact holographic tensor networks for the motzkin spin chain. arXiv:1806.09626. Cited by: §4.
  • R. Bailly (2011) Quadratic weighted automata: spectral algorithm and likelihood maximization. In Asian Conference on Machine Learning, pp. 147–163. Cited by: §1.2.
  • B. Balle, P. Panangaden, and D. Precup (2019) Singular value automata and approximate minimization. Mathematical Structures in Computer Science 86 (1), pp. 1–35. Cited by: §B.2.
  • M. Berglund, T. Raiko, M. Honkala, L. Kärkkäinen, A. Vetek, and J. T. Karhunen (2015) Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pp. 856–864. Cited by: footnote 1.
  • M. Born (1926) Quantenmechanik der stoßvorgänge. Zeitschrift für Physik 38 (11-12), pp. 803–827. Cited by: §2.3.
  • J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. Wanderman-Milne (2018) JAX: composable transformations of Python+NumPy programs External Links: Link Cited by: Appendix C.
  • S. Cheng, L. Wang, T. Xiang, and P. Zhang (2019) Tree tensor networks for generative modeling. Physical Review B 99 (15), pp. 155131. Cited by: §1.2, §1.
  • B. Coecke, M. Sadrzadeh, and S. Clark (2010) Mathematical foundations for a compositional distributional model of meaning. arXiv:1003.4394. Cited by: §1.2.
  • N. Cohen, O. Sharir, and A. Shashua (2016) On the expressive power of deep learning: a tensor analysis. In Conference on Learning Theory (CoLT), pp. 698–728. Cited by: §1.2, §1.
  • G. De las Cuevas, J. I. Cirac, N. Schuch, and D. Perez-Garcia (2017) Irreducible forms of matrix product states: theory and applications. Journal of Mathematical Physics 58 (12), pp. 121901. Cited by: §B.1.
  • E. DeGiuli (2019) Random language model. Physical Review Letters 122, pp. 128301. Cited by: §1.2.
  • F. Denis and Y. Esposito (2008) On rational stochastic languages. Fundamenta Informaticae 86 (1), pp. 41–47. Cited by: §2.2.
  • D. E. Evans and R. Høegh-Krohn (1978) Spectral properties of positive maps on C*-algebras. Journal of the London Mathematical Society 2 (2), pp. 345–355. Cited by: §B.1, §B.1.
  • M. Fannes, B. Nachtergaele, and R. F. Werner (1992) Finitely correlated states on quantum spin chains. Communications in mathematical physics 144 (3), pp. 443–490. Cited by: §1.2.
  • A. J. Ferris and G. Vidal (2012) Perfect sampling with unitary tensor networks. Physical Review B 85 (16), pp. 165146. Cited by: §1.2, §3.2.
  • J. Ficler and Y. Goldberg (2017) Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pp. 94–104. Cited by: §1.
  • J. N. Foerster, J. Gilmer, J. Sohl-Dickstein, J. Chorowski, and D. Sussillo (2017) Input switched affine networks: an rnn architecture designed for interpretability. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1136–1145. Cited by: §1.2.
  • A. Gallego and R. Orús (2017) Language design as information renormalization. arXiv:1708.01525. Cited by: §1.2.
  • I. Glasser, N. Pancotti, and J. I. Cirac (2018) Supervised learning with generalized tensor networks. arXiv:1806.05964. Cited by: §2.2.
  • Z. Han, J. Wang, H. Fan, L. Wang, and P. Zhang (2018) Unsupervised generative modeling using matrix product states. Physical Review X 8 (3), pp. 031012. Cited by: §1.2, §1, §2.3, §3.2, §3.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: Appendix C.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • E. Martin and C. Cundy (2018) Parallelizing linear recurrent neural nets over sequence length. In Conference on Learning Theory (CoLT), Cited by: §1.2, §3.1.
  • I. P. McCulloch (2008) Infinite size density matrix renormalization group, revisited. arXiv:0804.2509. Cited by: §3.
  • S. Merity, N. Shirish Keskar, and R. Socher (2018) Regularizing and optimizing lstm language models. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • A. Monras, A. Beige, and K. Wiesner (2010) Hidden quantum markov models and non-adaptive read-out of many-body states. arXiv:1002.2337. Cited by: §1.2.
  • M. A. Nielsen and I. L. Chuang (2002) Quantum computation and quantum information. Cambridge University Press. Cited by: §2.4.
  • A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov (2015) Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450. Cited by: §1.2, §1, §2.1, §5.
  • A. Novikov, M. Trofimov, and I. Oseledets (2017) Exponential machines. In International Conference on Learning Representations (ICLR), Cited by: §1.2, §1, §2.2.
  • R. Orús (2014) A practical introduction to tensor networks: matrix product states and projected entangled pair states. Annals of Physics 349, pp. 117–158. Cited by: §1.
  • R. Orús (2019) Tensor networks for complex quantum systems. Nature Reviews Physics 1 (9), pp. 538–550. Cited by: §1.2.
  • I. V. Oseledets (2011) Tensor-train decomposition. SIAM Journal on Scientific Computing 33 (5), pp. 2295–2317. Cited by: §1, §2.2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    Cited by: Appendix C.
  • D. Perez-García, F. Verstraete, M. M. Wolf, and J. I. Cirac (2007) Matrix product state representations. Quantum Information and Computation 7 (5-6), pp. 401–430. Cited by: §2.2.
  • V. Pestun, J. Terilla, and Y. Vlassopoulos (2017) Language as a matrix product state. arXiv:1711.01416. Cited by: §1.2.
  • V. Pestun and Y. Vlassopoulos (2017) Tensor network language model. arXiv:1710.10248. Cited by: §1.2.
  • G. Rabusseau, T. Li, and D. Precup (2019) Connecting weighted automata and recurrent neural networks through spectral learning. In

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    Cited by: §1.2.
  • S. Srinivasan, G. Gordon, and B. Boots (2018) Learning hidden quantum markov models. International Conference on Artificial Intelligence and Statistics (AISTATS). Cited by: §1.2.
  • J. Stokes and J. Terilla (2019) Probabilistic modeling with matrix product states. Entropy 21 (12). Cited by: §1.2.
  • E. M. Stoudenmire (2018) Learning relevant features of data with multi-scale tensor networks. Quantum Science and Technology 3 (3), pp. 034003. Cited by: §1.2, §1.
  • E. Stoudenmire and D. J. Schwab (2016) Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pp. 4799–4807. Cited by: §1.2, §1, §2.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • F. Verstraete, M. M. Wolf, D. Perez-Garcia, and J. I. Cirac (2006) Criticality, the area law, and the computational power of projected entangled pair states. Physical review letters 96 (22), pp. 220601. Cited by: footnote 5.

Appendix A Proof of Theorem 1

Theorem.

Let be a set of conditioned positions, let be the corresponding set of conditioning symbols and let be a sampling position not in .

Then, given a u-MPS , the probability of the th symbol being conditioned on the symbols appearing in the corresponding positions in a sequence of length is given by

(12)

where and the left and right context matrices are defined by

with

for all , where and are the transfer operator and conditional transfer operator of defined in Eq. (5) and (6).

Proof.

Let be the set of positions that are marginalized. We have

where the linearity of the trace and the fact that for any is used in the last equality. Using the identity , which holds for arbitrary matrices and CP maps , it follows that

and, consequently,

Appendix B Alternate Sampling Methods for the u-MPS

In Section 3.2, Algorithm 1 was given as a means of sampling strings with pre-determined length from a u-MPS model. The restriction to fixed-length samples is not fundamental however, as we show here using standard constructions employed in the study of MPS and WFA. In particular, we describe how a u-MPS can be used to sample either (a) infinite streams of text, with new characters conditioned on previously sampled characters, or (b) strings of a random (finite) length. These sampling procedures somewhat resemble those used in autoregressive RNN language models (with case (b) employing the equivalent of beginning/end of sequence tokens), but have the novel feature that sampling can also proceed backwards, with a random prefix being sampled based on all characters following it.

For simplicity, we will only discuss the case of unconditional sampling, where no sites are reserved for conditioning or marginalization. This restriction is not fundamental however, and the same techniques used in Algorithm 1 to handle conditioning and marginalization can be adapted to both forms of sampling given below.

b.1 Stream Sampling

Sampling long streams of text in a left-to-right manner can be seen as the application of Algorithm 1 in the limit , where the end of string boundary has moved infinitely far to the right. Algorithm 1 can’t be used directly in this limit, as we would have to produce an infinite number of right context matrices in the right-to-left sweep before sampling, but a simple analysis of the update rule used in this sweep shows a way of proceeding.

We show in the following how the right context matrices generated in the initial sweep of Algorithm 1 generically end up converging to a fixed point , so that for any fixed , the equality holds to arbitrary precision in the limit. This gives a means of performing stream sampling with a u-MPS, by first identifying a right fixed point , and then iterating over the left-to-right sweep in Algorithm 1 with the choice of right context matrix . This idea gives a general method to perform stream sampling, with the only remaining ingredients being a proof of existence and means of calculating such fixed points.

The basic rule used in Algorithm 1 to obtain new right context matrices from old ones is666We restrict the inputs of to be positive matrices with unit trace, a restriction which still allows for a linearly independent basis of matrices to be chosen.:

(13)

The properties of

are closely related to the eigenvalues and eigenmatrices

777

Given that CP maps are linear mappings between spaces of matrices, we refer to their eigenvectors as eigenmatrices.

of , and the CP nature of the latter is quite useful. We say that an orthogonal projection matrix is an invariant subspace of a CP map when for all matrices . CP maps are called irreducible when their only invariant subspaces are the trivial projectors . Irreducible CP maps admit a convenient characterization in terms of a generalized Perron-Frobenius theorem (Evans and Høegh-Krohn, 1978):

Theorem 2.

For any irreducible CP map , there exists a positive matrix associated with a positive eigenvalue such that:

  • is the unique eigenmatrix with eigenvalue .

  • , the spectral radius of (i.e. it is an eigenvalue of maximum norm).

  • is positive definite.

Although general CP maps won’t be irreducible, by identifying their invariant subspaces , we can recursively express any as a direct sum of irreducible CP maps , for some with . Each irreducible map is associated with a positive eigenvalue , and the eigenmatrices whose eigenvalues equal the spectral radius of (), are called the dominant eigenmatrices of . It is clear that under the repeated action of , any weighted sum of the eigenmatrices associated to the irreducible maps composing will converge to a corresponding sum containing only the maximal eigenmatrices. In other words, the space of maximal eigenmatrices are precisely the generic fixed points of , and at least one such eigenmatrix exists for any CP map .

Given this proof of existence, fixed points of can generally be obtained by simply repeating the action until convergence, starting from the initial positive matrix . In theory, this simple process of repeating can end up converging to an oscillatory sequence of positive matrices with period , a phenomenon arising from the existence of irreducible CP maps whose spectrum contains all eigenvalues of the form . More information about such “non-primitive” CP maps can be found in (Evans and Høegh-Krohn, 1978; De las Cuevas et al., 2017).

In practice however, such oscillatory behavior isn’t an issue. Simply averaging many hidden states occurring at the end of the above iteration is sufficient to produce a genuine fixed-point, as is the use of a sparse eigensolver to directly compute the positive matrix associated with the largest positive eigenvalue of . More interestingly, periodic sequences of right context matrices can be used for stream sampling, in place of the constant sequence , a situation which can permit the generation of long streams of text exhibiting periodic behavior.

b.2 Random-Length Sampling

In contrast to both fixed-length sampling and infinite stream sampling, which sample strings of known (potentially infinite) length, random-length sampling produces samples from a probability distribution over all strings . In order to normalize the distribution , we must have the sum converge to a finite value, a condition which holds generically when the u-MPS’s transfer operator has a spectral radius . Note that for fixed-length (as well as infinite stream) sampling, the spectral radius has no impact on the sampled distribution, with any rescaling of the u-MPS core tensor being exactly cancelled out by the normalization factor appearing in (7). By performing such a rescaling, we can therefore assume without loss of generality that , with no impact on the u-MPS’s other modes of sampling.

For random-length sampling, the probability of obtaining a string is

(14)

where the normalization factor can be expressed as

(15)

We use to denote the affine map , and to denote the positive matrix which is a fixed point of . The fact that implies that the limit appearing in (B.2) exists and is unique (Balle et al., 2019), in which case simply iterating the action of starting from is guaranteed to produce the fixed point .

With in hand, sampling proceeds by a left-to-right sweep where left context matrices are used to produce new samples, starting with . In each step of this sweep we use and to sample a new character , where denotes a new end of sequence character. The probability of obtaining is then

(16)

where the normalization factor is . The fact that (16) is properly normalized by comes from the fixed point identity . If our outcome belongs to , then we evolve the left context matrix as and continue sampling as in the left-to-right sweep of Algorithm 1. If then we terminate our sampling and return as our randomly sampled string.

Note that random length sampling with a u-MPS allows the empty string to be sampled, an event which occurs with probability . Additionally, the spectral norm of can be adjusted by multiplying the core tensor with a scalar, allowing the expected length of output sentences to be tuned freely.

b.3 Sampling Backwards

Although Algorithm 1 describes an initial right-to-left sweep to produce right context matrices, followed by a left-to-right sweep to generate samples, this ordering is not necessary to sample from the distribution defined by a u-MPS. For example, we could have chosen the algorithm to consist of an initial left-to-right sweep to produce left context matrices, followed by a right-to-left sweep to sample new characters. The fact that both procedures produce samples from the same distribution can be seen by applying Theorem 1 to the reverse-ordered unfolding of the conditional distribution

(17)

which holds for any disjoint subsets (cf. (11)). Given that the distribution sampled by the operation is manifestly invariant under the interchange of and , along with all CP maps by their adjoints, the unfolding of (17) gives an algorithm which samples strings from right to left, but yields the same distribution as Algorithm 1.

This same left/right equivalence can be applied to the sampling procedures described in Sections B.1 and B.2888The decomposition of into irreducible CP maps utilized in Section B.1 proceeds exactly the same with the adjoint map , and the eigenvalue spectra (but not the eigenmatrix ) of both maps are identical.. In the latter case, as with fixed-length sampling, such a reversal in direction results in samples which are drawn from the same sampling distribution. In the case of infinite stream sampling however, this permits infinite right-to-left streams of text, as well as bi-infinite streams, where new characters can be sampled at the beginning or end of the growing sampled string.

Appendix C Experiment Details

For the experiments, the u-MPS model was written in JAX (Bradbury et al., 2018), while the LSTM module from PyTorch (Paszke et al., 2017)

was used for the baselines. The LSTMs were single-layer models with 50 hidden units (50 in each direction for the bidirectional LSTM), and a linear decoder and softmax output layer used to obtain character probabilities. The bond dimension of the u-MPS was 50. In both experiments, models were trained for 100 epochs of gradient descent, with a negative log likelihood (NLL) loss and Adam optimizer 

(Kingma and Ba, 2015). A learning rate of was used, with early stopping based on the loss on a validation set determining the time at which the models’ correct sampling fraction was evaluated.

The u-MPS was trained identically for both experiments, with a loss given by the NLL of training strings of length 15. The unidirectional LSTM was trained the same way, while the bidirectional LSTM was trained specifically for the conditional sampling task. In particular, the loss was taken as a sum of the NLL of the correct character at each site of the training strings, given knowledge of all characters on the other sites.

For the unconditional generation task, we first tried training the LSTM on strings with a final EOS token appended and using rejection sampling to obtain samples of arbitrary length. This was unable to generate significantly longer strings though, so we instead sampled streams of text from the LSTM and took the first characters for each sample, with the initial hidden state of the LSTM treated as a trainable parameter.

The Motzkin strings were obtained by generating all strings in the context-free language of length 15, then choosing a random subset. The code for both experiments can be found in the supplementary material.