# Tensors over Semirings for Latent-Variable Weighted Logic Programs

Semiring parsing is an elegant framework for describing parsers by using semiring weighted logic programs. In this paper we present a generalization of this concept: latent-variable semiring parsing. With our framework, any semiring weighted logic program can be latentified by transforming weights from scalar values of a semiring to rank-n arrays, or tensors, of semiring values, allowing the modelling of latent variables within the semiring parsing framework. Semiring is too strong a notion when dealing with tensors, and we have to resort to a weaker structure: a partial semiring. We prove that this generalization preserves all the desired properties of the original semiring framework while strictly increasing its expressiveness.

## Authors

• 4 publications
• 24 publications
• 1 publication
• ### Bernstein Concentration Inequalities for Tensors via Einstein Products

A generalization of the Bernstein matrix concentration inequality to ran...
02/08/2019 ∙ by Z. Luo, et al. ∙ 0

• ### Quantum Hoare Logic with Ghost Variables

Quantum Hoare logic allows us to reason about quantum programs. We prese...
02/01/2019 ∙ by Dominique Unruh, et al. ∙ 0

• ### The Supervised IBP: Neighbourhood Preserving Infinite Latent Feature Models

We propose a probabilistic model to infer supervised latent variables in...
09/26/2013 ∙ by Novi Quadrianto, et al. ∙ 0

• ### Fast Compressive Sensing Recovery Using Generative Models with Structured Latent Variables

Deep learning models have significantly improved the visual quality and ...
02/19/2019 ∙ by Shaojie Xu, et al. ∙ 0

• ### StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing

Semantic parsing is the task of transducing natural language (NL) uttera...
06/20/2018 ∙ by Pengcheng Yin, et al. ∙ 0

• ### Learning Semantic Parsers from Denotations with Latent Structured Alignments and Abstract Programs

Semantic parsing aims to map natural language utterances onto machine in...
09/09/2019 ∙ by Bailin Wang, et al. ∙ 0

• ### Optimizing Spectral Learning for Parsing

We describe a search algorithm for optimizing the number of latent state...
06/07/2016 ∙ by Shashi Narayan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Weighted Logic Programming (WLP) is a declarative approach to specifying and reasoning about dynamic programming algorithms and chart parsers. WLP is a generalization of bottom-up logic programming where proofs are assigned weights by combining the weights of the axioms used in the proof, and the weight of a theorem is in turn calculated by combining the weights of all its possible proof paths. The combinatorial nature of this procedure makes weighted logic programs highly suitable for specifying dynamic programming algorithms. In particular, Goodman (1999) presents an elegant abstraction for specifying and computing parser values based on WLP where the values could be drawn from any complete semiring. This generalizes the case of Boolean decision problems, probabilistic grammars with Viterbi search and other quantities of interest such as the best derivation or the set of all possible derivations. It is then possible to derive a general formulation of inside and outside calculations in a way that is agnostic to the particular semiring chosen.

Latent variable models have been an important component in the NLP toolbox. The central assumption in latent variable models is that the correlations between observed variables in the training data could be explained by unobserved, hidden variables. Latent variables have been used with grammars such as Probabilistic Context-Free Grammars (PCFGs), where each node in the parse tree is represented using a vector of latent state probabilities that further extend the expressiveness of the grammar

(Matsuzaki et al., 2005).

The approach of adding latent variables to formal grammars have proven to be a fruitful one: in the context of PCFG parsing, Matsuzaki et al. (2005) show that latent variable PCFGs (L-PCFGs) perform on par with models hand-annotated with linguistically motivated features. Cohen et al. (2013) report that on the Penn Treebank dataset, L-PCFGs trained with either EM or a spectral algorithm provide a 20% increase in F1 over PCFGs without latent states. Gebhardt (2018) shows that the benefits of latent variables are not limited to PCFGs by successfully enriching both Linear Context-Free Rewriting Systems and Hybrid Grammars with latent variables, and demonstrates their applicability on discontinuous constituent parsing.

Given the usefulness of latent variables, it would be desirable to have a generic inference mechanism for any latent variable grammar. WLPs can represent inference algorithms for probabilistic grammars effectively. However, this does not trivially extend to latent-variable models because latent variables are often represented as vectors, matrices and higher-order tensors, and these taken together no longer form a semiring. This is because in the semiring framework, values for deduction items and for rules must all come from the same set, and the semiring operations must be defined over all pairs of values from this set. This does not allow for letting different grammar nonterminals be represented by vectors of different sizes. More importantly, it does not allow for a rule’s value to be a tensor whose dimensionality depends on the rule’s arity, as is generally the case in latent variable frameworks.

In this paper we start with a broad interpretation of latent variables as tensors over an arbitrary semiring. While a set of tensors over semirings is no longer a semiring, we prove that if the set of tensors have certain matching dimensions for the set of grammar rules they are assigned to, then they fulfill all the desirable properties relevant for the semiring parsing framework. This paves the way to use WLPs with latent variables, naturally improving the expressivity of the statistical model represented by the underlying WLP. Introducing a semiring framework like ours makes it easier to seamlessly incorporate latent variables into any execution model for dynamic programming algorithms (or software such as Dyna, Eisner et al. 2005, and other Prolog-like/WLP-like solvers).

We focus on CFG parsing, however the same latent variable techniques can be applied to any weighted deduction system, including systems for parsing TAG, CCG and LCFRS, and systems for Machine Translation Lopez (2009). The methods we present for inside and outside computation can be used to learn latent refinements of a specified grammar for any of these tasks with EM (Dempster et al., 1977; Matsuzaki et al., 2005), or used as a backbone to create spectral learning algorithms (Hsu et al., 2012; Bailly et al., 2009; Cohen et al., 2014).

## 2 Main Results Takeaway

We present a strict generalization of semiring weighted logic programming, with a particular focus on parser descriptions in WLP for context-free grammars. Throughout, we utilize the correspondence between axioms and grammar rules, deductive proofs and grammar derivations, and derived theorems and strings.

We assume that axioms/grammar rules come equipped with weights in the form of tensors over semiring values. The main issue with going from semirings to tensors over semiring values is that these weights need to be well defined in that any valid derivation should correspond to a sequence of well defined semiring operations. For CFGs, we give a straightforward condition that ensures this is the case. This essentially boils down to making sure that each non-terminal corresponds to a fixed vector space dimension. For example, if corresponds to a space of dimensions, to and to , then a rule would have a tensor weight in .

As long as the weights are well defined, the standard definitions for the value of a grammar derivation and a string according to a semiring weighted grammar extend to the case of tensors of semirings. Weighted logic programming provides the means to declaratively specify an efficient algorithm to obtain these values of interest. In line with Sikkel (1998) and Goodman (1999) we present precise conditions for when a partial-semiring WLP describes a correct parser.

The value of the WLP formulation of parsing algorithms is that it provides a unified fashion in which dynamic programming algorithms can be extracted from the program description. This relies on the ability of a WLP to decompose the value of a proof to a combination of the values of the sub-proofs. Specifically, given a derivation tree, a WLP description automatically provides algorithms for calculating the inside and outside values. We provide analogous algorithms for calculating the inside and outside values for partial-semiring WLPs. Our outside formulation addresses the non-commutative nature of tensors themselves, and could be extended to cases where the underlying semiring is non-commutative using the techniques presented by Goodman (1998).

## 3 Related Work

“Parsing as deduction” (Pereira and Warren, 1983) is an established framework that allows a number of parsing algorithms to be written as declarative rules and deductive systems (Shieber et al., 1995), and their correctness to be rigorously stated (Sikkel, 1998). Goodman (1999) has extended the parsing as deduction framework to arbitrary semirings and showed that various different values of interest could be computed using the same algorithm by changing the semiring. This led to the development of Dyna, a toolkit for declaratively specifying weighted logic programs, allowing concise implementation of a number of NLP algorithms (Eisner et al., 2005).

The semiring characterization of possible values to assign to WLPs gave rise to the formulation of a number of novel semirings. One novel semiring of interest for purposes of learning parameters is the generalized entropy semiring (Cohen et al., 2008) which can be used to calculate the KL-divergence between the distribution of derivations induced by two weighted logic programs. Other two semirings of interest are expectation and variance semirings introduced by Eisner (2002) and Li and Eisner (2009)

. These utilize the algebraic structure to efficiently track quantities needed by the expectation-maximization algorithm for parameter estimation. Their framework allows working with parameters in the form of vectors in

for a fixed , coupled with a scalar in . The semiring value of a path is roughly calculated by the multiplication of the scalars and (appropriately weighted) addition of the vectors. This is in contrast with our framework where weights could be tensors of arbitrary rank rather than only vectors, and the values of paths are calculated via tensor multiplication.

Finally, Gimpel and Smith (2009) extended the semiring framework to a more general algebraic structure with the purpose of incorporating non-local features. Their extension comes at the cost that the new algebraic structure does not obey all the semiring axioms. Our framework differs from theirs in that under reasonable conditions, tensors of semirings do behave fully like regular semirings.

## 4 Background and Notation

Our formalism could be used to enrich any WLP that implements a dynamic programming algorithm, but for simplicity, we follow Goodman (1999) and focus our presentation on parsers with a context-free backbone.222Note that given a grammar in a formalism and a string , it is possible to construct a CFG grammar from and (Nederhof, 2003). This construction is possible even for range concatenation grammars (Boullier, 2004) which span all languages that could be parsed in poly-time.

### 4.1 Context-free Grammars

Formally, a Context-Free Grammar (CFG) is a 4-tuple . The set of denotes the non-terminals which will be denoted by uppercase letters etc., and is a non-terminal that is the special start symbol. The set of denotes the terminals which will be denoted by lowercase letters etc. is the set of rules of the form consisting of one non-terminal on the left hand side (lhs), and a string on the right hand side (rhs). We will use if could be derived from with the application of one grammar rule. We will say that a sentence could be derived from the non-terminal if could be generated by starting with and repeatedly applying rules in until the right hand side contains only terminals, and denote this as . We will denote the language that a grammar defines by .

CFG derivations can naturally be represented as trees. We will use the notation to represent a tree that has the node as its root and as its direct subtrees. We will use to denote the set of all derivation trees that can be constructed with the grammar , and for all valid derivation trees that generate the sentence in .

### 4.2 Semirings

A semiring is an algebraic structure similar to a ring, except that it does not require additive inverses.

###### Definition 1.

A semiring is a set together with two operations and , where is commutative, associative and has an identity element 0. The operation of is associative, has an identity element 1 and distributes over .

The set of non-negative integers together with the usual is a semiring, and so are probability values in . Booleans {TRUE, FALSE} also form a semiring with , , FALSE and TRUE.

There are a few less common semirings that provide useful values in parsing. The Viterbi semiring calculates the probability of the best derivation. It has values in , max and as standard. The Derivation forest, Viterbi derivation and Viterbi -best semirings calculate the set of all derivations, the best derivation and the -best derivations respectively. Unlike the previous examples, the operation of these semirings is not commutative. In general, if the operation in a semiring is commutative, we refer to it as a commutative semiring, and otherwise it is referred to as non-commutative. For precise definitions and detailed descriptions of these semirings see Goodman (1999).

### 4.3 Weighted Logic Programming

A logic program consists of axioms and inference rules that could be applied iteratively to prove theorems. Inference rules are expressed in the form where are antecedents from which can be concluded. Axioms are inference rules with no antecedents.

One way to express dynamic programming algorithms such as CKY is as logic programs. This approach takes the point of view of parsing as deduction: terms consist of grammar rules and items in the form of that correspond to the intermediate entries in the chart. Grammar rules are taken to be axioms, and the description of the parser is given as a set of inference rules. These can have both grammar rules and items as antecedents and an item as the conclusion. A logic program in this form includes a special designated goal item that stands for a successful parse.

Continuing with the example of CKY, consider the procedural description for how to obtain a chart item from smaller chart items if we have the rule in the grammar:

 chart[i, A,j]:=chart[i,A,j]∨ (chart[i,B,k]∧chart[k,C,j])

The corresponding inference rule in a logic program would be:

 A→BC[i,B,k][k,C,j][i,A,j]

Note that in the inference rule above, is a rule template with free variables . In general, the terms in inference rules can contain free variables, however for a logic program to describe a valid dynamic algorithm, every free variable in the conclusion of an inference rule must appear in its antecedents as well.

A weighted logic program is a logic program where terms are assigned values from a semiring. When paired with semiring operations, inference rules provide the description of how to compute the value of the conclusion given the values of the antecedents. The result of an application of a particular inference rule is the semiring multiplication of all the antecedents. The value of a term is then calculated as the semiring sum of values obtained from inference rules that have as their the conclusion.

### 4.4 Semiring Parsing

In the context of parsing, Goodman (1999) presents a framework where a grammar comes equipped with a function that maps each rule in to a semiring value. Then, a grammar derivation string consisting of the successive applications of rules is defined to have the value , and the value of a sentence is defined as where are the derivations of in .

A parser specification is given in the form of a weighted logic program, referred to as item-based description. From these, the value of a derivation is calculated recursively as follows:

 V(D)={w(D)if D is a rule∏ki=1V(Di)if D=⟨b:D1,…,Dk⟩

where is the semiring product.

Let represent the set of all derivation trees headed by the item . Then the value of is:

 V(x)=∑D∈inner(x)V(D)

where is the semiring addition. The value of a sentence is then equal to .

Given the definitions of value according to the grammar and the parser, Goodman (1999) provides a theorem for conditions of correctness:

###### Theorem 4.1.

(Goodman 1999, Theorem 1; informal) An item-based description is correct if for every grammar there exists a one-to-one correspondence between the grammar and item derivations, and these derivations get the same value regardless of weight function used.

One caveat with calculating based on item-based derivations is that there is an ordering of items: we cannot compute the value of an item unless the values of all its children are computed already. For this, Goodman (1999) assumes that each item is assigned to a bucket so that if an item depends on , then . If a bucket depends on itself, then it is considered a special looping bucket. For all the formulas we present in this the main paper we assume that the items belong to non-looping buckets. The formulas for looping buckets are provided in Appendix B.

For an item , calculating its value might require summing over exponentially many derivation trees. To address this, it is possible to provide a general formula that efficiently computes the inner value for an item (Goodman 1999, Theorem 2):

 V(x)=∑a1,…,aks.t.a1,…,akxk∏i=1V(ai)

The other important value associated with an item is its outside value , which is the sum of values of derivation trees, modified so that is removed with all its subtrees. This value is complementary to the inside values (Goodman 1999, Theorem 4):

 V(x)×Z(x)=∑Da derivationV(D)C(D,x)

where is the count of the occurrences of item in derivation .

can likewise be calculated using a recursive formula if the values are from a commutative semiring (Goodman 1999, Theorem 5):

### 4.5 Tensor Notation

We use the term tensor to refer to an -dimensional array of semiring values. We use to denote a semiring and etc. to denote tensors. The element will denote that is a rank- tensor of values drawn from , with the th rank having dimension . The entry in index will be denoted with subscripts .

## 5 Latent-variable Parsing as Tensor Weighted Logic Programs

For semiring parsing to work for latent-variable models it should allow weights to be vectors, matrices and tensors. In this section we present a framework that generalizes that of Goodman (1999), and is able to capture tensors over semirings as weights. Note that this includes scalars as a special case.

### 5.1 Semiring Operations

The main reason why tensors over semirings are not semirings is that with tensor weights, and become partially defined – not all elements can naturally be added or multiplied to any other element anymore. We refer to these structures as partial semirings. With some reasonable constraints, we show that and obey the semiring axioms in cases that are relevant for the semiring parsing framework.

Let be the chosen underlying semiring, to be the semiring operations and be the additive and multiplicative identity of the semiring respectively. The set of possible weights are defined as for , and for all . is a partial addition that is defined on two tensors as long as the dimensions of each of their ranks match. Then, the addition is defined component-wise:

 (A⊕B)i1,…,in:=Ai1,…,in+Bi1,…,in

The additive identity is now a class of tensors, one for each unique list of tensor dimensions. The additive identity for any is the tensor with in every entry.

Multiplication is defined as the contraction of an index between two tensors with arbitrary number of ranks. Specifically, we consider the family which contracts the th rank of the first tensor with the th rank of the second tensor. This is only defined if the two ranks to be contracted have the same dimension, as follows:

 (A⊗[k;l]B)i1,…,ik−1,j1,…,jl−1,jl+1,…,jm,ik+1,…,in :=∑ik,jlδ(ik,jl)Ai1,…,in×Bj1,…,jm,

where is the identity function that is equal to if and otherwise. Note that the ranks corresponding to B which are not contracted over go in between the ranks of A, replacing where the contracted rank of A was. We will use as a shorthand of , and in cases where , we will omit the subscript on altogether.

More generally, we will allow multiplication operations that contract multiple consecutive dimensions. will denote contracting rank of with rank of , rank of with rank of and so forth until rank of and of . Formally:

 (A⊗r[k;l]B)i1,…,ik−1,j1,…,jl−1,jl+r,…,jm,ik+r,…,in:= ∑ik,…,ik+r−1jl,…,jl+r−1(r−1∏p=0δ(ik+p,jl+p))Ai1,…,inBj1,…,jm

We will use the notation as a shorthand for if and otherwise.

To make the presentation clearer, we will also use the notation to denote contraction of with the first rank of , with the second and so forth. In other words is equivalent to .

The multiplicative identity for and

is the identity matrix

where the diagonal entries are the multiplicative identity from the underlying semiring, and the non-diagonals are the additive identity. For and the multiplicative identity is a rank- tensor and is defined as follows:

 Id1,…,dr=n2∏i=0δ(di,dr2+i)

Lastly, as the higher order analogue of the transpose operator, we will define a permutation operator where is a permutation of and is the rank of . The th rank of is equal to th rank of .

The key property of semirings for purposes of efficient calculation of item values is the distributive property. This property also holds for tensors over semirings.

###### Lemma 5.1.

For any , distributes over

A proof can be found in Appendix A.

### 5.2 Grammar Derivations

For a grammar with a function that provides a mapping from rules to tensor weights, we will define a value of a derivation via the derivation tree:

###### Definition 2.

Given a grammar and a weight function , the value of a derivation tree is:

 VwG(T)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩w(r) if T=⟨r⟩w(r)⊗[VwG(T1),…,VwG(Tk)] if T=⟨r:T1,…,Tk⟩

Note that there is no guarantee that this equation is defined for any arbitrary . We will call a weight function well defined for a grammar if for all valid derivation trees in , is defined. For CFGs there is a straightforward method to ensure that is well defined:

###### Lemma 5.2.

A set of weights for a given CFG is well defined if there exist consistent dimensions for each nonterminal such that for all grammar rules ,

Proof is given together with Lemma 5.3.

Note that if a weight function for CFG is well defined, then the rank for the weights of rules with no non-terminals on their rhs is always 1.

Given a grammar derivation tree , let us call the list of derivation rules appearing in ordered via depth-first, left-to-right manner a grammar derivation string.

###### Definition 3.

Given a CFG with tensor weights , the value of a grammar derivation string is defined as:

 VwG(E)=⨂iw(Ri)

where the application of proceeds from left to right as is standard.

For semirings, since the bracketing does not affect the final value of an expression, it is straightforward to show that the value of a grammar derivation tree corresponds to that of a grammar derivation string. With tensors over semirings this might fail with an arbitrary formalism , and in the general we require the value of a derivation to be calculated with the bracketing induced by the derivation tree. However, for the special case of CFGs, the value of the grammar derivation tree and the value of its corresponding grammar derivation string are always equal. This means that for the computation of the value of the derivation, it is possible to replace the bracketing induced by the derivation tree by left-to-right bracketing without affecting the final value. Figure 1 demonstrates the calculation of the value of the tree and the string for the same derivation together with how the tensor dimensions of the intermediate results evolve with each step of the calculation.

###### Lemma 5.3.

Given a CFG and a weight function that fulfills the condition in Lemma 5.2, then is well defined and for any grammar derivation tree and corresponding grammar derivation string .

###### Proof.

We will proceed by induction on the derivation tree. If consists of only one rule , then . Furthermore, does not have any non-terminals on its rhs, so with corresponding to the lhs non-terminal in .

Otherwise, has a labeled node and the subtrees . Notice that if , ,…, , then due to all arguments within being rank-1.

Because fulfills the condition in Lemma 5.2, for some where is the space corresponding to the non-terminal on the lhs of , and is the space corresponding to the th non-terminal appearing in the rhs of for . Then to complete the proof, it suffices to show that for all subtrees . This already holds for the base case. For each , if then by induction , where is the space corresponding to the non-terminal in the lhs of . For the derivation to be valid, this non-terminal needs to match the th non-terminal in the rhs of , hence

### 5.3 Item-based Descriptions

Item-based descriptions are formal descriptions of various parsers for context-free grammars. Item-based descriptions consist of a set of deduction rules of the form where upper case letters could either be grammar rule templates (e.g. if then any non-terminals from the grammar can be substituted for ) or for items. are referred to as antecedents, as the conclusion and are side conditions that the parser requires to execute the rule, but doesn’t use the values of. Items correspond to chart elements in procedural descriptions of parsers, and are placeholders for intermediate results which can be combined to obtain the final result. The item-based description also provides a special goal item which is variable-free, and does not occur as a condition of any other inference rules.

###### Definition 4.

Given a grammar and an item-based description , a valid item derivation tree is defined as follows:

• For all , is an item derivation tree.

• If and are derivation trees headed by and respectively, and is the instantiation of a deduction rule in , then is also an item derivation tree.

denotes the set of all trees headed by that occur in parses for . Formally, if is headed by and is a subtree of some . The value of a derivation tree is calculated similarly to that of a grammar tree:

 VwI(G)(D)= ⎧⎪⎨⎪⎩w(D) if D is a ruleVwI(G)(D1)⊗[VwI(G)(D2),…,VwI(G)(Dn)] if D=⟨b:D1,…,Dn⟩

Notice that unlike the definition from Goodman (1999), the first antecedent in the inference rule has a special role in the calculation. Intuitively, our framework treats the value of the first antecedent as a function, and the trailing ones as the arguments. The interaction between the trailing antecedents is thus moderated through the value of the first antecedent, which corresponds to the requirement that the children nodes be independent of each other given the parent node.

###### Definition 5.

If for any and any , and are defined and , then the weights are well defined.

Given an item-based derivation , a grammar , a well defined weight function and a target sentence , the value of an item is defined to be the sum of all its possible derivations. Formally:

 VwI(G)(x,σ)=⨁D∈innerσ(x)VwI(G)(D)
###### Definition 6.

For a given grammar and item-based description , the value of a sentence is equal to the value of the goal item which spans :

 VwI(G)(σ)=VwI(G)(goal,σ)
###### Definition 7.

An item-based description is correct if for all grammars , complete semirings , well defined weight functions and sentences ,

Now we are ready to state the equivalent theorem to Theorem 4.1. Let us introduce a special symbol and extend and to any weight function so that if is not-well defined for , then and likewise for .

###### Theorem 5.4.

An item-based description is correct if

• For every grammar , the mapping that maps to the corresponding is a bijection with an inverse function .

• For any complete semiring and weight function , and preserve the values assigned to a derivation:

 VwG(d) =VwI(G)(f(d)) and VwI(G)(d′) =VwG(g(d′))

Proof proceeds similarly to that in Goodman (1999) and can be found in Appendix A.

## 6 Inside and Outside Calculations

In the following, we will omit the sentence from and refer to this as . Let the set of derivation trees where the root note is , and the direct children of are .

For efficient computation of this value, we will assume that there is a partial order on the items so that if the item depends on , then .

###### Theorem 6.1.
 V(x)=⨁[a1,…,ak]s.t.a1,..,akx V(a1)⊗[V(a2),…,V(ak)]

The proof uses the distributive property and follows that of Goodman (1999). It can be found in Appendix A.

For the notion of a value of a derivation to extend to outside trees, we will have to do some modifications. This is because an outside tree will have one subtree , such that will potentially not be defined since one of the subtrees will be missing. Note that the missing will be headed by an item. We will say the a tree if can be obtained by taking a tree headed by the goal item and removing any of its subtrees headed by the item . Outer value is defined recursively as follows:

If is headed by the goal item then . Else, it has a direct parent tree such that . In this case,

 (V(T1)⊗k[ITk×dS,V(Tk+1),…,V(Tn)])π ⊗[V(T2),…,V(Tk−1)]⊗∗Z(T)

where is the identity tensor for the space , , and is the dimension assigned for the terminal symbol . The permutation is defined as follows:

 [1,2,…,i,j+1,j+2,…,n,i+1,i+2,…,j]

where and

To understand the function of it is useful to consider the dimensions of the term before and after it is applied. Let the term have dimensions:

 e1×…×ek−1,d1×…×di×dS× ek×d1×…×di×dS×d′n×…×d′m

Here are the dimensions that will be contracted with with the second multiplication operation, and are the dimensions that were either introduced by the contraction with or were trailing dimensions from . The result of the contraction with are the dimensions in the middle: . Unlike the original definition of there is one dimension missing from the beginning of the sequence since it got used up during the contraction operation. What the permutation does is to move one section of the dimensions introduced by to the very end. The dimensions become:

 e1×…×ek−1,d1×…×di× d′n×…×d′m×dS×ek×d1×…×di×dS

Note that this has no effect on the next contraction with since the first ranks are left in place. However, changing the order of the ranks allow the last contraction with to be well defined.

###### Lemma 6.2.

Let and be defined on a commutative semiring and let and . If combining and in the obvious way results in the complete derivation ,

 V(D)=V(T)⊗∗Z(O)
###### Proof.

(Sketch) We proceed by induction on the parse tree. Base case is where , and is empty. Then and . by the definition of which proves the statement.

Otherwise has a parent tree where . Furthermore, , and by the induction hypothesis .

Since we know that

 V(Tp)=V(T1)⊗[V(T2),…,V(Tm)]

 (V(T1)⊗[V(T2),…,V(Tm)])⊗∗Z(Op)

The proof progresses by calculating the value for based on the above term and shows that this is equal to the value of . Full proof can be found in Appendix A. ∎

In the general case, Goodman (1999) defines the reverse value of as the sum of all its outer trees.

 Z(x)=⨁T∈outer(x)Z(T)

We will see that for a well defined weight function , any will be assigned a value with dimensions where is the dimension assigned to the start symbol , and are the dimensions for .

###### Lemma 6.3.

Let represent the number of times occurs in a derivation . Then,

 V(x)⊗∗Z(x)=⨁D∈D(σ)V(D)C(D,x)
###### Proof.
 V(x)⊗∗Z(x)=⨁T∈inner(x)V(T)⊗∗⨁O∈outer(x)Z(O) =⨁T∈inner(x)⨁O∈outer(x)V(T)⊗∗Z(O)

By Lemma 6.2, . For an item , any and can be combined to form a successful derivation tree containing , and thus the number corresponds exactly to the number of derivation trees containing . Hence,

 V(x)⊗∗Z(X) =⨁T∈inner(x)O∈outer(x)V(T)⊗∗Z(O) =⨁D∈D(σ)V(D)C(D,x)\qed

Now we are ready to state how to calculate the outside value of an item. Following Goodman (1999) we will extend the notation for the set of outer trees and introduce to mean the subset of the outer trees in where has parent and the siblings . In other words, this is the set of all outer trees where the rule from which is removed is .

###### Theorem 6.4.

If is the goal item, then . Else,

 ⊗[V(a2),…,V(ak−1)]⊗∗Z(b)
###### Proof.

(sketch) . Either is a goal item, in which case .

Otherwise the outer trees could be written as the union of outer trees for each rule where for some . Hence:

 Z(x)=⨁j,a1,..,ak,b s.t. a1…akb and x=aj⨁D∈outer(k,a1…anb)Z(D)

Using the distributive property of the partial semiring, the inside part of the equation becomes:

 ⨁D∈outer(k,a1…anb)Z(D)= (V(a1)⊗k[Iak,V(ak+1),…,V(an)])π ⊗[V(a2),…,V(ak−1)]⊗∗Z(b)

Replacing the inner part of the previous equation with this term gives the desired equality. ∎

## 7 Conclusion

We have presented a general extension of the semiring parsing framework where the weights for the grammar rules are tensors of semiring values, with the motivation of extending semiring parsing framework to latent variable models. We hope that this work will enable streamlined development of EM-based or spectral learning algorithms for latent refinements of a number of grammar formalisms.

## Acknowledgments

The authors thank the anonymous reviewers for feedback and comments on a draft of this paper, and acknowledge the support of NSF grant IIS-1813823.

## References

• Bailly et al. (2009) Raphaël Bailly, François Denis, and Liva Ralaivola. 2009.

Grammatical inference as a principal component analysis problem.

In

Proceedings of the 26th Annual International Conference on Machine Learning

, pages 33–40.
• Boullier (2004) Pierre Boullier. 2004. Range concatenation grammars. In New Developments in Parsing Technology, pages 269–289. Springer.
• Cohen et al. (2008) Shay B Cohen, Robert J Simmons, and Noah A Smith. 2008. Dynamic programming algorithms as products of weighted logic programs. In International Conference on Logic Programming, pages 114–129.
• Cohen et al. (2013) Shay B Cohen, Karl Stratos, Michael Collins, Dean P Foster, and Lyle Ungar. 2013. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 148–157, Atlanta, Georgia. Association for Computational Linguistics.
• Cohen et al. (2014) Shay B Cohen, Karl Stratos, Michael Collins, Dean P Foster, and Lyle Ungar. 2014. Spectral learning of latent-variable PCFGs: Algorithms and sample complexity. The Journal of Machine Learning Research, 15(1):2399–2449.
• Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22.
• Eisner (2002) Jason Eisner. 2002. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 1–8, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
• Eisner et al. (2005) Jason Eisner, Eric Goldlust, and Noah A Smith. 2005. In

Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

, pages 281–290, Vancouver, British Columbia, Canada. Association for Computational Linguistics.
• Gebhardt (2018) Kilian Gebhardt. 2018. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3049–3063, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
• Gimpel and Smith (2009) Kevin Gimpel and Noah A Smith. 2009. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 318–326, Athens, Greece. Association for Computational Linguistics.
• Goodman (1999) Joshua Goodman. 1999. Computational Linguistics, 25(4):573–606.
• Goodman (1998) Joshua T Goodman. 1998. Parsing Inside-Out. Ph.D. thesis, Harvard University Cambridge, Massachusetts.
• Hsu et al. (2012) Daniel Hsu, Sham M Kakade, and Tong Zhang. 2012.

A spectral algorithm for learning hidden Markov models.

Journal of Computer and System Sciences, 78(5):1460–1480.
• Kuich (1997) Werner Kuich. 1997. In Rozenberg Grzegorz and Arto Salomaa, editors, Handbook of Formal Languages: Volume 1 Word, Language, Grammar, pages 609–677. Springer, Berlin, Heidelberg.
• Li and Eisner (2009) Zhifei Li and Jason Eisner. 2009. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 40–51, Singapore. Association for Computational Linguistics.
• Lopez (2009) Adam Lopez. 2009. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 532–540, Athens, Greece. Association for Computational Linguistics.
• Matsuzaki et al. (2005) Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2005. Probabilistic CFG with latent annotations. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 75–82.
• Nederhof (2003) Mark-Jan Nederhof. 2003. Computational Linguistics, 29(1):135–143.
• Pereira and Warren (1983) Fernando C N Pereira and David H D Warren. 1983. In Proceedings of the 21st Annual Meeting on Association for Computational Linguistics, pages 137–144, Cambridge, Massachusetts, USA. Association for Computational Linguistics.
• Shieber et al. (1995) Stuart M Shieber, Yves Schabes, and Fernando C N Pereira. 1995. Principles and implementation of deductive parsing. The Journal of logic programming, 24(1-2):3–36.
• Sikkel (1998) Klaas Sikkel. 1998. Parsing schemata and correctness of parsing algorithms. Theoretical Computer Science, 199(1-2):87–103.
• Steenstrup (1985) Martha Edmay Steenstrup. 1985. Sum-Ordered Partial Semirings. Ph.D. thesis, University of Massachusetts Amherst.

## Appendix A - Proofs of Theorems in Main Paper

###### Lemma 5.1.

For any , distributes over

###### Proof.

We will proceed by showing that:

 A⊗[k;l](B⊕C)=(A⊗[k;l]B)⊕(A⊗[k;l]C)

Firstly, note that for the left hand side of the equation to be defined, and needs to be of matching ranks, and that will be the same rank as both and . Therefore, if the left hand side is well defined then both and is defined and has matching ranks. So the right hand side is defined if and only if the left hand side is defined as well.

 [A⊗[j;k] (B⊕C)]i1,…,ik−1,j1,…,jl−1,jl+1,…,jm,ik+1,…,in =∑ik,jlδ(ik,jl)Ai1,…,in×(B⊕C)j1,…,jm =∑ik,jlδ(ik,jl)Ai1,…,in×(Bj1,…,jm+Cj1,…,jm) =∑ik,jlδ(ik,jl)(Ai1,…,in×Bj1,…,jm)+δ(ik,jl)(Ai1,…,in×Cj1,…,jm) =[(A⊗[k;l]B)⊕(A⊗[k;l]C)]i1,…,ik−1,j1,…,jl−1,jl+1,…,jm,ik+1,…,in

###### Theorem 5.4.

An item-based description is correct if

• For every grammar , the mapping