ÆTHEL: Automatically Extracted Type-Logical Derivations for Dutch

12/29/2019 ∙ by Konstantinos Kogkalidis, et al. ∙ Utrecht University 0

We present ÆTHEL, a semantic compositionality dataset for written Dutch. ÆTHEL consists of two parts. First, it contains a lexicon of supertags for about 900,000 words in context. The supertags correspond to types of the simply typed linear lambda-calculus, enhanced with dependency decorations that capture grammatical roles supplementary to function-argument structures. On the basis of these types, ÆTHEL further provides 72,623 validated derivations, presented in four equivalent formats: natural-deduction and sequent-style proofs, linear logic proofnets and the associated programs (lambda terms) for meaning composition. ÆTHEL's types and derivations are obtained by means of an extraction algorithm applied to the syntactic analyses of LASSY-Small, the gold standard corpus of written Dutch. We discuss the extraction algorithm and show how `virtual elements' in the original LASSY annotation of unbounded dependencies and coordination phenomena give rise to higher-order types. We suggest some example usecases highlighting the benefits of a type-driven approach at the syntax semantics interface. The following resources are open-sourced with ÆTHEL: the lexical mappings between words and types, a subset of the dataset consisting of 8,569 semantic parses, and the Python code that implements the extraction algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Typelogical categorial grammars offer an attractive view on the syntax-semantics interface. The key idea is that the derivation establishing the well-formedness of a phrase systematically corresponds to a procedure that computes the meaning of that phrase, given the meanings of its constituent words.

This connection between logical derivations and programs, known as the Curry-Howard correspondence [sorensen2006lectures], plays a central role in current typelogical grammars. In extended versions of the Lambek calculus [lambek1958mathematics] such as [morrillTLG, DBLP:journals/jolli/Moortgat96, DBLP:journals/jolli/MorrillVF11, KubotaLevineMIT], the Curry-Howard correspondence applies to a language of semantic types, where meaning composition is addressed. The semantic type language is obtained by a structure-preserving translation from the syntactic type system that handles surface word order, prosodic structure and the like. Alternatively, the type system of Abstract Categorial Grammar [de2001towards] and Lambda Grammars [Muskens2001-MUSGAT] is designed to capture a ‘tectogrammatical’ [curry1961some] view of language structure that abstracts away from the surface realization. From the abstract syntax now both the meaning representations and the surface forms are obtained by compositional translations, where for the latter one relies on the standard encoding of strings, trees, etc in lambda calculus [barendregttypes].

A common feature of these approaches is the use of resource-sensitive type logics, with Linear Logic [girard1987linear] being the key representative. Curry’s original formulation of the logic-computation correspondence was for Intuitionistic Logic, the logic of choice for the formalization of mathematical reasoning. For many real-world application areas, Linear Logic offers a more attractive alternative. In Linear Logic, formulas by default stand for non-renewable resources that are used up in the course of a derivation; free copying or deletion is not available.

The resource-sensitive view of logic and computation fits the nature of grammatical composition particularly well. Despite this, linear logic is underrepresented when it comes to linguistic resources that exhibit its potential at the syntax-semantics interface. Motivated by this lack of resources, we introduce ÆTHEL111ÆTHEL: Automatically Extracted Theorems from LASSY, a dataset of typelogical derivations for Dutch. The type logic ÆTHEL employs rests on the non-directional types of simply typed intuitionistic linear logic, enhanced with modal operators that allow the inclusion of dependency information on top of simple function-argument relations. ÆTHEL consists of 72 623 sentences, every word of which is assigned a linear type that captures its functional role within the phrase-local structure. Under the parsing-as-deduction view, these types give rise to proofs, theorems of the logic showing how the sentence type is derived from the types of the words it consists of. These derivations are presented in four formats which, albeit being morally equivalent, can be varyingly suitable for different tasks, viz. proofnets, natural deduction and sequent-style proofs, and the corresponding -terms. ÆTHEL is generated by an automated extraction algorithm applied to LASSY Small [vanNoord2013], the gold standard corpus of written Dutch. The validity of the extracted derivations is verified by LinearOne [Moot2017], a mechanized theorem prover for first-order linear logic.

The structure of the paper is as follows: we begin in Section section:type_logic. by providing an overview of the type logic used. We proceed by detailing the proof extraction algorithm in Section section:extraction., before describing the dataset and its accompanying resources in Section section:resources.. We give some conclusive remarks in Section 5.

Related work

The work reported on here shares many of the goals of the CCGBank project [ccgbank].

A difference is that CCGBank uses syntactic types aiming at wide-coverage parsing, whereas our linear types are geared towards meaning composition. Also different is the treatment of long-range dependencies, where CCG establishes the connection with the ‘gap’ by means of type-raising and composition rules. From a logical point of view, these CCG derivations are suboptimal: they contain detours that disappear under proof normalization. In our approach, long-range gaps are hypotheses that are withdrawn in a single inference step of conditional reasoning.

2 Type Logic

2.1 Ill

The core of the type system used in this paper is ILL, the simply-typed fragment of Intuitionistic Linear Logic. ILL types are defined by the following inductive scheme:

where is a set of atomic types and linear implication () is the only type-forming operation for building complex types. A type is the type signature of a unary linear function that transforms an object of type into another object of type . Atomic types are assigned to functionally independent phrases; a complex type denotes a curried (multi-argument) function and is assigned to the functional head of a phrase which consumes its arguments, producing a larger phrase in the process.

Given Curry’s “proofs as programs” approach, one can view the inference rules that determine whether a conclusion B can be derived from a multiset of assumptions as the typing rules of a functional programming language. The program syntax is given by the following grammar:

Typing judgements take the form of sequents

stating that a program of type B can be constructed with parameters of type . Such a program, in the linguistic application envisaged here, is a procedure to compute the meaning of a phrase of type B, given meanings for the constituting words as bindings for the parameters . The typing rules then comprise the Axiom , together with inference rules for the elimination and introduction of the linear implication, where () corresponds to function application, and () to function abstraction.

(1)
(2)

2.2 Dependency enhancement

As it stands, our type system doesn’t have the expressivity to capture grammatical role information. For example, a ditransitive verb such as ‘give’ would be typed as 222Type brackets can be ommitted reading the operation as right-associative.; we would like to be able to distinguish the subject, direct object and indirect object among the three NP arguments.

To address this limitation, we use the modalities of Multimodal Categorial Grammars [DBLP:journals/jolli/Moortgat96]. Modalities are unary type-forming operations, commonly used for syntactic control, licensing restricted forms of restructuring or reordering, or blocking overgenerating applications of such operations. Here instead, we use them as a means of injecting dependency information directly into the type system in the form of feature decorations. For earlier applications of modalities-as-features, see [heylen99, Johnson99].

We augment the type system with a set of unary operators , one for each dependency label in the corpus. On the lexical level, we use to denote a functor that consumes an argument with dependency role to produce a . The extended type scheme is:

The dependency operators come with their own introduction and elimination rules. We assume that the dependency annotations do not affect the Curry-Howard program terms associated with a proof, so we can formulate the inference rules purely on the type level.

(3)
(4)

The introduction rule (3) says that if from resources one can derive a phrase of syntactic type A, then one can assign this A the dependency role by grouping together the resources as a dependency domain with the delimiter . The elimination rule (4) allows one to substitute a multiset of resources that together constitute a -dependency domain for a single formula that is assigned the dependency role.

In practice, the dependency refinement serves two main purposes. From a semantic perspective, the added operators can be meaningful in the interpretation of the type system, allowing distinct lexical recipes for types which would otherwise be equated. From a parsing perspective, ILL by itself is word-order agnostic, meaning it admits more proofs than linguistically desired. The dependency decorations in this respect act as a balancing counter-weight, which on the one hand increases lexical type ambiguity, but on the other hand provides valuable information to constrain proof search. Finally, it is worth noting that the dependency decorations may be dropped, trivially transforming the dataset into pure ILL format.

3 Extraction

The formulation of the extraction process takes its inspiration from [DBLP:conf/lrec/MoortgatM02, moot2015type], modulo some adaptations to account for the discrepancies in the type-logic and the source corpus. Algorithmically, it may be perceived as a pipeline of three distinct components. The first one concerns the transformation of the syntactically annotated input sentences into a directed acyclic graph (DAG) format that satisfies the input requirements of the remainder of the algorithm. The intermediate one is responsible for assigning types to the DAG’s nodes and asserting their validity within the phrase-local context. Finally, the third component accepts the type-decorated DAG and identifies the interactions between its constituent types, essentially transforming it into a typelogical derivation proper.

3.1 Lassy

Lassy [vanNoord2013] is a dataset of syntactically annotated written Dutch. It is subdivided in two parts, Lassy-Small and Lassy-Large, both of which have been automatically annotated by the Alpino parser [bouma2001alpino]. For the purposes of this work we aim our attention at the gold-standard Lassy-Small, which (unlike its sibling) has been manually verified and corrected. Lassy-Small enumerates about 65 000 sentences (or a million words), originating from various sources such as e-magazines, legal texts, manuals, Wikipedia content and newspaper articles among others. The Lassy annotations essentially are DAGs, with lexical and syntactic category information labeling the nodes, and dependency roles labeling the edges333A concise description of the syntactic category tags and dependency labels of Lassy can be found at http://nederbooms.ccl.kuleuven.be/eng/tags. DAGs allow for reentrancy, where a node has multiple incoming dependency edges. In the Lassy annotation, cases of reentrancy are handled via phantom syntactic nodes coindexed with nodes that carry real content, so as to allow the annotation DAG to be visually represented as a tree.

3.2 Preprocessing

The extraction algorithm is formulated in terms of operations on DAGs. The first step mandated is therefore to collapse all coindexed duplicate nodes of each word or phrase into one, which inherits all the incoming edges of the original tree.

Next, we wish to treat annotation instances that are problematic for our type logic, stemming from schemes that under-specify the phrase structure. The two usual culprits are discourse-level annotations, which do not exhibit a consistent function-argument articulation, and multi-word units, the children of which are syntactically indiscernible. In the first case, we erase branches related to discourse structure (in many cases, these consist simply of structures annotated as “discourse unit” with substructures each marked as “discourse part”). To minimize the amount of annotations lost, we reconstruct an independent sample from each disconnected subgraph positioned under the cut-off point. To resolve multi-word phrases without resorting to ad-hoc typing schemes, we merge participating nodes, essentially treating the entire phrase as a single word. Punctuation symbols are dropped, as they are left untreated by the original annotation.

Beyond ensuring compatibility, we apply a number of transformations to Lassy’s annotations designed to homogenize the types extracted. First, we remove the Lassy phantom syntactic nodes used to express the ‘understood’ subject or object of non-finite verb forms, since this is semantic information that doesn’t belong in a proper syntactic annotation. Second, we replace the generic conj and mwu tags (for conjunctions and multi-word units, respectively), by recursively performing a majority vote over their daughters’ tags. Finally, we revise a number of dependency role labels. For noun-phrases, we swap the head status from nouns to determiners, thus allowing the latter to assume the role of the functor in the phrase construction – nouns previously under a hd dependency are now marked with invdet. We refine the generic body label by specifying the particular kind of clause it applies to, thus splitting it into r_body, wh_body and cmp_body for body of a relative, wh- and comparative clause respectively.

Edge erasures performed by the above procedures might lead to artefacts, which we take measures against. We redeem DAGs with multiple sources by creating a distinct graph for each source, constituted by (a replica of) the source-reachable subset of the original graph. We remove redundant intermediate nodes with a single incoming and a single outgoing edge, and redirect the incoming edge to its corresponding descendant. Conjunctions left with no more than one conjunct are truncated and replaced by their sole daughter.

3.3 Type Assignment

The output of the preprocessing procedure is a number of rooted DAGs per Lassy sample. The goal now is to assign types to the nodes of these DAGs, according to the dependency-decorated scheme of Section section:type_logic.. We begin this endeavour by specifying two look-up tables that the type assignment algorithm is parametric to; one from part-of-speech tags and phrasal categories into atomic types, and one from dependency labels into diamond operators (refer to Appendix subsec:trans. for details). Further, we specify the edge labels that correspond to phrasal heads and the ones that correspond to modifiers. We then formulate the type assignment process as a conditional iteration over a DAG, with each iteration progressively inferring and assigning the types of a subset of the DAG’s nodes, and the termination condition being the absence of any untyped nodes.

In abstract terms, the algorithm looks as follows:

TypeDag :: dag dag
return last (unfold Step )
Step :: dag option dag
last (unfold TypeContextless )
last (unfold TypeHeads ’)
last (unfold TypeNonLocal ’)
return if else Nothing

The iteration loop consists of three steps, each being the unfold of a function that takes a DAG, selects its fringe of typeable nodes, assigns a type to each node within, and returns the new (partially) typed DAG. Each of these three functions differs in how it selects for its fringe and how it manufactures type assignments. We detail their functionality in the next paragraphs; an illustrative example is also given in Figure 0(a).

Context-Independent Nodes

The first step internalizes the bottom-up typing of nodes that are contextually independent of their topological predecessors. In this case, the membership condition of a node in the fringe is specified by the conjunction of four criteria: the node is not typed, all of its daughter nodes are typed (possibly excluding daughters whose edge to the parent node are labeled as either heads or modifiers), the node is not the target of a modifier or head dependency itself, and it is not a conjunction.

The type assignment is accomplished by a simple translation of the node’s part-of-speech or phrasal category tag into its corresponding atomic type.

Heads and Modifiers

The next step is to assign a type to untyped nodes acting as phrasal heads and modifiers. These may only be typed insofar as both their parents and all of their siblings are typed, imposing the analogous fringe conditions. The dependence on topological predecessors necessitates a switch to a top-down iteration.

Phrasal heads are given the complex type

with the list of the dependency-decorated sibling types for the arguments and R the phrasal type for the result. In order to obtain a consistent currying of multi-argument function types, we appeal to an obliqueness ordering of the dependency labels (see e.g. [dowty]), and have the curried -ary function type consume its arguments from most oblique to least oblique . We refer to Appendix subseq:obli. for details of the obliqueness order assumed here. Similarly, modifiers are typed based on the polymorphic scheme , i.e. they are treated as endomorphisms of the phrasal type they are modifying, fixed to determine the dependency they realize. Top-down iteration assures that modifiers of modifiers (e.g. adverbs) are correctly assigned their higher-order type.

(a) Following the steps of the type-assignment algorithm, we begin by typing leaf nodes 24 and 31 as nouns N. Node 18 is typed as a NP (being a conjunction of noun-phrases), and propagates its type to conjuncts 19 and 26, as well as the leaf 21 (being a head with no arguments). This then allows assigning nodes 20, 22, 28 and 29 the modifier type , in turn allowing the typing of phrasal heads 23 and 30 as functors from nouns to noun-phrase modifiers . Finally, the coordinator 25 is typed as the polymorphic conjunction of two conjuncts , where X is the conjuncts’ type NP from which the shared hypothesis type is withdrawn, thus yielding .
(b) Continuation of the previous example, with types showing atom polarities and indices. Prior to axiom linking, we assign each atomic formula an unique index, based on its position within the sentence. Afterwards, the copied node 21 is detached from its phrase-local position and re-attached at the conjunction level (to be consumed by the coordinator). The positive types , (which are normally supplied by the coordinator’s conjunct arguments, see previous subfigure) are now attached to their respective incomplete phrases in the form of placeholder nodes -1, -2. Axiom linking takes place bottom-up, matching functors to their arguments and propagating indices upwards; for instance, the of node 23 is matched with the of 24, and the remainder propagates to node 22.
Figure 1: Typed DAG for the phrase “eerst koningin van Portugal en later van Frankrijk” (first queen of Portugal and later of France), part of Lassy sample wiki-138.p.35.s.1. The phrase is a symmetric conjunction of noun-phrases, with the conjuncts sharing their head “koningin”. Top figure illustrates the typing process, bottom figure the axiom linking.

Phenomena of coordination and ellipsis represent a challenge for our resource-sensitive type logic: deriving conjunction types for incomplete phrases from the type for the conjunction of the corresponding complete phrases requires copying in the logic, an operation that our linear type system rules out. Our approach here is to replace syntactic copying in the logical derivations by lexical polymorphism. More precisely, coordinators are typed according to the generically polymorphic scheme

where the number of arguments is concurring with the number of conjuncts. To determine the value of X, we take the following steps. We begin by setting as the atomic type assigned to each of the conjuncts444We ensure a singular set of conjunct types by standardizing them using a majority-biased promotion.. We then select copied nodes (that is, nodes with multiple incoming edges of the same label), which are lying under the inspected conjunction node and not below any other common ancestor and set X to

where is the obliqueness-sorted list of dependency-decorated types of the copied nodes. This scheme provides a uniform treatment of arbitrarily nested elliptical conjunctions and allows their logical derivation by means of higher-order types and hypothetical reasoning, without appealing to copying.

If more than one heading node is present in a single branch (e.g. in the context of possibly discontinuous coordinator or determiner pairs), we fall back on to assigning the non-primary head(s) an extra-logical null type denoted by an underscore.

Non-Local Dependencies

Lassy treats non-local phenomena such as relative clauses and constituent questions by inserting a secondary edge pointing from a phrasal node embedded (possibly deeply) within the embedded clause on to the relativizing or interrogative pronoun. Such pronouns then serve two roles. In their primary role, they act as head of the relative or interrogative clause they project. They are distinguishable in this respect from simply copied nodes as the labels of their incoming dependency edges are distinct. In their second role, they contribute to the composition of the dependency domain of the embedded phrase where their secondary edge has its origin. Of these two roles, only the second one has been acknowledged by the algorithm described thus far (the first step of the iteration purposefully omits typing arguments that are also the targets of dependencies that are head candidates).

We select, therefore, our fringe as nodes falling under this construction, and which have already been assigned some implicational type depicting the top-level clause functor. We then inspect the secondary dependency edge originating from X or a subphrase of X in order to update the earlier to . The updated higher-order type has a nested implication: from the parsing-as-deduction perspective this means that in order to obtain a result of type Y, the relative or question pronoun has to assemble its argument X with the aid of a hypothesis . Hypothetical reasoning (the Introduction rule) is a key feature of our typelogical toolkit; it obviates the need for phantom ‘gap’ categories in unbounded dependency constructions.

The identity of and E depends on what the label of the secondary dependency edge is555The interpretation of the marking in long-range dependencies is in fact slightly different from its use in local dependency configurations. Because the hypothesis may be used as an immediate constituent of the X dependency domain itself, or of the dependency domain of a subphrase of X, we combine the dependency feature with a structural control modality that allows the hypothesis to traverse intermediate dependency domains if necessary.. In case of an argument edge, it is simply the pair of translations from the dependency label and the node’s part of speech or phrasal syntactic tag respectively. In case of a modifying edge, we use a dummy dependency label emb for and set E to , the modifying endomorphism on the type of the edge’s source node. For any non-terminal nodes the type of which has been altered, we iteratively update the types of all heads and modifiers lying underneath in the DAG so as to account for the new phrasal type.

3.4 Axiom Linking

The algorithm specified above assigns a type to each DAG node; the multiset of types given to terminal nodes should admit the derivation of the root’s type (i.e. the type of the sample phrase as a whole should be derivable by the types of its constituent words); it does not, however, specify the derivation itself. To that end we design an additional algorithmic component, which accepts a type-annotated DAG and produces the proof it prescribes.

The first choice to make is of how to encode proofs; standard choices would be Gentzen style proofs (either in natural deduction or sequent format) or proofnets [girard1987linear]. The proofnet presentation is more appealing, as it combines the pleasant property of natural deduction (one-to-one correspondence with the program terms for meaning composition) with the good computational properties of sequent calculus (decidable proof search).

Providing the full theory behind proofnets goes beyond the scope of this paper; what follows is a simplified summary. We begin by assigning types a polarity. In the context of a logical judgement, types appearing in antecedent position (i.e. assumptions left of the turnstile) are negative, whereas types appearing in succedent position (i.e. conclusions right of the turnstile) are positive. Polarities are then propagated to subformulas as follows. If a type is atomic, its polarity remains unaltered. If it is an implication , then the polariy of persists, whereas the polarity of is reversed.

Atoms nested within complex types are assigned a polarity by recursive application of the above scheme. At its essence, a proofnet is a bijection between positive and negative atoms, i.e. a pairing of each positive atom with an (otherwise equal) negative one666This bijection must satisfy certain correctness criteria, i.e. not all such bijections constitute valid proofnets..

Finding a proof is then equated with constructing the appropriate bijection. To ensure that such a bijection is indeed possible, we first perform a rudimentary correctness check, asserting the branch-wise invariance count of atoms and implications [van1991language]. We then project the types of the DAG’s terminal nodes into a flattened sequence, sort them based on the corresponding word order, and decorate each atom with an integer, thus distinguishing between unique occurrences of the same atom. Our goal then lies in propagating these indices upwards along the DAG, linking atom pairs as we go. The algorithmic procedure is outlined below; Figure 0(b) provides a visual reference.

We first instantiate an empty proofnet in the form of a bijective function from indices to indices. We additionally implement a function, which takes a proofnet and two equal types of inverse polarity, and recursively matches their corresponding atoms, updating the proofnet in the process. For simple branchings, we isolate the arguments of the phrasal head’s functor and identify them with the type of a phrasal dependent on the basis of the diamond decoration of the former and the dependency label of the latter. The resulting pairs are matched and the proofnet is expanded. The type of the node dominating the branch is then indexed with the head functor’s result index(es). A bottom-up iteration then gradually indexes the DAG’s non-terminal nodes while filling in the proofnet.

[capbesideposition=right, top,capbesidewidth=0.3]figure[]

Figure 2: Fraction of words covered as a function of types included. Y axis depicts the percentage of words that can be analyzed (logit-transformed). X axis depends the percentage of unique types considered. A point then represents the of words in the corpus that could be type-assigned if all but the most common types were discarded.

[capbesideposition=right, top,capbesidewidth=0.3]figure[]

Figure 3: Lexical ambiguity histogram. Y axis depicts the number of words (log-scaled), X axis depicts the number of types (log2-scaled). A bar spanning the horizontal range with height indicates that a total of words are associated with to number of types.

For constructions involving hypothetical reasoning, such as elliptic conjunctions and non-local phenomena, the process is a bit more intricate. The higher-order types involved in these do double duty, both providing the functor that composes the outer phrase and supplying the material missing from the inner phrase(s). To resolve these, we simplify the higher-order types by subtracting their embedded arguments and traversing the DAG to find the branch that misses them. Once there, we detach edges that are secondary or point to copied nodes, and replace them by edges (of the same label) that point to placeholder nodes carrying the aforementioned subtracted arguments. This transformation essentially converts the typed DAG into a typed tree, reducing the problem once more to the simple case.

3.5 Verification

The extraction procedure described in the previous section produces type formulas together with an axiom linking, i.e. a bijection matching each atomic subformula occurrence to another of opposite polarity. Although the extraction algorithm should produce only proofs, formulas together with matchings need not correspond to proofs: only structures satisfying a so-called correctness condition are proofs (that is, proof nets). To ensure the extraction procedure works correctly, we therefore verify whether the structures obtained correspond to linear logic proofs. If the extracted structure is a proof, then there must be a corresponding sequent proof (a proof net typically corresponds to many sequent proofs, which differ only in inessential rule permutations). We use LinearOne777Available at https://github.com/RichardMoot/LinearOne., a linear logic theorem prover to verify that all extracted structures are indeed linear logic proofs. In addition to checking all structures satisfy the correctness conditions, thereby providing a strong sanity check on the extraction algorithm, LinearOne also produces a sequent calculus proof, a natural deduction proof and the corresponding -term for each input proof net candidate.

3.6 Analysis

The end-to-end pipeline yields a number of samples, each corresponding to a mechanically verified typelogical derivation for a sentential or phrasal syntactic analysis. Even though everything on the output end is proven correct (i.e. all extracted proof nets satisfy the correctness criteria), the extraction algorithm fails to produce a derivation in a limited number of instances. These failures are recognized during runtime and their cause is pinpointed; we provide a breakdown of the algorithm’s coverage at each step of the process.

[capbesideposition=left, top,capbesidewidth=0.3]figure[]

Figure 4: Sentence length histogram. Y axis depicts the number of sentences (log-transformed), X axis depicts the number of words. A bar spanning the horizontal range with height indicates that a total of sentences have a length of to words (with punctuations removed and multi-word phrases counted as a single lexical item).

[capbesideposition=left, top,capbesidewidth=0.3]figure[]

Figure 5: Fraction of sentences covered as a function of types included. Y axis depicts the percentage of sentences that can be analyzed (logit-transformed). X axis depicts the percentage of unique types considered. A point then represents the of sentences in the corpus, all words of which could be type assigned if all but the most common types were discarded.

Preprocessing

Lassy-Small contains 65 200 analyses. We drop 156 of these due to being a single punctuation mark. Out of the remaining 65 044, we obtain 75 059 independent DAGs (1.15 DAGs per Lassy sample on average).

Typing

The type assignment algorithm produces a correct output for 72 628 samples (96.76% input coverage). The majority of failed cases relates to conjunction schemata; in particular, 1 494 samples contain conjunction branches that lack a coordinating word and 443 samples contain asymmetric conjunctions. Both cases could be trivially solved, for instance by promoting the first conjunct to the branch head or expanding the polymorphic coordinator scheme to account for unequal constituents; the ad-hoc nature of such solutions would however degrade the quality and consistency of the lexical types, and we therefore abstain from implementing them. The remaining 494 cases (0.6% of the input) are generic failures arising from copying outside the context of a conjunction, annotation errors and discrepancies, and preprocessing artifacts.

Linking and Verifying

Out of the 72 628 typed DAGs, the axiom linking algorithm fails to produce a sane bijection in 5 (of which 4 are edge cases and 1 is an annotation error). All 72 623 outputs of the axiom linking algorithm are validated by the theorem prover, leading to an overall coverage of 96.75%.

4 Resources

4.1 Code

We make the code implementing the extraction algorithm publicly available 888The code is written in Python3.7 and closely follows along the structure described earlier. It may be found at https://github.com/konstantinosKokos/Lassy-TLG-Extraction/tree/lrec2020.. It is parametric and tunable with respect to the part of speech and phrasal category translation tables (both in terms of domain and codomain, allowing either a refinement or a consolidation of the atomic type set considered), but also the dependency labels (similarly allowing an adjustment of the number of diamond operators). The algorithm is, to a certain degree, agnostic to the underlying type system specification; in other words, it is easy to adopt for different grammars, and can be fine-tuned according to the user’s needs and purposes. Further, the algorithm is immediately applicable to Lassy-Large, which boasts a total of 700 million words, significantly outnumbering Lassy-Small. Its massive size, together with the extraction algorithm, grants easy access to a large amount of cross-domain silver standard type-theoretic analyses, as well as a potential corpus for unsupervised language modeling enhanced with lexicalized structural biases. Last but not least, the extraction algorithm is compatible with the Alpino parser’s [bouma2001alpino] output format; combining the two would then essentially account to an “off the shelf” typelogical theorem prover for written Dutch that produces computational terms along with the standard parses.

4.2 Lexicon

Dissociating leaf nodes from their DAG, we obtain a weighted lexicon mapping words to type occurrences. Our lexicon enumerates a total of 81 730 distinct words and 5 771 unique types made out of 31 atomic types and 26 dependency decorations. On average, each word is associated with 1.73 types and each type with 23.93 words. Figure 3  presents a histogram of lexical type ambiguity; most words are unambiguous, being always assigned a single type, whereas 20 words (mostly coordinators and auxiliary verbs) are highly ambiguous, being associated with more than 128 unique types each. Figure 2  presents the relation between lexical coverage and types considered; evidently, the 1 154 most common types (20%) suffice to cover 99% of the corpus (on a per-word basis). The lexicon can be utilized as a stand-alone resource, useful for studying grammatical relations and syntactic variation at the lexical level.

4.3 Theorems

The core resource is a collection of 72 623 typelogical derivations999The public subset of the dataset consists of 8 569 sentences and is available online https://github.com/konstantinosKokos/aethel-public.

The primitive component behind each derivation is a type-annotated sentence or phrase. On average, samples consist of 12.6 lexical items (a 25% drop compared to the unprocessed source corpus due to multi-word merges and detached branches); figure 4 presents a histogram of sample lengths. As already hinted by Figure 2, the fine-grained nature of the type system has the side-effect of enlarging the lexicon’s co-domain, and therefore the type sparsity, in comparison to other lexicalized grammar formalisms. Figure 5 presents the relation between corpus coverage and types considered; no less than the 4 976 most common types (86%) are required to parse 99% of the corpus. This suggests that, in themselves, the annotated sentences constitute a challenging supertagging task as well as a potential benchmark for open-world classification [cts].

Of higher importance are, however, the derivations themselves. As stated earlier, they are provided in four distinct formats; as natural deduction and sequent calculus proofs, bijections between atomic formulas (proofnets) as well as -terms. The type system, being based on ILL, is agnostic to word order and thus inherently ambiguous. Out of the (possibly many) potential proofs, the dataset specifies the one that is linguistically acceptable, determining the correct flow of information within the sentence’s constituents (an example of such a derivation is presented in Appendix Figure 6). This has multiple ramifications and usecases.

First and foremost, it opens a path towards integrated neuro-symbolic approaches to parsing. On one hand, lexical interactions are constrained to just those that are respectful of the typing information. On the other hand, neural approaches can be applied to narrow down the resulting search space, simultaneously utilizing semantic and syntactic information, with the dependency enrichment providing an additional heuristic. In this context, selecting a parsing methodology and the appropriate proof format is up to the end-user; for instance, shift-reduce parsing would be easier accomplished on the binary branching natural deduction structures 

[shieber1995principles, ambati2016shift], -terms would be more accommodating for generalized translation architectures like sequence transducers [zettlemoyer2012learning], proofnet bijections could be obtained via neural permutation learning  [mena2018learning] etc; supertagging and type representations can also be jointly optimized [lewis2016lstm, kasai2017tag].

The dataset can also emerge as a useful resource for “pure” parsing as deduction. The type grammar and its derivations can be utilized as a stepping stone towards stricter type systems. Noting that our abstract types are in alignment with Lambek types (modulo directionality), linear implications can be gradually replaced with directed divisions based on their aggregated corpus-wide behavior, easing the transition towards an (either hybrid or multi-modal) Type-Logical dataset.

5 Conclusion

We have described a linear type system that captures abstract syntactic function-argument relations, but is also able to distinguish between arguments on the basis of their dependency roles. We have presented a methodology for extracting these linear types as well as their interactions out of dependency parsed treebanks. Our approach is modular, allowing a large degree of parameterization, and general enough to accommodate alternative type systems and source corpora. We implemented and applied a concrete algorithmic instantiation, which we ran on the Lassy treebank, generating a large dataset of type-theoretic syntactic derivations for written Dutch. Utilizing a theorem prover, we verified the correctness of the algorithm’s output, and transformed it into a number of different realizations to facilitate its use in different contexts. Taking advantage of the Curry-Howard correspondence between linear logic and the simply-typed -calculus, we naturally equated our derivations to computational terms which characterize the flow of information within a sentence. Disconnecting types from their derivations, we are left with a pairing of sentences to type sequences; disconnecting them from their surrounding context, we obtain a highly refined lexicon mapping words to type occurrences. We make a significant portion of the above resources publicly available. Our hope is that our resources will find meaningful applications at the intersection of formal and data-driven methods, in turn giving rise to practically applicable insights on the syntax-semantics interface.

Acknowledgements

Konstantinos and Michael are supported by the Dutch Research Council (NWO) under the scope of the project “A composition calculus for vector-based semantic modelling with a localization for Dutch” (360-89-070).

Bibliographical References

References

Appendix

.1 Translation Tables

Atomic Types

Table 1 presents the set of atomic types and their origins (part-of-speech and phrasal category tags). The current translation domain utilizes the Lassy part of speech tagset (pt); other options could be either the alpino tagset (pos) or even the detailed tagset (postag). Note also that in our usecase there is a one-to-one correspondence between tags and types. This does not necessarily need to be the case; one could as well choose to collapse one or more tags onto the same type (e.g. translate vnw to np or all sentential tags to s). The extraction algorithm is parametric to all above possible variations.

Tag Description Assigned Type
Short POS Tags

adj
Adjective adj
bw Adverb bw
let Punctuation let
lid Article lid
n Noun n
spec Special Token spec
tsw Interjection tsw
tw Numeral tw
vg Conjunction vg
vnw Pronoun vnw
vz Preposition vz
ww Verb ww
Phrasal Category Tags
advp Adverbial Phrase adv
ahi Aan-Het Infinitive ahi
ap Adjectival Phrase ap
cp Complementizer Phrase cp
detp Determiner Phrase detp
inf Bare Infinitival Phrase inf
np Noun Phrase np
oti Om-Te Infinitive oti
pp Prepositional Phrase pp
ppart Past Participial Phrase ppart
ppres Present Participial Phrase ppres
rel Relative Clause rel
smain SVO Clause smain
ssub SOV Clause ssub
sv1 VSO Clause sv1
svan Van Clause svan
ti Te Infinitive ti
whq Main WH-Q whq
whrel Free Relative whrel
whsub Subordinate WH-Q whsub
Table 1: Part-of-speech tags and phrasal categories, and their corresponding type translations.

Dependency Decorations

Table 2  presents the set of dependency decorations (i.e. diamond operators) and their descriptions. Most decorations coincide with a Lassy dependency label or a derivative thereof. Some heading edges, however, do not materialize as diamonds. Higher-order types embedding functors denote them with embedded to signify the absence of a material dependency.

Decoration Description Precedence
app Apposition 18
cmp_body Complementizer body 1 7
cmp Complementizer -
cnj Conjunct 0
embedded Hypothetical Functor -
invdet Head of noun phrase 2
hd Phrasal Head -
hdf Final part of circumposition 11
ld Locative Complement 12
me Measure Complement 13
mod Modifier 20
obcomp Comparison Complement 1
obj1 Direct Object 6
obj2 Secondary Object 8
pc Prepositional Complement 10
pobj1 Preliminary Direct Object 5
predc Predicative Complement 7
predm Predicative Modifier 19
rhd Relative clause head -
rhd-body Relative clause body 17
se Obligatory Reflexive Object 9
su Subject 4
sup Preliminary Subject 3
svp Separable Verbal Participle 15
vc Verbal Complement 14
whd-body WH-question body 16
Table 2: Dependency relations and their corresponding implication labels.

.2 Obliqueness Hierarchy

Phrasal heads are assigned functor types; in the multi-argument case, these would be of the form:

or their curried equivalent:

To avoid the inconvenience of (in this case, superficial) distinction between different argument type permutations, we impose a strict full order on argument sequences, loosely based on the obliqueness hierarchy of their syntactic roles (apparent through their diamond decorations). The ordering is presented in the third column of Table 2; the lower a label’s number, the less prominent (more oblique) the argument it marks, causing it to be consumed first. This scheme, recursively applied, provides a unique implicational type for each functor’s argument-permutation class. Functors carrying cnj-decorated arguments (i.e. coordinators and their derivatives) are the only kind of functor which permits two distinct argument of the same decoration; we sort those based on the conjuncts’ order within the sentence.

Figure 6: ILL proof for the running example of Figure 1. Judgement premises and dependency decorations have been dropped for brevity. Axiom leaves are marked with , denoting lexical type assignments. The proof corresponds to the -term en x.(van Portugal (eerst x)) z.(van Frankrijk (later z)) koningin which encodes the phrase’s computational semantics. Intuitively, the conjunction acts as a derived modifier to koningin, with the coordinator providing access to hypothetical reasoning about two noun-phrases (marked as and at the proof-level and the bound variables x and z at the term-level).