Log In Sign Up

Natural Language Specifications in Proof Assistants

Interactive proof assistants are computer programs carefully constructed to check a human-designed proof of a mathematical claim with high confidence in the implementation. However, this only validates truth of a formal claim, which may have been mistranslated from a claim made in natural language. This is especially problematic when using proof assistants to formally verify the correctness of software with respect to a natural language specification. The translation from informal to formal remains a challenging, time-consuming process that is difficult to audit for correctness. This paper argues that it is possible to build support for natural language specifications within existing proof assistants, in a way that complements the principles used to establish trust and auditability in proof assistants themselves.


page 6

page 7

page 8


Towards operational natural language

The multiplicity of software projects' stakeholders and activities leads...

Natural Language Deduction with Incomplete Information

A growing body of work studies how to answer a question or verify a clai...

Transforming UNL graphs in OWL representations

Extracting formal knowledge (ontologies) from natural language is a chal...

Natural Language Proof Checking in Introduction to Proof Classes – First Experiences with Diproche

We present and analyze the employment of the Diproche system, a natural ...

NALABS: Detecting Bad Smells in Natural Language Requirements and Test Specifications

In large-scale embedded system development, requirement and test specifi...

TrABin: Trustworthy Analyses of Binaries

Verification of microkernels, device drivers, and crypto routines requir...

1 Introduction

Proof assistants can establish very high confidence in the correctness of formal proofs, both because of their rigorous checking and attention to producing independently auditable evidence that the arument is correct [73, 76]. But one of the unavoidable points of trust for even a carefully-implemented proof assistant is the specifications themselves: proving the wrong theorem is of limited use. And only those who can read both formal and informal specifications can even consider whether this has occurred. This is particularly crucial for software verification: software specifications typically originate in natural language, and any accompanying formal specification comes afterwards — which increasingly occurs for compilers [54], operating systems [45], and other high-value software. Currently, the only bridge between the formal and informal specifications is the humans who perform the translation. There is no independently checkable record of this translation aside from the possibility of comments or notes by the translators — themselves largely in informal (though likely careful) natural language. And simply being familiar with both the specification language and the intended specification is insufficient by itself to bridge this gap [26, 25]: relating the two is a separate skill that is independently challenging to develop.

Ideally, it would be possible to give natural language specifications directly to the proof assistant, for example:

Goal "addone is monotone".

Robust support for such specifications could enable significant improvements in requirements tracing (machine checked mappings from natural language to formal results), including for artifact evaluation; education, where it could help students check their understanding of how either mathematics or program specifications are formalized logically; and even communication with non-technical clients who might wish to have some confidence that a formalization they do not themselves fully understand is correct. On the last point, Wing’s classic paper introducing formal methods [95] posits that customers may read the formal specifications produced from informal requirements, but this is only possible if the client can make sense of the formal specification itself. Machine-checked relationships between natural-language and formalized expressions of software properties can help bridge this knowledge gap by connecting formal properties to natural language a reader with less background in a specific formal logic could understand.

By the end of this paper, we will have developed a way to accept the very similar

Goal spec "addone is monotone".

(where spec returns the logical form of the natural language utterance in quotes). We envision such a system can eventually be used for the purposes above, to generate formal claims about mathematics or programs verified in a proof assistant, whether specified in a proof assistant’s own logic, or indirectly through a foundational program logic [4] built inside a proof assistant. While these generated specifications may not match the expert human formalizer’s choices (often influenced by setting up definitions in a particular way to simplify proofs), they should be implied by the specification used to conduct the primary verification activities, offering an additional, optional level of assurance.

This is not a job for machine-learning-centric natural language processing, which is incompatible with the goals of using a proof assistant for formal verification. inappropriate for foundational verification. There is no guarantee a learned translation is sensible, and if a translation from natural to formal language ends up being surprising, few machine learning approaches produce an auditable trail of evidence for

why that translation is believed correct (in the eyes of the trained model), and no way to precisely fix misunderstandings of specific words. Meanwhile, proof certificates play a central role in the design of trustworthy proof assistants [76, 31] and foundational program verification [4]. Moreover, as proof assistants are often used to formalize properties of new mathematics or new programs, often using new terminology, there will often be a lack of training data for mapping natural language to a formal property. Later, we point out additional ways that the needs of trusted formalization of natural language specifications run afoul of many of machine learning’s known limitations, while requiring few of its advantages. (We also point out limited ways machine learning can play a role in optimizing the techniques we employ.)

Fortunately the field of linguistics predates machine learning. Formalizing categorial grammar [2, 7, 52, 87] carefully in a proof assistant offers a path to a natural, auditable way to bridge the gap between formal and natural-language specification. In this paper we show a prototype demonstrating that it is possible parse a string containing a natural language specification into a semantic representation that can be used directly in proofs within a proof assistant (i.e., a proposition in Coq’s logic) in a principled way using Coq’s typeclasses [86], and argue that this approach is modular and can extend to sophisticated natural language. A partial adaptation to the Lean theorem prover suggests that with Lean’s algorithmic improvements [85] this can be made efficient enough for interactive use. We also analyze how the trusted computing base is affected when considering trust of a formal verification up to the natural language specification.

This paper establishes that the theoretical core is readily within reach. But the robust realization of these ideas posited above will require significant efforts to develop a broad core vocabulary to act as a starting point for extensions (we outline how such extensions would work) and eventually a robust library of domain-specific language support. It will require time and collective effort to collect precise natural language descriptions of partial specifications to guide development and evaluation of such a system. And it raises potential opportunities for fruitful collaboration with linguists on building systems that simultaneously benefit consumers of proof assistants while offering a live proving ground for work on semantic parsing.

2 Background & Motivation

This section provides a condensed (and therefore somewhat biased) background in natural language processing and categorial grammar, from a programming languages point of view.

Categorial grammar is a body of work concerned with the use of techniques and ideas from logic to relate the syntax of natural language with the meaning of language in a way independent of syntax. The core idea is to build a sort of type theory where base types correspond to grammatical categories (hence categorial), from which more complex grammatical categories can be defined. A set of inference rules is then used to define, simultaneously, how grammatical categories combine into larger sentence fragments and how those smaller fragments’ meanings (logical forms) are combined into larger meanings. This process bottoms-out at a lexicon, giving for each word its grammatical roles (types) and associated denotations. Thus categorial grammar is a system of simulatenously parsing natural language from strings and assigning denotational semantics — a process traditionally referred to as semantic parsing in the computational linguistics literature.

Most prominent in the linguistics community are combinatory categorial grammars (CCGs) [87], though also relevant to our goals are the categorial type logics (CTLs111Occasionally also called type-logical grammars (TLGs).). Work on CCGs epmhasizes appropriate constructs for linguistic ends, while CTLs hew close to Lambek’s view [52] of categorial grammars as substructural logics for linguistics. While these reflect very different philosophical and practical aims, for our present purposes the distinction is immaterial: it is widely held, and in some cases formalized [47, 30], that rules used in CCGs (including the variant with the most sophisticated linguistic treatments [6]) correspond to theorems in particular CTL[63]. In this work we use only principles common to both CCGs and CTLs.

All categorial grammars parse by combining sentence fragments based on their grammatical types. These types include both atomic primitives (such as noun phrases) as well as more complex types, namely so-called slash-types that indicate a predicate argument structure (which are used to model, for example, most classes of verbs). Oversimplifying slightly, categorial grammars treat parsing as logical deduction in a residuated non-commutative linear logic.222Technically only CTL[67] take this as an epistemilogical commitment, while CCG[87]) are agnostic, inheriting such a relation via Baldridge and Kruijff’s work [6]. This is essentially a family of linear logics without the structural rule for freely commuting the order of assumptions, thus modeling sensitivity to word order, and picking up as a consequence two forms of implication corresponding to whether an implication expects its argument to the left or to the right.333It is the presence of the ability to commute assumptions arbitrarily that allows a single implication to suffice in standard logics. The model for the logic is a sequence of words, and types correspond to the grammatical role of a sentence fragment.

( over ) is the grammatical type for a fragment that, when given a to its right, forms an . ( under ) is the grammatical type for a fragment that, when given an to its left, forms a . In both cases, the argument is “under” the slash, and the result is “above” it.444We follow CTL notation rather than CCG notation (which always puts results to the left) as users of proof assistants tend to be familiar with a range of logics, so the CCG syntax would likely confuse users already familiar with the Lambek calculus and related systems. This notational choice is orthogonal to the choice of which rules to employ.) These are called slash types. The grammars include rules to combine adjacent parts of a sentence. The elimination rules are the first two in Figure 1.

[-Elim]Γ⊢A ⇒a Δ⊢A\B ⇒fΓ,Δ⊢B ⇒(f a) [/-Elim]Γ⊢A/B ⇒f Δ⊢B ⇒a Γ,Δ⊢A ⇒(f a) [/-Comp]Γ⊢A/B ⇒e  Δ⊢B/C ⇒f Γ,Δ⊢A/C ⇒(e∘f) [-Comp]Γ⊢A\B ⇒e  Δ⊢B\C ⇒f Γ,Δ⊢A\C ⇒(f∘e) [Reassoc] (Γ,Δ),Υ⊢A ⇒e Γ,(Δ,Υ)⊢A ⇒e [Shift] Γ⊢A \(B / C) ⇒f Γ⊢(A \B) / C ⇒(λr,l.f l r)

Figure 1: A selection of rules used in this paper, all derivable in CTLs and CCGs.

The judgment is read as claiming the sequence of words can be combined to form a sentence fragment of grammatical type , whose underlying semantic form — logical form — is given by . is a term drawn from the logical language being used to represent sentence meaning, typically a simply-typed lambda calculus in keeping with Montague [60, 59, 61, 71], though in our work we follow the alternative approach [11, 89, 17, 79] of targeting a dependently-typed calculus.

A lexicon gives the grammatical role and semantics for individual words, providing the starting point for combining fragments. Categorial grammars push all knowledge specific to a particular human language into the lexicon, in categorizing how individual words are used. This allows the core principles to be reused across languages, which has been put to use in building wide-coverage lexicons for a variety of natural languages [34, 32, 3, 1, 65].

Together, these allow filling in choices for the metavariables in the rules above, which together permit derivations like

Each grammatical type corresponds to a particular type in the underlying lambda calculus and the underlying semantic type is determined by a systematic translation from the syntactic grammatical type. Borrowing more notation from logic (where this idea is known as a Tarski-style universe [56]), we write for ’s semantic type. Both slash types correspond to function types in the lambda calculus: . An invariant of the judgment is that in the underlying logic, always has type . This invariant explains why it is correct for “four” to have semantics while “is even” has a function as its semantics. In proof assistants based on type theory, like Coq, the set of grammatical types can be given as a datatype declaration, and the interpretation function as a function from grammatical types to proof assistant types.

Figure 1 includes a selection of additional rules, each of which is either an axiom or derived rule in CCGs, and a theorem in the Lambek calculus. Thus we do not commit to one approach over the other at this time.

These few rules are enough to formalize a small fragment of English, and demonstrate the possibility of interpreting natural language specifications within a proof assistant, in a well-founded and extensible way. Our initial choice of rules is limited, but not fundamentally so. CCGs recognize mildly context-sensitive languages [39, 48], which are believed to cover the full range of grammatical constructions in any natural language [88]. All rules of CCGs are encodable in the way we describe, as are the rules of Turing-complete CTL[15]. Ultimately the question of which rules are required is an empirical one, considering both linguistic constructions and the complexity of recognizing these grammars using typeclass search. For now we are adding additional CCG rules as needed.

3 Exploring Categorial Grammars for Coq Specifications

This section describes a model of a very small fragment of English for describing simple mathematics. Our goal is not to present a polished and complete natural language fragment suitable for a wide range of specifications. Producing such a result is a very long-term goal, elaborated in Section 4. The purpose of this section then is to emphasize:

  1. that existing work on linguistic semantics covers many of the grammatical aspects of the language constructions we use when speaking or writing about formal claims, and is sufficiently flexible to admit extensions for grammar unique to mathematical prose;

  2. that existing proof assistants (by example Coq, but by association similar systems like Lean) offer an environment ready to implement linguistic semantics in a way directly integrated into the use of specifications in proof assistants; and

  3. extensions of this idea are well worth investigating.

We propose using Coq’s typeclass support [86] to perform semantic parsing from natural language555In this paper, English, but in principle any other natural language with thorough treatments in categorial grammar [34, 33, 32, 1, 3, 65, 58]. to Coq’s specification logic, or alternatively (see Section 4.4) embedded logics. This approach naturally supports an open-ended lexicon, which is essential to modularly extending the words handled by semantic parsing (Section 4). Moreover, it has the advantage over external tools that it works within existing proof assistants today, with no need to involve an external toolchain for the translation. This ensures the extended lexicon can grow in tandem with a formal development, with organization chosen by developers rather than dictated by an external tool, and with Coq automatically checking validity of the lexicon extensions at the same time it checks validity of the rest of the formalization.

Most work on categorial grammars lumps all entities — frogs, people, books — into a single semantic type. But for logical forms to talk about entities in Coq’s logic — which distinguishes natural numbers, rings, monoids, and so on with different types — adjustments must be made. Noun phrases and related syntactic categories must be parameterized by the semantic (Coq) type of entity they concern. Adapting categorial grammar to refer to objects in Coq’s logic — a dependently-typed lambda calculus known as the Calculus of Inductive Constructions (CIC[72] requires making the base grammatical categories multi-sorted, distinguishing nouns with different semantic types. Traditional categorial grammars assume the logic used for logical forms has a sort for “entities”, which different kinds distinguished by predicates: e.g., and , but . This approach is rooted in the assumption that first order logic is an appropriate semantic model for sentence meaning. That assumption is plausible for many general circumstances, and useful for modeling figurative speech, but is clearly incompatible with making natural language claims about mathematical objects defined in intuitionistic type theory.

We follow linguistically-motivated work using intuitionistic type theories like Coq’s for exploring possible logical forms [89, 78, 79, 16, 11, 10, 80], in using an alternative model where many grammatical categories are indexed by the underlying semantic type to which they refer.

Coq’s logic is expressive enough to give the set of grammatical types as a datatype Cat, and to give the interpretation of those types into semantic types as a recursive function interp within Coq, as shown in Figure 2.

Inductive Cat : Type :=
  | S (* Sentence/proposition *)
  | NP : forall {x:Type}, Cat
  | rSlash : Cat -> Cat -> Cat (* A/B *)
  | lSlash : Cat -> Cat -> Cat (* A\B *)
  | ADJ : forall {x:Type}, Cat
  | CN : forall {A:Type}, Cat.
Fixpoint interp (c : Cat) : Type :=
match c with
| S => Prop (* Coq’s type of propositions *)
| @NP t => t
| rSlash a b => interp b -> interp a
| lSlash a b => interp a -> interp b
| @ADJ t => t -> Prop
| @CN t => unit
Figure 2: Core grammatical categories as a Coq type Cat and the mapping from grammatical types to semantic types as interp.

Grammatical types (categories) Cat (categories) include the aforementioned slash types, sentences (); noun phrases () denoting objects of Coq type ; adjectives () denoting predicates over such objects; and common nouns () denoting unit types, but imposing constraints on the semantic type indices of other phrases in the context of a sentence, used in cases where a sentence must refer to a common class of objects (i.e., a type), such as “natural numbers” or “rings.” . As mentioned previously, the slash types correspond to function types (the direction is relevant only in the grammar, not the semantics), sentences are modeled by Coq propositions (the type of logical claims), noun phrases of Coq type t correspond to elements of t, and similarly adjectives correspond to predicates on such types (i.e., elements of t -> Prop). These are modeled by the Coq function interp, which maps grammatical types to other Coq

types. Common nouns denote as the unit type. This may seem odd, but common nouns essentially serve as a means to force certain type variables to be the ones corresponding to a specific word.

666In traditional semantics in first-order logic, common nouns often denote a kind of predicate to guard quantifications.

We use Coq’s notation facilities (essentially, macros) to write the slash types as A // B for , and A \\ B for . We also use these macros to define more interesting grammatical categories.777We could use Coq definitions as well, but macros work better with the unification in the next section. For example, a quantifier over a certain common noun type is given as: .That is, a quantifier discussing a Coq type looks first to its right for a common noun (which constrains the type to the one named by the natural language common noun). Then the result of that looks to its right for a sentence fragment expecting a noun phrase to its left. The latter sentence fragment is essentially a predicate.

3.1 Combination Rules

Parsing natural language specifications requires automatically applying rules like /-Elim to combine sentence fragments. Rather than modifying Coq, we can use existing trusted888Officially, typeclasses are not part of Coq’s trusted computing base, as they elaborate to record operations before being passed to the core proof checking apparatus. In practice, they mediate which terms are passed to the core, so calling them untrusted would be a misnomer. functionality to do this for us: typeclasses [86]. These are a mechanism for parameterizing function definitions by a set of (often derivable) operations. Coq permits declaring a typeclass (roughly, an interface), and declaring implementations associated with certain types. The implementations may be parameterized by implementations for other types (such as defining an ordering on pairs in terms of orderings for each component of the pair). When a function is called that relies on a set of operations, Coq attempts to use higher-order unification to construct an appropriate implementation. It is possible to encode the rules of a system like a CTL or CCG into typeclasses. Each judgment signature corresponds to a typeclass, and each rule corresponds to an instance (implementation) of the typeclass. We define the judgement form as:

Class Synth (l : list word) (cat : Cat) :=
  { denote : interp cat }.

If an instance of Synth l C exists, it comes with an operation denote that produces a Coq value of type interp C. Because is viewed as an output we would like to query, it is defined as a member of the typeclass, rather than as an additional index. When deriving a formal specification for sentence , we will arrange for the typeclass machinery to locate an instance of Synth s S — checking that is a grammatically valid sentence — and request its term denotation when necessary. Translating a specification given by the list of words w corresponds to parsing w as a sentence: finding an instance of Synth w S. We define an instance for each rule to encode, such as this one corresponding to \-Elim, which applies the logical form of the functional to the logical form of the argument:

Instance SynthLApp {  A B}
 (L:Synth  A)(R:Synth  (A \\ B)) :
 Synth ( ++ ) c2 :=
{denote := (denote R) (denote L)}.

and for leftward composition (\-Comp):

Instance LComp:{  A B C}
 (L:Synth  (A\\B))(R:Synth  (B\\C))
  : Synth ( ++ ) (A \\ C) :=
{denote := fun x1 => (denote R) (denote L x1)}.

The remaining rules of Figure 1 can also be encoded in this way.

In addition to exploiting the proof assistant’s built-in search for parsing, the use of typeclasses means the set of rules is extensible. As mentioned, the core CCG rules are already known to be expressive enough to cover any known linguistic construction (the technical term is that CCGs are mildly context-sensitive [39, 48]). However:

  • Additional derived rules could be added either to accelerate proof search for recurring intricate constructions

  • The rules can be used to offer generalized constructions for words with many roles. This is particularly valuable for linguistic constructs likc coordination (Section 3.2.2).

  • As discussed momentarily, it allows easy extension of the lexicon with new words without modifying a fixed database.

3.2 Lexicon

The lexicon is encoded via another typeclass which assigns grammatical types and logical forms to individual words rather than series of words. Coq permits declaring multiple instances for the same word (e.g., if a word has multiple meanings of different grammatical types), giving essentially a free variant of intersection types [64] without the coherence issues described by Carpenter [14] (only one definition will be chosen per appearance of the word). We represent our lexicon with another type class, and tie it into the Synth typeclass:

Class lexicon (w : word) (cat : Cat) := { denotation : interp cat }.
Instance SynthLex {w cat}‘(lexicon w cat) : Synth [w] cat :=
  { denote := denotation }.

Thus, a dictionary for our approach consists of a set of instance declarations for lexicon:

Instance fourlex : lexicon "four" NP := { denotation := 4 }.
Instance noun_is_adj_sentence {A:Type} :
  lexicon "is" (@NP A \\ (S // @ADJ A)) :=
   { denotation n p := p n }.
Instance noun_is_noun_sentence {A:Type} :
  lexicon "is" (@NP A \\ (S // @NP A)) :=
   { denotation n a := n = a }.

Here we have defined two different meanings for “is” allowing it to be used to apply an adjective, or to denote equality. The difference between the two, beyond their denotation is the grammatical types: both expect a noun phrase to the left, and some other word to the right: an adjective in the first case, or another noun in the second. Note that in both cases, the adjective or noun phrase must match the type of underlying Coq object the the left-side noun phrase refers to: the argument n is in both cases a variable of type interp (@NP A)=A because that is the argument of the outermost slash type, while the second argument in each entry corresponds to the interpretation of the second slash type’s argument (a predicate or an additional term, respectively). Coupled with a development-specific bit of lexicon to name a particular Coq object of interest:

Instance addone_lex : lexicon "addone" NP :=
  { denotation := addone (* x. x + 1 *) }.

this approach permits giving correct denotations to both:

3.2.1 Quantifiers

Earlier we mentioned quantifiers over can be given grammatical type

Thus, a quantifier looks to its right first for a common noun (corresponding to the word identifying the Coq type to quantify over), and after that is combined, the result looks further to the right for a sentence fragment expecting such a thing to its left. Then adding appropriate lexicon entries for “every”:

Instance forall_lex {A:Type} : lexicon "every" (Quant A) :=
{ denotation := fun _ P => (forall (x:A), P x) }.

and another for the common noun “natural” (number) allows correctly parsing sentences like

(Recall, we must still be able to state claims that are false.) The common noun contributes nothing directly to the denotation, but constrains the quantifier to work with noun phrases referring to natural numbers.

3.2.2 Coordination

One aspect of natural language which is the source of some interest is that the words “and” and “or” (or their equivalents in other languages) can often be used to combine sentences fragments of widely varying grammatical types. For example, in “four is even and positive” the word “and” conjoins two adjectives: “even” and “positive.” Yet in the sentence “four is even and is positive” it conjoins two phrases of grammatical type (“is even” and “is positive”).

We can directly adopt a solution from the compuational linguistics literature [14], and formalize that “and” and “or” apply to any semantic type that is a function into (a function into…) the type Prop of propositions. We define an additional typeclass to recognize such “Prop-like” grammatical types inductively, starting with the grammatical types and , and inductively including slash types whose result type is also “Prop-like”, which define an operation to lift boolean semantics through repeated functions. We then add a polymorphic lexicon entry for each of “and” and “or” which assigns them any “Prop-like” type.

Thus in a sentence like “four is even and is positive” the two conjuncts are recognized as Prop-like (their underlying semantic type is ), and the operations of the typeclass recognizing this automatically lift a binary operation on Prop to a binary operation on predicates. For “and” this lifts logical conjunction to , which is exactly what is needed — the grammar rules will apply this function to the semantics of the even and positive predicates, and finally 4. Disjunction is handled similarly, and this generalizes to arbitrarily complex slash types whose final semantic result is Prop.

3.3 Using Specifications

We couple these with an additional typeclass Semantics s C which is defined when the string s splits into a list of words w (string splitting is implemented with another typeclass) and there exists an instance of Synth w C. Then we define a function from strings to their denotations:

Definition spec (s:string) ‘{sem:Semantics s S} := sdenote sem.

When invoked with a string s, Coq will search for an instance of Semantics s S — a parse of the string as a complete sentence. The logical form of a sentence has Coq type Prop (a proposition, or logical claim, to be proven).

Thus we may translate a range of specifications given an appropriate lexicon, including those below (sugared into math notation for space and readability):

Because Coq’s logic allows proof goals to be computed, spec can be used to declare proof goals:

Goal spec "addone is monotone".
> spec "addone is monotone"
simpl. (* simplify the goal *)
> forall x y : nat, x <= y -> addone x <= addone y

If Coq cannot find a Semantics instance for a specification, the user sees an error; because instance search reuses existing proof search functionality, a skilled user could manually debug the failure.

3.4 Predicativity

A careful reader will notice that Section 3 defines a Type whose constructors quantify over Type. Our current use cases produce elements of Prop, so we could simply define Cat has having sort Prop. However, there are many reasons to conduct proofs avoiding Prop even in specifications (e.g., homotopy type theory [9]). Making all of the typeclass definitions, instances, and goals universe-polymorphic (Polymorphic) allows us to remain predicative.

3.5 Performance

The performance of semantic parsing from natural language into formal specifications depends on both the underlying typeclass resolution procedure, as well as the space of derivations that must be explored during parsing. Our tokenization code for splitting strings runs in linear time because there is at most one typeclass instance that can apply for each character of a string (i.e., depending on whether or not the character is a space). So the rules which determine the search space are primarily the structural rules encoded in the Synth typeclass instances, and the lexicon instances. Since most words have only one or a very small number of grammatical roles (in general, not just in our small prototype [33, 34, 32]), lexicon ambiguity will not be a major driver of search costs. Instead, most costs come from exploring structural rule applications, particularly the rules for associativity and shifting of left and right slash type nesting, each of which may send search off into a dead end. To limit ambiguity, our typeclass for coordinators imposes an upper bound on how many times boolean operations can be lifted, and in Coq we set the typeclass resolution to use a depth limit of 15 on instance resolution. This reflects that by and large, categorial grammar derivations tend to be shallow and wide.

As a coarse measure of base efficiency, under these conditions parsing “every natural is non-negative and some natural is even” takes approximately 26 seconds999Measured by setting Semantics "every natural is non-negative and some natural is even" S as the proof goal, and using time typeclasses eauto to time the search. on a 4-core Intel Core i5 (with 16GB of RAM, but this is CPU-bound). Faster would be ideal, but this is at least suitable for using one file to cache the parses, which can be compiled infrequently, with other files pulling in the relationships as needed. Inspection of the search trace confirms that most search time is spent in dead-ends related to different combinations of associativity and shifting rules.

To investigate whether alternative typeclass resolution algorithms might have an impact, we ported most of our machinery — all except string tokenization101010String tokenization takes negligible time in Coq for reasons outlined above, and does not directly translate because Lean uses a different datatype for strings. — to version 4 of the Lean theorem prover [68]. This version uses a new typeclass resolution algorithm [85] designed specifically to accelerate complex typeclass resolution problems like the one our work presents. Working with pre-tokenized inputs, compiling the entire formalization and parsing multiple examples, including the one above, consistently takes less than 3.5 seconds. More complex sentences of course may require more time and investment in optimization, as the simplest categorial grammars model the context-free languages (which have worst-case cubic time parsing), while the most complete mildly-context-sensitive classes of categorial grammars have worst-case parsing cost [39], though in practice the common case can be made quite fast [19, 18, 20].

4 Modularity and Extension: Growing a Lexicon, Handling More Logics

The previous section described only a small fragment of English suitable for formalizing mathematical claims. Categorial grammars are what linguistic semanticists call lexicalized grammar formalisms. Unlike phrase-structure grammars (e.g., context-free grammars) which build in an explicit classification of grammatical phrase types, lexicalized grammars use a small set of general rules (like those in Figure 1), and then rely on the lexicon to give the precise grammatical types of every word. The availability of slash types (directed function types) affords significant flexibility, and extensions to attach modalities to the slashes [6, 63] allow further constraints capturing the subtleties of natural language to be captured solely by giving precise grammatical types (and semantics) to individual words.

4.1 Managing Words

Adding new words to a categorial grammar lexicon is conceptually as simple as adding the word, particular grammatical type, and associated denotation to the database. This makes it easy to extend a system with new concepts (e.g., new algebraic structures); lexicon entries to deal with concepts defined in a proof assistant library can be distributed as a part of that library. Conversely, if a word or particular usage of a word is found to be confusing to humans, leading to ambiguity, or otherwise problematic, it can be removed from the lexicon while affecting only inputs that use that word in that way (i.e., the problematic ones).

In practice the situation will be more complex, but we expect most extension to require little, if any, special linguistic knowledge. Assuming a robust core lexicon (Section 4.3), it is likely that most extensions will be additions of words with simpler categories. Experiments on a large standard-English lexicon showed [33] that when training on most of lexicon, the unseen words in a held-out test set were primarily nouns (35.1%) or transformations of nouns (e.g., adjectives, at 29.1%). These are the simplest categories to provide semantics for (types, objects, and predicates), strongly suggesting that proof assistant users with no special linguistics background could make most extensions themselves. Similar experiments for a wide-coverage lexicon of German [32] show over half of unknown words to be nouns, suggesting this feasibility extends beyond just English.

4.2 Supporting Additional Grammatical Constructions

Formalization of significant fragments of language much deal with more subtle constructions that what we have described so far can handle. However, what we have described thus far is essentially read directly out of the literature on linguistic semantics. Linguists have spent many decades building out knowledge of how to handle more sophisticated uses of quantification [88, 62] (“every,” “some,” “most”), resolving pronoun references [36], discontinuity [66] (where a word is far from a word it modifies), and much more [14, 67].

4.3 A Full-Featured Core Lexicon

How large should a base lexicon with reasonably wide coverage be? The largest lexicon for a categorial grammar is Hockenmaier and Steedman’s CCGBank [32, 34], which models the usage of language in a particular sample of the Wall Street Journal. It contains roughly 75K unique words, and some of the most common words have dozens of grammatical categories, or more (the English word “as” is overloaded with 130 distinct — though related — grammatical types). This has motivated work on learning lexicons with semantics [41, 5, 51, 50], as well as work on learning more compact lexicons that automatically capture standard word variations (e.g., automatically generalizing singular definitions to work for plural forms) [55, 93, 49]. These works all focus on cases where CCGs parse sentences into a variant of first-order logic, but should in principle generalize to targeting richer logics like the Calculus of Inductive Constructions underlying Coq and Lean. While the need to scale to large lexicons draws us back to a kind of machine learning, it draws us back to a kind with an eminently auditable results, producing lexicon entries which have well-defined individual meaning, which can be manually adjusted or removed if necessary.

Any initial broad-coverage lexicon for technical prose will need to be manually constructed (including input to a future learning algorithm). However, since technical prose about math and code is still a particular stylistic use of a standard natural language, it mostly reuses words in the same grammatical role — and therefore, same categorial grammar type — as non-specialist grammars. This means we can bootstrap an initial lexicon for English by reusing grammatical types from existing categorial grammar lexicons for English [34, 35], and similarly German [32], Hindi [3], Japanese [58], and other languages [1].

This means initial efforts can focus mostly on defining semantics for existing grammar-only lexicon entries, rather than starting from scratch. And for many of the words appearing in specifications, particularly quantifiers (“every”, “all”, etc.), determiners (“a”, “the”, etc.), and prepositions (“in”, “of”, etc.), the semantics are typically very simple (quantifier semantics look like the examples in earlier sections; prepositions are typically identity functions, functioning similarly to linguistic phantom types [27] / units of measure [42] for other words to locate parameters.)

Careful readers or prior students of linguistics may have wondered when matters of verb tense, noun case and number, grammatical gender,111111Which does not exist in English, but does in German, French, and so on etc. would arise. In full linguistic treatments, these are reflected in additional parameters to some grammatical categories. So for example, in our setting a noun phrase would be parameterized not only by the underlying referent type, but also by the case, number and so on; lexicon entries would then carry these through appropriately (making it possible to for example, require the direct object of a verb to be in the accusative case rather than nominative). We have omitted such a treatment here partly because it would obscure the key ideas while adding little value, partly because many of these distinctions are less important for our examples in English (which has fewer syntactic case distinctions than other languages), and partly because some aspects (like tense) may make sense only for specific embedded specification logics.

4.4 Beyond Cic

While the framing in this paper has focused on generating specifications which in Coq and Lean have type Prop, this is not required. Categorial grammars require only that their top-level semantic truth value type have the structure of a Heyting Algebra [53]: a type with binary operators for standard logical operators.

Our Coq and Lean formalizations in fact make this generalization: the core machinery is polymorphic over an arbitrary choice of Heyting Algebra, with a lexicon split between entries polymorphic over the Heyting Algebra being targeted (e.g., “or” and “and”) and words specific to a given Heyting Algebra.

This means the core idea applies not only to specs of type Prop, but that this machinery can be readily retargeted to any logic formalized within the proof assistant, such as LTL [75], CTL [21], or the BI-algebras underlying separation logics like Iris [40].

This is not itself a new observation, as we discuss in related work (Section 7), as others have instantiated categorial grammars to generate, for example, CTL [23], before. However, these prior applications have targeted only specific use cases, while this setting permits reusing many lexicon entries across many logics, which should help in retargeting this machinery to new applications and future logics which may be formalized within a proof assistant.

5 Trust and Auditing

One of the essential criteria for an LCF-style proof assistant is the production of an independently-checkable proof certificate. While we have proposed using Coq’s typeclass machinery to automatically parse and denote, and the typeclass resolution itself is typically not viewed as part of the trusted computing base (TCB), it does effectively produce a proof certificate. The typeclass machinery explicitly constructs a an instance of the typeclass — an element of the corresponding record type — and passes it to spec. So Coq’s kernel sees (effectively) a categorial grammar proof, constructed via typeclass instances rather than constructors of an inductive data type. This explicit term persists into the proof certificates Coq already produces, and could be identified by an independent proof checker that wished to also validate the natural language interpretation. For example, the textual representation of the term witnessing that “four is even” parses to even 4 is:

bridgeStringWords (* The typeclass instance to tokenize then parse *)
  (split1 NotSpace4 (* Start of tokenization *)
          (split2 NotSpace6 NotSpace6
                  (split4 NotSpace6 NotSpace6 NotSpace6 NotSpace6)))
  (SynthLApp (SynthLex fourlex) (* Parse tree of tokenized string *)
     (SynthRApp (SynthShift (SynthLex noun_is_adj_sentence))
                (SynthLex even_lex)))

which encodes both the tokenization (split1, etc.) and the constructed parse tree (with SynthLex calls referring to individual lexicon instances).

We can think of several ways a user might accidentally or maliciously risk confusing an independent checker: All but one can easily be detected by a checker aware of the categorial grammar specification typeclasses. The final possibility amounts to changing the specification in the proof certificate.

  • A user may redefine or extend our core instances (for Synth) to produce a different denotation. A certificate checker would already ensure these are type-correct. An natural-language-specification-aware extension could check that the Synth instances correspond to the desired rules. Or to better support some of the extensibility arguments made earlier, the Synth typeclass could be modified to also carry a justification of its conclusions in a more general substructural logic [47, 30], which would amount to requiring extensions to carry conservativity proofs.

  • A user may extend the lexicon with additional words or additional grammatical roles for a given word, introducing ambiguity into the parsing. Checking for ambiguity is relatively straightforward: ignoring indexing by Coq types, equivalence of grammatical types is decidable, and a checker could conservatively require that any lexicon entries with the same index-erased grammatical types have clearly-distinct indices. An independent checker could verify the absence of ambiguity in the lexicon, or alternatively surface the use of any ambiguity in a parsing derivation for human inspection.

  • A user could also manipulate the lexicon, for example redefining “monotone” to denote as . This is arguably a form of modifying the specification by changing definitions, rather than sneaking a broken proof past a certificate checker. It is analagous to changing a definition of a property verified by a proof — a working proof with the wrong definition is wrong, but this leaves behind evidence of the incorrect definition.

These possible forms of attack highlight the main sources of trust added when considering natural language specifications in the approach we describe: the grammatical rules for combining phrases, well-formedness of the lexicon, and the definitions of words in the lexicon.

A minor point of trust is the requirement that lexicon entries at least give semantics whose Coq type is consistent with the grammatical type at hand. In our prototype this is expressed via the requirement that the type of a word’s denotation is given by applying the interp function to the grammatical type. This is checked automatically by encoding this requirement in Coq’s types, and enforced by any proof certificate checker for Coq’s logic. This might seem trivial, but it is worth noting because other implementations of categorial grammars often do not check types. In early experiments we encountered type-related bugs in NLTK’s CCG implementation, and our attempt to directly reproduce an existing use of CCGs for temporal logic specifications [23] failed not because of our prototype’s limitations, but because we encountered cases where the published lexicon entries were inconsistent with the stated grammatical types (e.g., giving a word a grammatical type indicating two arguments, but a logical form which only accepted one). It is possible these were merely typographical errors, as is often found by any mechanization based on a published paper rather than a code artifact [44], but in either event the type system detected the incorrect entries when we attempted to enter them directly from the paper.

6 A Limited Role for Machine Learning

As discussed earlier, the need for predictability, auditability, and modular lexicons play to the strengths of categorial grammar. These also correspond to established weaknesses of common statistical and neural approaches to natural language processing, which often handle equivalent expressions in surprisingly incompatible ways (despite intriguing ongoing work to alleviate this [96]), provide no interpretable justification for linguistic choices, and are inherently non-modular (one cannot simply retrofit a small collection of additional words into a large model). Beyond this, neural approaches appear to systematically struggle to deal consistently and accurately with boolean operations (especially negation) [91, 24, 70], which are often critical to formal specifications.

One of the primary advantages of neural and statistical models for natural language processing is that natural language is constantly evolving, with new expressions and terms being invented regularly, and neural and statistical models can often handle unseen words somewhat reasonably by mimicking known usage patterns of other words, despite lacking any ground notion of meaning [12, 57]. However, this advantage has little role to play for formal specifications, both because of the hightened certainty requirements, but because new terms often carry very specific meanings, and also because the natural language used to describe formal properties tends to evolve more conservatively in order to avoid human misunderstandings.

The biggest general advantage of machine learning techniques is their ability to process large amounts of data with reduced human effort. Historically, large databanks of grammatical types and semantics for English [34], German [32], Hindi [3], Japanese [58], French [65], and other languages [1] have been rooted in enormous human efforts to manually label data, either directly or via translation from a corpus manually annotated for another grammar formalism. This has inspired a range of techniques for learning a lexicon for semantic parsing from a small set of initial examples [51, 5, 50, 49]. These techniques could in principle be used to rapidly expand an initially-hand-crafted core lexicon, or eventually to learn a domain- or program-specific lexicon to be added to a base lexicon after auditing. Currently all of these techniques target learning logical forms in first-order logic, and would require adaptation to deal with the indexed grammars required to target a proof assistant’s logic.

One way machine learning could play a major role in this endeavour would be analagous to one of its major roles in traditional semantic parsing, which is in learning an optimal search strategy over derivations from a large corpus of complete derivations [20, 18, 19]. Such a statistical model over likely parse structures can be used to dramatically accelerate parsing by using a model to choose priorities for proof search, thereby avoiding more dead-ends. Using machine learning in this narrow way could lead to substantial performance gains without compromising the categorial grammar properties relevant for formal specifications: a successful parse still yields a full derivation. In principle it should be possible to learn how to assign rule priorities and/or customize generate custom (derived) rules. However, a prerequisite for training such models is a large corpus of complete parses, which at least initially would need to be obtained through regular unification alone.

7 Related Work

Both categorial grammars of the form we work with and the use of dependent type theories for natural language semantics have long histories [52, 92, 89]

, and we are hardly the first to propose reducing the gap between formal or semi-formal specifications and natural language. Our proposal differs from the former primarily in that we argue for employing these not for the study of linguistics, but for the application of linguistics research to build a system for integrating natural language descriptions into the main intended use of proof assistants, including cases where the proof assistant is used to construct proofs in an embedded logic. Our proposal differs from most work on the latter primarily in our focus on employing actual linguistic models of grammar and meaning to extract intent from natural language, rather than using a range of shallow (though often effective) heuristics, in order to afford a higher degree of freedom in expressing expectations. We offer more details on these relationships below.

Others have used type theories like Coq’s for logical forms [89, 78, 79, 16, 11], broadly making the argument that variants of dependent type theory offer a range of appealing options for modeling natural language semantics that fix some percieved deficiencies in the use of a lambda calculus over first-order logic formulas, but consistently focused on using this as a means to study linguistics. The notion of indexing some grammatical categories by the type of a referent in such an underlying type theory comes from Ranta’s work [80] on studying the linguistics of mathematical statements. Ranta [79], Kokke [46] and Kiselyov [43] have formalized variants of categorial grammar with semantics in proof assistants, but only as object logics of study in order to prove properties of those systems, rather than as working parsers integrated with other uses of proof assistants. This leaves much to explore in integrating categorial grammar with various forms of type-theoretical language semantics [17], some of which coincide with common specification patterns.

Others have worked towards using categorial grammars and related techniques to translate natural language into formal specifications in a variety of other logics. Dzifcak et al. [23] used CCG to translate natural language specifications to , though as mentioned in Section 4.4 their semantics contain semantic type errors which are caught by working within a proof assistant that enforces consistency between grammatical and semantic types. Seki et al. [84, 83] is the earliest approach we are aware of, using an alternative grammar formalism (HPSG [77]) to translate natural langauge to first-order logic. Each of these approaches targets only a single logic, and assumes the translation is divorced from any particular use of formal specifications.

There have also been notable attempts to translate formal specifications into English, such as the support in the KeY theorem prover [38], and a predecessor system [13]

that directly uses categorial grammar in reverse for natural language generation 

[81] (another established use for categorial grammars beyond the semantic parsing we focus on).

There are also other approaches to bringing rigorous formalization closer to natural language, without attempting to capture natural language grammar in a systematic way. Isabelle/HOL’s [69] Isar proof language [94] attempts to make proofs themselves more readable using proof manipulation commands resembling English. This is an example of what is known as controlled natural language [28], a pattern of system development where the input language is a heavily restricted fragment of natural language, usually (though not always) simple enough to enable fully automatic processing, usually by heavily restricting both grammatical constructions and vocabulary. This includes examples like Cramer et al. [22], who have worked on heavily restricted subsets of natural language that address both the specification of lemmas and their proofs, but like Isar do not attempt to capture any general natural language structure, and techniques which focus only on stating formulas in natural language [29]. By contrast, our proposed approach (1) reuses existing proof assistant machinery (typeclasses [86, 85]) rather than requiring specialized support, and (2) aims to (eventually) permit almost arbitrary natural language grammar once an adequate base lexicon is developed (which can then be directly extended by individual proof developments).

8 Looking Forward

We have presented evidence that it is plausible to support natural language specifications in current proof assistants by exploiting existing typeclass machinery, with no additional tooling required. Carried further, this could be useful in many ways. It can reduce the gap between informal and formal specifications, reducing (though not eliminating) trust in the manual formalization of requirements. Potentially non-experts in verification could understand some theorem statements, gaining confidence that a verification result matched their understanding of desired properties. And this could be used in educational contexts to help students learn or check informal-to-formal translations.

Of course, the details matter as well, and it will take time to realize a prototype that is broadly useful. First and foremost, a rich lexicon is required. As explained earlier, at least the initial lexicon will need to be manually constructed (borrowing grammatical categories from existing lexicons, and filling in the semantics) before it would be fruitful to adapt techniques for learning lexicons. Guiding this effort would require a substantial collection of examples of natural-language descriptions of formal claims, both for prioritizing lexicon growth and for validation that the approach is growing to encompass real direct descriptions of claims. Despite the now-enormous body of formalized proofs of program properties and mathematical results, early efforts in this direction have revealed this is less trivial than it seems. Even popular and classic texts introducing formal specification like Software Foundations [74] and classic texts like Type Theory and Functional Programming [90] have remarkably few crisp natural language statements matching a specific formal statement, instead discussing various needs at length in order to motivate eventual details of the final formalization (which is sensible for expository texts). Reynolds’ classic paper introducing separation logic [82] contains no English-language description of any full invariant involving separating conjunction. We do not necessarily require exemplars of full descriptions in a single sentence; it is common for one specification to imply multiple high-level properties, and we envision one style of use for natural language specifications to be checking that a given verified result implies multiple natural language claims which each cover part of the desired results.

It is possible that small differences will be required between standard natural language grammars and those used by this approach, arising from distinctions important to proof assistants but irrelevant to colloquial language. This is already the case, as mentioned, with the indexing of some grammatical categories with the semantic types of referents, following Ranta’s early work on formalizing mathematical prose [80]. This direction offers opportunities to collaborate with linguists working in syntax and compositional semantics [8, 37, 88]. Such collaborations could both help with possible novel linguistic features of “semi-formal” natural language, and offers a setting for applying classical linguistic techniques in a domain where they provide unique value.

A great deal of work lies ahead, but the potential benefits seem to more than justify further exploration in this direction.


  • [1] Lasha Abzianidze, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann, Duc-Duy Nguyen, and Johan Bos. The parallel meaning bank: Towards a multilingual corpus of translations annotated with compositional meaning representations. In EACL, 2017.
  • [2] Kazimierz Ajdukiewicz. Die syntaktische konnexität. studia philosophica, 1: 1–27. reprinted in storrs mccall, ed., polish logic 1920–1939, 207–231, 1935.
  • [3] Bharat Ram Ambati, Tejaswini Deoskar, and Mark Steedman. Hindi ccgbank: A ccg treebank from the hindi dependency treebank. Language Resources and Evaluation, 52(1):67–100, 2018.
  • [4] Andrew W Appel. Foundational proof-carrying code. In Proceedings of the 16th Annual IEEE Symposium on Logic in Computer Science, LICS 2001, pages 247–256. IEEE, 2001.
  • [5] Yoav Artzi and Luke Zettlemoyer.

    Weakly supervised learning of semantic parsers for mapping instructions to actions.

    Transactions of the Association for Computational Linguistics, 1:49–62, 2013.
  • [6] Jason Baldridge and Geert-Jan M. Kruijff. Multi-modal combinatory categorial grammar. In Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics - Volume 1, EACL ’03, pages 211–218, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. doi:10.3115/1067807.1067836.
  • [7] Yehoshua Bar-Hillel. A quasi-arithmetical notation for syntactic description. Language, 29(1):47–58, 1953.
  • [8] Chris Barker and Pauline Jacobson, editors. Direct compositionality. Oxford University Press, 2007.
  • [9] Andrej Bauer, Jason Gross, Peter LeFanu Lumsdaine, Michael Shulman, Matthieu Sozeau, and Bas Spitters. The hott library: a formalization of homotopy type theory in coq. In Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs and Proofs, pages 164–172, 2017.
  • [10] Daisuke Bekki. Dependent type semantics: An introduction. In Logic and Interactive Rationality (LIRA) Yearbook 2012, Volume 1, pages 277–300. 2012.
  • [11] Daisuke Bekki. Representing anaphora with dependent types. In Logical Aspects of Computational Linguistics - 8th International Conference, LACL 2014, Toulouse, France, June 18-20, 2014. Proceedings, pages 14–29, 2014. doi:10.1007/978-3-662-43742-1_2.
  • [12] Emily M Bender and Alexander Koller. Climbing towards nlu: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, 2020.
  • [13] David A Burke and Kristofer Johannisson. Translating formal software specifications to natural language. In International Conference on Logical Aspects of Computational Linguistics, pages 51–66. Springer, 2005.
  • [14] Bob Carpenter. Type-logical semantics. MIT press, 1997.
  • [15] Bob Carpenter. The turing-completeness of multimodal categorial grammars. JFAK: Essays dedicated to Johan van Benthem on the occasion of his 50th birthday. Institute for Logic, Language, and Computation, University of Amsterdam. Available on CD-ROM at http://turing. wins. uva. nl, 1999.
  • [16] Stergios Chatzikyriakidis and Zhaohui Luo. Natural language inference in coq. Journal of Logic, Language, and Information, 23, 2014.
  • [17] Stergios Chatzikyriakidis, Zhaohui Luo, et al. Modern perspectives in type-theoretical semantics, volume 98. Springer, 2017.
  • [18] Stephen Clark and James R. Curran. Log-linear models for wide-coverage ccg parsing. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Conference on Empirical Methods on Natural Language Processing ’03, pages 97–104, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
  • [19] Stephen Clark and James R. Curran. Wide-coverage efficient statistical parsing with ccg and log-linear models. Computational Linguistics, 33(4):493–552, December 2007.
  • [20] Stephen Clark, Julia Hockenmaier, and Mark Steedman. Building deep dependency structures with a wide-coverage ccg parser. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 327–334, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. URL:, doi:10.3115/1073083.1073138.
  • [21] Edmund M. Clarke, E. Allen Emerson, and A. Prasad Sistla. Automatic Verification of Finite-state Concurrent Systems Using Temporal Logic Specifications. ACM Transactions on Programming Languages and Systems (TOPLAS), 8(2):244–263, 1986.
  • [22] Marcos Cramer, Bernhard Fisseni, Peter Koepke, Daniel Kühlwein, Bernhard Schröder, and Jip Veldman. The naproche project controlled natural language proof checking of mathematical texts. In International Workshop on Controlled Natural Language, pages 170–186. Springer, 2009.
  • [23] Juraj Dzifcak, Matthias Scheutz, Chitta Baral, and Paul Schermerhorn. What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution. In 2009 IEEE International Conference on Robotics and Automation, pages 4163–4168. IEEE, 2009.
  • [24] Allyson Ettinger. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48, 2020.
  • [25] Kate Finney. Mathematical notation in formal specification: Too difficult for the masses? IEEE Transactions on Software Engineering, 22(2):158–159, 1996.
  • [26] Kate M Finney and Alex M Fedorec. An empirical study of specification readability. In Teaching and Learning Formal Methods. Academic Press, 1996.
  • [27] Matthew Fluet and Riccardo Pucella. Phantom types and subtyping. Journal of Functional Programming, 16(6):751–791, 2006.
  • [28] Norbert E Fuchs. Controlled natural language. In Workshop on Controlled Natural Language, CNL. Springer, 2009.
  • [29] Norbert E Fuchs, Uta Schwertel, and Sunna Torge. Controlled natural language can replace first-order logic. In 14th IEEE International Conference on Automated Software Engineering, pages 295–298. IEEE, 1999.
  • [30] Jager Gerhard et al. Anaphora and type logical grammar, volume 24. Springer Science & Business Media, 2005.
  • [31] Mike Gordon. From lcf to hol: a short history. In Proof, Language, and Interaction: Essays in Honour of Robin Milner, pages 169–186. 2000.
  • [32] Julia Hockenmaier. Creating a ccgbank and a wide-coverage ccg lexicon for german. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 505–512. Association for Computational Linguistics, 2006.
  • [33] Julia Hockenmaier and Mark Steedman. Ccgbank: User’s manual. Technical report, 2005.
  • [34] Julia Hockenmaier and Mark Steedman. Ccgbank: a corpus of ccg derivations and dependency structures extracted from the penn treebank. Computational Linguistics, 33(3):355–396, 2007.
  • [35] Matthew Honnibal, James R Curran, and Johan Bos. Rebanking ccgbank for improved np interpretation. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 207–215, 2010.
  • [36] Pauline Jacobson. Towards a variable-free semantics. Linguistics and philosophy, 22(2):117–185, 1999.
  • [37] Pauline I Jacobson. Compositional semantics: An introduction to the syntax/semantics interface. Oxford University Press, 2014.
  • [38] Kristofer Johannisson. Natural language specifications. In Verification of Object-Oriented Software. The KeY Approach, pages 317–333. Springer, 2007.
  • [39] Aravind K. Joshi, David J. Weir, and K. Vijay-Shanker. The convergence of mildly context-sensitive grammar formalisms. Technical Report MS-CIS-90-01, University of Pennsylvania (Philadelphia, PA US), Philadelphia, 1990. URL:
  • [40] Ralf Jung, Robbert Krebbers, Jacques-Henri Jourdan, Aleš Bizjak, Lars Birkedal, and Derek Dreyer. Iris from the ground up: A modular foundation for higher-order concurrent separation logic. Journal of Functional Programming, 28, 2018.
  • [41] Makoto Kanazawa. Learnable classes of categorial grammars. CSLI Publications, Stanford University, 1995.
  • [42] Andrew Kennedy. Dimension types. In European Symposium on Programming, pages 348–362. Springer, 1994.
  • [43] Oleg Kiselyov. Applicative abstract categorial grammars in full swing. In

    JSAI International Symposium on Artificial Intelligence

    , pages 66–78. Springer, 2015.
  • [44] Casey Klein, John Clements, Christos Dimoulas, Carl Eastlund, Matthias Felleisen, Matthew Flatt, Jay A McCarthy, Jon Rafkind, Sam Tobin-Hochstadt, and Robert Bruce Findler. Run your research: on the effectiveness of lightweight mechanization. In Proceedings of the 39th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 285–296, 2012.
  • [45] Gerwin Klein, June Andronick, Kevin Elphinstone, Toby Murray, Thomas Sewell, Rafal Kolanski, and Gernot Heiser. Comprehensive formal verification of an os microkernel. ACM Trans. Comput. Syst., 32(1):2:1–2:70, February 2014. URL:, doi:10.1145/2560537.
  • [46] Wen Kokke. Formalising type-logical grammar in Agda. In 1st Workshop on Type Theory and Lexical Semantics, 2015.
  • [47] Geert-Jan M. Kruijff and Jason Baldridge. Relating categorial type logics and ccg through simulation, 2000. Unpublished manuscript. URL:
  • [48] Marco Kuhlmann, Alexander Koller, and Giorgio Satta. Lexicalization and generative power in CCG. Computational Linguistics, 41(2):187–219, 2015.
  • [49] Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. Lexical generalization in ccg grammar induction for semantic parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Conference on Empirical Methods on Natural Language Processing ’11, pages 1512–1523, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics.
  • [50] Tom Kwiatkowski, Luke S. Zettlemoyer, Sharon Goldwater, and Mark Steedman. Inducing probabilistic ccg grammars from logical form with higher-order unification. In Conference on Empirical Methods on Natural Language Processing, pages 1223–1233. ACL, 2010.
  • [51] M. Collins L. Zettlemoyer. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2005.
  • [52] Joachim Lambek. The mathematics of sentence structure. The American Mathematical Monthly, 65(3):154–170, 1958.
  • [53] Joachim Lambek. Categorial and categorical grammars. In Categorial grammars and natural language structures, pages 297–317. Springer, 1988.
  • [54] Xavier Leroy. A formally verified compiler back-end.

    Journal of Automated Reasoning

    , 43(4):363–446, 2009.
  • [55] Mike Lewis and Mark Steedman. Improved ccg parsing with semi-supervised supertagging. Transactions of the Association for Computational Linguistics, pages 327–338, 2014. URL:
  • [56] Per Martin-Löf and Giovanni Sambin. Intuitionistic type theory, volume 9. Bibliopolis Naples, 1984.
  • [57] William Merrill, Yoav Goldberg, Roy Schwartz, and Noah A Smith. Provable limitations of acquiring meaning from ungrounded form: What will future language models understand? arXiv preprint arXiv:2104.10809, 2021. To Appear in Transactions of the ACL.
  • [58] Koji Mineshima, Ribeka Tanaka, Pascual Martínez-Gómez, Yusuke Miyao, and Daisuke Bekki. Building compositional semantics and higher-order inference system for a wide-coverage japanese ccg parser. In EMNLP, 2016.
  • [59] Richard Montague. English as a formal language. In Bruno Visentini, editor, Linguaggi nella societa e nella tecnica, pages 188–221. Edizioni di Communita, 1970.
  • [60] Richard Montague. Universal grammar. Theoria, 36(3):373–398, 1970.
  • [61] Richard Montague. The proper treatment of quantification in ordinary english. In Approaches to natural language, pages 221–242. Springer, 1973.
  • [62] Michael Moortgat. Generalized quantifiers and discontinuous type constructors. In Discontinuous Constituency, volume 6 of NATURAL LANGUAGE PROCESSING, pages 181–208. Mouton de Gruyter, 1996.
  • [63] Michael Moortgat. Multimodal linguistic inference. Journal of Logic, Language and Information, 5(3):349–385, Oct 1996. doi:10.1007/BF00159344.
  • [64] Michael Moortgat. Constants of grammatical reasoning. In Constraints and resources in natural language syntax and semantics, pages 195–219. 1999.
  • [65] Richard Moot. A type-logical treebank for french. Journal of Language Modelling, 3(1):229–264, 2015.
  • [66] Glyn Morrill. Discontinuity in categorial grammar. Linguistics and Philosophy, 18(2):175–219, 1995.
  • [67] Glyn V Morrill. Type logical grammar: Categorial logic of signs. Springer Science & Business Media, 2012.
  • [68] Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In International Conference on Automated Deduction, pages 625–635. Springer, 2021.
  • [69] Tobias Nipkow, Lawrence C Paulson, and Markus Wenzel. Isabelle/HOL: a proof assistant for higher-order logic, volume 2283. Springer Science & Business Media, 2002.
  • [70] Lalchand Pandia and Allyson Ettinger. Sorting through the noise: Testing robustness of information processing in pre-trained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1583–1596, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL:
  • [71] Barbara H Partee and Herman LW Hendriks. Montague grammar. In Handbook of logic and language, pages 5–91. Elsevier, 1997.
  • [72] Christine Paulin-Mohring. Inductive definitions in the system coq rules and properties. In International Conference on Typed Lambda Calculi and Applications, pages 328–345. Springer, 1993.
  • [73] Lawrence C Paulson. Logic and computation: interactive proof with Cambridge LCF, volume 2. Cambridge University Press, 1990.
  • [74] Benjamin C. Pierce, Chris Casinghino, Marco Gaboardi, Michael Greenberg, Catalin Hritcu, Vilhelm Sjoberg, and Brent Yorgey. Software Foundations. 2011–2016. URL:
  • [75] Amir Pnueli. The Temporal Logic of Programs. In FOCS. IEEE, 1977.
  • [76] Robert Pollack. How to believe a machine-checked proof. In Twenty Five Years of Constructive Type Theory, pages 205–220. Oxford University Press, 1998.
  • [77] Carl Pollard and Ivan A Sag. Head-driven phrase structure grammar. University of Chicago Press, 1994.
  • [78] Aarne Ranta. Intuitionistic categorial grammar. Linguistics and Philosophy, 14(2):203–239, 1991.
  • [79] Aarne Ranta. Type-theoretical Grammar. Oxford University Press, Inc., New York, NY, USA, 1994.
  • [80] Aarne Ranta. Context-relative syntactic categories and the formalization of mathematical text. In International Workshop on Types for Proofs and Programs, pages 231–248. Springer, 1995.
  • [81] Aarne Ranta. Grammatical framework. Journal of Functional Programming, 14(2):145–189, 2004.
  • [82] John C Reynolds. Separation logic: A logic for shared mutable data structures. In Proceedings of the 17th Annual IEEE Symposium on Logic in Computer Science, LICS 2002, pages 55–74. IEEE, 2002.
  • [83] Hiroyuki Seki, Tadao Kasami, Eiji Nabika, and Takashi Matsumura. A method for translating natural language program specifications into algebraic specifications. Systems and computers in Japan, 23(11):1–16, 1992.
  • [84] Hiroyuki Seki, Eiji Nabika, Takashi Matsumura, Yujii Sugiyama, Mamoru Fujii, Koji Torii, and Tadao Kasami. A processing system for programming specifications in a natural language. In [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track, volume 2, pages 754–763. IEEE, 1988.
  • [85] Daniel Selsam, Sebastian Ullrich, and Leonardo de Moura. Tabled typeclass resolution. arXiv preprint arXiv:2001.04301, 2020.
  • [86] Matthieu Sozeau and Nicolas Oury. First-class type classes. In International Conference on Theorem Proving in Higher Order Logics, pages 278–293. Springer, 2008.
  • [87] Mark Steedman. The Syntactic Process. The MIT Press, 2001.
  • [88] Mark Steedman. Taking scope: The natural semantics of quantifiers. Mit Press, 2012.
  • [89] Göran Sundholm. Proof theory and meaning. In Handbook of philosophical logic, pages 471–506. Springer, 1986.
  • [90] Simon Thompson. Type Theory and Functional Programming. Addison-Wesley, 1999. URL:
  • [91] Aaron Traylor, Roman Feiman, and Ellie Pavlick. And does not mean or: Using formal languages to study language models’ representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021.
  • [92] Johan van Benthem. Categorial grammar and type theory. Journal of Philosophical Logic, 19(2):115–168, 1990. URL:
  • [93] Adrienne Wang, Tom Kwiatkowski, and Luke Zettlemoyer. Morpho-syntactic lexical generalization for ccg semantic parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1284–1295, 2014.
  • [94] Markus Wenzel. Isar—a generic interpretative approach to readable formal proof documents. In International Conference on Theorem Proving in Higher Order Logics, pages 167–183. Springer, 1999.
  • [95] Jeannette M Wing. A specifier’s introduction to formal methods. Computer, 23(9):8–22, 1990.
  • [96] Yuhao Zhang, Aws Albarghouthi, and Loris D’Antoni. Certified robustness to programmable transformations in LSTMs. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1068–1083, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL: