QED at Large: A Survey of Engineering of Formally Verified Software

03/13/2020 ∙ by Talia Ringer, et al. ∙ Association for Computing Machinery The University of Texas at Austin University of Washington Yale-NUS College 0

Development of formal proofs of correctness of programs can increase actual and perceived reliability and facilitate better understanding of program specifications and their underlying assumptions. Tools supporting such development have been available for over 40 years, but have only recently seen wide practical use. Projects based on construction of machine-checked formal proofs are now reaching an unprecedented scale, comparable to large software projects, which leads to new challenges in proof development and maintenance. Despite its increasing importance, the field of proof engineering is seldom considered in its own right; related theories, techniques, and tools span many fields and venues. This survey of the literature presents a holistic understanding of proof engineering for program correctness, covering impact in practice, foundations, proof automation, proof organization, and practical proof development.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1.1 Challenges at Scale

Scaling up leads to new challenges and additional demand for tool support in proof development and maintenance. For example, users may have to reformulate properties to facilitate library reuse [Hales2017], or to encode data structures in specific ways to aid in automation of proofs about them [Gonthier2008]. Proof development environments need to allow users to efficiently write, check, and share proofs [Faithfull2016]; proof libraries need to allow easy search and seamless integration of results into local developments [Gauthier2015]. Evolving projects face the possibility of previous proofs breaking due to seemingly unrelated changes, justifying design principles [Woos2016] as well as support for quick error detection [Celik2017] and repair [Ringer2018].

The research community has answered these challenges with theories, techniques, and tools for proofs of program correctness that scale—all of which fall under the umbrella of proof engineering, or software engineering for proofs. Many of these techniques draw inspiration from work in software engineering on large-scale development practices and tools [Klein2014]. However, even with close conceptual ties between construction of programs and proofs, research in software engineering requires careful translation to the world of formal proofs. For example, proof engineers can benefit from regression testing techniques by considering lemmas and their proofs in place of tests, as in regression proving [Celik2017]; yet, the standard metric used to prioritize regression tests—statement coverage—has no clear analogue for lemmas with complex conditions and quantification.

This survey serves to gather these theories, techniques, and tools into a single place, drawing parallels to software engineering, and pointing out challenges that are especially pronounced in proof development. It discusses the problems engineers encounter when verifying large systems, existing solutions to these problems, and future opportunities for research to address underserved problems.

1.2 Scope: Domain and Literature

We consider proof engineering research in the context of interactive theorem provers (ITPs) or proof assistants (used interchangeably with ITPs in this survey) that satisfy the de Bruijn criterion [Barendregt2002, Barendregt2351], which requires that they produce proof objects that a small proof-checking kernel can verify; the general workflow of such tools is illustrated in Figure 1.1. That is, we consider proof assistants such as Coq [coq], Isabelle/HOL [isabelle], HOL Light [hollight], and Agda [agda]; we do not consider program verifiers, theorem provers, and constraint solvers such as Dafny [Leino2010], ACL2 [acl2], and Z3 [z3] except when contributions carry over. We focus on proof engineering for software verification, but consider contributions from mathematics and other domains when relevant.

Sometimes, the key design principles in engineering a large program verification effort are not the focus of the most well-known publications on the effort. Instead, they can be in less standard references such as workshop papers [Komendantskaya2012, Blanchette2013, CompanyCoq2016, Mulhern06proofweaving], invited talks [WenzelIsabelleFuture], blog posts [VerifiedCryptoFirefox], and online documents [LeroyDeepSpecSS17, WenzelScalingIsabelle]. One purpose of this survey is to bring such design principles front-and-center. Naturally, we shall aim to survey the relevant literature with our best effort to provide accurate and thorough citations. To that end, we will not hesitate to cite both traditional research papers in well-known venues and relevant discussions in less traditional forms, without further distinction among them.

1.3 Overview


logic engine

proof checker

proof assistant



Figure 1.1: Typical proof assistant workflow, adapted from pa-history-geuvers-sadhana09.

After motivating chapters (Chapters 2 and 3), this survey discusses the history and foundations of proof assistants (Chapter 4). It then surveys proof engineering research under three headings: languages and automation (Chapter 5), proof organization and scalability (Chapter 6), and practical proof development and evolution (Chapter 7). At a glance, Chapter 5 concerns proof automation approaches and languages, Chapter 6 concerns methods to express and organize programs and proofs, and Chapter 7 concerns development processes and tools. Each of these three chapters is divided into sections; each section surveys a more granular area of proof engineering research, then concludes with a discussion of opportunities for future work within that area when applicable. The survey concludes (Chapter 8) with a discussion of opportunities for future work within proof engineering more broadly. In the case of factual errors, an errata may be found on https://proofengineering.org.

1.4 Reading Guide

This survey aims to reach a broad audience of researchers, proof engineers, and community members who are interested in understanding, using, or contributing to proof engineering research. Readers need not be deterred for lack of background knowledge. It is not always necessary to understand previous chapters in order to understand later chapters; readers should feel free to skip sections or chapters, or to consult later chapters or cited resources for more information. This guide lists topics with which basic familiarity is helpful in order to get the most out of the referenced chapters (all chapters unless otherwise specified), along with resources (cited next to  icons) for the interested reader:

  • Programming languages, type systems, and metatheory  [pierce2002types, harper2016practical], including:

    • ITPs  [pa-history-geuvers-sadhana09, Harrison2014], especially:

      • Coq  [CPDT, Pierce-al:SF, CoqArt]

      • Isabelle/HOL  [wenzel2004isabelle, Nipkow2014concrete]

    • Automated reasoning  [Bradley2007, Kroening2008] (Chapters 3 and 5)

    • The Curry-Howard correspondence  [pfenning2010curry, sorensen2006lectures]

    • Dependent and inductive types  [CPDT]

    • Equality  [CPDT, nlab:equality] (Chapters 3 and 4)

    • Compilers  [cooper2011engineering] (Chapters 3, 4, and 6)

  • Software engineering  [se-canon]

  • Systems  [Anderson2014, Cachin2011] (Chapters 3 and 6)

  • Formalized mathematics  [ams-proof, nlab:foundation_of_mathematics] (Chapter 3)

3.1 Proof Engineering for Program Verification

We discuss a sample of the domains in which proof engineering has had a large impact: certified compilers (Section 3.1.1), low-level systems software (Section 3.1.2), concurrent and distributed systems (Section 3.1.3), and ceritified solvers and checkers (Section 3.1.4).

3.1.1 Certified Compilers

Development of certified compilers is a classic application of proof assistants (Section 4.2). In spite of their long history, however, certified compilers for practical and widely used programming languages have only started to appear in the last decade. This is mainly due to the sheer size and complexity of the semantics of these languages—challenges that have necessitated developments in proof engineering.

The inertia for those developments came in 2006 with the help of CompCert. This project was a pivotal moment in the history of program verification.

Leroy:POPL06 received the POPL test of time award in 2016 [popl-time], with the release noting its pivotal role:

The paper was (and still is) groundbreaking in that it demonstrates the feasibility of using an interactive theorem prover—specifically, Coq—to both program and formally verify a realistic compiler … [it] made a convincing case that theorem-proving technology is mature enough to be applied to the full functional verification of realistic systems, and in so doing heralded a new age of “big verification.”

Many projects built on CompCert by, for example, using its C semantics, or by targeting C or any of its intermediate representations. Appel2011 developed a program logic based on CompCert’s C semantics, allowing Coq users to prove properties of deeply embedded C programs in Coq. These properties then hold for machine code generated by CompCert, via CompCert’s main correctness result [Appel:BOOK14]. Kaestner2017 described many extensions and enhancements to the basic compilation toolchain of CompCert, e.g., translation validation of the process of linking machine code to produce object files and executable files. Cao2018 presented a CompCert-based C program verification environment called VST-Floyd in Coq based on separation logic, simplifying the process of specifying and verifying properties that hold down to machine code.

Certified compilers now span broad applications: For example, the Cogent language for system programming and verification is accompanied a certifying compiler [oconnor2016b] in Isabelle/HOL that produces a proof that the generated C code is correct. The compiler uses a refinement framework (Section 6.1.2) to automate relating the Cogent semantics to the generated code [Rizkallah16]. The Standard ML language variant CakeML [Kumar2014] has a verified compiler in HOL4 with a certified machine-code implementation produced by bootstrapping (applying the compiler to itself). Using the Vellvm [zhao2012] framework in Coq, proof engineers can reason about transformations on the LLVM intermediate language representation.

Certified compilers have also covered new ground with respect to modularity and compositionality. The IMM [Podkopaev2019] memory model modularizes certified compilation from high-level concurrent programming languages to different hardware models. The Bedrock [Chlipala2013] intermediate language and verification environment in Coq for low-level programming contains a notion of macros that can be reasoned about modularly. The Pilsner [Neis2015] certified compiler in Coq from a higher-order ML-like language to machine code supports programs that can, in contrast to CompCert, be compositionally verified. Compositional CompCert [Stewart2015] is a variant of CompCert with a correctness theorem that can be applied compositionally.

Proof engineering for certified compilers has had applications directly to the languages these compilers are verified in. For example, there are certified compilers both from HOL4 [Myreen2012] and from Isabelle/HOL [Hupel2018] to CakeML; it is possible to produce machine code by composing these compilers with the certified CakeML compiler. Both CertiCoq [Anand2017] and Œuf [Mullen2018] describe certified compilers for Coq’s specification language Gallina. CertiCoq is an ongoing project to build a certified compiler from Gallina to machine code, using a hierarchy of custom intermediate languages. Œuf presents a certified compiler from a subset of Gallina to assembly code. Both compilers target CompCert intermediate languages: CertiCoq targets Clight, while Œuf targets Cminor. Each of these compilers provide an alternative to untrusted program extraction (Section 4.4.2).

3.1.2 Low-Level Systems Software

In addition to compilers, low-level software such as operating systems, file systems, and network stacks are building blocks of large software systems. In turn, such software relies on interfaces to hardware, and on hardware behavior. Through considerable effort, proof engineers have specified and verified important pieces of systems software and their hardware bases.

In pioneering work, Klein2009, Klein2014micro developed a small general-purpose OS kernel, called seL4, in the C programming language, with correctness proofs in Isabelle/HOL. The proven properties include correctness of interprocess communication, access-control enforcement, and information-flow noninterference. Many verified extensions to seL4 have been proposed since its inception, e.g., to ensure hard deadlines are met for system calls, which is required for applications in real-time systems [Sewell2017].

Verification of OS kernels has brought new developments in proof engineering. For example, as part of the CertiKOS project, Gu-al:POPL15 presented a framework in Coq for specifying and verifying abstraction layers, which they used to develop several certified OS kernels. In doing so, they introduced the idea of a deep specification (Section 6.2.3). Gu-al:OSDI16 designed and verified a concurrent kernel for the x86 architecture with fine-grained locking.

An Instruction Set Architecture (ISA) provides an important interface to hardware for, .e.g., compilers. Fox2010 developed formal specifications (semantics) of the ISA for ARMv7 in HOL4. Armstrong2019 later used a domain-specific language to provide ISA specifications for Isabelle/HOL, HOL4, and Coq for the ARMv8, RISC-V, and CHERI-MIPS architectures. Morrisett2012 modeled a subset of the x86 ISA in Coq, and used it to build a verified checker for a sandbox policy.

Security policies of systems software are apt targets for verification due to the importance of them being correct. Along these lines, Dam2013 proposed a separation kernel (hypervisor) based on the ARMv7 processor architecture and its formalization by Fox and Myreen, and proved, in HOL4, an information flow security property that ensures OS instances can only communicate via explicit channels. Guanciale2016 proved memory virtualization security in HOL4 for ARMv7.

OS kernel subsystems and formats are additional formalization targets: Bishop2006, Bishop2018 defined and validated executable specifications in HOL4 for the TCP/IP stack, and Kell2016 formalized the ELF binary format in HOL4 for executables used in Unix-like operating systems such as Linux.

Verification has also reached file systems, such as FSCQ [Chen2015], a file system with guarantees about crash safety that have been verified in Coq, and DFSCQ [Chajed2017], an efficient crash-safe file system with several verified optimizations. Using the Cogent language and its certifying compiler (Section 3.1.1), Amani16 developed a file system called BilbyFS in Isabelle/HOL with executable code in C; they also implemented and verified the legacy Linux file system ext2. Ridge2015 developed a specification of POSIX file systems in HOL4, which they tested against real-world file system behavior.

3.1.3 Concurrent and Distributed Systems

Concurrent and distributed systems can be difficult to develop, understand, and debug. Researchers have developed many different theories and frameworks for reasoning about such systems in proof assistants. For example, FCSL [Sergey-al:PLDI15] is a framework in Coq for reasoning about fine-grained concurrency, based on a shallow embedding of concurrent programs. The Disel [Sergey2017] framework for distributed separation logic in Coq builds on this, using the shallow embedding approach from FCSL. The formalization is extracted to executable OCaml code and run on real hardware. Two different frameworks [zeller2014, gomes2017verifying] in Isabelle/HOL exist for verifying two different models of CRDTs, replicated datatypes which provide strong eventual consistency guarantees.

Progress in proof engineering for verification of concurrent and distributed systems has had implications for the formalization of practical programming languages and protocols. This is illustrated by the Jung2017 formalization of the imperative and threaded Rust programming language using the Iris [Jung2018] framework for concurrent separation logic in Coq, and by the Woos2016 implementation and verification of the key correctness property of the Raft consensus algorithm using the Verdi [Wilcox2015] framework for verification of asynchronous message-passing distributed systems.

3.1.4 Certified Solvers and Checkers

Proof engineers have formalized and proven correct automated solvers for first-order logic and other more restricted logics. blanchette2018verified verified a SAT solver in Isabelle/HOL with conventional features such as clause learning. schlichtkrull2019verified verified a purely functional first-order superposition-based solver in Isabelle/HOL and obtained executable code in Standard ML.

Another line of work in Isabelle/HOL formalizes various model checkers, for example for Linear Temporal Logic [esparza2013fully] and for timed automata [wimmer2018verified].

3.2 Proof Engineering for Other Domains

While this survey focuses on proof engineering for program verification, domains outside of program verification encounter similar proof engineering challenges. The solutions that these communities develop have implications for proof engineering for program verification. We discuss these impliciations for two domains: mathematics (Section 3.2.1) and programming languages metatheory (Section 3.2.2).

3.2.1 Mathematics

Mathematics is a natural application domain for proof assistants. Formalized mathematics is the attempt to formalize mathematical theories in part or in whole using proof assistants, so that proofs can be mechanically checked. Formalized mathematics was one of the first major application domains for proof assistants; several early ITPs and their predecessors were designed with this domain in mind (Section 4.2).

Early formal developments in mathematics arose in the 1990s [zucker1994formalization, van1994checking, bancerek1990fundamental, qed-manifesto-boyer-cade94]. Since then, there have been many notable developments for formalized mathematics, including the Four Color Theorem [Gonthier2008], the Kepler conjecture [Hales2011, Hales2017], the fundamental theorem of algebra [Geuvers2000], Gödel-Rosser incompleteness [OConnor2005], and the Jorden curve theorem [Hales2007]. Other interesting mathematical proofs can be found in a comparative overview of different proof assistants for mathematics [Wiedijk2006]. The QED manfesto [qed-manifesto-boyer-cade94] called for a complete database of formalized mathematics. The UniMath [UniMath] library is an ongoing attempt to formalize foundations of mathematics [Voevodsky2015] in Coq, using homotopy type theory (Section 4.3.2).

This section samples tools and design principles for formalized mathematics, and discusses how they are relevant to program verification.

Design Principles

Many proof developments in mathematics are mature and involve a large community of contributors. Several of these developments have style guides for contributors. For example, UniMath library has a style guide that serves to make proofs rigorous, easy to port to other proof assistants, and less fragile, and to standardize and improve appearance and readability. Among other things, the style guide prohibits the addition of new axioms, and encourages the use of tactics whose semantics are well-defined. Similarly, the HoTT library [Bauer2017] for homotopy type theory contains a style guide which, among other things, encourages uniform naming principles, outlines methods for defining equivalences, describes how to use axioms uniformly, and encourages the use of tactics that have well-defined relationships with the terms they produce.

In addition to style guides, proof engineers have developed design principles to handle certain kinds of problems common in mathematics. Gonthier2013

, for example, outlines a number of techniques used in the proof of the Odd Order Theorem.

Wiedijk2006 compares the styles of seventeen different proof assistants for mathematics. The book is a collection of proofs of the irrationality of from users of each of the proof assistants, and a discussion of each of the proof assistants and the proofs in those proof assistants. This comparison can be useful for understanding the tradeoffs of and design considerations in each of the proof assistants.


Formalized mathematics has also seen the development of tooling to support entire classes of proofs. Notable examples include autarktic computations for algebraic reasoning [Barendregt2002], special support for equational reasoning in proof checkers [Barthe1996], decision procedures for fragments of arithemic (Section 5.2.1), techniques for reasoning modulo associativity and commutativity [Braibant2011], transport methods (Sections 6.4.2 and 6.4.3), and theory exploration (Section 5.2.1).

Beyond Mathematics

In formalized mathematics, user communities of certain frameworks or libraries adhere to style guides and design principles. These style guides and design principles have different emphases. Proof engineers in communities outside of mathematics may similarly benefit from standardizing some elements of style depending on the desired outcome. On a project-by-project basis, this may make collaboration between proof engineers easier, and limit the accidental introduction of untrusted code.

Style guides and design principles also have implications for proof understanding. Mathematicians are the original proof checkers, so it’s perhaps unsurprising that many mathematics communities emphasize proof understanding by humans. Beyond mathematics, human understanding of proofs communicates information to the reader beyond the theorem statement itself. Furthermore, just like in software engineering, effective collaboration between proof engineers hinges on mutual understanding of the underlying code.

Standardizing style may also make automation easier. For example, by limiting the set of tactics used within a community, the produced proofs are more clearly defined, which may make higher-level automation such as refactoring and repair tools (Section 7.2) less challenging.

Proof engineers in communities outside of mathematics may also benefit from comparitive studies across different proof assistants, similar to Wiedijk2006 but for other domains.

Finally, much of the tooling developed for mathematics addresses problems that occur in proof developments outside of mathematics. For example, dealing with equivalences and isomorphisms is a problem that is not exclusive to the domain of mathematics. Many of the techniques and tooling from mathematics that solve this problem may be useful for proof engineers who encounter this same problem in other domains.

3.2.2 Programming Language Metatheory

One large domain of focus is mechanized metatheory: proofs about programming languages. The desire to mechanize theory led to the introduction of the Edinburgh Logical Framework (LF) by Harper1987 (and later in more detail, Harper1993), building on ideas from Automath. LF defined a methodology for encoding and reasoning about a simpler programming language from within the higher-order dependently typed lambda calculus. Mechanizations of metatheory followed shortly after, both in Nuprl (for example, howe1988computational) and in LF (see Harper2007 for an overview of mechanized metatheory in LF, and harrison-reflection for an early history of mechanized metatheory more broadly).

Since then, the domain has grown to reach practical languages: The mechanization  [mechanized-sml] of Standard ML in Twelf formalized the metatheory of a practical language in its entirety. WebAssembly has had a formal semantics from the very beginning, which has been mechanized [watt2018mechanising] in Isabelle/HOL. Simplified languages representing the core underlying theories of Scala [Rompf2016, Amin2017] and of OCaml [owens2008] have been verified. Results from mechanized metatheory have also influenced verification of real compilers, like CakeML (Section 3.1.1).

The success of metatheory has brought with it benchmark suites and design principles. In addition, mechanized metatheory has influenced new additions to the core languages of ITPs. This section describes a small sample of those benchmark suites, design principles, and language features, and discusses how the lessons learned from mechanized metatheory generalize beyond this domain.

Benchmark Suites

Some of the success of mechanized metatheory is attributable to benchmark suites that have clearly established the importance of the domain and set out to define how to measure progress within it. The POPLMark challenge [Aydemir2005] has been particularly influential.

The benchmarks in the POPLMark challenge are proofs of properties of the language System F-Sub [Cardelli1994], which has parametric polymorphism and subtyping. POPLMark highlights specific problems in proof engineering for metatheory, and outlines criteria for evaluating the success of technology that addresses these problems.

15 solutions to the POPLMark challenge remain accessible online [poplmark-website]. Of these solutions, 8 are in Coq, 2 are in Isabelle/HOL, and the remaining 5 are spread across 5 other ITPs. The solutions cover 8 different ways to represent variable binders, one of the problems that POPLMark highlights.

Personal communications [harpersonal, piercenal2] suggest that the solution using Twelf [Pfenning1999] (an LF implementation) was the first solution to solve all of the difficult parts of the challenge. The website notes that this solution demonstrated the benefits of using that framework, including the style of binders it supports, while also sparking an interesting discussion on different ways of specifying the problem across different frameworks. Only Arthur Charguéraud attempted the same solution in the same proof assistant with different styles of binders; while his solutions were inconclusive, they inspired later work on making binders easier to represent [Aydemir2008, Chargueraud2011].

The POPLMark solutions may be thought of as a springboard for later work. They provided information about what the state-of-the-art was at the time, which enabled later researchers to measure progress. Over 300 papers have cited POPLMark since its introduction in 2005.

Still, there is some dissatisfaction with the outcomes of POPLMark. For example, mst papers that cite POPLMark focus on the difficulty of dealing with binders, which is just one challenge that proof engineers face when mechanizing metatheory [piercenal]. The List-machine Benchmark [appel2012list], developed in parallel with POPLMark, deemphasizes binders and instead emphasizes connections between proofs and real implementations of compilers. POPLMark Reloaded [abel2017poplmark] emphasizes logical relations proofs. The ORBI [felty2018benchmarks, felty2015benchmarks] benchmarks focus on the tradeoffs of design decisions of different systems for mechanizing metatheory, rather than on different approaches using a given system.

Design Principles

Many papers that cite POPLMark focus on difficulties of dealing with binders. Paper proofs typically use a named representation, where variables are represented by names. These are easy for humans to reason about, but make it difficult for tools to reason about alpha-equivalence. Nominal logics [Aydemir2006, Urban2008] encode names such that equivalent terms are alpha-equivalent. These representations have the benefits of named representations, but without the cost of difficulty reasoning about alpha-equivalence. While the nominal approach is common in Isabelle [Urban2008], only preliminary approaches exist in Coq [Aydemir2006]. Developing such a tool in Coq may be difficult due to the differences in logics between Coq and Isabelle/HOL, as Nominal Isabelle makes use [urban2011] of quotient types [Homeier05] in HOL.

In the absence of a nominal tool for Coq, proof engineers may explore tradeoffs between de Bruijn indexes (introduced by debruijn1972 and explored by owens2008 and schafer2015, among others) and a locally nameless representation (introduced by Gordon1993b and explored by Leroy2007, Aydemir2008, and Chargueraud2011, among others). De Bruijn indexes make reasoning about alpha-equivalence simple because alpha-equivalence is definitional equality (Section 4.3.2), but they require shifting operations in the code, which complicates human understanding; Berghofer2007 detail the tradeoffs between de Bruijn indices and names. Locally nameless attempts to capture the best of both worlds: it uses names for free variables and de Bruijn indexes for bound variables. Consequentially, locally nameless requires reasoning about indexes and shifting operations only for locally closed terms.

LF-based systems like Twelf [Pfenning1999], Delphin [poswolsky2009system], Beluga [pientka2008programming] use higher-order abstract syntax (HOAS) [Pfenning1988] to simplify reasoning about binders. HOAS gives an encoding of binders in the object language (the language that is reasoned about) as binders in the meta-language (the language of reasoning—in the case of LF, the higher-order dependently typed lambda calculus). Beluga uses ideas from contextual modal type theory [nanevski2008contextual] to further simplify reasoning about HOAS encodings. The Hybrid [ambler2002] tool makes it possible to use HOAS within Isabelle/HOL; capretta2007hybrid describe a version of Hybrid for Coq, and felty2012hybrid describe how to use Hybrid to reason about an object language in a manner similar to Twelf. felty2010hoas compare Twelf, Beluga, and Hybrid on case studies of metatheory using HOAS, and presents a set of challenge problems which highlights the differences between these systems. Variants of HOAS such as weak HOAS [Ciaffaglione2012] and parametric higher-order abstract syntax (PHOAS) [Chlipala2008] make HOAS more tractable in general-purpose ITPs like Coq.

There is a small amount of work on design principles that address the concerns of POPLMark beyond dealing with binders. For example, Engineering Formal Metatheory [Aydemir2008] identifies specific lemmas that are useful and discusses the organization of theorems, proofs, and automation. It also introduces cofinite quantification of free variables in inductive relations—defining relations that hold on all but finitely many variables, rather than for some fresh variable. This strengthens the premise of the relations, which in turn strengthens inductive hypotheses for proofs.

Design principles for mechanized metatheory often go hand-in-hand with high-level frameworks such as 3MT (Section 6.3.2), or with domain-specific languages such as Ott and Lem (Section 6.1). Other work in design principles for mechanized metatheory includes an overview of different ways of formalizing language semantics in an ITP for the same language [Bertot2009], and the use of the coinductive partiality monad [Capretta2005] in Agda to define denotational semantics [Danielsson2012].

Beyond Metatheory

Few domains have seen as much movement in the development of design principles for proof engineering as mechanized metatheory. Opinions on the role that POPLMark played in this are mixed [piercenal, personappel]. There is little disagreement that POPLMark was timely: Proof assistants were becoming more usable, and the ongoing development of CompCert (Section 3.1.1) inspired confidence in their usefulness. At the same time, the properties that researchers wanted to prove about their languages were becoming larger and more complex. It was becoming difficult to know that these properties were actually correct, and to maintain confidence in correctness in the face of changes.

POPLMark gave a common platform for experimentation and offered a concrete criteria for success in a timely domain. The benchmarks were difficult enough to stress technology, but simple enough that they were easy to understand and that experts could prove them in a few weeks. The work in metatheory that followed POPLMark demonstrated that general-purpose proof assistants really were usable to prove these properties that researchers cared about.

Work to this day continues to use POPLMark

as an evaluation metric.

Amin2017, for example, introduce a proof technique using definitional interpreters that addresses the open challenge of scaling type soundness proofs to realistic languages, and evaluate the success of this technique using the F-Sub language from the POPLMark benchmarks. This highlights that sometimes, proving a slightly different property and then showing how that relates to the original property can be much simpler than proving the original property directly.

POPLMark suggests that timely benchmark suites are instrumental in bringing the challenges of design for proof engineering to the attention of the research community; in doing so, however, they can narrow the focus to one particularly difficult problem, sometimes to the exclusion of the bigger picture. Domains outside of metatheory can take this into consideration.

In addition, the success of experts using LF and Twelf on the POPLMark benchmarks and in mechanizing a practical programming language suggests that it is worth weighing carefully the tradeoffs of using different ITPs, including ITPs with special support for a given domain. Along those lines, Miller2018 argues that handling of variable bindings should be built into ITPs. It is also worth considering the barriers to adoption by non-experts of tools with which experts have demonstrated success within a domain, and how to overcome those barriers. SASyLF [Aldrich2008], for example, is one attempt to make LF-based ITPs more accessible to students.

3.3 Practical Impact

Proof engineering has already had a large impact on program verification in many domains, including those from Section 3.1. Proof engineers have in recent years verified operating system [Klein2009] and web browser [Jang2012]

kernels, machine learning systems 

[DBLP:journals/corr/SelsamLD17], distributed systems [Woos2016], quantum circuits [rand2017], constraint solvers [blanchette2018verified, schlichtkrull2019verified], compilers [Leroy:POPL06, Kumar2014], and file systems [Chen2015, Amani16].

So far, proof assistants have had the strongest practical impact in systems software. The CompCert verified compiler, sold as a commercial product, is finding applications in embedded systems, such as those used in aviation [CompCert-ERTS-2018]. The BoringSSL library, used in the popular Google Chrome Web browser, recently started to include high-performance cryptographic code in C verified in Coq [Erbsen2019]. The seL4 verified operating system kernel is used in SCADA systems, and aviation and automotive systems [Klein2018].

4.1 Proof Assistant Pre-History

Specification and verification of software systems can be viewed as reducing human, informal notions and reasoning to systematic application of logical principles and axioms. From this perspective, Aristotle’s systematization of the principles of correct reasoning [AristotlePriorAnalytics, sep-aristotle-logic] is arguably the oldest precursor. The proposal of Leibniz1685 to reduce human reasoning to mathematical calculation is a second important step. Leibniz also laid the foundations for symbolic propositional logic, although this was also done independently by, e.g., Boole.

A later important development was the introduction of predicate logic (or predicate calculus) by Frege in the late 19th century [Frege1893, sep-frege]. Two key innovations in Frege’s logical system were (1) the introduction of quantifiers of expressions in propositions, and (2) a notion of proof (sequences of valid inferences) for propositions with quantifiers. Frege was able to capture and prove concepts from number theory in his system from first principles. However, his system included an axiom later shown by Russell to make the system inconsistent [sep-frege-theorem]. Nevertheless, the logics of ITPs based on higher-order logic are reminiscent of Frege’s logic (which included second-order quantification), and his notion of proof is similar to the modern conception.

In the early 20th century, Russell and Whitehead continued Frege’s work of putting mathematics on a firm logical basis. Crucially, this included developing methods for avoiding inconsistencies, e.g., due to unrestricted formation of sets of entities [Russell1918]. In the end, they proved many significant theorems of arithmetic and set theory in their logical system by rigorous inference [Whitehead1997], but relied on axioms that were considered questionable at the time [sep-principia-mathematica]; this is echoed in more recent concerns for philosophical justification of the basis for the logical system underpinning a proof assistant [Barras2010]. Goedel1930 established the connection between truth and provability for first-order predicate logic, showing that proofs of true propositions can always be constructed, in principle (systems of inference rules can be made complete). However, he then subsequently established that even modest extensions of expressibility in first-order logic lead to incompleteness [Goedel1931]: there can be no system that allows constructing proofs for all true propositions. Together with other negative results, e.g., by Tarski1936, this ended the search for a single universal logical system as a foundation for mathematics and all mathematical endeavors.

At roughly the same time, a theoretical basis for computation and computer programs was given by Church1936 in the -calculus, and computers were developed more practically by von Neumann and others in the 1940s [vonNeumann1993]. As pointed out by Backus1978, the -calculus and computers as described by von Neumann gave rise to two distinct program styles: the functional style is characterized by computational steps as reductions of expressions and an absence of state, and the imperative style is characterized by computation as transitions between complex states and statements that effect such transitions.

Also around that time, Curry1934 observed a connection between axioms and type systems. This and later observations culminated in 1969 (published in Howard1980) with the principle of formulae-as-types, also known as propositions as types, the Curry-Howard correspondence, or the Curry-Howard isomorphism. This principle established the connection between programs and proofs, which provided groundwork for the later development of ITPs.

turing1949 first considered the problem of correctness for an imperative program that computes the factorial of its input by repeated additions. He described how full correctness could be decomposed into verifying assertions associated with certain points in the code (today called invariants), and how to ensure program termination by finding a consistently decreasing quantity (today called a variant or ranking function). However, this work remained obscure, and more systematic approaches for reasoning about imperative programming languages were presented only late in the 1960s [Floyd1967Flowcharts, Hoare:CACM69].

McCarthy1960 proposed a practical realization of the functional style of programming in the form of the Lisp language. McCarthy1963 also highlighted the problem of putting computing and programs on a formal foundation. To this end, he proposed several formalisms for capturing different classes of functions, and showed how to reason about the equivalence of such functions. He also described how datatypes could be constructed recursively and be subject to inductive reasoning. Burstall1969 showed how to reason about more practical programs in the functional style using the principle of structural induction.

Research on logical reasoning using computers initially took two main forms [Warden2009Book]: (1) fully automated proofs of propositions in simple proof systems such as Robinson’s resolution system [Robinson1965], and (2) computer checking of the validity of single steps in human-constructed mathematical proofs, as in the Automath system by de Bruijn [DeBruijn1970, DeBruijn1994]; Automath is notable for representing both propositions and proofs in the same formal system (a variant of the -calculus). The former approach is limited by the difficulty (and resulting long machine time) of finding proofs algorithmically and its bounds on expressiveness of propositions, while the latter is limited by the ingenuity (and labor supply) of the humans that construct the proofs that the system checks. The legacy of Automath includes the de Bruijn principle [Barendregt2002], which states that proof-checking programs should be as small and simple as possible to facilitate high assurance and trustworthiness.

4.2 Proof Assistant Early History

In the early 1970s, Milner proposed an approach to computer proofs in between full automation and basic inference checking. One of his insights was that fine-grained automation can be directed by human ingenuity through so-called proof tactics (Section 5.1.1), alleviating the burden on users in Automath-style systems. He also chose an underlying formal system (Scott’s logic of computable functions [Scott1993]) that could represent concepts familiar to computer scientists and programmers, such as integers, lists, and computer programs themselves [Gordon2000]. The first implementation of his approach, called Stanford LCF [Milner1972b], provided a workflow still used in several modern proof assistants, where the user inputs a command (e.g., a single tactic to apply to attempt to reach the current proof goal) and the system executes the command, resulting in a complete proof or in a number of subgoals. Although tactics could be complex, the system guaranteed that a proof reported as finished could be exported and verified independently by an Automath-style checker [Warden2009Book].

Limitations on the flexibility and scalability of Stanford LCF prompted Milner to develop ML (Meta Language), a programming language for use in a new version of LCF. ML was a typed language, and Milner defined a theorem in LCF as an abstract data type whose predefined values were instances of axioms and whose operations were inference rules. This technique, which persists in some proof assistants today, is usually referred to as the “LCF approach,” and it in effect reduces the soundness of inferences in an embedded logical system to the soundness of the type system (and type checking mechanism) of the host language. Adventurous and flexible tactics could be implemented in ML and applied without concern for affecting soundness, although, e.g., termination was not guaranteed.

The resulting implementation of LCF in ML was called Edinburgh LCF [Milner1979], and was further developed mainly by Paulson and Huet, who enhanced its reasoning capabilities and wrote a compiler for ML to avoid the overhead of interpretation [Gordon2000]. Paulson then went on to develop the Isabelle proof assistant framework [Paulson1994, Paulson2000], and Huet to develop, with Coquand, the first version of the Coq proof assistant [Coquand1985]. Gordon used the last version of the LCF system, called Cambridge LCF, as a basis for the HOL proof assistant [Gordon1993]. The Nuprl proof assistant [Constable1986] also followed in the LCF tradition. Together, these proof assistants comprise the LCF family, and their recent incarnations are now widely used in the research community. Recently developed proof assistants such as RedPRL [redprl] and Lean [deMoura2015] have also joined the LCF family. ML was standardized as Standard ML [Milner1997], and it and its dialects are widely used as implementation languages for proof assistants.

While Automath targetted mathematics, the initial applications of LCF-style systems for verification was in the area of programming languages and compilers. Stanford LCF had case studies for verified compilation of an imperative language to a stack-based language [Milner1972], and equational theories on integers and lists [Newey1973]. Edinburgh LCF had case studies for verified programming language implementations [Cohn1983], and an important use case of HOL was hardware verification [Boulton1992].

pa-history-geuvers-sadhana09 and Harrison2014 provide a more comprehensive description of the history of ITPs.

4.3 Proof Assistant Foundations

The foundational theories of many ITPs are based on some variation of the theory of types, which goes back to Russell and his attempt in the early 1900s to avoid inconsistency in formal systems by forbidding pathological cases such as the set of all sets with some arbitrary property [sep-russell-paradox]. Specifically, Church1940 introduced the simply typed -calculus to avoid inconsistencies in the original -calculus. This typed calculus, also referred to as Higher-Order Logic (HOL), is the basis of the proof assistants HOL4 and HOL Light.

While -calculus gives an account of computation, the principle of propositions-as-types was a later development, related to the conception of intuitionistic mathematics by Brouwer, Heyting, Kolmogorov, and others [Heyting1956]. A basic tenet of intuitionism (or constructivism) is to only admit mathematical objects that can be mentally construed from basic principles; postulation of existence or axiomatization is not enough. At the level of logical reasoning, this leads to intuitionists rejecting certain proofs established by an appeal to the law of excluded middle (LEM)—that all propositions are either true or false. Moreover, intuitionists interpret functions as effective methods of computation rather than, say, relations that satisfy some set of equations. Consequently, many classical mathematical theorems do not hold as typically formulated with such a restricted logic. However, similar theorems turn out to be possible to prove in many cases, as shown, e.g., by Bishop1985. Widely used proof assistants based on intuitionistic type theories, following the tradition of MartinLof1982a, MartinLof1982b, include Coq, Agda, and Lean.

Logical frameworks [Harper1993] support reasoning about many different logics from within a single system. Automath is a logical framework, as is LF (Section 3.2.2). The popular general-purpose proof assistant Isabelle similarly supports many logics, as long as they can be made to conform to underlying framework for simply-typed higher-order natural deduction. While HOL is the most commonly used logic for Isabelle, other bundled logics include first-order logic with Zermelo-Fraenkel set theory, and constructive type theory [Paulson2000].

4.3.1 Proof Objects

Barendregt2013 characterizes proof assistants according to how they deal with proof objects, i.e., the certificates that some property is true according to the underlying logic. In proof assistants closely related to LCF such as Isabelle, HOL4, and HOL Light, proof objects are normally not represented in full, but constructed and checked piece by piece, i.e., they are ephemeral. In contrast, Coq and Agda produce complete proof objects, although such objects are usually not kept in memory once constructed, but are stored on disk.

The time to construct and validate proof objects (piecemeal) in Isabelle, HOL4, and HOL Light is directly proportional to the object’s size. However, checking proofs in Coq, Agda, and other proof assistants that support reflection [Boutin1997], i.e., computational steps in proofs, may not be proportional to proof object size. In effect, one computional step can take as long as all conventional steps combined.

Barendregt2007 uses a formula of the following kind to illustrate the usefulness of reflection and its relation to proof object sizes:

Proving directly requires repeated and tedious use of basic derivation rules. In addition, even if rule application is automated, the proof object will be large. Instead, it is possible to perform the proof indirectly by using computation. To this end, we define:

We then prove by induction on that whenever , we have . We conclude the proof of by rewriting using two equalities, and , and apply the fact about that we just proved. In proof assistants that support reflection, the final proof object contains no trace of the proofs of the two equalities, since they are established using reductions in the logic engine. In proof assistants without reflection, full proofs must be provided for the equalities before they can be used for rewriting, resulting in large proof objects. For example, proof objects in Isabelle/HOL are typically large, but this does not necessarily mean that proof checking is slower overall, since they are ephemeral [Wenzel2015]. In effect, large proof objects in Isabelle can be viewed as a consequence of deliberate design decisions, e.g., concerning how to perform rewriting, computation, and proof checking.

4.3.2 Equality

In logical systems and type theories, there is a conceptual difference between definitional equality, used for proof checking (type checking), and propositional equality, used in expressing statements to prove.

In intensional type theories, such as the early intuitionistic type theory by MartinLof1982a and the Calculus of Constructions [Coquand1988] (CoC), these concepts are completely distinct, which can limit what can conveniently be proven to be equal. For example, in Coq, follows by definitional equality, while requires inductive reasoning. In contrast, in extensional type theories, such as that implemented in Nuprl [Constable1986], definitional and propositional equality coincide. However, this means that proof checking is inherently undecidable, since propositional equality can be used to specify undecidable problems. Intuitively, intensionally equal entities are such that they are “constructed in the same way,” while extensionally equal entities “behave in the same way.” Proofs in extensional systems can sometimes be translated into intensional theories after adding a few axioms [Oury2005].

Even within intensional type theories, not all notions of propositional equality are created equal. Homotopy type theory [univalent2013homotopy] (HoTT), for example, is an intensional type theory [nlab:intensional_type_theory] in which the notion of propositional equality corresponds to type equivalence. A type equivalence between types A and B is a function:

f : A -> B.

for which there exists some function:

g: B -> A.

that is a mutual inverse:

section$\phantom{ion}\forall$ (a : A), g (f a) = a.
retraction :  (b : B), f (g b) = b.

Univalence in HoTT states that propositional equality between types is equivalent to type equivalence between those types. Consequentially, in HoTT, it is possible to treat equivalent types as being the same.

Both CoC and HoTT are intensional. In both of these type theories, propositional equality corresponds to inhabitance of the identity type. However, in HoTT, univalence provides a means of constructing a term of the identity type [escardo2018self] that is not present in CoC. This has implications for other properties of these intensional type theories. For example, in HoTT, as a consequence of univalence, functional extensionality holds: functions can be proven equal merely from the fact that they always return the same values for the same arguments. This is not true in CoC, though it may be consistently assumed as an axiom.

There are many other weaker equalities than propositional equality that can be useful for reasoning about programs and systems. McBride2002 proposes a heterogenous equality relation for type theories where terms can be considered equal despite having different types. As CPDT remarks, researchers are continually discovering new ways for entities such as functions and data to be equal.

4.3.3 Predicativity

The term predicative was first used by Russell1906 to describe so-called propositional functions that define a class, i.e., for which the class actually exists. He distinguished such functions from impredicative functions for which no such class exists [Feferman2005]. For example, the propositional function specifying that a class has itself as member does not define a class, and is thus impredicative. In modern logical systems, enforcing predicativity means that when objects are defined using quantifiers, no such quantifier may be instantiated with the object itself (see Chapter 12 of CPDT).

Isabelle’s meta-logic stays within predicative simple type theory [Paulson2000]. In Coq, the Type universe (including Set) is predicative, but Prop is impredicative. Including both of these provides a balance to users of consistency with common axioms and expressivity: Impredicative Type with large elimination

(pattern matching that returns terms of type

Type) would not be consistent with LEM, which can be added as an axiom in Coq [predicavity1]. On the other hand, there is an informal consensus that the impredicativity of Prop adds expressivity which is useful for expressing most mathematical proofs [predicavity1]. Otherwise, in predicative logic, some proofs are more complex [predicativity2], though it is not known to what extent this has practical implications on what it is possible to express in each logic. Thus, Coq includes an impredicative Prop universe in which large elimination is disabled.

4.3.4 Definitional Mechanisms

Programs of interest to computer scientists and engineers often involve classic data structures such as lists, trees, and natural numbers. These data structures can be described, e.g., by initial algebras in category theory or in fixpoint theory [Scott1970]. Most proof assistants provide mechanisms for defining such data structures; these mechanisms take one of three forms [Berghofer1999]: (1) axiomatic, (2) inherent, and (3) definitional.

In the first approach, taken by early users of the LCF system, datatype constructors are defined by introducing new axioms, from which induction principles are proved [Paulson1984]. In the second approach, the underlying logic is extended to support custom datatypes, which requires metatheoretic investigation, e.g., as carried out by Coquand1990 for inductive types in CoC, and then implemented in Coq by PaulinMohring1993. In the third approach, datatype support is added on top of already existing mechanisms; this is done by Pfenning1990 for the CoC and by Berghofer1999 for HOL. Church’s classic encoding of numbers as functions repeatedly applying an argument function in the -calculus may be considered an example of the definitional approach.

Initial support for datatypes in proof assistants only included inductive datatypes, i.e., the minimal solutions to fixpoint equations. Inductive datatypes are arguably the most important, since they facilitate proofs by the fundamental technique of structural induction [Burstall1969, harper2016practical]. However, some applications require coinductive datatypes (maximal solutions), which can be accounted for in most type theories [Coquand1994]. Gimenez1995 initially implemented support for coinductive and corecursive functions in Coq, while Paulson1997 did the same for Isabelle/HOL.

A long-standing issue in proof assistants is developing mechanisms for quotient types, which are defined by dividing members of an existing type into equivalence classes. Quotients are widely used in mathematical reasoning, in particular in algebra. An initial approach to quotients in Isabelle/HOL was proposed by Slotosch1997, with later alternatives by Paulson2006 and Huffman2013. Cohen2013b proposed an approach to quotient types in Coq which elides the conventional approach using setoids [GEUVERS2002271] that significantly restricts the scope of rewriting tactics.

The dependently-typed language Cedille [stump2017calculus] makes it possible to define induction principles in a language based on Church-encodings [Church1941], and encodes datatypes in terms of induction principles. This allows for, among other things, zero-cost reuse of functions and proofs across certain datatypes (Section 6.4).

Research on definitional mechanisms is still an active topic. Sozeau2010 designed and implemented a Coq extension for defining functions equationally which compiles definitions down to eliminators for inductive types; this extensions was used for the function acc in Chapter 2. Biendarra2017 presented a redesigned Isabelle/HOL library, following the definitional approach, for writing and reasoning about inductive and coinductive datatypes. While Coq and Agda inherently allow nonuniform datatypes, i.e., recursive types whose arguments vary recursively, HOL systems did not support them until the advent of this library, which reduces such definitions to uniform counterparts.

4.3.5 Totality of Functions and Termination

The logic of Church’s simply-typed -calculus, HOL, is a logic of total functions. This means that partial functions cannot be directly described in proof assistants based on HOL, such as Isabelle/HOL. Similarly, the Calculus of Inductive Constructions (CIC), which Coq is based on, supports only total functions. Partial functions can still be indirectly encoded in CIC and HOL, for example by (a) returning values in a monad [McBride2015], such as the coinductive delay monad described by Capretta2005, (b) requiring proofs of argument value subset membership as function arguments [CoqArt], (c) letting functions return values in the option type, or (d) capturing functions as inductive relations between input and output.

Functions in proof assistants based on intuitionistic type theories like CIC need to be terminating for the sake of consistency. When a function is defined in Coq, for example, termination is automatically proven for cases where functions recurse on a subterm of the input and in other simple cases; in more advanced cases, users must manually prove termination or rely on approaches that use, e.g., sizes of argument terms [Abel2017]. In contrast, functions in HOL are not required to be computable at the outset, and thus do not need to be accompanied by termination witnesses. On the other hand, the uses of functions with unproven termination are somewhat limited.

Requirements for totality and termination are two hurdles that new users of ITPs face. They are constraints even to users familiar with functional programming languages, where no such requirements are typically imposed. For certain functions, arguing and formally proving termination may not even be a key concern. In that spirit, Zombie [Casinghino2014] separates out a logical, terminating fragment from a programmatic, possibly non-terminating fragment, that way the programmer can move freely between those fragments.

In other languages, a common technique to encode such functions in a total setting is to define a “fuel” argument, such as a natural number. Either the fuel argument is empty (0) and the function terminates, or there is enough fuel to continue to, e.g., perform recursive calls. This allows for proving termination by a simple structural argument on the fuel type. When calling the function in some other context, passing “infinite fuel” may be possible, which implicitly trusts that the function always terminates. Jourdan2012 provide a detailed description of the fuel technique in the context of a verified parsing function on a potentially infinite stream of tokens in Coq.

4.4 Trusted Computing Bases of Proofs and Programs

The concept of a Trusted Computing Base (TCB) was introduced by Rushby1981 in the context of security of computer systems. The basic idea is that the security of a system may be reduced to the security of a proper subset of all system components. If these components behave as expected, the system as a whole is secure. For verified software, security is replaced with correctness, e.g., functional correctness. Kumar2015 divides the TCB into the following categories:

  • formal models of system components (e.g., model of a processor);

  • system components for which there are no explicit formal models (e.g., linker or operating system);

  • tools used to check proofs about the system (e.g., proof assistant).

4.4.1 TCB of Proofs

Pollack1998 considers the question of how to trust specific machine-checked proofs, and by extension, programs that such proofs pertain to. He divides the question into a purely formal part—whether a provided proof is derivable in a given formal system—and an informal part that asks whether the proof has a purported meaning as expressed outside any formal system.

As to the formal part, trusting the proof can be reduced, by computer, to trusting the (implementation of) the proof checker of the formal system; the source code for such proof checkers can be compact and readable. However, Pollack argues that a complicated semantics of the checker’s implementation language can still provide serious obstacle to trust, and proposes that the language itself should be a logical framework designed to represent formal systems, such as LF [Harper1993] or Isabelle [Paulson1994]. As to the informal part, Pollack points to that understanding specific pieces of mathematics relies on acceptance of previous mathematics, whose trust may be partly due to its wide acceptance.

Based on Pollack’s investigation, Wiedijk2012 defined the notion of Pollack-inconsistency, which is expressed in terms of the mechanisms a proof checker uses to print and parse its formulas (which are what the user ultimately must intepret informally). In particular, Wiedijk argues that a system should always be able to parse formulas it outputs. He then demonstrates that current proof assistants are Pollack-inconsistent to some extent, but outlines how this can be addressed by modifying the implementations of printing and parsing.

With the goal of determining how small a trusted proof checker can be for a practical application, Appel2003 attempted to minimize the size of a proof checker for proof-carrying machine code. The result was less than 2700 lines of code. The Lean theorem prover attempted to minimize the size of the proof-checking kernel from the start [deMoura2015].

coqincoq encoded a limited version of the formal system underlying Coq in Coq itself, and proved strong normalization of its type system. Barras2010 addressed the problem of providing set-theoretical models of CoC, the logic that underpins Coq, with the ultimate goal of ensuring that Coq’s theory is consistent with the theory implicitly or explicitly assumed by most mathematicians. Anand2014 encoded and verified the foundations of the Nuprl proof assistant in Coq. Davis2015 certified the Milawa theorem prover. Kuncar2018 proved the relative consistency of extensions made to the foundations of Isabelle/HOL. Anand2018 encoded Coq’s internal data structures in Coq itself and gave a semantics for type checking, leading up to the MetaCoq project [Sozeau2019] for building verified checking and extraction for Coq.

4.4.2 TCB of Programs

Coq and other similar proof assistants contain logic engines that can execute functions that have been verified. However, execution inside such a logic engine is generally slow compared to execution of native functional programs [Leroy2015], and does not directly support handling of input and output. Instead, to obtain practical verified programs, proof assistant users rely on mechanisms such as program extraction to produce programs that can be integrated into larger systems or executed in conventional runtime environments. However, these mechanisms may increase the trusted base of verified programs.

Program Extraction

Paulin1989, Paulin-Mohring1989 proposed realizability for CoC to . To obtain practical executable programs from Coq functions, Paulin1993 extended the realizability from to ML. Letouzey2003, Letouzey2004 later introduced a new extraction mechanism for Coq which removed several restrictions. This introduced an intermediate language called MiniML, which can be translated to OCaml, Haskell, and Scheme. The new mechanism, argued correct by a conventional proof, was evaluated on several large projects [Berger2005, CruzFilipe2006, Letouzey2008].

Berghofer2002 identified a subset of HOL that can be translated to practical functional languages and implemented code generation for Isabelle/HOL. Haftmann2010 proposed a redesign of the code generation mechanism in Isabelle/HOL; their approach is based on translating HOL to an intermediate language called Mini-Haskell, and then further to Standard ML, Haskell, and OCaml. The correctness argument is reminiscent of that for Coq’s extraction mechanism. Haftmann2013 proposed a data refinement (Section 6.1.2) framework which replaces abstract datatypes with concrete ones, which widens the scope of code generation.

Beyond Extraction and Code Generation

Practical functional programming languages such as OCaml and Haskell lack a fully formal (and machine-checked) semantics. Proof assistant users who want to avoid trusting extraction may use deep embeddings (Section 6.2.5) of target practical programming languages along with language semantics. These embeddings and semantics can then be used in certified compilers (Section 3.1.1), which may include formal models of system components such as processors, to produce verified machine code. However, these approaches may have more restrictions and inconveniences than extraction. Removing or circumventing these restrictions may be fruitful. Continuing to develop and improve certified compilers for Coq like Œuf and CertiCoq, for example, may help proof engineers circumvent extraction to OCaml and Haskell altogether. Instead, proof engineers may be able to directly compile certified programs to machine or assembly code and run those programs directly.

5.1 Styles of Automation

Proof engineers can construct proofs of theorems in a wide variety of ways. There are three common styles of proof automation: writing sequences of proof search tactics (which can be defined either using a metalanguage like Standard ML or a specialized tactic language like Ltac), writing high-level programs in a structured proof language, and using reflection to write proof-checking procedures within the host language itself. When executed, all expressions or commands reduce to primitive inference rule applications in the proof-checking kernel.

Ltac and Isabelle/Isar are examples of a tactic language and a proof language, respectively. Some proof assistants (for example, both Coq and Isabelle) have support for all three styles of automation, either natively or through extensions. However, not all do; Agda, for example, takes a minimilistic approach, supporting only reflection (it is possible to imitate tactics using reflection). Table 5.1 references examples of supported styles of automation for a sample of major proof assistants.

Metalanguage Tactic Lang. Proof Lang. Reflection
Agda Possible
Coq OCaml Ltac SSReflect Possible
Isabelle ML Eisbach Isar Possible
Nuprl ML Possible
Table 5.1: Examples of styles of automation a sample of major proof assistants support, including some external developments.

These styles often merge, and the lines between them can be blurry. It is possible, for example, to write proofs in the high-level proof language SSReflect in Coq, or to combine this language with Ltac tactics. It is possible to write ML tactics in Isabelle/HOL, and to write tactic-style proofs in Isabelle/HOL that look similar to Ltac proofs in Coq.

This section explores the design space and uses in common proof assistants of languages for different styles of automation: Tactics and tactic languages (Section 5.1.1), proof languages (Section 5.1.2), and reflection (Section 5.1.3). It then concludes with a discussion of future styles of automation (Section 5.1.4).

5.1.1 Tactics & Tactic Languages

LCF introduced the language ML (metalanguage) to let users write high-level proof automation [Gordon1978]. In LCF, theorems are represented using the abstract type thm in ML; the only way to construct an inhabitant of thm is using the axioms and inference rules of the logic. A proof in LCF has the following type in ML:

type proof = thm list  thm

In other words, a proof is a function that takes a list of hypotheses and, from them, proves the conclusion.

A basic unit in LCF proof automation is the tactic:

type tactic = goal  (goal list  proof)

The goal type (not shown) represents proof goals. Thus, a tactic is a function that takes a proof goal and then produces a list of new goals which the goal reduces to; such goals are conventionally called subgoals. When no more goals remain, the tactic produces a value of type proof.

Based on this tactic definition, it is possible to define higher-order functions that take tactics as arguments and return new tactics. Milner called such functions tacticals. For example, a collection of tactic combinators may include the tactical repeat with type tactic  tactic, which repeatedly applies its argument tactic to the proof goal. In Coq, the composition tactical t1; t2 runs the first tactic t1, then runs the second tactic on all goals produced by the first tactic t2.

However, the LCF representation of tactics means that, by definition, the outputted goals are mutually independent—a proof for one is unrelated to proofs for others. In practice, constraints during proof search can apply across subgoals [SpiwackTactic]. For this reason, Coq previously used, up to at least version 8.3 in 2010 [Spiwack2010], a tactic type definition along the following lines:

type proof = thm list  thm
type tactic = goal  state  (goal list  state  proof)

Here, the state returned from a tactic call can be used to figure out dependencies between subgoals, such as shared variables.

In the early days of proof assistants, users combined custom tactics written in an ML dialect with built-in tactics and tactical combinators to write custom automation [Constable1986, cornes1995coq, paulson1988preliminary, paulson1983tactics]. This tradition for programming proof automation is the default workflow in the HOL4 and HOL Light proof assistants, and remains a possibility in Coq and Isabelle. However, Coq and Isabelle also support writing tactics in tactic languages rather than in the tool’s implementation language.

Tactic-Based Proofs

The original proof development workflow in LCF was to write sequences of tactic calls until no proof goals were left. This style is still prevalent in modern proof assistants. Consider an inductive proof of the theorem app_nil_r in Coq, which states that appending the empty list to any list produces the original list. We can write this using tactics and tacticals:

Theorem app_nil_r : forall (A : Type) (l : list A), l ++ [] = l.
  intros. induction l; auto. simpl. rewrite IHl. auto.

Executing these tactics produces a Gallina proof term:

(fun (A : Type) (l : list A) =>                         (* hypotheses *)
  list_ind                                              (* induction principle for lists *)
    (fun (l$_0$ : list A) => l$_0$ ++ [] = l$_0$)                (* motive to prove *)
    eq_refl                                             (* base case (by reflexivity) *)
    (fun (a : A) (l$_0$ : list A) (IHl : l$_0$ ++ [] = l$_0$) =>
      (* inductive case (by rewriting) *)
      eq_ind_r (fun (l$_1$ : list A) => a :: l$_1$ = a :: l$_0$) eq_refl IHl)
    l)                                                  (* argument to induction *)
: forall (A : Type) (l : list A), l ++ [] = l.          (* theorem type *)

There is an analogous in the Isabelle/HOL standard library:

lemma append_Nil2 : "append xs [] = xs"
by (induct xs) auto

While it is not typical to do so, using Isabelle/HOL-Proofs, it is possible to reconstruct and inspect a proof object for this proof.

Tactic Languages

Tactic languages allow proof engineers to write custom tactics alongside specifications and proofs, rather than in an implementation language such as Standard ML. We describe two tactic languages—Ltac for Coq and Eisbach for Isabelle—in detail, then conclude with a brief discussion of other tactic languages.


Nearly 20 years ago, Coq introduced the Ltac tactic language [Delahaye2000], which has since become the standard for tactic development in Coq. Ltac is an untyped domain-specific language with support for pattern matching on terms and goals, as well as writing custom tactics and tacticals. The Ltac manual can be found in the Coq documentation [ltac].

To understand Ltac, consider the break_match tactic from the StructTact [structtact] library:

Ltac break_match := break_match_goal || break_match_hyp.

This tactic breaks down match statements, both in goals:

Ltac break_match_goal := match goal with
  | [ |- context [ match ?X with _ => _ end ] ] =>
    match type of X with
      | sumbool _ _ => destruct X
      | _ => destruct X eqn:?

and in hypotheses:

Ltac break_match_hyp := match goal with
  | [ H : context [ match ?X with _ => _ end ] |- _] =>
    match type of X with
      | sumbool _ _ => destruct X
      | _ => destruct X eqn:?

Both tactics perform syntactic pattern matching over the goal, using the syntax name : cpattern |- cpattern, where the left cpattern represents hypotheses and the right cpattern represents the conclusion. In break_match_goal, pattern matching looks only in the conclusion; in break_match_hyp, pattern matching looks only in the hypotheses. Both tactics use the context syntax to find all subterms of the term that are match statements, then pattern match on the type of the result, using the destruct tactic to break down those match statements.

The effect of break_match is to simplify tedious but conceptually simple proofs by case analysis, and to do so without relying on the names of hypotheses, which can make proofs likely to break as specifications change [Woos2016]. Consider, for example, an interpreter correctness proof:

Lemma interp_eval : forall op v v’, interp op v = Some v -> eval op v v’.
  unfold interp. intros. destruct op; destruct v;
  try discriminate; inversion H; constructor.

Here, the intros tactic introduces hypotheses named op, v, v, and H. The destruct op; destruct v sequence of tactics then does case analysis on op and v. We can use repeat break_match instead of destruct op; destruct v to simplify this proof:

  unfold interp. intros. repeat break_match;
  try discriminate; inversion H; constructor.

The resulting proof is more concise. It is also less likely to break as specifications change, since it does not depend on the automatically generated names op and v, which may later change (Section 6.2.3).

Ltac was designed to achieve a balance between the flexibility of writing tactics in a powerful language like OCaml, and the ease of flexibility from using built-in combinators to write custom tactics directly inside of the Coq proof assistant. Like ML and OCaml, it is Turing-complete (see Chapter 16 of CPDT). However, it gives proof engineers limited access to underlying features like environment management. This means that proof engineers do not have to deal with low-level issues in OCaml such as managing de Bruijn indexes; proof engineers who want that level of control may write plugins in OCaml.

Ltac2 [ltac2], the next generation of Ltac, is in development. It comes full circle, returning to the ML family of languages.


The tactic language Eisbach [Matichuk2015EisbachAP] for Isabelle was inspired by Ltac. Eisbach is tactic language that is integrated into the proof language Isabelle/Isar. Using Eisbach, proof engineers write tactics (called proof methods) directly in Isar syntax. The Eisbach manual can be found in matichuk2015eisbach.

To understand Eisbach, consider an example proof method from the Eisbach manual for solving existentials:

method solve_ex =
  (match conclusion in  x. Q x for Q 
    <match premises in U : Q y for y 
      <rule exI [where P = Q and x = y, OF U]>>)

This matches the current conclusion with the Q in x. Q x, then looks in the hypotheses for a term that matches Q y for the matched Q and some y, and then calls the introduction rule for existentials with that hypothesis. In other words, if some hypothesis is Q y and the goal is x. Q x, then solve_ex will automatically prove the goal from the hypothesis. The manual shows an example of calling this proof method to solve a proof:

lemma halts p   x . halts x
  by solve_ex

Like Ltac, Eisbach provides limited access to low-level details. Isabelle proof engineers who want access to these details may write proof methods directly in Isabelle/ML.

Other Tactic Languages

The untyped nature of languages like Ltac may make it difficult to debug custom automation and to provide strong guarantees on custom tactics. Typed tactic languages such as those of the Delphin [poswolsky2009system] and Beluga [pientka2008programming] logical frameworks, the tactic language for VeriML [Stampoulis2010], and Mtac [Ziliani2015] and Mtac2 [Kaiser2018] in Coq address these problems.

PSGraph [lin2016understanding] is a graphical tactic language which aims to make debugging and refactoring of tactics easier. In PSGraph, tactics are flow graphs for proof subgoals, and executing tactics amounts to following the flow graph directly.

The Matita [asperti2007user] proof assistant introduces a language of tinycals to address some challenges that tacticals pose for user interaction. In particular, proof assistants traditionally execute tacticals atomically. This makes it difficult to communicate how the tactical is executed to the user, as well as to debug tacticals and provide useful error messaging when they fail. Tinycals, in contrast, act like traditional tacticals, except that they allow for more fine-grained execution.

Some tactic languages merge the tactic approach with metaprogramming constructs, so that users can write tactics in the proof language itself. In Idris [brady2013idris], it is possible to implement tactics using the elaboration monad, which exposes elaboration to users and enables metaprogramming within Idris [Christiansen2016]. Agda recently replaced its reflection mechanism with one based on Idris’ elaborator reflection [agda-elab].

Similarly, Lean exposes several metaprogramming constructs (including a tactic monad), which enable users to write tactics in Lean itself, to access proof state, and to access internal methods in the underlying C++ codebase [Ebner2017]; this enables proof engineers to write powerful procedures.

Some tools merge the approach of proof by reflection with a tactic language; we discuss these in Section 5.1.3.

5.1.2 Proof Languages

A proof language is a mechanism for structuring and composing propositions, facts, and proof goals. Proof languages generally allow both backward reasoning, going from the proof goal to a new set of goals, and forward reasoning, where new facts are added to the proof context but the goal remains the same. When formulating reasoning steps in a proof language, procedures in lower-level languages, such as tactics, can be invoked explicitly or implicitly. In turn, these procedures can invoke specialized external proof search programs. In contrast to plain unstructured sequences of commands (“tactic soups”), proofs written in proof languages are usually meant to convey key proof ideas, i.e., to be understandable by humans. To ensure readability, proof languages take inspiration from traditional mathematical vernacular. We consider two proof languages—Coq/SSReflect and Isabelle/Isar—in detail, and then briefly discuss other proof languages.


Coq/SSReflect is a proof language for Coq that emphasizes reasoning by rewriting using equalities and proofs by computation via small-scale reflection [Gonthier2010]. It was originally developed by Gonthier in the context of his proof of the four-color theorem [Gonthier2008]. Idiomatic proofs in Coq/SSReflect make use of “bullets” (-, *, or +) to structure proofs similarly to how Isabelle/Isar uses indentation of blocks to indicate structure.

The basis of Coq/SSReflect is that a step in a proof is one of the following:

  • a deduction step that directly constructs parts of a proof, either by backwards or forwards reasoning;

  • a bookkeeping step that performs a management operation on the proof context, e.g., introducing or renaming assumptions;

  • a rewriting step that changes parts of the proof goal or some assumption, either by way of some equality lemma or by computation.

In idiomatic proofs, these kinds of steps are interleaved and tend to be used in equal proportions.

The reflect part of SSReflect refers to the convention of performing deduction steps that translate between symbolic representations, such as expressions involving boolean functions, and logical representations, such as inductive predicates. For example, when a proof goal can be solved by reasoning in propositional logic, we can convert the proof goal to boolean form and perform computation in Coq’s logic engine instead of applying multiple propositional derivations manually. In contrast to general proofs by reflection (Section 5.1.3), which focus on computational efficiency and managing the sizes of proof objects, SSReflect leverages reflection for convenience and user productivity. For example, a conjunct can be reflected to a boolean value that directly computes to true, saving the manual effort of applying tactics.

The proof language contains special syntax for translating between representations, which is called application of view lemmas. Moreover, Coq users traditionally use different commands for rewriting, definition expansion, and partial evaluation. In Coq/SSReflect, all of these tasks are performed via parameters to the rewrite tactic. Rewriting operations can pinpoint specific subterms in the current proof goal through the use of pattern expressions [Gonthier2012] that may mention some constant names for disambiguation but leave others implicit.

Consider the following lemma from the Mathematical Components project whose proof is written in idiomatic SSReflect:

Lemma edivnP : forall m d, edivn_spec m d (edivn m d).
rewrite /edivn => m [|d] //=; rewrite -{1}[m]/(0 * d.+1 + m).
elim: m {-2}m 0 (leqnn m) => [|n IHn] [|m] q //=; rewrite ltnS => le_mn.
rewrite subn_if_gt; case: (ltnP m d) => [// | le_dm].
rewrite -{1}(subnK le_dm) -addSn addnA -mulSnr; apply: IHn.
apply: leq_trans le_mn; exact: leq_subr.

Here, the first rewrite unfolds the edivn definition, while the last rewrite performs chained rewriting using facts arithmetic, such as that addition is associative (addnA). Explicit names for quantified variables are given after the operator =>, which reduces the chance of brittleness due to reliance of machine-generated variable names. The tactic elim performs induction on the natural number m.


Isabelle/Isar is a proof language for Isabelle that aims for human readability while retaining some symbolism of formal deduction systems [Wenzel2007isar]. It is built on the Isabelle/Pure logic, which is an intuitionistic fragment of HOL. Isabelle/Isar can be understood as an interpreter for block-structured syntax capturing the flow of facts and proof goals [Wenzel2006].

As an example, the Isabelle/Isar manual [wenzel2004isabelle] contains a definition of a group that assumes only a left identity element, along with the following proof that for any group, the identity element of the group is a right identity (we expand the term notation from the version in the reference manual for clarity):

theorem right_unit : x  1 = x
proof -
  have 1 = x$^{-1}\circ$ x by (rule left_inv [symmetric])
  also have x  (x$^{-1}\circ$ x) = (x  x$^{-1}\circ$ x by (rule assoc [symmetric])
  also have x  x$^{-1}\circ$ x = x by (rule left_unit)
  finally show x  1 = x.

Translated directly into English, we can think of this as the following proof (with implicit symmetry of equality): [Right unit]


By left inverse, . By associativity, . By right inverse, . By left unit, . Then by the above, . ∎

Isabelle/Isar completes and checks this proof much like a human reader would, by making all of the appropriate substitutions. We could render an alternate English proof with all of these substitutions explicit, rather than leaving them to the reader:


We can write as by left inverse, which is by associativity, which is by right inverse, which holds by left unit. Thus, . ∎

At this point, we are much closer to proof by a sequence of commands. The key difference for readability is that the intermediate goals are explicit in the proof language, so to reconstruct an English proof, the reader does not need to step through tactics one-by-one and track the transformation of the goal.

Other Proof Languages

Mathematically-inclined proof languages such as Isar for Isabelle and Czar for Coq [Corbineau2008] were influenced by the language of the Mizar proof system [trybulec1985computer], which tilts more towards natural language than logical symbolism for representing deduction steps. Many proof assistants following the LCF tradition now have Mizar modes; for example, Mizar modes have been implemented in HOL [harrison-mizar] and HOL light [wiedijk2001mizar], as well as in Coq [giero2003mmode].

Building a proof system specifically for human-readability means that there is less of a barrier for humans and computers to check the same proofs. Following in this spirit, the Formalized Mathematics journal consists entirely of mathematical properties and proofs in Mizar that are automatically translated into English and generated as PDFs, such as the properties of sets [darmochwal1990finite], naturals [bancerek1990fundamental], and reals [JFR1411].

While most proof languages were designed with mathematics in mind, their use has not been confined to mathematics. For example, using Isar is recommended style for submission to the Isabelle Archive of Formal Proofs [isabelleafp], which consists of more computer science than mathematics formalizations [Blanchette2015].

The language PSL [Nagashima2017] for Isabelle/HOL allows expressing high-level proof strategies. PSL generates efficient Isar proof scripts from user-written strategies.

5.1.3 Proofs by Reflection

Writing proofs by reflection, that is, calling certified procedures within the host language itself [Allen1990], can be viewed as an alternative to writing proofs in a proof language or using tactics.

Chapter 15 of CPDT [CPDT] illustrates this style of proof in Coq and demonstrates its benefits on a proof that a natural number is even; we present a slightly modified version of that example that is self-contained. Given some inductive predicate for evenness:

Inductive isEven : nat -> Prop :=
| Even_O : isEven O
| Even_SS : forall n, isEven n -> isEven (S (S n)).

we construct a verified function to check evenness:

Fixpoint check_even (n : nat) : option (isEven n) := match n with
| 0 => Some Even_O
| 1 => None
| S (S n’) =>
    match check_even n with
    | Some p => Some (Even_SS n p)
    | _ => None

For a given n, check_even returns an optional a proof of isEven n (None when n is not even). As CPDT notes, this type signature guarantees that it only returns a proof when n actually is even.

Our goal is to write a tactic that uses check_even to prove evenness. To write this tactic, we need to extract the proof from the option type above when possible. We define a dependently-typed function optionOut that does this:

Definition optionOutType (P : Prop) (o : option P) :=
  match o with
  | Some _ => P
  | _ => True
Definition optionOut (P : Prop) (o : option P) : optionOutType P o :=
  match o with
  | Some pf => pf
  | _ => I

We then write a tactic that extracts the proof that check_even returns:

Ltac prove_even_reflective :=
  match goal with
  | [ |- isEven ?N] => exact (optionOut (isEven N) (check_even N))

With this, the following proof goes through:

Theorem even_256 : isEven 256.

and similarly for any other even number; the tactic fails (as expected) for odd numbers. As CPDT notes, the size of the resulting proof term is manageable even for large numbers. This is a particular advantage of this style of proof.

The concept of computational reflection predates ITPs; an early history of reflection can be found in Demers95reflectionin, and an early history of its use in theorem proving can be found in harrison-reflection. Accordingly, it is one of the oldest styles of proof automation. Its use in modern proof assistants with support for higher-order logics can be traced back to the 1990s, starting with a proof of the existence of this class of proofs in Nuprl [Allen1990], and following soon after in other ITPs such as LEGO [Pollack1994] and Coq [Boutin1997]; in Coq, this approach predates Ltac [Delahaye2000].

Idris recently replaced its specialized tactic language with a mechanism for reflection called elaborator reflection [Christiansen2016]. This mechanism exposes Idris’ elaborator directly to the programmer, which allows for powerful proof automation. For example, Christiansen2016 demonstrates how to use elaborator reflection to write a mush tactic, which can be used to dispatch many goals in the Idris standard library:

mush : Elab ()
mush =
  do attack
    x <- gensym "x"
    intro x
    try intros
    induction (Var x) andThen auto

Proof by reflection is also the dominant style of proof automation in Agda, which does not support tactics. van2012engineering demonstrate the isEven example from earlier using Agda’s old mechanism for reflection. This mechanism was replaced in 2016 with a reflection mechanism based on Idris’ elaborator reflection. Other proof assistants that support proofs by reflection include HOL4 [fallenstein2015proof], Isabelle/HOL [chaieb2008proof], and Milawa [Davis2015].

Nowadays, there is a trend of integrating the approach of proof by reflection with tactic languages. For example, Cybele [claret2013lightweight] is a plugin for writing reflective tactics in Coq, with support for effects and non-termination. Rtac [Malecha2016] is a reflective tactic language for Coq, which contains specialized automation to make it simpler to write soundness proofs of decision procedures when writing reflective tactics. These mixed approaches enable proof engineers to take advantage of the benefits of both approaches to more easily build efficient automation.

5.1.4 Future Styles of Automation

One drawback of using tactics is that they can sometimes impede proof understanding. In the future, we expect more tools for proof understanding (in addition to existing structured proof languages). For example, a tool could use tactics to find proofs, then simplify the result, or otherwise output a format that is easier to understand.

Along those lines, debugging tactics and tacticals can be difficult, since the execution of tactics and tacticals often is not conducive to fine-grained debugging, and since fully informative debugging of tactics sometimes requires interfacing with multiple languages (such as Ltac and OCaml). Future tactic languages should better support debugging. The continued development of alternative tactic execution models as well as typed and graphical tactic languages may help with both of these problems.

Another opportunity for improvement with existing automation is improved performance of tactics and tactic languages. We expect more exploration of improving tactic performance, both by writing tactics differently and by improving the performance of the underlying engine.

The continued development of tools that integrate several styles of automation may help proof engineers better take advantage of the benefits of each of these approaches.

5.2 Automation in Practice

Both specialized and general-purpose automation help move the burden of proof away from the proof engineer and toward the tooling with which the proof engineer interacts. This section briefly discusses a non-exhaustive sample of automation procedures (Section 5.2.1). It then concludes with a discussion of the future of automation (Section 5.2.2).

5.2.1 Automation Procedures

Automation can be built using any of the various styles of automation (Section 5.1); since these styles of automation overlap, we consider automation by what it achieves, rather than by the style of automation that it utilizes.

Domain-Specific Automation

Domain-specific automation automates proofs within particular domains. For example, the omega [coq-omega] tactic in Coq implements a decision procedure for quantifier-free Presburger arithmetic based on the Omega Test [Pugh1991], an integer programming algorithm. It can automatically prove mathematical statements that can be difficult for Coq users to prove by hand.

Some domains include verifying programs within specific languages [cao2015practical, Ricketts2014], writing mathematical proofs [nipkow1990, slind1994ac, Braibant2011, narboux2004, gregoire2005, agdasemiring], deciding regular expressions [braibant2010], and reasoning about embedded logics such as separation logic [appel2006tactics, mccreight2009practical, Krebbers-al:POPL17].

General-Purpose Automation

General-purpose automation is machinery that is useful across many domains. For example, the break_match tactic that we used as an example for the Ltac tactic language (Section 5.1.1) contains useful machinery to make proofs by case analysis simpler and more robust.

Many proof assistants ship with useful general-purpose automation; third-party tools may build on these. For example, many proof assistants come with automation for inversion and induction [nipkow1989term, mcbride1996inverting, cornes1995automating]. Isabelle/Isar has special support for performing induction proofs [Wenzel2006]; besides specifying the variable to perform induction on, and the induction principle (rule) to use, a user can indicate that certain variables are to be arbitrary, i.e., that they are not bound in the resulting assumptions and proof goals. For example, the following clause opens a proof by strong induction for natural numbers on the expression , where and some other variable from the previous context are arbitrary:

proof (induct "x - y" arbitrary: z x rule:less_induct)

In Coq, a similar proof requires building a custom induction principle as a separate lemma.

Hint databases [coq-commands] in Coq store theorems that its other tactics [coq-tactics] such as auto and rewrite can use as hints. For example, the tactic auto with arith tells auto to use the arithmetic theorems defined in the arith database when it tries to solve the goal. Hints can help make proofs not only simpler, but more robust (see Chapter 3.8 of CPDT), though they may negatively impact proof search performance or even cause it not to terminate (see Chapter 13 of CPDT).

Some third-party libraries such the StructTact library [structtact] and the code distributed with CPDT and FRAP [FRAPBook] ship a variety of general-purpose automation that builds on the standard library packaged in one place. Automation from these libraries ranges from machinery to better handle induction such as prep_induction [structtact] and induct [FRAPBook] to powerful tactics like crush [CPDT], which can dispatch many proof obligations automatically. The agda-prelude [agda-prelude] library provides efficient alternatives to automation in the Agda standard library.

Theory exploration—the automatic discovery and sometimes proof of theorems for a given theory—is a form of general-purpose automation that first arose in the context of automated theorem proving for mathematics [buchberger2000theory]. This style of automation aims to mimic the way that mathematicians explore theories when writing proofs by hand. While theory exploration tooling began with the development of specialized tooling, specialized tools can be used in tandem with an ITP; dramnesc2015theory, for example, uses the theory exploration tool Theorema [buchberger2006theorema] in combination with Coq to explore the theory of binary trees. The tool Hipster [JohanssonRSC14, DBLP:journals/eceasst/ValbuenaJ15, Johansson2017] for Isabelle/HOL integrates theory exploration directly with an ITP.

Other examples of useful general-purpose automation include simple general-purpose proof automation [coq-tactics, auto2, Lindblad2004], rewriting [coq-tactics, nipkow1989term], and solving logical fragments [paulson1999generic, lescuyer2009improving, hurd2003first, kumar1991integrating, busch1994first, dahn1997integration, hurd1999integrating], and techniques for reasoning about executable specficiations [barthe2002efficient], as well as an implementation of a generalization of congruence closure to dependent type theory [DBLP:journals/corr/SelsamM17]. In addition, Chapter 6 describes general-purpose automation and tooling for proof reuse (Section 6.4.3), as well as general-purpose automation built on type classes and canonical structures (Section 6.2.1).


Hammers are systems for general reasoning over large libraries of formal proofs [Blanchette2016]. Like the verification language F* [Swamy2016] or the congruence closure algorithm [DBLP:journals/corr/SelsamM17] in Lean, hammers leverage automated theorem provers (ATPs) from within an ITP. Hammers leverage ATPs while preserving the small trusted bases of ITPs. They are able to learn from previous proof efforts. In proof assistants, a hammer is exposed as a collection of tactics that in effect comprise a brute-force method for discharging a proof goal.

A hammer for a proof assistant typically has three components:

  1. a premise selector that selects facts (axioms) to be used by ATPs from the large library available to the proof assistant;

  2. a translator that converts the selected facts and proof goal to the restricted logics of the ATPs;

  3. a proof reconstructor that builds proofs accepted by the proof assistants from the evidence provided by the ATPs.

The premise selector arguably has the most challenging task, since the database of facts can be large, and it is difficult to determine whether a fact is relevant to the given proof goal. The standard approach is to leverage machine learning techniques, such as naive Bayes and

-nearest neighbors [Blanchette2016, Czajka2018].

The translator must take into account the particular foundations and features of the proof assistant, such as polymorphic or dependent types, and provide faithful representation in target logics, which may have no types or only monomorphic types.

The proof reconstructor, like the translator, is highly specific to the proof assistant. Reconstructors can use many different approaches of varying robustness, such as ATP proof replay, reflection, or proof assistant source generation, augmented by various heuristics. As a result, reconstruction may sometimes fail.

Implementations of hammers include Sledgehammer for Isabelle/HOL [Blanchette2013], HOL(y)Hammer for HOL Light and HOL4 [Kaliszyk2014], and CoqHammer for Coq [Czajka2018, coqhammer]. Invocations of hammers typically spawn many parallel instances of different ATPs, such as Z3, Vampire, the E theorem prover, and CVC4. Hammer services to proof assistants can also be provided remotely, overcoming local limitations on processing power and memory [Kaliszyk2015].

As an example of applying a hammer, consider a Coq lemma about lists from the StructTact library:

Lemma app_cons_singleton_inv : forall A xs (y : A) zs w,
 xs ++ y :: zs = [w] -> xs = [] /\ y = w /\ zs = [].

Invoking the CoqHammer hammer tactic finds a proof via Z3:

Extracting features...
Running provers (using 8 threads)...
Z3 (nbayes-32) succeeded
- dependencies: List.app_eq_unit

The output also gives the following tactic call to replace the hammer invocation, yielding a proof of the lemma without ATPs:

Proof. Reconstr.rcrush List.app_eq_unit Reconstr.Empty. Qed.

While CoqHammer generates sequences of calls to custom tactics for reconstruction, Isabelle’s Sledgehammer typically results in calls to the built-in superposition prover metis [Blanchette2011], which works similarly to ATPs such as Vampire.

The usual way to evaluate the effectiveness of a hammer for a particular proof assistant is to apply the hammer on a standard library by replacing proof scripts with invocations of hammer tactics. For example, CoqHammer was able to reprove 44.5% of all results in the Coq standard library [Czajka2018], which is in line with success rates for HOL Light benchmarks (40%). Success rates for benchmarks in Isabelle/HOL can be as high as 70%, for databases with upwards of 100,000 facts [Blanchette2016]. However, these rates do not reflect practical application of hammers in evolving projects, where proof goals may be reformulated based on manual exploration using certain proof strategies.

In contrast to property-based testing [Paraskevopoulou2015] and counterexample generators [Blanchette2010], hammers do not give feedback when ATPs are unable to discharge a proof goal. Consequently, applying hammers does not necessarily lead to progress. On the other hand, hammers do not require decidable properties, generation of datatype instances, or domain knowledge.

Augmentation of hammer components to increase effectiveness and success rates is an active research topic [Blanchette2016b, Wang2017, Peng2017].

5.2.2 Future of Automation in Practice

Hammers have been around in Isabelle for a long time, but until recently, it was not known if a hammer could be implemented for a dependent type theory. We expect more development to follow in the lines of CoqHammer.

The existence of third-party libraries for general-purpose proof automation in many ways mirrors the rise of third-party libraries for other programming languages which supplement the standard library; we expect more of these libraries to come into existence, and we expect existing libraries to grow in popularity. We also expect domain-specific tactics for common domains to continue to develop, and to cover new domains as they arise.

6.1 Property Specification and Encodings

Proof engineers leverage many constructs and notations to express programs and their specifications. For example, Coq offers a single basic language called Gallina for both logical formulas and (computable) functions, while Isabelle offers both an object language (e.g., HOL) and a metalogic with different operators and quantifiers. The specification languages can be extended inside proof assistants by using notations, which provides new syntax for existing concepts; this is crucial for emulating mathematical vernacular, which can aid understanding of formal definitions.

On top of basic specification constructs, sophisticated properties can be expressed using inductive predicates via familiar definitional mechanisms (Section 4.3.4). These predicates can be interpreted as higher-order Prolog programs [CoqArt]. For generality and reuse, collections of such specifications can be abstracted over using mechanisms such as parametric polymorphism [Strachey2000], modules, and type classes (Section 6.2).

6.1.1 Domain-Specific Specification Languages

Sewell2010 presented a domain-specific language called Ott for expressing inductive definitions and inductive properties over such definitions, suited in particular for formalizing programming language semantics. Ott files can be exported to Isabelle/HOL, Coq, and HOL4, using specific annotations for each proof assistant. For example, the regular expression datatype from Chapter 2 can be expressed in Ott as

regexp :: regexp_ ::= {{ com regexp }} {{ coq-universe Type }}
 | 0 :: :: zero | 1 :: :: unit | c :: :: char
 | r + r :: :: plus | r r :: :: times | r * :: :: star

while the last matching rule for the Kleene star becomes

s in L ( r )  s in L ( r * )
----------------------------- :: star_2
s s in L ( r * )

Note that the extra spacing is necessary for Ott’s parser to properly disambiguate the syntax.

The more general proof assistant-agnostic specification language Lem [Mulligan2014] also includes definition of recursive functions and other programming language constructs, as well as a standard library useful for semantic definitions. Ott files can be exported to Lem format and thus incorporated into larger definitions.

6.1.2 Refinement of Programs, Data, and Proofs

Stepwise program refinement is the construction of a program by a sequence of refinement steps, where each refinement step breaks the original problem into a subproblem [Wirth1971]; these steps can be verified in an ITP. Each refinement can be a refinement of a program without changing the datatypes, or a refinement of the datatypes themselves (data refinement [de1998data]). Via the principle of propositions-as-types, similar approaches can be used to develop proofs by stepwise proof refinement of an existing specification.

Program Refinement

A proof of refinement formally relates an abstract program to a concrete, refined version of that program. It establishes that all of the behaviors of the concrete program are contained in the set of behaviors of the abstract program [de1998data]. This relation can also be stated and proven in terms of the program specifications (as in the refinement calculus [back1988calculus]) or in terms of a simulation relation (Section 6.2.4). FRAPBook contains an overview of using program refinement to derive verified correct programs from their specificationss.

back1991 formalized the refinement calculus in HOL. vonwright1994 presented a tool for verified program refinement using the refinement calculus in HOL. Since then, there have been a number of refinement tools in Isabelle/HOL with support for logic [hemer2001], object-oriented [Liu2011], functional [lammich2013refinement], and imperative [lammich2015refinement] programs. Cohen2013 developed a framework for Coq called CoqEAL which automates key steps of data refinement. Delaware2015 presented Fiat, a refinement framework for deductive synthesis of abstract data types in Coq.

Proof engineers use proofs of program refinement to break down large proof developments or to compose modular proof developments, for example for the verification of storage systems [Chajed2019], compilers [Leroy2009, Rizkallah16, Kumar2014], and OS kernels [Klein2014micro, Gu-al:POPL15, Gu-al:OSDI16]. Refinement proofs can also help make proof developments robust to changes (Section 6.2.3).

Proof Refinement

Reasoning backwards in proof assistants, from goals to premises, can be viewed as a form of proof refinement [Bates1979, krafft1981], where the proof is the refinement of the specification. The idea of proof refinement is to refine the goal to proofs of subgoals, then refine those subgoals further. Bates1979, for example, describes the rule for refining a conjunction A  B given hypotheses S:

  S pr A  B by
    S pr A
    S pr B

In other words, A  B follows from S if each of A and B follow from S. Each of A and B follow from S if they can be refined using other rules.

Refinement logics such as Nuprl [Constable1986] and RedPRL [angiuli2018] as well as other proof assistants following in the LCF tradition (Section 4.2) encourage this style of reasoning. SterlingH17 contains an overview of proof refinement.

6.2 Proof Design Principles

Good design principles can make proofs easier to develop and maintain. These design principles mirror software engineering design principles in many ways, but also address challenges unique to proof engineering.

Consider an example in Coq from Woos2016, which demonstrates a design principle that addresses challenges unique to proof engineering. In this example, we have a proof eg_proof of a theorem eg, which shows that if two functions map equal inputs to equal outputs, then any proposition that holds on all outputs of g must also hold on all outputs of f:

Definition eg : Prop :=
  forall (A B : Type) (f g : A -> B) (P : B -> Prop),
    (forall x, P (g x)) ->
    (forall x, f x = g x) ->
    (forall x, P (f x)).
Lemma eg_proof : eg.
  unfold eg. intros. rewrite H0. auto.

Suppose we later change eg (using orange to show changes):

Definition eg : Prop :=
  forall (A B : Type) (f g : A -> B) (P  : B -> Prop),
    (forall x, P (g x)) ->
    (forall x, f x = g x) ->
    (forall x, P (f x) /\ ).

As the authors note, our proof eg_proof no longer holds, since the automatically generated hypothesis name H0 is now called H1. One way to address this is to change the hypothesis name in the proof as well:

  unfold eg. intros.  auto.

But if we continue changing eg, then we will need to keep making these kinds of changes. Instead, the authors advocate for using the tactic find_rewrite, since it does not depend on hypothesis names:

  unfold eg. intros.  auto.

This proof goes through for both definitions of eg.

The design principle from this example addresses a challenge unique to proof engineering, since it deals with the consequences of proof automation. Other proof engineering design principles mirror software engineering design principles. This section provides an overview of design principles for proof engineering, drawing parallels to software engineering when appropriate. It focuses on general-purpose design principles, and discusses domain-specific design principles (beyond those from Chapter 3) only when relevant more broadly.

6.2.1 Design Principles for Abstraction

As in software engineering, design principles for proof engineers prevent changes in implementation from breaking dependencies that ought to rely only on specifications. For example, much like a software engineer may write an interface for a collection of functions so that he can switch out implementation details such as the underlying data structure without breaking functionality that depends on those functions, so a proof engineer may write an interface for a collection of lemmas so that changes to the proofs of those lemmas do not break other lemmas and theorems that depend on those lemmas [Woos2016].

There are many ways to achieve this sort of abstraction in ITPs, some of which have different impliciations for proof automation. For example, in Coq, it is possible to write interfaces using modules, type classes, or canonical structures; Coq has special support for proof search for type classes [Sozeau2008] and canonical structures [Saibi:PhD]. This section describes some means of abstraction in an ITP.


Modules, as manifested in languages such as Standard ML [MacQueen1986] and proof assistants such as Coq [Chrzaszcz2003], are collections of named components which may be types, values, or nested modules. A central property is the separation of module interfaces (signatures or module types) and module implementations (structures). The interface-implementation relation is many-to-many; one signature can be implemented by several structures, and one structure can implement several signatures. A structure can choose to hide all information not specified in the signatures it implements.

Parametric modules, called functors, take structures that implement certain signatures as arguments. In proof assistants, functors can provide abstraction and reuse of both functions and proofs. This approach is taken to implement finite sets and maps in Coq using AVL trees [Filliatre2004], and later to implement balanced binary search trees using red-black trees [Appel2011b].

Type Classes

Type classes were first implemented in the Haskell programming language [Wadler1989]. In a proof assistant context, type classes have notably been implemented for Coq [Sozeau2008] and Isabelle/HOL [Haftmann2006]; instance arguments [Devriese2011] are a similar feature in Agda. Type classes can be viewed as a particular use of a module system as in Standard ML, and type classes can coexist with such a module system [Dreyer2007].

A type class can be viewed as an abstract data type that defines a collection of functions by their parameter types, while not fixing function implementations. The abstract data type can then be implemented in different ways for different parameter types. For example, Volume 4 of Software Foundations [Pierce-al:SF] describes an equality type class with a single function eqb which, when provided two arguments of the same type, returns a boolean:

Class Eq A :=
  eqb: A -> A -> bool;

Different implementations (instances) of this class can then be provided for different types; Software Foundations describes one for booleans:

Instance eqBool : Eq bool :=
  eqb := fun (b c : bool) =>
     match b, c with
       | true, true => true
       | true, false => false
       | false, true => false
       | false, false => true

and one for natural numbers:

Instance eqNat : Eq nat :=
  eqb := Nat.eqb

A compiler can translate programs that use functions defined for type classes to programs that do not by looking up and applying the appropriate function instances, using information about function invocation types.

A key use of type classes in Haskell programs is as a way to structure programs by abstracting certain code over appropriate type classes, and concretizing the abstracted code with appropriate type instances elsewhere, avoiding duplication and facilitating reuse; proof engineers can use type classes similary to achieve both code and proof reuse. However, type classes in ITPs also provide additional benefits beyond those that type classes in other languages such as Haskell provide. For example, one drawback of using only type classes for structuring programs in Haskell is that a type can implement a type class in exactly one way [HarperModules2011]; it may be useful to define different type class instances for sorting of integers depending on the size of the input data, or use different orders on integers for sorting. Unlike in Haskell, type classes in Coq can support multiple instances.

The Coq implementation of type classes is first-class, meaning that it is a thin layer on top of existing functionality (specifically, implicit arguments and dependent records). In addition, Sozeau2008 added specific support for type class resolution into Coq’s proof search mechanism. In contrast to Coq’s type classes, the type classes of Isabelle/HOL are restricted to one type variable, and are not first-class. One particular advantage of type classes in proof assistants (as opposed to type classes in Haskell) is that propositions can be type class members, e.g., a type class for a monad can require witnesses (proofs) for the monad laws along with monad operations. By extension, this means that proofs in one type class instance can be derived partly from proofs in other type class instances (e.g., of some more general class). In contrast with Haskell, the type class instance resolution system in Coq can always be elided by manually passing implicit type class instances.

Type classes have been used for abstraction and reuse in many proof developments. For example, Spitters2011 used type classes to represent a standard algebraic hierarchy in Coq, along with parts of category theory. Woos2016 used type classes to organize the correctness proof of the Raft consensus protocol in Coq, and for abstracting the Raft protocol implementation for replication over arbitrary state machines.

Type classes are closely tied to other language features for abstraction. General parametrization of theories in Isabelle can be achieved via locales [Kammueller1999, Ballarin2006], which is the mechanism used to provide type class support [Haftmann2009]; a locale can be viewed as a persistent proof context that includes arbitrary variables and assumptions, and which can be instantiated in other proofs.

Canonical Structures

In Coq, canonical structures [Mahboubi-Tassi:ITP13, Saibi:PhD] provide an alternative to type classes. Canonical structures are a mechanism to provide theory-specific dictionaries to datatypes, allowing for more flexible resolution strategy than more the more widely used type classes.

To show their typical use, consider partial commutative monoids (PCMs); an algebraic structure which recurs in our current ongoing work on the verification of stateful and concurrent programs [Nanevski-al:ESOP14]. We implement PCMs using two of the Coq’s native constructs: dependent records and canonical structures. We follow the established SSReflect design pattern of defining algebraic data structures by means of mix-in composition [Garillot:PhD], whereby different dependent records formalize different algebraic properties, which can be combined using packed classes mechanism. The latter also defines the field resolution strategy [Garillot-al:TPHOL09] in a case of overlapping names. For instance, in Coq the mix-in defining PCMs is represented by the following dependent record:

Record mixin_of (T : Type) := Mixin {
  valid : T -> bool;
  join : T -> T -> T;
  unit : T;
  _ : commutative join;
  _ : associative join;
  _ : left_id unit join;
  _ : forall x y, valid (join x y) -> valid x;
  _ : valid unit }.

The type T is the carrier type of the structure. The field valid selects a subset of T, standing for the “defined” elements. The invalid (or “undefined”) elements help model partiality: a partial function over T will return some invalid element on an input on which it is mathematically undefined. join is the binary operation of the PCM, and unit is the unit element. The remaining five unnamed fields enumerate the axioms that have to be satisfied by each PCM instance.

Next, the mix-in “interface” is packaged with a carrier type, into a dependent record type, which represents PCMs. We also introduce a coercion from the package to the underlying carrier type, so that the two can be conflated. This coercion essentially accounts for the delegation hierarchy from object-oriented languages.

Structure pcm : Type := Pack {type : Type; _ : mixin_of type}.
Coercion type : pcm >-> Sortclass.

Next, we explain the mechanism of packaging all necessary definitions along with lemmas about data structures (such as join’s commutativity and associativity in the case of PCMs) into a single module that should be imported by the clients of the algebraic structure. For example, we introduce appropriate notation for the join operation, and specifically name and prove the lemmas that correspond to the PCM properties that we left unnamed in the mixin.

Notation x \+ y := (join x y).
Lemma joinC (U : pcm) (x y : U) : x \+ y = y \+ x.
Lemma joinA (U : pcm) (x y z : U) :  x \+ y \+ z = x \+ (y \+ z).

The lemmas such as joinC and joinA are proved by destructing the package U, but notice how the coercion allows conflating U with its carrier type. Also notice how the notation \\+ allows the PCM U to be ommitted from the equations themselves, as the typechecker can infer it from the context.

Algebraic structures can inherit the properties of other, more basic structures. Thus, we also require an analogue of object-oriented inheritance. We illustrate how this can be done in Coq, by defining an interface for a cancellative PCM, which inherits from an ordinary PCM. The cancellative PCM is defined as the following mix-in record:

Record mixin_of (U : pcm) := Mixin {
  _ : forall a b c: U, valid (a \+ b) -> a \+ b = a \+ c -> b = c

Notice that the dependent record mixin_of in this case is parametrized via the carrier PCM U, which is used as a target for a coercion whenever an instance of a plain PCM or a carrier type U is required, since coercions a transitive.

Let us now instantiate the definition of abstract structure with concrete datatypes. It turns out that it is insufficient to merely prove that a datatype satisfies the PCM axioms. To work comfortably with an algebraic structure in practice, one has to explicitly “register” the structure with the type inference engine.

We first show what goes wrong if one doesn’t perform the “registration.” For instance, assume we first define an instance of a PCM for nat with addition, by proving that + with satisfies the PCM axioms. Then the following lemma which uses the generic notation \\+ for the PCM operation, is considered ill-formed by Coq. The reason is that Coq cannot figure that there is a PCM associated with nat, and that the generic notation \\+ should be resolved with addition. Indeed, we could have defined the PCM for nat via multiplication with , in which case \\+ should be resolved by .

Lemma add_perm (a b c : nat) : a \+ (b \+ c) = c \+ (b \+ a).

In the above case, once a structure is registered as the default PCM for nat, the add_perm lemma can be proved by selective rewriting using the standard PCM properties.

Some notable uses of canonical structures include telescopes [Garillot-al:TPHOL09] and higher-order tactics for separation logic [Gonthier2011].

6.2.2 Design Principles for Programming with Dependent Types

In order to use dependent types to their full extent, proof engineers have developed many paradigms to deal with the challenges they present. For example, using dependent types, we can define heterogenous lists and a selection function over hetereogenous lists. To write the selection function, however, we must first define the type that it has, which depends on the case. CPDT accomplishes this using a membership predicate:

Section hlist.
  Variable A : Type.
  Variable B : A -> Type.
  Inductive hlist {A : Type} {B : A -> Type} : list A -> Type :=
  | HNil : hlist nil
  | HCons : forall (x : A) (ls : list A), B x -> hlist ls -> hlist (x :: ls).
  Variable elm : A.
  Inductive member : list A -> Type :=
  | HFirst : forall ls, member (elm :: ls)
  | HNext : forall x ls, member ls -> member (x :: ls).
  Fixpoint hget ls (mls : hlist ls) : member ls -> B elm :=
    (* ... *)
End hlist.

In other words, the type of hget states that whenever elm is a member of some list ls, then we can select some element of type B elm from any mls : hlist ls. This is one example of a common style of writing functions and proofs using dependent types, wherein the proof engineer first defines the type of the function or proof inductively, and then defines the function or proof that has that type. It is a powerful style that makes it possible to define very expressive types.

CPDT provides a comprehensive overview of dependently-typed programming in Coq with many more examples. Tanter2015 outlines design principles for gradual verification in Coq, which may help reduce the burden of verification with dependent types and increase adoption.

6.2.3 Design Principles for Scale

The scale of programs verified in ITPs has increased over the years. In recent years, proof engineers have begun to look at how to address the challenges that come with this increase in scale. Proof engineering in the large [Kaivola2003], for example, describes a methodology for verifying large-scale ciruits; it is among the earliest work noting that proof design for large verification projects is important. This section describes design principles dealing with the challenges of scale such as robustness in the face of changes, compositionality of components, and efficiency of code and proofs. In addition, Section 6.4 describes design principles for proof reuse.

Design Principles for Robustness

A major source of inefficiency in verification is proof brittleness: Even a minor change to a single theorem or definition can break many dependent proofs. This makes proofs difficult to maintain [Woos2016, Aydemir2008, plse-coevolve-djg-fose14, Delaware2013ICFP]. Design principles help make proofs robust in the face of changes. This is one approach to proof evolution (Section 7.2).

One approach to building robust proofs is to make use of proof automation (Chapter 5) to dispatch similar goals. Proof engineers who use the default tactics included in many proof assistants already take advantage of this, since the same tactic can discharge different goals. For example, a proof engineer who uses omega in Coq or presburger in Isabelle/HOL need not change the proof script as the goal changes, so long as the goal stays within the same fragment of arithmetic solved by those tactics.

The degree to which proof engineers rely on automation varies by style. CPDT [CPDT], for example, advocates for the heavy use of program-specific automation, noting that this makes proofs more robust; the tagless interpreter proofs from Chapter 8.3 of CPDT contain an example of automation of this kind. This style of development localizes the burden of change to the automation itself as opposed to the many proofs that use the automation.

While automation can make proofs more robust, it can also be brittle in itself. For example, some Coq tactics automatically generate hypothesis names; small changes in specifications can cause proofs that rely on those names to break. One approach to this problem is to always explicitly specify hypothesis names, so that Coq never generates hypothesis names automatically; the IDE Company-Coq [CompanyCoq2016] provides some built-in support for this approach. Planning for Change [Woos2016] notes that, while this approach helps, it is still necessary to update those explicit names as specifications change. Instead, the authors advocate for the use of structural tactics, or tactics that do not depend on hypothesis names and hypothesis ordering; many tactics of this style can be found in the associated StructTact [structtact] library.

Planning for Change addresses design for robustness not only at the level of automation, but also at the level of specifications and proof objects. It presents a methodology for writing robust proofs independently of any domain or framework. This methodology is informed by a large proof engineering effort verifying the Raft consensus protocol. It is a set of five recommendations. Some of these recommendations draw on software engineering design principles. For example, the authors recommend using information hiding techniques similar to those used in software engineering to hide definitions. That way, the burden of change is localized to interface changes, and changes in only implementation do not cause breaking changes in dependencies. Other recommendations tackle challenges that are unique to proof engineering. For example, the authors advocate for the use of custom induction principles to capture common patterns in inductive proofs.

Refinement (Section 6.1.2) can help make proofs robust to changes. For example, the proofs of the seL4 microkernel in Isabelle/HOL have evolved alongside the implementation for over eight years [Klein2014micro]. The proof development makes use of two layers of specifications: an abstract specification which describes only behavior of the system, and an executable specification which includes implementation details. These two layers are connected by a refinement proof. Using this approach, the authors found that both making low-level changes and adding new simple features were not very costly, though more complex changes that interacted with other parts of the code significantly were still costly.

Design Principles for Compositionality

CompCert (Section 3.1.1) employs a compositional design for describing the different intermediate languages and how they interact with each other. Affinity lemmas from Planning for Change also capture this concept. The CertiKOS project introduces the idea of a deep specification [Gu-al:POPL15] that makes compositional verification more tractable. DeepSpec [DeepSpec], an ongoing project, is addressing this problem more generally. Section 6.3 describes frameworks for compositional verification.

Design Principles for Efficiency

Some proof assistants like Coq work by extraction (Section 4.4.2) from the core language into an executable language. The resulting extracted code can be slow, which can be a barrier for verifying a realistic system. CruzFilipe2003, CruzFilipe2006 describe proof design principles for optimizing the efficiency of extracted code.

6.2.4 Style Guides and Proof Techniques

Style guides and proof techniques help guide proof engineers in dealing with common patterns to address common challenges.

Style Guides

Section 3.2.1 described style guides in mathematics. A few general-purpose style guides exist. Gerwin’s style guide [isabellestyle] for Isabelle, for example, is a set of guidelines that are used within Isabelle itself and in several large developments. The Isabelle Archive of Formal Proofs requires that submitted proofs follow some of these guidelines, and recommends others [isabelleafp]. The CoqStyle [coqstyle] style guide is a set of guidelines for Coq in the main Coq repository which is used within the standard library.

Proof Techniques

Proof techniques are techniques that handle common classes of proofs, or that make it easier to write proofs in a particular style. For example, one widely-used proof technique is simulation [Lynch1994]; an overview of this technique can be found in FRAPBook. This technique helps proof engineers prove that systems preserve liveness and safety properties. Refinement (Section 6.1.2) reduces to simulation [Klein2014micro].

One application of simulation is to show compiler correctness. CompCert [Leroy2009], for example, uses this technique to show that the program transformations that the compiler makes are semantics-preserving. In the case of compiler correctness as in CompCert, both directions of simulation (forward simulation and backward simulation) start with a source program and a target program that are related along some relation . The forward simulation (Figure 6.1, left) states that if steps to , then can step to some , where and are related by . Similarly, the backward simulation (Figure 6.1, right) states that if steps to , then can step some , where and are related by . Intuitively, a forward simulation shows that “anything the source program could do, the target program could do too,” and a backward simulation shows that “anything the target program could do, the source program could do too.” Together, a forward and a backward simulation establish indistinguishability, any entity restricted to only observe “visible” program transitions (e.g., input and output) will never be able to determine if they are interacting with the source or target program [Sangiorgi2011]. If the source language is deterministic, then the forward simulation follows from the backward simulation; if the target language is deterministic, then the backward simulation follows from the forward simulation [Leroy2009]. CompCert takes advantage of this to show backward simulation from only forward simulation and determinism of the target language.





Figure 6.1: Forward (left) and backward (right) simulation for compiler correctness. Premises are show as solid lines and goals are shown as dashed lines.

Section 5.1 discusses some techniques for interacting with automation. Proof techniques can also help proof engineers reason within certain domains. bahr2015, for example, describes a technique for deriving correct compilers from specifications in Coq. Section 6.2.5 describes techniques for reasoning about imperative programs.

6.2.5 Design Principles for Reasoning about Imperative Programs

When desigining a verification framework for imperative programs based on a dependently-typed proof assistant (e.g., Coq), the most common approach is to implement a version of a Floyd-Hoare style program logic [Floyd1967Flowcharts, Hoare:CACM69] in it. When doing so, the framework designer is faced with the following choices:

  • How to embed, into a proof assistant, the language with the features, which the host language does not have (e.g., mutable state and concurrency)?

  • How to encode verification conditions for imperative programs specified in Floyd-Hoare style, and implement the corresponding reasoning principles in a proof assistant?

Below, we elaborate on these two design choices and provide a survey of the most prominent approaches implementing them, both for sequential and concurrent reasoning about imperative programs.

On Shallow and Deep Embedding

An important design decision to take when designing a framework for verification of effectful (i.e., heap-manipulating or concurrent) programs on top of a general-purpose proof assistant is its use of shallow or deep embedding the language to be verified.

Shallow embedding is an approach of implementing programming languages, characterized by representation of the language of interest (usually called a domain-specific language or DSL) as a subset of another general-purpose host language, so the programs in the former one are simply the programs in the latter one. The idea of shallow embedding originates at early ’60s with the beginning of era of the Lisp programming language [Graham:BOOK], which, thanks to its macro-expansion system, serves as a powerful platform to implement DSLs by means of shallow embedding (such DSLs are sometimes called internal or embedded).

Shallow embedding in the world of practical programming is advocated for a high speed of language prototyping and the ability to reuse most of the host language infrastructure. An alternative approach of implementing and encoding programming languages is called deep embedding, and amounts to the implementation of a DSL from scratch, essentially, writing its parser, interpreter and type-checker in a general-purpose language. Deep embedding is preferable when the overall performance of the implemented language runtime is of more interest than the speed of DSL implementation, since then a lot of intermediate abstractions, which are artifacts of the host language, can be avoided.

In the world of mechanized program verification, both deep and shallow embeddings have their own strengths and weaknesses. Although implementations of deeply embedded languages and calculi naturally tend to be more verbose, design choices in them are usually simpler to explain and motivate. Moreover, the deep embedding approach makes the problem of name binding to be explicit, so it would be appreciated as an important aspect in the design and reasoning about programming languages [Aydemir2008, Weirich-al:ICFP11, Chargueraud2011]. We believe that these are the reasons why this approach is typically chosen as a preferable one when teaching program specification and verification in Coq [Pierce-al:SF].

Importantly, deep embedding gives the programming language implementor full control over its syntax and semantics. In particular, the expressivity limits of a defined logic or a type system are not limited by expressivity of the host language’s type system. Deep embedding makes it much more straightforward to reason about pairs of programs by means of defining the relations as propositions on pairs of syntactic trees, which are implemented as elements of corresponding datatypes. This point becomes crucial when one needs to reason about the correctness of program transformations and optimizing compilers [Appel:BOOK14].

In contrast, the choice of shallow embedding, while sparing one the labor of implementing the parser, name binder and type checker, may limit the expressivity of the logical calculus or a type system to be defied. In the case of Hoare Type Theory [Nanevski-al:POPL10], for instance, it amounts to the impossibility to specify programs that store effectful functions and their specifications into a heap.111This limitation can be, however, overcome by postulating necessary axioms.

In the past decade Coq has been used in a large number of projects targeting formalization of logics and type systems of various programming languages and proving their soundness, with most of them preferring the deep embedding approach to the shallow one. We believe that the explanation of this phenomenon is the fact that it is much more straightforward to define semantics of a deeply-embedded “featherweight” calculus [Igarashi-al:TOPLAS01] and prove soundness of its type system or program logic, given that it is the ultimate goal of the research project. However, in order to use the implemented framework to specify and verify realistic programs, a significant implementation effort is required to extend the deep implementation beyond the “core language,” which makes shallow embedding more preferable.

Encoding Verification Conditions

In a Floyd-Hoare style logic, specification of a program is given in a form of a tiple , where the assertions and are referred to as the precondition and the postcondition, respectively. The standard semantics of the triple imposes that for any state, satisfying , the final state, after terminates, satisfies . This definition corresponds to termination-insensitive partial correctness (i.e., is allowing to not terminate at all, so any postcondition would hold). Some program logics impose a stronger semantics of total correctness, requiring to terminate, in addition to the above [Dockins-Hobor:DS10].

This treatment of a Floyd-Hoare triple allows for verifying the programs by means of following the inference rules of a program logic, allowing to decompose the proof of into the proofs about ’s sub-programs [Hoare:CACM69]. While this style of reasoning seems natural and relatively easy to implement in a proof assistant, and is advocated by the most widely used tutorials [Pierce-al:SF], it is not the most convenient to conduct the proofs in, due to the need of constantly discharge the weakening obligations, required for “massaging” the verification goal, and represented by the following inference rule: *[Right=(Weaken)] P ⇒P’
{P’}  c  {Q’}
Q’ ⇒Q {P}  c {Q}

A more proof-assistant-friendly way to encode the Floyd-Hoare-style verification conditions, “compressing” the necessary applications of the weakening rule, is to use the idea of predicate transformer by [Dijkstra:CACM75] that can be used to compute a pre-condition for a computation, for any context in which that computation may be used. This approach, dubbed the weakest precondition (WP) calculus allows one to encode the meaning of a Floyd-Hoare triple (roughly) as follows:

where is the program ’s weakest precondition wrt. the imposed postcondition , expressed as a logical formula. Therefore, what is left to the designer of the mechanised program logic to do is to provide the implementation of the primitive , which would “compile” a program and its postcondition to the logical assertion, which can be later discharged using the host proof assistant’s machinery.

The weakest precondition approach to encoding verification conditions for imperative programs is amazingly versatile, and, to the best our knowledge, has been adopted in most of the major implementations of program logics embedded into proof assistants [Nanevski2008, mccreight2009practical, Nanevski-al:POPL10, Chargueraud2010, Chlipala:PLDI11, Swamy-al:PLDI13, Appel:BOOK14, Krebbers-al:POPL17].

Verifying sequential heap-manipulating programs

The main success in a program logic-based verification of heap-manipulating programs has been achieved with the discovery of Separation Logic [OHearn-al:CSL01, Reynolds:LICS02]. It did not take long for Separation Logic to be mechanised in an ITP [Nanevski2008, Mehta-Nipkow:CADE03, mccreight2009practical], using both deep and shallow embedding.

One of the most successful formalizations in Coq by means of shallow embedding is the series of work on Hoare Type Theory (HTT) by Nanevski2008, known as YNot. YNot has been used, among other things, in verifying a relational database system [Malecha2010] and a secure browser kernel [Jang2012].

In addition to adopting the WP-calculus for expressing verification conditions in an embedding of Separation Logic into Coq, HTT first made active use of binary postconditions, enabling a straightforward treatment of logical variables, whose scope spans both pre- and postconditions of a Floyd-Hoare triple. Specifically, this has been achieved by making a postcondition to be not of type (i.e., unary, as suggested by the textbook expositions of program logics), but rather of type , i.e., constraining both the pre- and the post-state. This style of specification has later been adopted by multiple other verification frameworks [Swierstra:TPHOLS09, Swamy-al:PLDI13].

The Sepref [lammich2015refinement] tool for verifying imperative programs in Isabelle/HOL includes a separation logic framework built on top of Imperative HOL, which is built on top of Isabelle/HOL. Sepref uses refinement (Section 6.1.2) to derive an imperative heap-based program and correctness proof from a functional program and correctness proof.

YNot [Nanevski2008] has implemented the heap disjointness, inherent to Separation Logic, by means of a deep embedding of logic reasoning principles. Such an embedding of a domain-specific logic required a later development of a number of tactics for making large mechanised proofs tractable [Chlipala-al:ICFP09]. In a later work, Nanevski-al:POPL10 have shown shown how to achieve almost the same expressivity with very little domain-specific automation, by making reasoning about finite heaps decidable and leveraging the machinery of small-scale reflection [Gonthier2010].

Various successful deep embeddings of Floyd-Hoare style reasoning into Coq have been demonstrated viable for the sake of reasoning about low-level programs using different versions of Separation Logic [Chlipala-al:ICFP09, Chlipala:PLDI11, Chlipala2013, Chen2015, Cao2018]. All those efforts came supplied with tailored libraries of domain-specific tactics, with those tactics automatically applying Separation Logic’s Frame rule and thus progressively reducing the size of the verification goal.

6.2.6 Future of Design

While we think that domain-specific design principles will always be important, we expect that there is a lot of potential for general-purpose design principles that frame proof engineering in the context of software engineering and make novel use of what we already know. Planning for Change investigates where proof engineering diverges from software engineering and where it calls for specialized techniques; continuing along these lines should drive more useful proof design techniques.

Compared to design principles for mathematics, current general-purpose design principles place little emphasis on proof understanding. While this is an understandable difference in emphasis, his can inhibit collaboration for proof engineers as well. We expect more work on proof understanding to become common as collaboration between proof engineers increases with the growth in large-scale verification projects.

Automation-heavy styles can help prevent breaking changes, but have drawbacks. Some of these drawbacks may be avoidable. For example, one limitation is that proof checking of the large and complex terms these procedures produce can be slow. Developments in proof checking such as term simplification could make this style more tractable. Debugging is also difficult; alleviating this concern could be as simple as better debugging tooling for tactics.

6.3 High-Level Verification Frameworks

In the context of software engineering, a framework is distinguished from a library or domain-specific language in that the client relinquishes control of execution to the framework. In practice, the concept of a framework often refers to some combination of design principles, libraries, and tooling that together give structure to code, often within a certain domain, regardless of control of execution. We use the latter term, as it is what is used most often in proof engineering papers, and as the concept of control of execution does not always make sense in the context of proof development.

Several of the libraries and languages we have already discussed (for example, Bedrock [Chlipala2013]) fit this definition of a framework. This section extends that discussion to cover frameworks for two common domains: concurrent applications (Section 6.3.1) and language design and metatheory (Section 6.3.2). It then discusses frameworks for a few other domains (Section 6.3.3), and concludes with a discussion of the future of frameworks (Section 6.3.4) for proof engineering.

6.3.1 Frameworks for Verifying Concurrent Applications

Reasoning about concurrent programs brings new challenges into mechanising reasoning: due to the excessively large state-space of possible interactions between simultaneously executing processes or threads, simply enumerating them is no longer tractable. However, since in most of the practical applications the interaction between processes on some sort of shared state happens only at dedicate program points, via specific programming primitives, a plausible way to reduce this complexity is to reduce concurrent reasoning to a sequential one. This idea has been pioneered in the work on Concurrent Separation Logic by ohearn07resources, which provided a series of inference rules for compositional sequential and concurrent reasoning for shared-memory concurrency.

Similarly to plain Separation Logic, variants of CSL have been implemented as both shallow and deep embedding with the corresponding benefits and drawbacks.

The first shallow embedding of Subjective Concurrent Separation Logic, a CSL-like logic for concurrency, was due to LeyWild-Nanevski:POPL13, who implemented it using Coq’s indexed types. Unlike the prior work on Hoare Type Theory [Nanevski-al:POPL10], in which Coq’s dependent types were only capturing the effect of an imperative program on a state, in SCSL, the types were also carrying information about resource invariants, capturing the contract of a concurrent interaction between threads. That work has been later extended to a more expressive Fine-Grained Concurrent Separation Logic (FCSL) [Nanevski-al:ESOP14, Sergey-al:PLDI15], which provided a more general treatment of concurrent resources, incorporating ideas from both CSL and Rely-Guarantee-based verification methodologies [Jones:IFIP83, Feng:POPL09], and implementing them in a form of a shallowly-embedded type theory for state.

The main shortcoming of both SCSL and FCSL, both being shallowly-embedded type theories for state, are the limitations due to the limitations of Coq’s model wrt. impredicativity. At the time of this writing, FCSL did not support higher-order heaps (i.e., the possibility to reason about arbitrary storable effectful procedures). It was conjectured by FCSL’s authors that this obstacle could be overcome by relying on the universe polymorphism feature introduced in Coq version 8.5 [Sozeau-Tabareau:ITP14]. An approach based on Rely-Guarantee references, similar to FCSL in spirit, employed Coq as a host framework for implementing DSL (but not proving its soundness wrt. some semantics) for streamlining reasoning about certain concurrency patterns [Gordon-al:TOPLAS17], allowed by considering Rely-Guarantee contracts, but without CSL-enabled proof modularity.

Implementation of concurrent imperative programs in Coq by means of deep embedding has been first considered in the context of verifying low-level code with dynamic thread creations in CAP and CCAP program logics [Yu-Shao:ICFP04, Feng-Shao:ICFP05]. Targeting real architectures, those formal verification efforts required astonishingly high proof efforts and have been eventually superseded by a mechanized proof methodology based on certified abstraction layers, not grounded in any specific Hoare-style program logic [Gu-al:POPL15, Kim-al:APLAS17, Gu-al:PLDI18].

Iris is another CSL-inspired mechanised verification framework that has been in development in parallel with FCSL, with an aim to provide more uniform foundations for reasoning about concurrency [Jung-al:POPL15]. Due to the chosen semantic foundations, allowing for impredicativity in the presence of mutable state (and hence, storable higher-order procedures) [Svendsen-Birkedal:ESOP14], Iris could not have been implemented as a shallow embedding and, hence, has been encoded as a deeply-embedded logic.

While that initially has been considered an significant obstacle for verifying large concurrent programs in Iris, due to a large proof overhead, the later introduction of Iris Proof Mode (IPM) [Krebbers-al:POPL17] fixed this shortcoming, significantly lowering the entrance threshold for conducting mechanised Iris proofs [iris-tutorial]. This has been achieved in IPM by effectively leveraging Coq’s extensible parsing and proof-by-reflection, and introducing a library of domain-specific tactics, mimicking, for the sake of an end user of the framework, standard CSL-style inference rules. Due to its success, IPM itself has been later generalised to MoSeL—an extensible proof mode allowing for reasoning not just with Iris but with any separation-style program logics [Krebbers2018].

As frameworks implementing Hoare-style reasoning about concurrency with pre/postconditions, both FCSL and IPM follow the encoding style with Dijkstra-style weakest preconditions.

6.3.2 Frameworks for Language Design and Metatheory

Several frameworks deal with the challenge of component reuse, one of the challenges from POPLMark (Section 3.2.2). Meta-Theory à la Carte (MTC) [Delaware2013POPL] is a framework and Coq library that builds on Data Types à la Carte [Swierstra2008] to address challenges in reuse: extensibility of definitions and proofs through algebraic properties that provide control over the evaluation order, and modular reasoning about partial definitions and proofs through algebraic combinators. Using MTC, the proof engineer can assemble a language from existing components.

MTC does not address extensibility of languages with effects: adding new effects breaks existing proofs. Modular Monadic Meta-Theory (3MT) [Delaware2013ICFP] extends MTC with a methodology and monad library that includes monads for effects as well as algebraic laws. The methodology and library make proofs resilient to the addition of new effects to a language. MTC and 3MT both use algebraic properties to address difficulties with component reuse—algebra in many ways offers natural abstractions, and those techniques can apply more broadly outside of formal metatheory.

Other notable frameworks for language design include the Fiat [Chlipala2017, Delaware2015] framework for Coq, as well as the Hybrid [felty2012hybrid] framework for Isabelle/HOL and Coq, which addresses the difficulties of using HOAS with inductive and coinductive proofs.

6.3.3 Frameworks for Other Domains

Concurrent applications and language metatheory are just two domains for which verification frameworks are useful. Frameworks assist proof engineers in many other domains. For example, a few frameworks exist for verifying distributed systems. Verdi [Wilcox2015] is a framework for building verified distributed systems in Coq; it has been used to build and verify an implementation of the Raft consensus protocol [Woos2016]. Disel [Sergey2017] is a framework for compositional verification of distributed protocols.

6.3.4 Future Frameworks

The expressiveness of the underlying logics of common proof assistants combined with their interactive natures makes it possible to develop useful frameworks for a variety of domains. We expect that proof engineers will continue to develop and improve on frameworks that tackle challenges associated with common domains, as well as build new frameworks to handle challenges associated with new domains for verification as they arise. In addition, we expect that frameworks will address challenges that current frameworks do not fully address, such as language extension and component reuse in metatheory.

While it is natural to apply frameworks to challenges within common domains, we also expect the development of more general-purpose frameworks building on common proof assistants to address challenges that proof engineers face independently of domain, or when following specific design principles.

6.4 Proof Reuse

Large proof developments may involve redundant efforts that can be time-consuming. Proof reuse addresses this by repurposing existing proofs as much as possible, minimizing the amount of redundant work that proof engineers must do.’ Early examples of proof reuse include proof by analogy [curien1995], the technique of adapting a proof of a theorem to a proof of a related theorem, and proof generalization [hasker1992generalization], the technique of adapting a proof of a theorem to prove a more general theorem.

Proof reuse is the proof engineering analogue to software reuse. Like software reuse, proof reuse leverages design principles (Section 6.4.1) and language constructs (Section 6.4.2). In addition, the interactive nature of proof assistants naturally leads to a class of proof reuse technologies less explored in the software engineering world: automated tooling (Section 6.4.3). This section samples these approaches.

6.4.1 Design Principles for Modularity and Reuse

Good design principles can help maximize the reusability of existing proofs. Some of these design principles are natural generalizations of design principles for software reuse more generally, such as aspect-oriented software development (AOSD), a programming approach that optimizes for separation of concern [Filman2004]. Others, like the affinity lemmas from Planning for Change (Section 6.2.3) are unique to proof engineering.

Design Principles from Software Engineering

In software engineering, encapsulating behavior can help not only protect against future changes, but can also help with reusing multiple implementations of interfaces with the same behavior. Likewise, the interfaces and information hiding recommendations form Planning for Change are useful not only to protect proofs against future changes, but also to switch between different datastructure implementations with the same high-level behavior.

The work by Delaware2011 attacks the problem of language metatheory extension, along with the corresponding formalization in a proof assistant and changing the corresponding type safety proofs (i.e., progress and preservation theorems), from the perspective of Software Product Lines (SPL). SPL is an approach to AOSD that opportunistically reuses software by deriving many different pieces of software from a common producer. Delaware2011 starts from formalizing a core language, taking a “core” Featherweight Java (cFJ), and considering all further extensions to the language (casts, interfaces, generics) as features.

What is inherent for the SPL approach is reasoning about composition and possible interaction between features, expressed by means of an algebra of feature operators: , , and . Introducing multiple features can lead to an exponential explosion of pairwise interaction, which, however, is rarely observed in practice, as most of the features are mutually independent.

In order to enable feature-based decomposition of a language, all its components (syntax, dynamic semantics, safety proofs etc.) are written is specific languages, amenable for feature compostions. For instance, the language syntax and its semantic/typing rules can be extended by introducing the mechanism of variation points (VPs) into the corresponding grammar productions, premises and conclusions, of the rules, reusing the intuition of SPL design.

On the implementation side, the modularity of extensions is achieved by means of reusing Coq’s capabilities for higher-order parametrization: language component definitions are parameterized by the corresponding variation point contexts. The crux of the technique is identifying the effect of the VPs to the safety proofs, which are conducted in a way, parametric with respect to the inductive cases to be considered. For each specific combination of the features, the top-level proof dispatches to the proofs from the corresponding feature module.

The shortcoming of the approach is the requirement, for a core language, to have a significant foresight when identifying the appropriate VPs, which provide the opportunity for feature extensions. While the paper demonstrates how to do it in the context of a language, whose safety is formalised via the syntactic approach, it provides little guidance with respect to other ways of stating type soundness (e.g., via logical relations), neither does it consider other domains beyond PL design. Overall, the approach seems to be a bit ad-hoc, which is why further advances in this direction lead to the creation of the monadic MTC [Delaware2013POPL] and 3MT [Delaware2013ICFP] frameworks we have already discussed.

Beyond Software Engineering

Proof assistants in the LCF family are complex systems with multiple languages at different levels. Accordingly, reuse in these systems happens not only at the term level, but also at the tactic level. Designing powerful tactics can maximize reuse of proof scripts to prove different goals (Section 6.2.3).

Among the recommendations that Planning for Change makes is the use of affinity lemmas that describe relationships between components. These lemmas show that properties that hold over one component also hold over another, which facilitates reuse of proofs across components.

6.4.2 Language Constructs for Organization and Reuse

As in software engineering, proof assistants often provide support for reuse at the language level. These range from entire languages optimized for reuse to useful constructs built on existing languages that make reuse easier. This section describes a sample of languages and language constructs for proof reuse.

Languages for Reuse

Some languages are designed with the goal of optimizing for proof reuse. For example, Felty1994 describes an ITP that is optimized for reuse at the tactic level.

In this system, reuse works by replaying tactics in a new proof setting. To make this possible, the system automatically generalizes proofs using metavariables. In contrast, the logical framework PR [Caplan1995] optimizes for reuse at the level of the type theory. The framework builds on an embedded Hoare logic, adding constructs to the logic that aid in abstraction and reuse of proof terms.

HoTT (introduced in Section 4.3.2) has practical proof engineering applications. HoTT’s univalence axiom gives rise to automatic transport of functions and proofs across type equivalences: to write the same function or proof about two equivalent types, the proof engineer needs only to write the function or proof over one of these two types, and then show the equivalence between them. Cubical type theory [cohen2016cubical] provides a computational interpretation of HoTT’s univalence, so that it is no longer an axiom.

Proof assistants or extensions to proof assistants built on HoTT or cubical type theory include Cubical Agda [cubical-agda], CoqHoTT [coq-hott], and RedPRL [redprl]. While these ITPs are relatively new, we expect that reuse will be easier in these proof assistants. However, univalence is incompatible with the popular axiom UIP (Uniqueness of Identity Proofs, which states that all proofs of equality at a given type are equal), and univalent ITPs present their own difficulties, so these are not a catch-all solution. We discuss tooling for transport that does not rely on univalence at the level of the type theory in Section 6.4.3.

Language Constructs for Reuse

Even in languages that are not designed with the goal of proof reuse in mind, certain language features can help make proof reuse more tractable. The modules, type classes, and canonical structures discussed in Section 6.2.1 are examples of these features, as are other mechanisms for inheritance. In addition, many proof assistants implement subtyping or type coercions [barthe1995implicit, aspinall1996subtyping, Saibi1997, luo1999coercive, asperti2007user, callaghan2001implementation, deMoura2015] in various forms, and these can also help make proof reuse more tractable.

One recent development is the notion of an ornament [mcbride2010]

, a programming mechanism for describing relationships between inductive types that preserve inductive structure. That is, there is an ornament between natural numbers and lists, and between lists and length-indexed vectors; there is no ornament between lists and trees, since these types have different inductive structures. Ornaments allow for the derivation of new types from existing types, and for the automatic lifting of functions and proofs from each existing type to the corresponding new type. Lifting functions and proofs necessitates some additional automation beyond the addition of ornaments to a language. So far, ornaments exist in various forms as deep embeddings in Agda 

[Dagand17jfp, Williams2014, ko2016], and as tooling for proof reuse (Section 6.4.3) in Coq.

The language Cedille makes it possible to define combinators that allow for reuse of functions and proofs across certain related datatypes without any performance penalty [Diehl2018]. Like ornaments, these combinators facilitate reuse between unindexed and indexed versions of types like lists and vectors. They do not support incompletely determined relations that ornaments support, such as the ornament between natural numbers and lists (lists have a new element in the inductive case). Applications of these combinators definitionally reduce in such a way as to facilitate efficient reuse thanks to properties of the underlying type theory of Cedille.

6.4.3 Automated Tooling for Proof Reuse

Since proof assistants typically involve a heavily interactive workflow like the REPL, they lend themselves naturally to automation. As such, in addition to the design principles and language features for proof reuse found in typical software engineering projects, there is a body of work that uses automated tooling to repurpose existing proofs. Section 6.3.2 decribes some frameworks for component reuse, a kind of proof reuse, in mechanized metatheory. This section describes other tooling for proof reuse.

Adapting Inductive Proofs

Boite2004 describes a tactic to adapt proof obligations to changes in inductive types. This technique constructs and analyzes a dependency graph to determine when reuse of existing proofs is possible, then reuses existing proofs when possible and generates new proof obligations for new branches of the proof. Mulhern06proofweaving provides a high-level description of a possible method to synthesize missing proofs for those new obligations using a type reconstruction algorithm, though it is not currently implemented.

Proof Planning

Proof planning [Bundy1998] is a proof search technique that uses plans to guide search for proofs with similar structures. Proof planning can involve the use of critics [ireland1996], which reuse information from failing proofs to guide search for correct proofs. While it was originally designed for use with automated theorem provers, it has also reached interactive theorem provers. For example, IsaPlanner [Dixon2003] is a proof planner for Isabelle with support for rippling [shah2005], a technique for automatic induction. Rippling has also been implemented in an induction automation tool for Coq [wilson2010].

Proof Generalization

Proof generalization tools generalize proofs of a theorem to obtain proofs of more general theorems. Proof generalization first arose in the 1990s [hasker1992generalization, kolbe1998proof, pons1999conception]. A simple example of proof generalization is the Coq generalize tactic, which does basic syntactic generalization. The Coq documentation demonstrates this tactic on the following proof state [coq-tactics]:

x, y : nat
0 <= x + y + y

Running generalize (x + y + y) on this goal produces the following proof state:

x, y : nat
forall n : nat, 0 <= n

The final generated proof term proves the original goal by specialization of this generalized goal.

The generalization technique implemented in Coq’s generalize tactic can handle only simple syntactic substitution. A few tools can handle more complex transformations. For example, Johnsen2004 presents a proof generalization tool for Isabelle with proof terms which can handle generalizing over dependencies on other theorems, as well as generalization over functions and types. The Coq proof repair tool PUMPKIN PATCH (Section 7.2.3) includes an abstraction component which does not just syntactic generalization, but also type-driven generalization.


Transport (also known as transfer) methods automatically adapt proofs along relations. These tools aim to mimic the experience of mathematical proofs on paper, in which simply stating a relation between two structures (such as an equivalence) can be enough use theorems about one structure as theorems about the other.

This idea developed both as an extension to the language and as an approach to automation: Barthe2001 introduced an extension to dependent type theory with a computational intepretation of isomorphisms using rewrites. Around the same time, Magaud2002 introduced an automatic method for adapting proofs along binary and unary representations of the natural numbers.

Since then, there have been many more transport tools handling more than just those two types, including the Transfer and Lifting packages [Huffman2013] for Isabelle/HOL, and a prototype Coq plugin for transporting proofs across ismorphisms and implications [ZimmermannH15].

Univalent transport is the particular kind of transport across type equivalences that arises from HoTT’s univalence axiom (see Sections 4.3.2 and 6.4.2). Equivalences for Free! [tabareau2018] uses insights from HoTT to develop and formalize a powerful tool for transporting proofs across equivalences in Coq. In many cases, it is possible to use this tool to port functions and proofs without any axiomatic dependencies; in some cases, the tool relies on the functional extensionality axiom. Thus far, the primary barriers to usability of the tool are the proof burden on the user to configure the automation, and the inefficiency of the generated functions. Nonetheless, this is a significant step toward a robust tool for automatic transport in a proof assistant that does not depend on univalence.

The DEVOID [Ringer2019] Coq plugin automates transport across certain equivalences that correspond to algebraic ornaments, a particular class of ornaments (Section 6.4.2). DEVOID automatically discovers and proves the equivalences that correspond to these ornaments, and then transports functions and proofs across those equivalences using a program transformation. DEVOID handles a narrow class of equivalences relative to Equivalences for Free!, but the functions and proofs that it produces for the cases it can handle are small and efficient in comparison.

6.4.4 Packaging and Distributing Programs and Proofs

Proof assistants projects are software artifacts, and can thus be packaged and distributed in a similar way. For Isabelle/HOL, the venue for distribution is the Archive of Formal Proofs [isabelleafp, Blanchette2015]. Coq uses the OCaml infrastructure around the OPAM package manager to provide a similar collection of packages [CoqOPAM]. In principle, executable verified software can also be distributed on these platforms, but can also use conventional channels, which may raise issues of trust.

6.4.5 Future of Reuse

Component reuse, a form of proof reuse, is an underaddressed tenant POPLMark. The solutions which do exist are able to take advantage of common proof structure within the domain of metatheory. We expect that similar common structure exists for proofs in other domains, and this area is ripe for the development of design principles, frameworks, and automated tooling to maximize reuse.

More generally, we expect to see mainstream proof assistants continue to integrate language reuse constructs. For example, ornaments are a promising feature designed specifically for reuse in a dependently typed language, but most existing implementations require the user to write programs and proofs in a domain-specific deeply embedded logic. DEVOID takes some steps toward integrating ornaments into an existing ITP without an embedding, but it handles only a small class of ornaments and makes some additional restrictions beyond those that the original ornaments work assumes. We expect ornaments to integrate more smoothly with existing ITPs in the future.

We expect that recent developments in HoTT will fundamentally change how people view proof reuse, and that concepts from HoTT will continue to influence the design of proof reuse tools for other languages. Approaches like Equivalences for Free! have the benefit of principled design of automation with guaranteed properties, but do not introduce univalence, and so are not incompatible with other assumptions that programmers may want in their type theories. These two views of univalent transport can continue to evolve alongside one another.

7.1 User Interfaces and Tooling for User Support

Most early proof assistants shipped with a very simple user interface: the Read-Eval-Print Loop (REPL). This interface reads in user-written expressions in the proof assistant language, evaluates those expressions, then prints a result or error for the user.

User interfaces for proof assistants have come a long way from the REPL. The support that these REPLs provide users is minimal, and so soon after their development, many techniques to ease interaction with REPLs arose. While the interfaces from earlier eras still see common use, we are now entering an era of interaction that emphasizes full integrated development environments (IDEs), with support for project management and for asynchronous development.

In parallel to this evolution of user interfaces (Section 7.1.1), we are seeing an increase in specalized interfaces (Section 7.1.2), usability analysis of user interfaces (Section 7.1.3), and advanced tooling for user support (Section 7.1.4). We expect these traditions will merge and drive the future of interaction (Section 7.1.5) with proof assistants.

7.1.1 The Evolution of User Interfaces

We can think of proof assistant user interfaces as evolving in three generations:

  1. Generation I: The REPL

  2. Generation II: Separation of Concerns

  3. Generation III: Full IDEs

Generation I: The REPL

The REPL was the earliest form of interaction with the proof assistant. For example, the description of Stanford LCF [Milner1972b] calls the proof process a “conversation between the user and the computer.” The LCF user writes commands, which the computer evaluates and replies to with feedback such as new goals. In part of the example from the LCF description, the user cuts an inline lemma:

*****GOAL f  g;

The computer then responds acknowledging the new goal:

NEWGOAL #1 f  g

The user tells the computer to prove this goal inductively:

*****TRY 1 INDUCT 1;

The computer responds with two intermediate goals, a base case and an inductive case:

NEWGOAL #1#1 UU  g
NEWGOAL #1#2 fun(f1)  g ASSUME f1$\subset$g

The user then proves those subgoals, then uses the inline lemma to prove the original result.

Many ITPs followed in this tradition and introduced command line REPLs. Examples of command line REPLs include the coqtop [coq-commands] command for Coq and the hol [hol-interact] command for HOL. Some of these tools are still accessible even when graphical interfaces exist. For example, Coq still exposes its coqtop command, in spite of the existence of the graphical interfaces CoqIDE [coqide] and Proof General [Aspinall2000].

Figure 7.1: Agda Emacs mode (from barret-agda)

However, not all ITPs followed in this tradition. Nuprl, for example, was distributed with a graphical interface from the start [Constable1986]. Even those ITPs that followed in this tradition sometimes later diverged. For example, Agda’s --interactive option to interact with the REPL directly is no longer supported; Agda interaction happens through an Emacs [emacs] or Atom [atom] mode which calls out directly to the backend theorem prover [agda-editing]. Figure 7.1 shows an examples of the Agda Emacs mode. Isabelle/HOL has recently done away with its REPL; the default interface is now Isabelle/jEdit [Wenzel2012], which instead builds on Isabelle/PIDE [Wenzel2014].

Generation II: Separation of Concerns

The 1990s saw a surge in the release of interfaces for ITPs decoupled from the proof checker, typically communicating with the system through a protocol. In some ways, this was a natural path of evolution from the way that users typically interacted with the REPLs for existing ITPs. While the REPL was typically exposed through a command line tool, it was common to instead use multiple Emacs buffers, one for development and for the proof assistant top-level, and to copy definitions between the two. This approach is still used in some modern proof assistants such as HOL [hol-tutorial].

This mode of interaction naturally led to the development of Emacs modes which interact with the REPL or theorem prover backend. For example, Isamode [isamode] for Isabelle99 was an Emacs mode for Isabelle which smoothed interaction with the REPL. The HOL4 Emacs mode [hol4-interact] is still used to this day. The Agda Emacs mode, which interacts with the Agda backend, is similar in spirit but contains more advanced functionality; it allows the user to, for example, define holes in terms and fill those holes in later in development. Idris includes an Emacs mode [mehnert2014tool] for interacting with the REPL, which inspired by both the Agda Emacs mode and Proof General.

Other interfaces beyond Emacs modes communicate with the backend theorem prover or REPL in this style. For example, ALF, a predecessor to Agda, included a window-based interface for communicating with the backend [altenkirch1994user]. The lightweight interface TkHOL [tkhol] for HOL also follows in this style. BERTOT1998 describes a generic approach for building an interface that communicates with the ITP using a protocol, inspired by the early Coq user interface CtCoq.

Figure 7.2: Proof General (left) and CoqIDE (right) for Coq

In some cases, these interfaces were entirely independent of the underlying proof assistant. One notable example of such an interface from this generation is the Emacs extension Proof General [Aspinall2000], an interface for proof development that supports multiple proof assistants. Proof General has seen widespread use, especially within the Coq community. While Proof General best supports Coq, it also has support for LEGO, PhoX, and an old version of Isabelle, as well as experimental support for other proof assistants [proofgeneralwebsite]. It is simple yet easily extensible, both to support new proof assistants and to add new functionality for existing proof assistants. Company-Coq [CompanyCoq2016] for example, extends Proof General with many new features for Coq, including improved autocompletion, and integration of documentation.

Following the success of Proof General, Coq released the lightweight interface CoqIDE [coqide] as part of Coq 8.0 [coqide-commit]. Its main selling point was speed: It claimed to be faster than Proof General. In addition, CoqIDE’s native support for Coq means that it is always maintained and distributed with new versions of Coq, imposing minimal overhead on users. Figure 7.2 shows CoqIDE and Proof General side-by-side for Coq.

Both third-party interfaces and native interfaces from this generation continue to be popular to this day. This separation of concerns has also inspired a new generation of specialized interfaces (Section 7.1.2) for proof assistants.

Generation III: Full IDEs

The third generation of user interfaces coincides with the rise of proof development of large projects and the corresponding increase in concern for good proof engineering support. Interfaces from this generation focus on scaling to large developments. For example, early user interfaces did not support asynchronous development: they did not allow the user to run the proof checker on some proofs while modifying others. Early user interfaces also did not have support for project management, and so were not truly full-scale IDEs.

Figure 7.3: Isabelle/jEdit (from isabelle-jedit)

Many IDEs in the latest wave of development address these concerns. Coqoon [Faithfull2016], for example, is an IDE for Coq built on Eclipse with support for both project management and asynchrony. The PIDE framework, originally developed for use with the Isabelle IDE Isabelle/jEdit [Wenzel2014], also supports asynchronous development; Isabelle/jEdit is shown in Figure 7.3.

PIDE is ultimately indifferent to the backend theorem prover; Wenzel2013 and Barras2015 describe interfaces built on PIDE for Coq. PIDE also has additional interfaces in Isabelle aside from the default Isabelle/jEdit, including Isabelle/VSCode [isabelle-vscode], an Isabelle plugin for Visual Studio [vscode]. PIDE has seen enough success that Isabelle has done away with its REPL entirely.

Like Isabelle, Lean also has an IDE implemented as a Visual Studio plugin [lean-vscode]. This IDE communicates with the Lean server, and supports incremental compilation and proof checking, debugging, documentation, and batch execution.

Many existing proof assistant interfaces have integrated features from this generation. For example, CoqIDE now supports asynchrony [coqide]. It remains to be seen to what extent full-scale IDEs for proof assistants will continue to evolve and to grow in popularity.

7.1.2 Specialized Interfaces

The separation of concern from Generation II interfaces for proof assistants inspired the development of specialized interfaces. For example, web-based interfaces require minimal setup and installation, and so are thought to be less intimidating to new users, especially students. Many web-based interfaces are built with students as the key audience to address concerns students have about installing and using heavyweight IDEs. Examples of web-based interfaces for proof development include ProofWeb [kaliszyk2007web], jsCoq [Gallego2016], and PeaCoq [peacoq]. The Lean 2 tutorial [lean-tutorial] uses the Lean.JS [lean-js] web interface for Lean to provide an interactive learning experience directly in the browser.

Proof assistant users sometimes note that the experience of writing proofs has game-like elements. The interactive nature of a proof assistant, for example, is similar to interacting with an adversary in a game. There is some work on gamification of proofs that reifies this intuition into the interface itself. In these games, players can generate program annotations [dietl2012verification], write natural deduction proofs [lerner2015polymorphic], and identify inductive invariants [bounov2018inferring], all the while having low-level details of these proofs abstracted away from them. While these games are not interfaces for well-known ITPs like Coq and Isabelle, they may help with tasks that can assist users in writing proofs, such as finding inductive invariants. Applying this same intuition to build new interfaces for commonly-used proof assistants may help make them more accessible to non-experts in the future.

7.1.3 Interface Usability Analysis

Traditional software engineering tools and interfaces are often subject to usability analyses according to the conventions in human-computer interaction (HCI). There are some similar analyses related to proof assistants. Aitken1998 propose a three-layer model to account for user interaction with a proof assistant, and perform an empirical study which concludes that there is support for the view of “proof as programming” for proof assistant interaction, rather than “proof by pointing” [Bertot1994] and “proof as structure editing”. Kadoda1999 analyze the usability of theorem provers in a cognitive framework by using questionnaires. Aitken2000 analyze errors in proof attempts. HOFM2014raey use focus groups to evaluate usability of proof assistants, finding that users prefer proof assistants that produce intuitive proofs, can present comprehensible proof steps, and provides a convenient interface.

7.1.4 Tooling for User Support

In parallel with the evolution of user interfaces, recent years have seen an emphasis on tooling to help users with proof development. These features are more useful than ever because of the advent of Generation II user interfaces that are not tightly tied to the REPL.

Many of the user support features that are now arising for proof engineering echo similar features that already exist in languages with more mature IDEs. For example, languages with more mature IDEs often integrate refactoring tools into those IDEs; now that proof assistant interfaces are maturing, interfaces with refactoring support such the Coq interface CoqPIE [Roe2016] are beginning to emerge. We discuss more refactoring tools in Section 7.2.2.

In addition, new techniques are extending the reach of user support features to support the challenges particular to proof development. For example, one common challenge in proof engineering is efficiently finding relevant datatypes and proofs Many proof assistants distribute tools for this by default. For example, Coq includes the Search command, which SSReflect [Gonthier2010] extends; Isabelle includes the find_theorems and find_consts commands. This challenge has also inspired several external tools, including the web-based tool Whelp [asperti2004content] for Coq, upon which Matita builds [asperti2007user].

Machine-learning techniques can also help with challenges in proof developments, for example by suggesting hints to users. Recent tooling of this flavor includes the ML4PG [Komendantskaya2012] extension to Proof General, which uses machine learning to suggest hints during proof development, and ACL2(ml) [Heras2013], which uses machine learning to suggest auxiliary lemmas for ACL2 development. Nagashima and He propose a proof method recommendation system for Isabelle/HOL based on machine learning, which is trained on large proof corpora [Nagashima2018].

Unlike traditional software development, proof development with ITPs often involves significant interaction with automation. Accordingly, one question that many tools explore is the ideal user experience for interacting with automation. The web-based IDE PeaCoq [peacoq], for example, has extra support for tactic previews and context management. Matita [asperti2007user] includes special support for contextual term manipulation, and for understanding the execution of tactical-like chains of tactics.

Another common problem in proof development is that the proof engineer may accidentally state a false theorem, or may be unsure if a stated theorem is true. When a stated theorem is false, it can be difficult to determine that the theorem is actually false; the proof engineer may instead think his inability to prove the theorem is due to his own shortcomings. Hammers and other general-purpose automation (Section 5.2.1) can help a proof engineer discharge simple proof obligations and quickly determine that a theorem is true; the proof engineer can then reprove the theroem in a different way if desired. Property-based testing tools like Quickcheck for Isabelle [Bulwahn2012b] and QuickChick [lampropoulos2017generating, Paraskevopoulou2015] for Coq can help users identify counterexamples to false properties.

7.1.5 Future of User Interfaces

Many of the Generation I interfaces still exist today. We expect some of these will continue to exist, since they are useful when resources are limited. However, there is a growing trend of moving away from the REPL in some ITPs such as Isabelle/HOL; perhaps more ITPs will move in that direction.

The interactive nature of the REPL makes it simple to collect fine-grained data on how proof engineers develop code. For proof assistants that are backed by a REPL, collecting this data could help with the development of better tooling to support proof engineers during development. Similarly, while there is some empirical information on how proof engineers interact with different user interfaces, this is still a lot more ground to cover. Even collecting simple information like the number of users of each interface for each proof assistant over time may help gauge the impacts of different design decisions. More user studies on interacting with proof assistants could also help pave the way for more useful interfaces.

We expect that the separation of concerns emphasized with Generation II interfaces has had a strong influence and will likely continue to have a strong influence. Separation of concerns and extensibility may be part of why Proof General has continued to be successful after so many years. In many ways, this mirrors the success that is seen in successful IDEs for software engineers, such as Eclipse [eclipse] or IntelliJ IDEA [intellij]. We expect that future developments will continue to work on separation of concerns and extensibility, with better plugin systems for IDEs to support more features with minimal effort for the interface developer and for the proof engineer.

Emacs has played a crucial role in the history of the development of IDEs for ITPs, with many early interfaces implemented as Emacs modes. Recent years have seen the development of IDEs as Visual Studio plugins for Isabelle/HOL and for Lean. In the future, perhaps more proof assistants will implement IDEs as plugins for existing IDEs such as Visual Studio and Eclipse, and perhaps these existing IDEs will play a similar role to Emacs in the continued development of proof assistant IDEs.

We also expect that project management will continue to grow in importance as large proof developments become more common. Few interfaces for proof assistants currently have strong support for project management; this is an area ripe for improvement. Better integration of build tools and continuous integration tools can greatly improve development experience.

Finally, we expect more productivity tools to emerge, like the refactoring tools that already exist, and for these tools to be integrated into IDEs. For example, most mature IDEs for existing languages have strong support for debugging. Debugging tools for proof assistants, on the other hand, are few and far between. Better plugin systems for interfaces could help minimize the friction in supporting these features at the IDE level.

7.2 Proof Evolution

Programs change over time, and so proofs about programs must change with those programs. This concern is raised in the Social Processes [DeMillo1977] critique of program verification as a barrier for the verification of real programs. This barrier has been realized in real developments; a review [Elphinstone2013] of the evolution of the seL4 verified OS microkernel [Klein2009], for example, notes that while customizing the kernel to different environments may be desirable, “the formal verification of seL4 creates a powerful disincentive to changing the kernel.” leroy2012 motivates and describes updates to the initial CompCert memory model that include changes in specifications, automation, and proofs [leroy-mem-2010].

Changes in programs and proofs are not always in the proof engineer’s control— updating a standard library, for example, can lead to proofs in client code failing during regression proof checking (Section 7.2.1). Reactive approaches to proof evolution address changes that occur outside of the proof engineer’s control. These approaches contrast with and are complementary to proactive approaches that address brittleness ahead of time, such as the design principles discussed in Section 6.2.

Consider, for example, a Coq proof that uses the intros tactic. If the user does not pass identifiers to intros, then Coq automatically chooses hypothesis names. Small changes to the theorem statement or to the proof can change the names of the hypotheses that Coq chooses, which can make proofs that refer to those hypotheses brittle. We briefly discussed two proactive approaches to this problem in Section 6.2.3: explicitly choosing identifiers to pass to intros, and writing tactics that do not refer to these hypotheses at all. In contrast with these proactive approaches, the IDE CoqPIE [Roe2016] automatically renames references to hypotheses in proofs to work around this problem reactively.

The renaming functionality of CoqPIE is an example of proof refactoring (Section 7.2.2), a reactive approach to proof evolution. Proof repair (Section 7.2.3) is a similar reactive approach to proof evolution. The main distinction between these two approaches is that proof refactoring is semantics-preserving, while proof repair need not be. Nonetheless, these technologies often overlap.

7.2.1 Regression Proving

Regression proving is the process of rechecking proofs after a change to a verification project, mirroring regression testing for software projects. For large-scale projects, regression proving may require considerable machine time—from tens of minutes and hours up to several days. This can negatively affect the productivity of proof engineers. Absent domain- and context-specific knowledge, as in proof refactoring, the two main techniques to speed up regression proving are proof-checking parallelization [Wenzel2013MultiProcessing, Barras2013] and proof selection [Celik2017].

Support for parallelization varies in degree and kind among proof assistants. Isabelle leverages the support for threads in its host compiler, Poly/ML, to spawn proof checking tasks processed by parallel workers. Using a notion of proof promises, proofs that require previous unfinished result can proceed normally and become finalized when extant tasks terminate [Wenzel2013]. Isabelle also includes a build system with integrated support for checking of proofs and management of parallel workers. The lack of native threads in OCaml prevents similar low-cost fine-grained parallelism for Coq. However, spawning parallel operating system processes is still possible, and such processes can be leveraged for both file-level parallelism and to check fine-grained proof tasks [Barras2015]. Lean supports fine-grained parallel proof checking [deMoura2015]. Compared to parallelization of test execution for software projects, checking a proof is deterministic and has no side-effects detrimental to checking other proofs.

A regression proof selection (RPS) technique limits the scope of regression proving to those proofs that are affected by a change to a project. While selection at the file level (modulo file dependencies) is broadly supported via build systems such as make, only some proof assistants such as Isabelle and Coq supports selection of individual proofs; this is made possible by support for asynchronous proof checking [Wenzel2014, Barras2015]. Celik et al. proposed an RPS technique for Coq that combines dependency analysis at the file and proof levels [Celik2017]; their tool implementation, dubbed iCoq, compares checksums of files, terms, and proof scripts to locate and run affected proofs sequentially. In an evaluation on the revision histories of several large-scale Coq projects, iCoq was up to 3 times faster than using conventional make-style checking with a persistent store, and up to 10 times faster than conventional checking when each revision is checked from a clean slate.

Palmskog2018 defined a taxonomy of regression proving techniques for proof assistants that include both parallelism and selection. Along one axis, they consider parallelization at the file and proof granularity. Along the other axis, they consider selection of files and proofs. Their most sophisticated technique combines proof selection and fine-grained parallelization, and consisistently outperforms other techniques on the revision histories of several Coq large-scale projects.

WenzelScalingIsabelle, WenzelFurtherScalingIsabelle outlined how to scale Isabelle for large projects using both parallelism and other techniques.

7.2.2 Proof Refactoring

Refactoring is the restructuring of code in a way that preserves semantics [opdyke1992]; proof refactoring is the refactoring of proofs [WhitesidePhD]. Proof refactoring tools help automate this process, propogating a single change throughout the proof development. Like program refactoring tools, proof refactoring tools can help keep developments maintainable as they change over time [Bourke12]. In that way, it is possible to consider refactoring tools as both proactive and reactive approaches to proof evolution, though we consider them here through a reactive lens.

Some proof assistants expose tactics (Section 5.1.1) or proof languages (Section 5.1.2) in which the proof engineer can write high-level proof scripts to guide proof search. Some proof refactoring tools refactor these proof scripts directly. One such tool is POLAR [Dietrich2013], a generic framework for proof script refactoring. POLAR is instantiated with two languages, both of which are based on Isabelle/Isar [Wenzel2007isar]: Hiscript [WhitesidePhD], a language with support for refactoring, and script [dietrich2011], a language with support for proof planning (Section 6.4.3). Refactoring in POLAR works through a combination of rewrite rules that operate over a graph representation of the underlying language. POLAR implements ten kinds of refactorings by default, and also supports custom refactorings. It guarantees that all lemmas that go through before the refactoring continue to go though after the refactoring.

Some proof refactoring tools focus on specific refactoring tasks that are common in proof development. For example, Levity [Bourke12] is a proof refactoring tool for an old version of Isabelle/HOL that automatically moves lemmas to maximize reuse. The design of Levity is informed by experiences with two large proof developments. Levity addresses problems that are especially pronounced in the domain of proof refactoring, such as the context-sensitivity of proof scripts. Tactician [adams2015] is a refactoring tool for proof scripts in HOL Light that focuses on refactoring proofs between sequences of tactics and tacticals.

There is little work on refactoring proof terms (Section 4.3.1) directly. This is the main focus of Chick [robert2018], which refactors terms in a dependently-typed functional language similar to Gallina. To use Chick, the proof engineer applies some refactorings. Chick then uses a program differencing algorithm to determine the changes to make elsewhere in the program, then makes those changes. Chick supports insertion, deletion, modification, and permutation of subterms. Similarly, RefactorAgda [wibergh2019] is a refactoring tool for a subset of Agda that operates directly over Agda terms. RefactorAgda supports many changes, including changing indentation, renaming terms, moving terms, converting between implicit and explicit arguments, reordering subterms, and adding or removing constructors to or from types; it also documents ideas for supporting other refactorings, such as adding and removing arguments and indicies to and from types.

For both Chick and RefactorAgda, only some of these changes are semantics-preserving. Adding a new index to a type, for example, does not preserve the semantics of the original program. Accordingly, these tools can be viewed as both refactoring and repair tools, though the algorithms that they use are syntactic.

A natural integration point for a proof refactoring tool is at the level of a platform or an IDE. The Coq IDE CoqPIE [Roe2016] for Coq takes this approach for refactoring proof scripts. CoqPIE includes a Replay button which steps through the proof while renaming any changed hypothesis names. CoqPIE can also automatically split out intermediate goals from a proof into separate lemmas. There are plans to support more refactoring functionality in CoqPIE in the future.

7.2.3 Proof Repair

Program repair [Monperrus2018] is the automatic patching of programs to fix bugs; proof repair is program repair for proofs. Proof repair tools automatically fix broken proofs. Recent lessons from a review of a certain class of program repair tools [Qi2015] highlight why proof repair is a particularly good domain of program repair. The review demonstrates that many existing tools produce incorrect patches. Among the recommendations the authors make to remedy this is the suggestion that program repair tools make use of extra information such as specifications, code from other applications, or example patches when generating patches.

In proof repair, a specification is always available: the theorem the repaired proof ought to prove. Some proof repair tools take this a step further and make use of additional extra information, such as examples patches. One such tool is PUMPKIN PATCH [Ringer2018], a proof repair tool for Coq that generalizes example patches. PUMPKIN PATCH takes as inputs an old proof and a new proof that addresses some change in specification. From those, it identifies a reusable patch that describes the change in specification; for the kinds of changes PUMPKIN PATCH can currently handle, this patch is a Gallina function. The proof engineer can then use this patch to patch other proofs broken by the change in specification. PUMPKIN PATCH has only preliminary tooling [pumpkin-git] for applying patches automatically, and currently handles only simple changes.

Chick [robert2018] was developed in parallel to PUMPKIN PATCH, and has a similar workflow: Chick takes a set of example changes supplied by the programmer, and uses a program differencing algorithm to determine the changes to make elsewhere. Unlike PUMPKIN PATCH, Chick also applies the changes it finds. However, Chick does this using a syntactic algorithm that handles only simple transformations; for this reason, it presents itself primarily as a refactoring tool, even though the changes it makes may not preserve semantics. The refactoring tool RefactorAgda [wibergh2019] similarly decribes some semantics-changing repairs for a subset of Agda.

While proof repair is analogous to program repair, it was born out of traditional proof reuse (Section 6.4). For example, PUMPKIN PATCH discovers patches which help adapt a proof of a theorem to a proof of a related theorem, and can so be thought of as a tool to assist in proof by analogy [curien1995]. Similarly, the proposed proof weaving [Mulhern06proofweaving] method to automatically satisfy new obligations generated in response to changes in inductive types can be viewed as a proof repair technique. Proof planning critics [ireland1996] can also be viewed as a technique for proof repair.

New technologies continue to make proof repair more feasible. GALILEO [chan2011galileo] is a tool build on Isabelle for identifying and repairing faulty ontologies in response to contradictory evidence; it has been applied to repair faulty physics ontologies, and may have applications more generally for mathematical proofs. GALILEO uses repair plans to determine when to trigger a repair, as well as how to repair the ontology.

Knowledge sharing methods [gauthier2014] match concepts across different proof assistants with similar logics and identify isomorphic types, and may have implications for proof repair. Later work uses these methods in combination with HOL(y)Hammer to reprove parts of the standard library of HOL4 and HOL Light using combined knowledge from the two proof assistants [Gauthier2015]. More recently, this approach has been used to identify similar concepts across libraries in proof assistants with different logics [gauthier2017]. These methods combined with automation like hammers may help the proof engineer adapt proofs between isomorphic types, and may have applications when repairing proofs even within the same logic, using information from different libraries, different commits, or different representations of similar types.

7.2.4 Future of Proof Evolution

There is a lot of room for work in proof evolution—only a few techniques exist so far, many of which emerged in parallel. We expect these reactive approaches to continue to evolve alongside proactive approaches like design principles, as the two approaches are complementary. Proof evolution can help with changes that occur outside of the programmer’s control, such as changes in dependencies (examples of this can be found in Ringer2018) and changes that are difficult to protect against even with informed design (examples of this can be found in Klein2014).

Ideally, proof evolution tools ought to integrate naturally with the workflows of proof engineers, for example through integration with existing tactic or proof languages, or through IDE or continuous integration support. While some proof evolution tools focus on this already and can offer useful insights, this can be challenging. For example, refactoring Ltac proof scripts can be difficult, since the semantics of Ltac are not well-defined; Ltac2 [ltac2] may simplify this in the future. robert2018 discusses the challenges involved in refactoring proof scripts in more detail. Ringer2018 also discusses the challenges of workflow integration, along with other open problems in proof repair. We expect to see more emphasis on addressing these challenges in the future.

There is only preliminary work exploring how much of the work from existing refactoring and repair tools for programming carries over to the domain of proof assistants. It is worth exploring in more detail which challenges are unique to this domain. For example, Qi2015 provides several recommendations for how program repair tools can make use of extra information such as examples to make searching for patches more feasible; PUMPKIN PATCH and machine learning tools use examples for this purpose already. Future proof refactoring and repair tools can similarly learn from those recommendations.

One tempting use case for proof refactoring and repair tools is when a library changes a specification that breaks proofs in client code that uses those libraries. Current refactoring and repair tools, however, rely each individual client to determine the appropriate refactors and repairs to make to fix those proofs. To better address this problem, future refactoring and repair tools can provide support at the level of library design. A library designer may, for example, specify how something has changed to a tool; the tool may then apply this information in client code automatically. Some program repair tools already support library-provided patches [Monperrus2018]; we expect to see this extend to proof refactoring and repair tools in the future.

One barrier to useful refactoring and repair tools for proof engineers is the lack of information on the kinds of changes that proof engineers make in practice. Collecting data on the changes that proof engineers make and classifying it could help guide refactoring and repair tools to handle classes of changes that matter in practice, and could also help machine learning tools gather both positive and negative examples. Similarly, collecting the benchmarks and examples from both proactive and reactive approaches to proof evolution such as Planning for Change, seL4, iCoq, and

PUMPKIN PATCH can help drive the development of future proof evolution tools and measure their success meaningfully.

7.3 User Productivity and Cost Estimation

Bourke12 outline challenges in large-scale verification projects using proof assistants: (1) new proof engineers joining the project, (2) expert proof engineering during main development, (3) proof maintenance, and (4) social and management aspects. They highlight three lessons: (1) proof automation is crucial, (2) using introspective tools for quickly finding facts in large databases gain importantance for productivity, and (3) tools that shorten the edit-check cycle increase productivity, even when sacrificing soundness.

Zhang2012 present a simulation model of the process of verifying the operating system kernel seL4. Their model is expressed as a software process using the tool Vensim. Andronick2012 describe the development process and management issues in verifying seL4. They conclude that formal verification, and re-verification, for systems requiring in the order of 10,000 LOC is feasible using a proof assistant. Staples2013 studied the relationships between sizes of artifacts in seL4. They find that the formal specifications have a significant relationship with the the size of the verified executable code. Staples2014 study the proof productivity problem in the context of seL4; they find that effort is correlated linearly with proof size. Matichuk2015 analyze the Isabelle/HOL specifications and proof scripts from the seL4 project, and find a quadratic relationship between the size of a formal property and the proof script required to prove it.

Jeffery2015 identify 30 research questions about productivity in application of formal methods, such as verification using proof assistants. Klein2015 outline the benefits of trustworthy systems.

7.4 Mining and Learning from Proof Repositories

Mining software repositories is an emerging field that analyzes software repositories to yield actionable information about software systems and their development and evolution. We describe similar forms of analysis that have been carried out for repositories with proof assistant code.

Wiedijk2009 compared statistics for standard libraries of several proof assistants for versions available around 2009, including Isabelle/HOL, Coq, and HOL Light. For each library, he reports the number of lines of comments, proofs, definitions, etc. Despite foundational differences, the numbers are similar, with HOL Light having the smallest number of lines for definitions. For example, the LOC shares of theorem statements, definitions, and proof in the Coq version 8.1 standard library were 11%, 8%, and 53%, respectively. Wiedijk argues informally that fewer definitions per proof means higher trustworthiness, since having proofs of relevant properties yield higher confidence in the adequacy of definitions.

Blanchette2015 investigated Isabelle’s Archive of Formal Proofs (AFP), analyzing among other properties the number and sizes of proofs, interdependencies between projects, and number of authors. For the AFP in aggregate, the LOC shares of theorem statements, definitions, and proofs were 19%, 8%, and 58%, respectively. They found that the Isabelle Sledgehammer tool for proof automation [Blanchette2013] could prove about 60% of all theorems in the AFP.

Software metrics provide quantitative ways to describe software artifacts and processes and discover new properties. Aspinall2016 first considered analogous metrics for formal proofs. More specifically, they define an abstract model of formal proofs and a set of proof metrics for this model, which they implement for three different proof assistants (Isabelle, Mizar, and HOL Light) and apply to several large proof corpora.

Komendantskaya2012 used machine learning with clustering algorithms to identify patterns in large collections of Coq tactic sequences and proof trees, e.g., to find structural similarities between lemmas, and Heras2013b, Heras2014 highlighted how statistical patterns in proofs can be leveraged during interactive proof development. Aspinall2016b used machine learning, in the form of a -nearest-neighbor classifier, to learn and suggest theorem names in HOL Light projects that accurately reflect their property definitions.

Muller2017 proposed a format and database for capturing and leveraging alignments between concepts in different proof assistants, e.g., between natural numbers in Coq on one hand and Isabelle/HOL on the other. One of the basic assumptions in alignment is that concepts have syntactic and semantic similarities across environments, consistent with repetitiveness assumptions in naturalness. gauthier2017 proposed an algorithm based on heuristics for generating alignments given two proof assistant libraries, and evaluated it on libraries from six proof assistants. For example, by evaluating a library against itself for alignment, duplicated concepts can be found.

Kaliszyk2017b leveraged statistical machine learning techniques in a tool that automatically translates (“formalizes”) mathematical texts to proof assistant code. Their approach and evaluation is based on learning and cross-validation using a corpus with established alignments between English texts and HOL Light documents, based on the Flyspeck project [Hales2017]. They find that the number of correct translations among the top 20 is 64%.

There are many recent lines of work that learn from large proof assistant corpora to directly perform various automated reasoning tasks [Kuhlwein2012, Kuhlwein2013, Kaliszyk2014, Irving2016, Loos2017, Peng2017]; the HOLStep dataset [Kaliszyk2017] is designed as a benchmark for training and evaluating such techniques in a proof assistant context. Gauthier2017b, Gauthier2018 proposed a technique for learning from HOL4 tactic sequences and proof states and automatically suggest tactic-based proofs of theorems. They achieved around 66% success rate on the HOL4 standard library, and by also incorporating the automated E prover into the toolchain, they raised the success rate to 69%. Huang2018 similarly learn from tactics and proof states, but in the context of Coq and for a limited set of algebraic proof goals. Yang2019 proposed a more general tactic-based approach for learning and automatic proof suggestion for Coq, which achieved around 12% success rate on proofs from a large dataset of 123 Coq projects. Nagashima2018 used custom encodings of proof state in Isabelle/HOL for learning in order to predict suitable proof methods (essentially powerful domain-specific proof tactics) to apply. Bansal2019 presented a learning environment for HOL Light.