Learning to Format Coq Code Using Language Models

by   Pengyu Nie, et al.

Should the final right bracket in a record declaration be on a separate line? Should arguments to the rewrite tactic be separated by a single space? Coq code tends to be written in distinct manners by different people and teams. The expressiveness, flexibility, and extensibility of Coq's languages and notations means that Coq projects have a wide variety of recognizable coding styles, sometimes explicitly documented as conventions on naming and formatting. In particular, even inexperienced users can distinguish vernacular using the standard library and plain Ltac from idiomatic vernacular using the Mathematical Components (MathComp) library and SSReflect. While coding conventions are important for comprehension and maintenance, they are costly to document and enforce. Rule-based formatters, such as Coq's beautifier, have limited flexibility and only capture small fractions of desired conventions in large verification projects. We believe that application of language models - a class of Natural Language Processing (NLP) techniques for capturing regularities in corpora - can provide a solution to this conundrum. More specifically, we believe that an approach based on automatically learning conventions from existing Coq code, and then suggesting idiomatic code to users in the proper context, can be superior to manual approaches and static analysis tools - both in terms of effort and results. As a first step, we here outline initial models to learn and suggest space formatting in Coq files, with a preliminary implementation for Coq 8.10, and evaluated on a corpus based on MathComp 1.9.0 which comprises 164k lines of Coq code from four core projects.


Deep Generation of Coq Lemma Names Using Elaborated Terms

Coding conventions for naming, spacing, and other essentially stylistic ...

Towards Full-line Code Completion with Neural Language Models

A code completion system suggests future code elements to developers giv...

Function completion in the time of massive data: A code embedding perspective

Code completion is an important feature of integrated development enviro...

Are Multi-language Design Smells Fault-prone? An Empirical Study

Nowadays, modern applications are developed using components written in ...

Roosterize: Suggesting Lemma Names for Coq Verification Projects Using Deep Learning

Naming conventions are an important concern in large verification projec...

Who Made This Copy? An Empirical Analysis of Code Clone Authorship

Code clones are code snippets that are identical or similar to other sni...

A Proposal for a Revision of ISO Modula-2

The Modula-2 language was first specified in [Wir78] by N. Wirth at ETH ...

Please sign up or login with your details

Forgot password? Click here to reset