Alignment Elimination from Adams' Grammars

06/20/2017 ∙ by Härmel Nestra, et al. ∙ University of Tartu 0

Adams' extension of parsing expression grammars enables specifying indentation sensitivity using two non-standard grammar constructs --- indentation by a binary relation and alignment. This paper proposes a step-by-step transformation of well-formed Adams' grammars for elimination of the alignment construct from the grammar. The idea that alignment could be avoided was suggested by Adams but no process for achieving this aim has been described before.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Parsing expression grammars (PEG) introduced by Ford [6] serve as a modern framework for specifying the syntax of programming languages and are an alternative to the classic context-free grammars (CFG). The core difference between CFG and PEG is that descriptions in CFG can be ambiguous while PEGs are inherently deterministic. A syntax specification written in PEG can in principle be interpreted as a top-down parser for that syntax; in the case of left recursion, this treatment is not straightforward but doable (see, e.g., [8]).

Formally, a PEG is a quadruple where:

  • is a finite set of non-terminals;

  • is a finite set of terminals;

  • is a function mapping each non-terminal to its replacement (corresponding to the set of productions of CFG);

  • is the start expression (corresponding to the start symbol of CFG).

So and , where the set of all parsing expressions writable in is defined inductively as follows:

  1. (the empty string);

  2. for every (the terminals);

  3. for every (the non-terminals);

  4. whenever , (concatenation)

  5. whenever , (choice);

  6. whenever (negation, or lookahead);

  7. whenever (repetition).

All constructs of PEG except for negation are direct analogues of constructs of the EBNF form of CFG, but their semantics is always deterministic. So repeats parsing of until failure, and always tries to parse first, is parsed only if fails. For example, the expression consumes the input string ab entirely while only consumes its first character. The corresponding EBNF expressions and are equivalent, both can match either or from the input string. Negation tries to parse and fails if succeeds; if fails then succeeds with consuming no input. Other constructs of EBNF like non-null repetition and optional occurrence can be introduced to PEG as syntactic sugar.

Languages like Python and Haskell allow the syntactic structure of programs to be shown by indentation and alignment, instead of the more conventional braces and semicolons. Handling indentation and alignment in Python has been specified in terms of extra tokens INDENT and DEDENT that mark increasing and decreasing of indentation and must be generated by the lexer. In Haskell, rules for handling indentation and alignment are more sophisticated. Both these languages enable to locally use a different layout mode where indentation does not matter, which additionally complicates the task of formal syntax specification. Adams and Agacan [3] proposed an extension of PEG notation for specifying indentation sensitivity and argued that it considerably simplifies this task for Python, Haskell and many other indentation-sensitive languages.

In this extension, expression , for example, denotes parsing of while assuming a greater indentation than that of the surrounding block. In general, parsing expressions may be equipped with binary relations (as was in the example) that must hold between the baselines of the local and the current indentation block. In addition, denotes parsing of while assuming the first token of the input being aligned, i.e., positioned on the current indentation baseline. For example, the do expressions in Haskell can be specified by

Here, and stand for statement lists in the indentation and relaxed mode, respectively. In the indentation mode, a statement list is indented (marked by in the second production) and all statements in it are aligned (marked by ). In the relaxed mode, however, relation is used to indicate that the indentation baseline of the contents can be anything. (Technically, is the binary relation containing all pairs of natural numbers.) Terminals and are also equipped with to meet the Haskell requirement that subsequent tokens of aligned blocks must be indented more than the first token.

Alignment construct provides fulcra for disambiguating the often large variety of indentation baseline candidates. Besides simplicity of this grammar extension and its use, a strength of it lies in the fact that grammars can still serve as parsers.

The rest of the paper is organized as follows. Section 2 formally introduces additional constructs of PEG for specifying code layout, defines their semantics and studies their semantic properties. In Sect. 3, a semantics-preserving process of eliminating the alignment construct from grammars is described. Section  4 refers to related work and Sect. 5 concludes.

2 Indentation extension of PEG

Adams and Agacan [3] extend PEGs with the indentation and alignment constructs. We propose a slightly different extension with three rather than two extra constructs. Our approach agrees with that implemented by Adams in his indentation package for Haskell [1], whence calling the grammars in our approach Adams’ grammars is justified. All differences between the definitions in this paper and in [3] are listed and discussed in Subsect. 2.4.

Let denote the set of all natural numbers, and let (the Boolean domain). Denote by the set of all subsets of set , and let denote the set of all binary relations on set , i.e., . Standard examples are (consisting of all pairs of natural numbers such that ) and (the identity relation consisting of all pairs of equal natural numbers); the indentation extension also makes use of (the relation containing all pairs of natural numbers). Whenever and , denote (the image of under relation ). The inverse relation of is defined by , and the composition of relations and by . Finally, denote .

2.1 Adams’ grammars

Extend the definition of given in Sect. 1 with the following three additional clauses:

  1. for every and (indentation);

  2. for every and (token position);

  3. for every (alignment).

Parsing of an expression means parsing of while assuming that the part of the input string corresponding to forms a new indentation block whose baseline is in relation to the baseline of the surrounding block. (Baselines are identified with column numbers.) The position construct , missing in [3], determines how tokens of the input can be situated the current indentation baseline. Finally, parsing an expression means parsing of  while assuming the first token of the input being positioned on the current indentation baseline (unlike the position operator, this construct does not affect processing the subsequent tokens).

Inspired by the indentation package [1], we call the relations that determine token positioning the indentation baseline token modes. In the token mode for example, tokens may appear only to the right of the indentation baseline. Applying the position operator with relation  to parts of Haskell grammar to be parsed in the indentation mode avoids indenting every single terminal in the example in Sect. 1. Also, indenting terminals with  is inadequate for do expressions occurring inside a block of relaxed mode but the position construct can be easily used to change the token mode for such blocks (e.g., to ).

We call a PEG extended with these three constructs a . Recall from Sect. 1 that and denote the set of non-terminal and terminal symbols of the grammar, respectively, and is the production function. Concerning the semantics of , each expression parses an input string of terminals () in the context of a current set of indentation baseline candidates () and a current alignment flag indicating whether the next terminal should be aligned or not (), assuming a certain token mode (). Parsing may succeed, fail, or diverge. If parsing succeeds, it returns as a result a new triple containing the rest of the input , a new set of baseline candidates updated according to the information gathered during parsing, and a new alignment flag . This result is denoted by . If parsing fails, there is no result in a triple form; failure is denoted by .

Triples of the form are behaving as operation states of parsing, as each parsing step may use these data and update them. We will write (as we never deal with different terminal sets, dependence on is not explicitly marked), and denote by the set of possible results of parsing, i.e., .

The assertion that parsing expression in grammar with input string in the context of and assuming token mode results in is denoted by . The formal definition below must be interpreted inductively, i.e., an assertion of the form is valid iff it has a finite derivation by the following ten rules:

  1. .

  2. For every , holds in two cases:

    • If for , , such that ( denotes occurring at column ) and either and , , or and , ;

    • If , and there are no and such that with either and or and .

  3. For every , holds if holds.

  4. For every , holds in two cases:

    • If there exists a triple such that and ;

    • If and .

  5. For every , holds in two cases:

    • If there exists a triple such that and ;

    • If and .

  6. For every , holds in two cases:

    • If and ;

    • If there exists a triple such that and .

  7. For every , holds in two cases:

    • If and ;

    • If there exists a triple such that and .

  8. For every and , holds in two cases:

    • If there exists a triple such that and ;

    • If and .

  9. For every and , holds if holds.

  10. For every , holds in two cases:

    • If there exists a triple such that and ;

    • If and .

The idea behind the conditions and occurring in clause 2 is that any column  where a token may appear is in relation  with the current indentation baseline (known to be in ) if no alignment flag is set, and coincide with the indentation baseline otherwise. For the same reason, consuming a token in column restricts the set of allowed indentations to or depending on the alignment flag. In both cases, the alignment flag is set to . In clause 8 for , the set of allowed indentation is replaced by as the local indentation baseline must be in relation with the current indentation baseline known to be in . After successful parsing of with the resulting set of allowed local indentations being , the set of allowed indentations of the surrounding block is restricted to . Clause 10 similarly operates on the alignment flag.

For a toy example, consider parsing of with the operation state assuming the token mode . For that, we must parse with by clause 8 since . For that in turn, we must parse with by clause 10. By clause 2, we have (as ) and (as ). Therefore, by clause 4, . Finally, and by clauses 10 and 8. The set in the final state shows that only and are still candidates for the indentation baseline outside the parsed part of the input (before parsing, the candidate set was the whole ).

Note that this definition involves circular dependencies. For instance, if for some then if by clause 3. There is no result of parsing in such cases (not even ). We call this behaviour divergence.

2.2 Properties of the semantics

Ford [6] proves that parsing in PEG is unambiguous, whereby the consumed part of an input string always is its prefix. Theorem 2.2 below is an analogous result for . Besides the uniqueness of the result of parsing, it states that if we only consider relations in then the whole operation state in our setting is in a certain sense decreasing during parsing.

Denote by the suffix order of strings (i.e., iff for some ) and by the implication order of truth values (i.e., ). Denote by the pointwise order on operation states, i.e., iff , and .

Let be a , , and . Then for at most one , whereby implies . Also if and then implies both and , and implies .


By induction on the shape of the derivation tree of the assertion . ∎

Theorem 2.2 enables to observe a common pattern in the semantics of indentation and alignment. Denoting by either or , both clauses 8 and 10 have the following form, parametrized on two mappings :

  1. For , holds in two cases:

    • If there exists a state such that and ;

    • If and .

The meanings of indentation and alignment constructs are distinguished solely by and . For many properties, proofs that rely on this abstract common definition can be carried out, assuming that is monotone, preserves the largest element and follows together with the axiom . The class of all meet semilattices with top element, equipped with mappings , satisfying these three conditions, contains identities (i.e., semilattices with ) and is closed under compositions (of different , defined on the same semilattice ) and under direct products. If then the conditions hold for with , , similarly in the case if with , . Now the direct product of the identities of and with on gives the indentation case, and the direct product of the identities of and and the Boolean lattice with gives the alignment case.

If satisfy the conditions then since . Adding dual conditions ( monotone, and ) would make a Galois’ connection. In our cases, the dual axioms do not hold.

2.3 Semantic equivalence

Let be a  and . We say that and are semantically equivalent in and denote iff for every , and .

For example, one can easily prove that , , , , for all [6]. We are particularly interested in equivalences involving the additional operators of . In Sect. 3, they will be useful in eliminating alignment and position operators. The following Theorem 2.3 states distributivity laws of the three new operators of  other constructs:

Let be a . Then:

  1. , , , , , , for all ;

  2. , , , for all ;

  3. , , , .


The equivalences in claim 1 hold as the token mode steadily distributes to each case of the semantics definition. Those in claims 2 and 3 have straightforward proofs using the joint form of the semantics of indentation and alignment and the axioms of , . ∎

Note that indentation does not distribute with concatenation, i.e., . This is because assumes one indentation block with a baseline common to and while tolerates different baselines for and . For example, take , , let the token mode be and the input state be (recall that means terminal occurring in column ). We have and (since ), therefore and . On the other hand, implies (since ) and, analogously, (since and ). Consequently, .

We can however prove the following facts:

Let be a .

  1. Identity indentation law: For all , .

  2. Composition law of indentations: For all and , .

  3. Distributivity of indentation and alignment: For all and , .

  4. Idempotence of alignment: For all , .

  5. Cancellation of outer token modes: For all and , .

  6. Terminal alignment property: For all , .


For claim 1, note that an indentation with the identity relation corresponds to , being identity mappings. Hence

where can be replaced with because by Theorem 2.2.

Concerning claims 2–4, let be two constructs whose semantics follow the common pattern of indentation and alignment with mapping pairs and , respectively. Then

By monotonicity of and the fact that , we have . By the third axiom of and , we also have whence . Consequently, can be replaced with . Hence the semantics of the composition of and follows the pattern of semantics of indentation and alignment for mappings and . To prove claim 2, it now suffices to observe that the mappings in the semantics of equal the compositions of the corresponding mappings for the semantics of and . For claim 3, it suffices to observe that the mappings , given for an indentation and for alignment modify different parts of the operation state whence their order of application is irrelevant. Claim 4 holds because the mappings , in the alignment semantics are both idempotent.

Finally, claim 5 is trivial and claim 6 follows from a straightforward case study. ∎

Theorems 2.3 and 2.3 enact bringing alignments through all syntactic constructs except concatenation. Alignment does not distribute with concatenation, because in parsing of an expression of the form , the terminal to be aligned can be in the part of the input consumed by or (if parsing of succeeds with consuming no input) by . Alignment can nevertheless be moved through concatenation if any successful parsing of the first expression in the concatenation either never consumes any input or always consumes some input:

Let be a  and .

  1. If implies for all , , then .

  2. If implies for all , , then .


Straightforward case study. ∎

Theorem 2.3 (1) holds also for indentation (instead of alignment), the same proof in terms of , is valid. Finally, the following theorem states that position and indentation of terminals are equivalent if the alignment flag is false and the token mode is the identity:

Let be a . Let , and , , . Then .


Straightforward case study. ∎

2.4 Differences of our approach from previous work

Our specification of differs from the definition used by Adams and Agacan [3] by three essential aspects listed below. The last two discrepancies can be understood as bugs in the original description that have been corrected in the Haskell indentation package by Adams [1]. This package also provides means for locally changing the token mode. All in all, our modifications fully agree with the indentation package.

  1. The position operator is missing in [3]. The treatment there assumes just one default token mode applying to the whole grammar, whence token positions deviating from the default must be specified using the indentation operator. The benefits of the position operator were shortly discussed in Subsect. 2.1.

  2. According to the grammar semantics provided in [3], the alignment flag is never changed at the end of parsing of an expression of the form . This is not appropriate if succeeds without consuming any token, as the alignment flag would unexpectedly remain true during parsing of the next token that is out of scope of the alignment operator. The value the alignment flag had before starting parsing should be restored in this case. This is the purpose of conjunction in the alignment semantics described in this paper.

  3. In [3], an alignment is interpreted the indentation baseline of the block that corresponds to the parsing expression to which the alignment operator is applied. Indentation operators occurring inside this expression and processed while the alignment flag is true are neglected. In the semantics described in our paper, raising the alignment flag does not suppress new indentations. Alignments are interpreted the indentation baseline in force at the aligned token site. This seems more appropriate than the former approach where the indentations cancelled because of an alignment do not apply even to the subsequent non-aligned tokens. Distributivity of indentation and alignment fails in the semantics of [3]. Note that alignment of a block nevertheless suppresses the influence of position operators whose scope extend over the first token of the block.

Our grammar semantics has also two purely formal deviations from the semantics used by Adams and Agacan [3] and Ford [6].

  1. We keep track of the rest of the input in the operation state while both [3, 6] expose the consumed part of the input instead. This difference was introduced for simplicity and to achieve uniform decreasing of operation states in Theorem 2.2.

  2. We do not have explicit step counts. They were used in [6] to compose proofs by induction. We provide analogous proofs by induction on the shape of derivation trees.

3 Elimination of alignment and position operators

Adams [2] describes alignment elimination in the context of CFGs. In [3], Adams and Agacan claim that alignment elimination process for PEGs is more difficult due to the lookahead construct. To our knowledge, no concrete process of semantics-preserving alignment elimination is described for PEGs before. We provide one below for well-formed grammars. We rely on the existence of position operators in the grammar; this is not an issue since we also show that position operators can be eliminated from alignment-free grammars.

3.1 Approximation semantics and well-formed expressions

For defining well-formedness, we first need to introduce approximation semantics that consists of assertions of the form where and . This semantics is a decidable extension of the predicate that tells whether parsing of may succeed with consuming no input (result ), succeed with consuming some input (result ) or fail (result ). No particular input strings, indentation sets etc. are involved, whence the semantics is not deterministic. The following set of clauses define the approximation semantics inductively.

  1. .

  2. For every , and .

  3. For every , if .

  4. For every , holds in four cases:

    • , and ;

    • There exist such that , , and ;

    • There exists such that , and ;

    • and .

  5. For every , holds in two cases:

    • and ;

    • and .

  6. For every , holds in two cases:

    • and ;

    • There exists such that and .

  7. For every , holds in two cases:

    • and ;

    • , and .

  8. For every and , if .

  9. For every and , if .

  10. For every , if .

On the PEG constructs (1–7), our definition basically copies that given by Ford [6], except for the case where our definition requires besides . This is sound since if parsing of never fails then parsing of cannot terminate. The difference does not matter in the grammar transformations below as they assume repetition-free grammars.

Let be a . Assume that for some and , . Then:

  1. If then ;

  2. If for some then ;

  3. If then .


By induction on the shape of the derivation tree of the assertion