π: Towards a Simple Formal Semantic Framework for Compiler Construction

05/12/2018 ∙ by Christiano Braga, et al. ∙ Universidade Federal Fluminense 0

This paper proposes π, a formal semantic framework for compiler construction together with program validation. π is comprised by π Lib, a set of programming languages constructs inspired by Peter Mosses' Component-Based Semantics and π Automata, an automata-based formalism to describe the operational semantics of programming languages, that generalizes Gordon Plotkin's Interpreting Automata.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compiler construction is considered an intimidating discipline in Computer Science and related courses. This is perhaps captured quite graphically by the cover of the standard book on the subject, the so-called “Dragon book” (Compilers: Principles, Techniques, and Tools [1]), by Alfred V. Aho, Jeffrey D. Ullman and later on with Ravi Sethi and Monica S. Lam. There are “red”, “green” and “purple dragon” editions, but the Dragon, representing how burdensome people think of the subject, is always there.

The author has been developing and applying a formal approach, called , for compiler construction, aiming at a simple technique, that relies on basic mathematics and standard Computer Science courses, that could eventually ease compiler construction and help teaching the subject. Event though preliminary attempts on its pedagogical use have been made, the main objective of this paper is to present the framework and its implementation in the Maude language.

is comprised by Lib, a set of programming languages constructs inspired by Peter Mosses’ Component-Based Semantics [20] and Automata, an automata-based formalism to describe the operational semantics of programming languages, that generalizes Gordon Plotkin’s Interpreting Automata approach [23]. To write a compiler using , one needs to transform the (abstract) syntax tree of a given language into a description in Lib. Then, one can execute, validate or have machine code using different formal tools developed for Lib, such as an interpreter, a model checker, or a code generator, implemented following the formal semantics of Lib given in terms of Automata.

This paper contributes with , an automata-based semantic framework for formal compiler construction and its implementation in the Maude language. A Python implementation as a Jupyter notebook is also underway, to explore different compilation and validation techniques. In this paper, we will focus on Automata for the dynamic semantics of programming languages, and its Maude implementation.

For the moment

, parsing and transformation to Lib depend on the particular framework used to implement , Maude in this paper. Optimization is also left to an external framework, such as LLVM. In the foreseeable future we intend to cover all phases of the compiler construction process, in a formal way, based on Automata.

The remainder of this paper is organized as follows. In Section 5 related work is discussed. Section 2 recalls some preliminary material to the discussion of Automata, subject of Section 3. The Automata semantics of Lib is discussed in Section 4.1, together with its Maude implementation. Section 6 concludes this paper with the usual final remarks and indication of future work.

2 Preliminaries

2.1 Transition systems, structural operational semantics and model checking

This section recalls, very briefly, just for completeness, the basic concepts of labeled and unlabeled transition systems, structural operational semantics and model checking.

A transition system [2] (TS) is a pair , where denotes the set of the states of the system, is the transition relation. Transition systems are the standard models of structural operational semantics (SOS) descriptions.111As a matter of fact, SOS has labeled transition systems as models with the set of labels denoting actions of the system. Labels are essential while modeling action synchronization in concurrent systems. Therefore, since we are not discussing concurrency primitives in this paper, considering the more liberal transition systems as models of SOS descriptions will not cripple the proposal of this paper. Given an SOS description specifying the semantics of a programming language , the set defines the grammar of while relation represents the semantics of (either static or dynamic) in a syntax-directed way. Rule 1 presents the general form of the transition rule for the inductive step of the evaluation of a programming language construct in the SOS framework, where and are environments, is the result of some computation involving , is a programming language construct and its parameters, are memory stores, with the result of some computation of , and is a predicate not involving transitions.

(1)

Typically, SOS rules have a sequent in the conclusion of the form , where and are derivations of . If one looks at the conclusion as a transition of the form then the construction of the transition system from becomes straightforward with .

Model checking [10] is an automata-based automated validation technique to solve the “question” , that is, does model , a transition system, or Kripke structure in Modal Logic [15] jargon, with initial state , satisfies property ? The standard algorithm checks if the language accepted by the intersection Büchi automaton (a regular -automaton, that is, an automaton that accepts infinite words) of and is empty.

2.2 Maude

In this section we introduce the main elements of the Maude language, our choice of programming language for this work.

The Maude system and language [12] is a high-performance implementation of Rewriting Logic [18], a formalism for the specification of concurrent systems that has been shown to be able to represent quite naturally many logical and semantic frameworks [17].

Maude222Maude allows for programming with different Equational Logics: Many-sorted, Order-sorted or Membership Equational Logic. In this paper, Maude programs are described using Order-sorted Equational Logic. is an algebraic programming language. A program in Maude is organized by modules, and every module has an initial algebra [14] semantics. Module inclusion may occur in one of three different modes: including, extending and protecting. The including mode is the most liberal one and imposes no constraints on the preservation of the algebra of the included module into the including one, that is, both “junk” and “confusion”333Informally, when “junk” may be added to an algebra but “confusion” may not, as in extending mode, it means that new terms may be included but are not identified with old ones. may be added. Inclusion in extending mode may add “junk” but no “confusion”, while inclusion in protecting mode adds no “junk” and no “confusion” to the included algebra. Module inclusion is not enforced by the Maude engine, being understood only as an indication of the intended inclusion semantics. Such declarations, however, are part of the semantics of the module hierarchy and may be important for Maude-based tools, such as a theorem prover for Maude specifications, that would have to discharge the proof obligations generated by such declarations.

Computations in Maude are represented by rewrites according to either equations, rules or both in a given module. Functional modules may only declare equations while system modules may declare both equations and rules. Equations are assumed (that is, yield proof-obligations) to be Church-Rosser and terminating [3]. Rules have to be coherent: no rewrite should be missed by alternating between the application of rules and equations. A (concurrent) system is specified by a rewrite system where denotes its signature, the set of equations, a set of axioms, and the set of rules. The equational theory specifies the states of the system, which are terms in the -algebra modulo the set of equations and axioms, such as associativity, commutativity and identity. Combinations of such axioms give rise to different rewrite theories such that rewriting takes place modulo such axioms. Rules specify the (possibly) non-terminating behavior, that takes place modulo the equational theory . Another interesting feature of Maude is to support non-linear patterns (when the same variable appears more than once in a pattern) both in equations and rules. Section 3.3.1 exemplifies how this feature is intensively used in the Maude implementation of the Automata framework.

An interesting remark regards the decision between modeling behavior as equations or rules. One may specify (terminating) system behavior with equations. The choice between equations and rules provides an observability gauge. In the context of a software architecture, for instance, non-observable (terminating) actions, internal to a given component, may be specified by equations, while observable actions, that relate components in a software architecture, may be specified as rules. Section 3.3.1 illustrates how this “gauge” is used in the Maude implementation of the framework.

A compiler can be implemented in Maude as a meta-level application. Such a Maude application uses the so called descent functions [11, Ch.11] that represent modules as terms in a universal theory, implemented in Maude as a system module called META-LEVEL. Some of the descent functions are metaParse, metaReduce, metaRewrite and metaSearch.

  • Function metaParse receives a (meta-represented) module denoting a grammar, a set of quoted identifiers representing the (user) input and a quoted identifier representing the rule that should be applied to the given input qids, and returns a term in the signature of the given module.

  • Descent function metaReduce receives a (meta-represented) module and a (meta-represented) term and returns the (meta-represented) canonical form of the given term by the exhaustive application of the (Church-Rosser and terminating) equations, only, of the given module. An interesting example of metaReduce is the invocation of the model checker at the meta-level: (i) first, module MODEL-CHECKER must be included in a module that also includes the Maude description of the system to be analyzed, and (ii) one may invoke metaReduce of a meta-representation of a term that is a call to function modelCheck, with appropriate parameters, defined in module MODEL-CHECKER.

  • Finally, function metaRewrite simplifies, in a certain number of steps, a given term according to both equations and rules (assumed coherent, that is, no term is missed by the alternate application of equations and rules) of the given module. The descent function metaSearch looks for a term that matches a given pattern, from a given term, according to a choice of rewrite relation from , , , denoting the reflexive-transitive closure of the rewrite relation, the transitive closure of the rewrite relation or the rewrite relation that produces only canonical forms.

3 Automata

3.1 Interpreting automata

In [23], Plotkin defines the concept of Interpreting Automata as finite-state Transition Systems as a semantic framework for the operational semantics of programming languages. Interpreting Automata are now recalled from the perspective of Automata Theory.

Let be a programming language accepted by a Context Free Grammar (CFG) defined in the standard way where is the finite set of variables (or non-terminals), is the set of terminals, and is the start symbol of . An Interpreting Automaton for is a tuple where , is the set of configurations, is the transition relation, is initial configuration, and the finite set of final configurations. Configurations in are triples of the form where with the language generated by ,444There are some situations where one may need to push not only computed values but code as well into the value stack. One such situation is when a loop is being evaluated and both the loop’s test and body are pushed in order to “reconstruct” the loop for the next iteration. the set is a finite map with and , and the elements of the , where is the set of keywords of . A computation in is defined as , the reflexive-transitive closure of the transition relation.

As an example, let us consider the CFG of a programming language with arithmetic expressions, Boolean expressions and commands.

The values in the are elements of the set where is the set of Boolean values, is the set of natural numbers, with the set of variables, the set of Boolean expressions, and the set of commands of . The is defined as the set , where is the set of arithmetic expressions and .

Informally, the computations of an Interpreting Automaton mimic the behavior of a calculator in Łukasiewicz postfix notation, also known as reverse Polish notation. A typical computation of an Interpreting Automaton interprets a statement on the top of of a configuration , by unfolding its subtrees and that are then pushed back into , and possibly updating the with intermediary results of the interpretation of the , and the , should .

For the transition relation of , let us consider the rules for arithmetic sum expressions.

(2)
(3)
(4)

where are metavariables for arithmetic expressions, and . Rule 3 specifies that when the arithmetic expression is on top of the control stack , then its operands should be pushed to and then the operator +. Operands and will be recursively evaluated, as a computation is the reflexive-transitive closure of relation , leading to a configuration with an element in left on top of the value stack , as specified by Rule 2. Finally, when is on top of the control stack , and there are two natural numbers on top of , they are popped, added and pushed back to the top of .

Finally, there is one quite interesting characteristic of Interpreting Automata: transitions do not appear in the conditions of the rules, a characteristic that can be quite desirable from a proof theoretic standpoint, in particular in the context of term rewriting systems (see Section 3.3), as pointed out by Viry [26] and later by Roşu in [24], for instance. As opposed to transition rules that admit transitions in its premises, as in the Structural Operational Semantics (SOS) framework, for instance, also defined in [23]

, Interpreting Automata evaluation uses the control stack to push the evaluation context, so to speak, to the configuration. Unconditional rewriting has also very desirable computational consequences, in particular in Maude, regarding executability and performance. Model checking, for instance, does not consider transitions (rewrites) in the conditions of rules. Also, narrowing does not work, for the time being, on conditional rules. Regarding performance, the combination of a proper use of equations instead of rules together with unconditional rules provides an effective search mechanism. The use of equations shortens the state space and unconditional rules do not create “scratch pad” rewrites performing only forward rewriting.

As an example, let us recall Rule 1, the general form of the rule for the recursive step of the evaluation of a programming language construct in the SOS framework,

where and are environments, is the result of some computation involving , is a programming language construct and its parameters, are memory stores, with the result of some computation of , and is a predicate not involving transitions.

The Interpreting Automata rule for Rule 1 is as follows,

(5)

where is the control stack. Note that recursion will take care of evaluating a when it is on top of the control stack, so there is no need to explicitly require transitions of the form as premises or conditions to Rule 5. Copies of the environment (such as in Rule 1) and side-effects are naturally calculated during the computation process by the application of the appropriate rule for the term on top of the control stack.

3.2 Automata

Automata are Interpreting Automata whose configurations are sets of semantic components that include, at least, a , a and a . Plotkin’s stacks and memory in Interpreting Automata (or environment and stores of Structural Operational Semantics) are generalized to the concept of semantic component, as proposed by Peter Mosses in the Modular SOS approach to the formal semantics of programming languages.

Formally, a -automaton is an Interpreting Automaton where, given an abstract finite pre-order Sem, for semantics components, its configurations are defined by , with , denoting the disjoint union operation of semantic components, with , and subsets of .

The semantic rules for arithmetic sum in Automata look very similar to the ones from Interpreting Automata.

(6)
(7)
(8)

The ellipsis “555This notation is similar to the one defined by Chalub and Mosses in the Modular SOS Description Formalism, which is implemented in the Maude MSOS Tool [8]. are adopted as notation for “don’t care” semantic components, that is, those components that are not relevant for the specification of the semantics of a particular language construct.

The point is that if one wants to extend one’s Interpreting Automata specification with new semantic components, say disjointly uniting an output component (representing standard output in the C language, for instance), understood as a sequence of values, to the already existing disjoint set of environments and stores, would require a reformulation of the existing specification. For instance, the specification for arithmetic sum in Interpreting Automata would require such reformulation while in Automata would not. The rules in the latter have the “don’t care” variable that matches any, or no component at all, that may be together with , and . Semantic component composition is monotonic, as the addition of new semantic components does not affect the transition relation, that is, , where and is a function that adds a new semantic component to .

3.3 Automata and Term Rewriting

A -automaton can be seen as an unlabeled Transition System and therefore as a Term Rewriting System [3] when the latter is understood as where is a set and a reduction relation on . Clearly, the set of configurations is and the transition relation of the Interpreting Automata is the reduction relation of the Term Rewriting System.

There is an interesting point on the relation between the semantics of a programming language construct, specified by a Automata, and the properties that one may require from the reduction relation of the associated Term Rewriting System (TRS). Let us first recall two basic properties of a reduction relation from [3, Def.2.1.3],

  • Church-Rosser: ,

  • termination: there is no infinite reduction .

where , denotes the reflexive-transitive-symmetric closure of , and denotes that and are joinable, that is, . In rewriting modulo equational theories [3, Ch. 11], otherwise non-terminating systems become terminating when an algebraic property, such as commutativity, is incorporated into the rewriting process. Given a TRS , let be a set with the identities induced by a given property, such as commutativity, and the remaining identities induced by . Rewriting then occurs on equivalence classes of terms, giving rise to a new relation, , defined as follows:

Moving back to Automata, the semantics of a programming language construct is functional, where is the construct and its parameters, when given any configuration , there exists a single such that and the computation is finite. The semantics of a programming language construct is relational when given any configuration , where , the computations starting in may lead to different and may not terminate.

Therefore, if the semantics of a programming language construct is functional, one must require the associated reduction relation to be Church-Rosser and terminating. No constraints are imposed to the reduction relation when the semantics is relational.

As an illustration, according to this definition, the semantics of addition is functional but an undefined loop (such as a while command) semantics is relational as its execution may not terminate.

In order to support the specification of monotonic rules in a modular way, one last thing is required from the TRS associated with a Automata: rewriting modulo associativity, idempotence and commutativity. In other words, set-rewriting takes place, not simply term rewriting, while representing Automata as TRS, as each rule rewrites a set of semantic components.

3.3.1 Automata in Maude

Maude parameterized programming capabilities are used to implement Automata. The main datatype of Automata is Generalized SMC (GSMC in Listing 1), a disjoint union set of semantic components. The trivial view SemComp maps terms of sort Elt to terms of sort SemComp. Module GSMC then imports module SET parameterized by view SemComp, of semantic components, implemented in Maude by functional module GSMC-SORTS. A configuration of a -automaton is declared with constructor <_> : SetSemComp -> Conf that gives rise to terms such as < , > where is a semantic component.

1fmod SEMANTIC-COMPONENTS is sorts SemComp . endfm
2view SemComp from TRIV to SEMANTIC-COMPONENTS is sort Elt to SemComp . endv
3fmod GSMC is ex VALUE-STACK . ex MEMORY . ex CONTROL-STACK . ex ENV .
4    ex SET{SemComp} * (op empty to noSemComp) .
5    sorts Attrib Conf EnvAttrib StoreAttrib ControlAttrib ValueAttrib .
6    subsort EnvAttrib StoreAttrib ControlAttrib ValueAttrib < Attrib  .
7    op <_> : Set{SemComp} -> Conf [format(c! c! c! o)] .
8    op env : -> EnvAttrib .      Semantic components
9    op sto : -> StoreAttrib .
10    op cnt : -> ControlAttrib .
11    op val : -> ValueAttrib .
12    op _:_ : EnvAttrib Env -> SemComp [ctor format(c! b! o o)] .
13    op _:_ : StoreAttrib Store -> SemComp [ctor format(r! b! o o)] .
14    op _:_ : ControlAttrib ControlStack -> SemComp [ctor format(c! b! o o)] .
15    op _:_ : ValueAttrib ValueStack -> SemComp [ctor format(c! b! o o)] .
16endfm
Listing 1: Generalized SMC in Maude

Recall that the elements of the disjoint union

are ordered pairs

such that serves as an index indicating which semantic component came from. This is exemplified in Maude with the memory store component. The constructor operator sto functions as the index for the memory store component and the constructor operator _:_ to represent ordered pairs where is the memory store.

Now, for the transition rules, they are represented either by equations or rules, depending on the semantic character of the programming language construct being formalized. In the case of arithmetic expressions their character is functional and therefore are implemented as equations in Maude. For sum, in equation add-exp1666Keyword variant is an attribute for equations and means that the given equation should be used in the variant unification process. Due to space constraints, this feature is not discussed in this paper. The keyword is left in the code snippet to present the actual executable code for the tool. first operands E1:Exp and E2:Exp are unfolded, and then pushed back to the control stack C, together with ADD, an element of set . (Recall that .) Equation add-exp2 implements the case where both E1:Exp and E2:Exp have been both evaluated and their associated (Rational) value (in this implementation) was pushed to the value stack. When ADD is on top of the control stack then the two top-most values in the value stack are added. (Note that + symbol in add-exp2 denotes sum in the Rationals whereas in add-exp2 is the symbol for sum in language .)

1               < cnt : (E1:Exp E2:Exp ADD C:ControlStack),  > [variant] .
2eq [add-exp2] : < cnt : (ADD C:ControlStack),
3                 val : (val(R1:Rat) val(R2:Rat) SK:ValueStack),  >  =
4               < cnt : C:ControlStack,
5                 val : (val(R1:Rat + R2:Rat) SK:ValueStack),  > [variant] .

3.4 Model checking Automata

Model checking (e.g. [10]) is perhaps the most popular formal method for the validation of concurrent systems. The fact that it is an automata-based automated validation technique makes it a nice candidate to join a simple framework for teaching language construction that also aims at validation, such as the one proposed in this paper.

This section recalls the syntax and semantics for (a subset of) Linear Temporal Logic, one of the Modal Logics used in model checking, and discusses how to use this technique to validate Automata, only the necessary to follow Section 4.3.

The syntax of Linear Temporal Logic is given by the following grammar

where connectives , are called temporal modalities. They denote “Future state” and “Globally (all future states)”. There is a precedence among them given by: first unary modalities, in the following order , and , then binary modalities, in the following order, and .

The standard models for Modal Logics (e.g. [15]) are Kripke structures, triples where is a set of worlds, is the world accessibility relation and is the labeling function that associates to a world a set of atomic propositions that hold in the given world. Depending on the modalities (or operators in the logic) and the properties of , different Modal Logics arise such as Linear Temporal Logic. A path in a Kripke structure represents a possible (infinite) scenario (or computation) of a system in terms of its states. The path is an example. A suffix of denoted is a sequence of states starting in -th state. Let be a Kripke structure and a path in . Satisfaction of an LTL formula in a path , denoted is defined as follows,

A Automata, when understood as a Transition System, is also a frame, that is, , where is the set of worlds and the accessibility relation. A Kripke structure is defined from a frame representing a Automata by declaring the labeling function with the following state proposition scheme:

(9)

meaning that for every variable in the index of the memory store component (which is a necessary semantic component) there exists a unary proposition that holds in every state where is bound to ’s parameter in the memory store. A poetic license is taken here and Automata, from now on, refers to the pair composed by a Automata and its state propositions. As an illustrative specification, used in Section 4.3, the LTL formula specifies safety (“nothing bad happens”), in this case both and in the critical section, when are state proposition formulae denoting the states of two processes and is a constant denoting that a given process is in the critical section, and formula specifies liveness (“something good eventually happens”), by stating that if a process, in this case, tries to enter the critical section it will eventually do so.

4 Lib: Basic Programming Language Constructs

Lib is a subset of Constructive MSOS [21], as implemented in [6, Ch. 6]. In Section 4.1, Lib constructions are presented, their -automata semantics is discussed in Section 4.2 and a simple compiler for the Imp language in Maude, using Lib, is described in Section 4.3.

4.1 Lib signature

The signature of Lib is organized in five parts, and implemented in four different modules in Maude: (i) Expressions, that include basic values (such as Rational numbers and Boolean values), identifiers, arithmetic and Boolean operations, (ii) Commands, statements that produce side effects to the memory store, (iii) Declarations, which are statements that construct the constant environment, (iv) output and (v) abnormal termination.

Due to space constraints, only the Lib signature for arithmetic expressions is discussed. The remaining declarations follow a similar pattern. First, it includes modules QID, RAT, and GSMC, for quoted identifiers, rational numbers and Generalized SMC machines, respectively. Modules QID and RAT are part of the Maude standard prelude while GMSC was defined in Listing 1. Next, module EXP declares sorts Exp, BExp and AExp, for (general) expressions, Boolean expressions and arithmetic expressions. Identifiers are subsorts of both Boolean expressions and arithmetic expressions, which are in turn subsorts of expressions. The latter are included in Control. Operator idn constructs Identifiers from Maude built-in quoted identifiers. Arithmetic and Boolean operations alike are declared as Maude operators, and so are elements of set .

1    sorts Exp BExp AExp . subsort Id < BExp AExp < Exp < Control .
2    op idn : Qid -> Id [ctor format(!g o)] .  Identifiers
3    op rat : Rat -> AExp [ctor format(!g o)] .
4    op add : AExp AExp -> AExp [format(! o)] .
5    op sub : AExp AExp -> AExp [format(! o)] .  Arithmetic
6    op mul : AExp AExp -> AExp [format(! o)] .
7    op div : AExp AExp -> AExp [format(! o)] .
8    ops ADD SUB MUL DIV : -> Control [ctor] . 
9 endfm

4.2 Automata transitions for Lib dynamic semantics in Maude

Again, due to space constraints, the transition relation is not discussed for the complete Lib signature. Transitions for loop evaluation have been chosen to illustrate Automata transitions for Lib.

The semantics of the loop construction in module CMD is implemented in terms of equations and a rule in Maude. The first equation (i) pushes the loop body into the control stack, (ii) pushes the loop test into the control stack and pushes the whole loop into the value stack. These steps are of functional character, that is, they are Church-Rosser and terminating therefore satisfying the requirements to be implemented by an equation in Maude. The execution of the body of the loop, however, may not terminate as there could be a nested loop, for instance, that does not terminate its execution. For that reason it is implemented as a rule in Maude.

1       < cnt : loop(E:Exp, K:Cmd) C:ControlStack, val : V:ValueStack,  >  =
2       < cnt : E:Exp LOOP C:ControlStack, val : val(loop(E:Exp, K:Cmd)) V:ValueStack,  >
3[variant] .
4rl [loop] :
5      < cnt : LOOP C:ControlStack, val : val(true) val(loop(E:Exp, K:Cmd)) V:ValueStack,  >  =>
6      < cnt : K:Cmd loop(E:Exp, K:Cmd) C:ControlStack, val : V:ValueStack,  >  [narrowing] .
7eq [loop] :
8      < cnt : LOOP C:ControlStack, val : val(false) val(loop(E:Exp, K:Cmd))  V:ValueStack,  >  =
9      < cnt : C:ControlStack, val : V:ValueStack,  > [variant] .

4.3 A compiler for Imp in Maude

In this Section, the use of Lib is illustrated by a compiler for a simple (and yet Turing-complete) imperative language called Imp. The current implementation of Lib in Maude supports execution by rewriting, symbolic execution by narrowing and LTL model-checking. These are the tools that are “lifted” from Maude to Imp.

A compiler for a language L, such as Imp, defined as a denotation of Lib constructions, has the following main components: (i) A read-eval-loop function (or command-line interface) that invokes different meta-functions depending on the given command. For example, a load command invokes the parser, exec invokes metaRewrite and mc invokes metaReduce with the model checker. (ii) A parser for L, which is essentially a meta-function that given a list of qids returns a meta-term according to a given grammar, specified as a functional module; (iii) A transformer from L to Lib, a meta-function that given a term in the data-type of the source language, produces a term on the data-type of the target language; (iv) A pretty-printer from Lib to L, a meta-function that given a term in the data-type of the target language produces a list of qids. Each component is discussed next, but pretty-printing, due to space constraints.

Imp’s command-line interface

In Listing 2 we describe an excerpt of the implementation for Imp’s command-line interface, detailing only the module inclusions, sort declarations for the command-line state and rules for loading an Imp program. A full description is not possible due to space constraints. However, the pattern explained in this excerpt is the same for every command: a qidlist denotes the input, a different meta-function is called on them, depending on the input, that updates or not the state of Imp’s command-line interface and or the output of the system as whole, with a message to the end user. Operation op <_;_;_> : MetaIMPModule Dec? QidList -> IMPState represents the state of the command-line interface, which is a triple comprised by (i) the meta-representation of an Imp module, (ii) the Lib representation of the Imp module in the first projection, and (iii) a qid list denoting the message from the last processed command. Rule labeled in is responsible for processing the input of an Imp module, therefore, in this case, the first projection of the System term contains a list of qids representing an Imp module. Should the parsing process be successful, variable T:ResultPair? will be bound to a pair whose first projection is the term resulting from parsing and in the second projection its sort, or a unary function, of sort ResultPair?, denoting that the parsing process was not properly carried on, with a parameter representing the qid where the parsing process failed. With a successful parsing, the following term becomes now the state of the system

1   IMP: ’\b Module Q:Qid loaded. ’\o  >

where getTerm(T:ResultPair?) denotes the meta-term, according to Imp’s grammar, representing the input module, the input module in Lib is denoted by compileMod(getTerm(T:ResultPair?)), in the second projection, and the third component of the IMPState triple is a qidlist represents a message to the user know informing that the module was properly loaded.

1mod IMP-INTERFACE is
2 
3 op <_;_;_> : MetaIMPModule Dec? QidList -> IMPState .
4 vars QIL QIL QIL QIL1 QIL2 : QidList .
5  Loading a module.
6 crl [in] : [’module Q:Qid QIL, < M:MetaIMPModule ; D:Dec? ; QIL >, QIL”] =>
7             if (T:ResultPair? :: ResultPair) then
8                    [nil, < getTerm(T:ResultPair?) ;
9                          compileMod(getTerm(T:ResultPair?)) ;
10                           IMP: ’\b Module Q:Qid loaded. ’\o  >, QIL”]
11             else [nil, < noModule ; noDec ; nil >,
12                      printParseError(’module Q:Qid QIL, T:ResultPair?)]
13             fi
14 if T:ResultPair? :=
15    metaParse(upModule(’IMP-GRAMMAR, false), ’module Q:Qid QIL, ’ModuleDecl) .
16 
17endm
Listing 2: Imp’s command-line interface in Maude
Imp parser

To write a parser in Maude one has to first define the grammar of the language as a Maude functional module. The IMP-GRAMMAR module is the first argument in the function call to metaParse in Rule in of module IMP-INTERFACE in Listing 2, that implements Imp’s read-eval-loop. Essentially, variables (or non-terminals) are represented by sorts, terminals by constants and grammar rules are represented by operations. Module IMP-GRAMMAR encodes an excerpt of the Imp grammar, only enough to discuss the main elements of the grammar representation: we exemplify it with sum expressions Sort ExpressionDecl, for instance, specifies both arithmetic and Boolean expressions. A grammar rule relating two grammar variables is represented by a subsort declaration. Therefore, PredicateDecl, the sort for Boolean expressions, is a subsort of ExpressionDecl. Attributes prec and gather are declared for disambiguation.

1 inc PREDICATE-DECL . inc COMMAND-DECL .
2 sorts VariablesDecl ConstantsDecl OperationsDecl ProcDeclList
3       ProcDecl FormalsDecl BlockCommandDecl ExpressionDecl
4       InitDecl InitDeclList InitDecls ClausesDecl
5       ModuleDecl Expression .
6 subsort InitDecl < InitDeclList .
7 subsort VariablesDecl ConstantsDecl ProcDeclList InitDecls < ClausesDecl .
8 subsort BlockCommandDecl < CommandDecl .
9 subsort ProcDecl < ProcDeclList .
10 subsort PredicateDecl < ExpressionDecl .
11 op _+_ : Token Token -> ExpressionDecl [gather(e E) prec 15] .  Arithmetic expressions
12 op _+_ : Token ExpressionDecl -> ExpressionDecl [gather(e E) prec 15] .
13 op _+_ : ExpressionDecl Token -> ExpressionDecl [gather(e E) prec 15] .
14 op _+_ : ExpressionDecl ExpressionDecl -> ExpressionDecl [gather(e E) prec 15] .
15
16endfm
Imp to Lib transformer

Compilation from Imp to Lib is quite trivial as there exists a one-to-one correspondence between Imp constructions and Lib. 777This is not the case for variable and constant declarations that require initializations to be mapped to a Lib ref declaration, a simple exercise but useful stimulate non bijectional mappings between the source language and Lib. A similar situation arises when compiling Imp code to Python, as variable declarations are local by default, requiring the global modifier otherwise. This also not be the case for other programming languages, such as the denotation of object-oriented constructions [6]. Essentially, an Imp module gives rise to a Lib dec. Imp var and const are declarations and so is a proc declaration that gives rise to a prc declaration in Lib. The compilation from Imp to Lib exp relates Imp tokens to Lib Id, Imp arithmetic and boolean expressions to Lib Exp. In particular, the compilation of an Imp token has to check if the token is a primitive type, either Rat (for Rational numbers) or Bool (for Boolean values), or an identifier. Since Rat and Bool are tokenized and we need Maude meta-level descent function downTerm to help us parse them into proper constants.

1 eq compileId(I:Qid) = idn(downTerm(I:Qid, Qid)) .
2 op compileId : Term -> Id .
3 eq compileId(’token[I:Qid]) =
4    if (metaParse(upModule(’RAT, false),
5         downTerm(I:Qid, Qid), Rat)  :: ResultPair)
6    then rat(downTerm(getTerm(metaParse(upModule(’RAT, false),
7        downTerm(I:Qid, Qid), Rat)), 1/2))
8    else
9      if (metaParse(upModule(’BOOL, false),
10           downTerm(I:Qid, Qid), Bool) :: ResultPair)
11      then boo(downTerm(getTerm(metaParse(upModule(’BOOL, false),
12        downTerm(I:Qid, Qid), Bool)), true))
13      else idn(downTerm(I:Qid, Qid))
14      fi
15    fi .

To conclude this Section, the compilation of Imp arithmetic expressions simply maps them to their prefixed syntax counterpart in Lib, e.g, an Imp expression a + b is compiled to add(compileExp(a), compileExp(b)).

1 ceq compileExp(I:Qid) = compileId(I:Qid) if not(I:Qid :: Constant) .
2 eq compileExp(’token[I:Qid]) = compileId(’token[I:Qid]) .
3 eq compileExp(’_+_[T1:Term, T2:Term]) = add(compileExp(T1:Term), compileExp(T2:Term)) .
4 eq compileExp(’_-_[T1:Term, T2:Term]) = sub(compileExp(T1:Term), compileExp(T2:Term)) .
5 eq compileExp(’_*_[T1:Term, T2:Term]) = mul(compileExp(T1:Term), compileExp(T2:Term)) .
6 eq compileExp(’_/_[T1:Term, T2:Term]) = div(compileExp(T1:Term), compileExp(T2:Term)) .

Figure 1 illustrates loading and model checking an Imp program, implementing a Mutex protocol, for safety and liveness properties. (Imp command denotes non-deterministic choice.) State propositions and are automatically generated by the compiler to properly construct the Automata for Mutex, as discussed in Section 3.4.

Figure 1: Loading and model checking a Mutex protocol in Imp that is safe but not live.

5 Related work

First and foremost there is the work by Peter Mosses on Component-Based Semantics [20] and funcons [9, 22], where programming language constructs are specified in Modular Structural Operational Semantics (MSOS). Lib is inspired by this research and is also a result of the research on the relation between MSOS and Rewriting Logic, with an implementation in Maude, that started in [19, 5], with Edward Hermann Haeusler, Peter Mosses and José Meseguer, and continued with Fabricio Chalub [7, 8]. Despite their common roots, funcons and Lib have different models. The models of funcons are Arrow-labeled Transition Systems and Lib descriptions are to be interpreted as Automata, as described in Section 3. Automata can be understood as unlabeled transition systems. This makes it easy to relate Automata with term rewriting systems and to have an efficient implementation of them when transition rules are mapped to unconditional rewrite rules. This is in contrast, for instance, with previous work by Chalub and the author in the MSOS Tool in Maude [8], that understands transition rules in MSOS as conditional rewrite rules in Maude.

In [22], Mosses and Vesely propose an implementation of Component-Based Semantics using the K Framework (e.g. [25]). K aims at being a methodology to define languages with tools for formal language development. It is based on concepts from Rewriting Logic Semantics, with some intuitions from Chemical Abstract Machines [4] (CHAMs) and Reduction Semantics [13] (RS). Abstract computational structures contain context needed to produce a future computation (like continuations). Computations take place in the context of a configuration, which are hierarchically made up of K cells. Each cell holds specific pieces of information such as computations, the environment, and memory store. K specifications allow for equations and rules. Equations (representing heating and cooling processes) manipulate term structure as opposed to rules that are computational and may be concurrent, similar to how Rewriting Logic understands equations and rules. K has stablished itself as a powerful framework for language semantics (e.g. the formal semantics for the Ethereum Virtual Machine [16]). However, it has a non-trivial model, with many different concepts, coming from different frameworks such as MSOS, Rewriting Logic, Reduction Semantics, and CHAM. The combination of Component-Based Semantics and K in [22] provides indeed a powerful tool for language semantics descriptions.

Automata, as described in Section 3, is a less ambitious framework while compared with K, being conceived to be simple, easily integrated into an undergraduate level course, and with an efficient implementation in Maude, as K is. As a matter of fact, it has several intersections with K given their common roots in Rewriting Logic Semantics and MSOS. Due to Automata’ simpler automata-based model, it appears that it is a nicer candidate to teach formal semantics and compiler construction than K. It smoothly connects with Introduction to Programming Languages, Programming Languages Semantics and Formal Languages and Automata Theory, with good properties such as expressivity, efficiency and support for automata-based automated specification and reasoning.

6 Conclusion

Summary. This paper discusses the framework for teaching formal compiler construction. It has a denotational character, and its implementation called Lib builds on Peter Mosses Component-Based Semantics [21]. The framework implements a library of common programming languages constructions, such as assignments, function declarations and function calls. The semantics of a programming language is then given in a syntax-directed way, by expressing the denotations of the given programming language constructs in terms of Lib elements. The semantics of Lib is also described formally. Each element in Lib is specified in terms of Automata. Essentially, a Automata describes both static and dynamic semantics by means of (unconditional) rules that relate sets of semantic components, such as the memory store, the environment, a control stack and a value stack. Automata is overloaded to refer also to a Automata with a set of state propositions that are used to validate a given Automata using automata-based techniques such as model checking. Automata is a generalization of Plotkin’s Interpreting Automata [23]. Currently, the approach is implemented in Maude yielding an effective tool for formal compiler construction and program verification. The latter is accomplished when the formal tools in Maude, such as term rewriting, narrowing and LTL model checking, are lifted to a given programming language in . The current prototype implementation of in Maude is available at http://github.com/ChristianoBraga/BPLC, with an implementation for an imperative language called Imp, available in the same repository.

Preliminary assessment and future work. appears to be a suitable approach to teach compiler construction, as much as Component-Based Semantics is to teach formal semantics of programming languages [21] since one only works with a small set of programming constructions that may be used to give semantics to different programming languages, in different paradigms, with a model amenable to automated verification, as discussed in Section 3.4. The approach proposed in this paper has been class tested for the past year, with quite positive results. All students have completed their projects within the academic semester, reporting it back as a rewarding experience, with a lot of work. The framework appears to ease understanding of the meaning of the constructions when compared to their SOS counterparts. Even though a complete Maude implementation of is available as reference, ways of stimulating its use and sandboxing with it need to be developed. In this context, perhaps an interesting discussion regards the definition of a meta-language for describing compilers. At first, our intention is to make the Library available in different programming languages and let one choose one’s preferred parsing/transformation framework. However, this choice appears to have some undesirable pedagogical consequences. The Maude implementation of , for instance, uses meta-programming techniques that create some resistance to the understanding of the rather simple aspects of Lib and its Automata semantics. The author foresees the continuation of this work by addressing the issues raised in this preliminary assessment and by extending the Lib library with new constructs, improving code generation and validation techniques.

Acknowledgements

The author would like to warmly thank Fabricio Chalub, Narciso Martí-Oliet and Leonardo Moura for their comments on a draft of this paper, and Fabricio Chalub, Edward Hermann Hauesler, José Meseguer and Peter D. Mosses for the long term collaboration that inspired the work discussed in this manuscript.

References

  • [1] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2nd. edition, 2006.
  • [2] A. Arnold. Finite Transition Systems: Semantics of Communicating Systems. Prentice Hall, Upper Saddle River, New Jersey 07458, 1994.
  • [3] F. Baader and T. Nipkow. Term Rewriting and All That. Cambridge University Press, New York, NY, USA, 1998.
  • [4] G. Berry and G. Boudol. The chemical abstract machine. Theoretical Computer Science, 96(1):217 – 248, 1992.
  • [5] C. Braga and J. Meseguer. Modular rewriting semantics in practice. In N. Martí-Oliet, editor, Proceedings of 5th International Workshop on Rewriting Logic and its Applications, WRLA 2004, volume 117, pages 393–416. Elsevier, 2005.
  • [6] F. Chalub. An implementation of Modular Structural Operational Semantics in Maude. Master’s thesis, Universidade Federal Fluminense, 2005.
  • [7] F. Chalub and C. Braga. A modular rewriting semantics for CML. Journal of Universal Computer Science, 10(7):789–807, 2004.
  • [8] F. Chalub and C. Braga. Maude MSOS Tool. Electronic Notes in Theoretical Computer Science, 176(4):133–146, 2006.
  • [9] M. Churchill, P. D. Mosses, N. Sculthorpe, and P. Torrini. Reusable components of semantic specifications. In S. Chiba, É. Tanter, E. Ernst, and R. Hirschfeld, editors, Transactions on Aspect-Oriented Software Development XII, pages 132–179, Berlin, Heidelberg, 2015. Springer Berlin Heidelberg.
  • [10] E. M. Clarke, Jr., O. Grumberg, and D. A. Peled. Model Checking. MIT Press, Cambridge, MA, USA, 1999.
  • [11] M. Clavel, F. Durán, S. Eker, S. Escobar, N. Martí-Oliet, P. Lincoln, J. Meseguer, and C. Talcott. Maude Manual (Version 2.7.1). SRI International and University of Illinois at Urbana-Champaign, http://maude.cs.uiuc.edu/maude2-manual/, July 2016.
  • [12] M. Clavel, F. Durán, S. Eker, P. Lincoln, N. Martí-Oliet, J. Meseguer, and C. Talcott. All About Maude - a High-performance Logical Framework: How to Specify, Program and Verify Systems in Rewriting Logic. Springer-Verlag, Berlin, Heidelberg, 2007.
  • [13] M. Felleisen and R. Hieb. The revised report on the syntactic theories of sequential control and state. Theoretical Computer Science, 103(2):235 – 271, 1992.
  • [14] J. A. Goguen and G. Malcolm. Algebraic Semantics of Imperative Programs. MIT Press, Cambridge, MA, USA, 1996.
  • [15] R. Goldblatt. Logics of time and computation, volume 7 of CSLI Lecture Notes. Center for the Study of Language and Information, Stanford, CA. ISBN: 0-937073-94-6, second edition edition, 1992.
  • [16] E. Hildenbrandt, M. Saxena, X. Zhu, N. Rodrigues, P. Daian, D. Guth, and G. Rosu. Kevm: A complete semantics of the ethereum virtual machine. Technical report, University of Illinois at Urbana Chanpaign, http://hdl.handle.net/2142/97207, 08 2017.
  • [17] N. Martí-Oliet and J. Meseguer. Handbook of Philosophical Logic, volume 9, chapter Rewriting logic as a logical and semantic framework, pages 1–87. Kluwer Academic Publishers, P.O. Box 17, 3300 AA Dordrecht, the Netherlands, 2002.
  • [18] J. Meseguer. Conditional rewriting as a unified model of concurrency. Theoretical Computer Science, 96(1):73–155, April 1992.
  • [19] J. Meseguer and C. Braga. Modular rewriting semantics of programming languages. In C. Rattray, S. Maharaj, and C. Shankland, editors, In Algebraic Methodology and Software Technology: proceedings of the 10th International Conference, AMAST 2004, volume 3116 of LNCS, pages 364–378, Stirling, Scotland, UK, July 2004. Springer.
  • [20] P. D. Mosses. Component-based description of programming languages. In Proceedings of the 2008 International Conference on Visions of Computer Science: BCS International Academic Conference, VoCS’08, pages 275–286, Swindon, UK, 2008. BCS Learning & Development Ltd.
  • [21] P. D. Mosses. Fundamental concepts and formal semantics of programming languages. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.457.4959&rep=rep1&type=pdf, 2009.
  • [22] P. D. Mosses and F. Vesely. Funkons: Component-based semantics in k. In S. Escobar, editor, Rewriting Logic and Its Applications, pages 213–229, Cham, 2014. Springer International Publishing.
  • [23] G. D. Plotkin. A structural approach to operational semantics. Journal of Logic and Algebraic Programming, 60–61:17–139, 2004. Special issue on SOS.
  • [24] G. Roşu. From conditional to unconditional rewriting. In J. L. Fiadeiro, P. D. Mosses, and F. Orejas, editors, Recent Trends in Algebraic Development Techniques, pages 218–233, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.
  • [25] G. Roşu and T. F. Şerbănuţă. An overview of the K semantic framework. Journal of Logic and Algebraic Programming, 79(6):397–434, 2010.
  • [26] P. Viry. Elimination of conditions. J. Symb. Comput., 28(3):381–400, 1999.