Parametric Denotational Semantics for Extensible Language Definition and Program Analysis

11/30/2018 ∙ by In-Ho Yi, et al. ∙ The University of Melbourne 0

We present a novel approach to construction of a formal semantics for a programming language. Our approach, using a parametric denotational semantics, allows the semantics to be easily extended to support new language features, and abstracted to define program analyses. We apply this in analysing a duck-typed, reflective, curried dynamic language. The benefits of this approach include its terseness and modularity, and the ease with which one can gradually build language features and analyses on top of a previous incarnation of a semantics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Programming language semantics is a sub-field within theoretical computer science where researchers develop formal descriptions for the meaning of computer programs. Over the years, we have seen the development of denotational semantics, where we mathematically model the effect of an execution of a language construct. Operational semantics formalise mechanical steps that transform program states given a particular program. As we shall argue in this thesis, the challenge of analysing dynamic languages, in both concrete and abstract manner, necessitates a semantics that bridges the gap between the two different semantics in order for such task to be feasible.

Abstract interpretation is a unifying theory for program analysis and verification with which we ascertain run-time properties of a program by approximating its semantics. The properties of interest are almost always undecidable. The task of abstraction interpretation can be thought of as over-approximating a set of concrete states in a finite number of steps. The usual semantic domain is replaced by an abstract domain whose elements describe a set of run-time states. Mathematically, such an abstract domain is a partially ordered set (forming a lattice), the ordering corresponding to subset ordering of the powerset of concrete states.

Two distinct needs motivated the development of the present work. First, there is the theoretician’s need for a simple, concise, elegant way of presenting a formal semantics for a programming language and of developing that into various static analyses. We base our approach on a parametric denotational semantics that is modularised to allow the concrete semantics and the abstract interpretation to share a common framework that uniformly handles most aspects of the programming language. Use of denotational semantics provides a strong foundation in proving correctness of an abstract interpretation, and allows us to focus on algorithmic details of analysis.

Second, there is a practical need for program analyses suitable for the dynamic languages that have been growing in popularity in recent years. Traditionally these languages were called “scripting” languages, as they were mainly used for automating tasks and processing strings. However, with the advent of web applications, languages such as Perl and PHP gained popularity as languages for web application development. On the client side, web pages make heavy use of JavaScript, a dynamically typed language, to deliver dynamic contents to the browser. Recent years have seen an increasing use of JavaScript on the server side, as well.

What these languages provide is an ability to rapidly prototype and validate application models in a real time read-eval-print loop. Another strength comes from the fact that programmers do not need to have a class structure defined upfront. Rather, class structures and types of variables in general are dynamically built. This reduces the initial overhead of software design.

However, these features come at a cost. The lack of a formal, static definition of type information makes dynamically typed languages harder to analyse. This difficulty causes several practical problems.

  • As applications become more mature, more effort is devoted to program unit testing and writing assertions to ensure type safety of systems. This extra effort can sometimes outweigh the benefit of having a dynamically typed language.

  • Whereas programmers using statically typed languages enjoy an abundance of development tools, the choice of tools for development in dynamically typed languages is limited, and the tools that do exist lack much of the power of the tools for statically typed languages, owing largely to the difficulty or infeasibility of type analysis for such languages.

  • Lack of static type structure has a significant impact on the performance of dynamically typed languages.

With these problems in mind, we have designed a model language that has a dynamism comparable to that of the aforementioned scripting languages, such as duck typing, reflection, and partial function application. A notable omission is closure scoping. However, allowing function currying gives expressive power to the language comparable to that of languages with closure or lexical scoping.

The two concerns are not distinct ones, but an interconnected dialectic. The theoretical need is there because of the difficulty of describing the abstract and concrete meaning of dynamic languages, which often allow side-effect causing, type-altering functions. With such complexity, duck-typed languages are interesting test cases for which we formulate concrete semantics, abstract interpretation and the proof of correctness. Our Haskell implementation of both concrete and abstract analysis, appearing in the appendix to this thesis, illustrates the practicality of the proposed programming language semantics.

This work is inspired by Haskell’s use of monads, and we assume the reader’s familiarity with monadic style Haskell programming. We also assume knowledge of lambda notation, denotational semantics, order and fixed point theory at the level of the textbook of Nielson and Nielson’s [1].

In the following section, we discuss other works in the field of language semantics and differentiate our work from them. In section 3, we give a general overview of the proposed language semantics framework and analysis. In sections 4 and 5, we formally introduce our framework. In section 6, we develop a model language with features gradually added on. We also present concrete and abstract analysis of the language in each stage of development in parallel. In sections 7 and 8, we argue formal properties of the language analysis. Finally, in section 9, we conclude this thesis and discuss future direction.

2 Related work

Denotational semantics is the starting point of our development of a formal framework. The idea of incorporating monads into denotational definitions was developed by Liang and Hudak [2, 3]. Whereas these works modularise an analytic framework by having multiple layers of monadic transformations, we instead parametrise the definition of a program state.

Action semantics, as advanced by Mosses [4], shares the motivation that semantics ought to be pragmatic, yet expressive enough to deal with non-trivial, feature-rich languages. While action semantics endeavours to devise a new meta-language for describing semantics, we constrain ourself to the language of denotational semantics, and seek to devise a formalism largely compatible with denotational semantics.

The idea of constructing formulae with parametric types can be found in Wadler’s work [5]. The present work is a special application of the parametricity in the field of language semantics and analysis.

Regarding the type analysis of dynamic languages, there have been numerous studies [6, 7, 8, 9] that consider simple toy languages and their semantics for the purpose of static analysis of dynamic languages. A major difference between those languages and the model language presented in this paper is that our language is designed to capture the critical feature of real world languages which allows functions to alter types through side-effect causing statements. We point out similarities and differences of this work compared to the cited works as we encounter them in this thesis.

Type analysis plays a crucial part in compiling scripting languages, mainly to improve performance. Ancona et al [10] and Dufour [11] design restricted versions of scripting languages so that static inference of types can be performed. We adopt several techniques employed in those projects, such as the use of named memory allocation sites as static references.

An important use case of functions in dynamically typed languages is “mixin” functions [12]. By passing arguments to a mixin function, objects can be extended with extra methods; that is, functionality can be added dynamically. There are model languages and formalisations of mixin functions, such as the works of Anderson et al [6] and Mens et al [13]. Where those works seek to find functional models for mixins, we define instead a language (with side-effect causing functions) that is expressive enough to program mixin inheritance.

Jensen et al [14] describe a feature-complete analyser for the JavaScript language. Our work can be extended further to provide the semantic foundation for such an analyser. Such an attempt to formalise the analysis might pave the way for further refinement and improvement.

3 Overview

Our semantic framework is comprised of two components: one for the syntactic structure, and the other for giving meanings to the primitive operations. What divides the two is the following separation of concerns:

  1. What are the semantic operations entailed in a particular syntactic structure? For example, syntactic structure entails a primitive operation .

  2. How do we interpret such semantic operations in a particular point of view? If we were to give a concrete interpretation, we would interpret as updating an environment with a newly defined variable e.g., .

Observe that an interpretation of syntactic structure can remain agnostic of the structure of a program state at a given point. Therefore, once we remove the actual interpretation of primitive operations, what remains in a semantics can be re-used for multiple interpretations of the language. Hence, not only are the primitive operations parametrised, but so is the whole definition of the domain of the program state. Such a separation of concerns also helps to define an extensible semantics, to which adding a new feature takes as little effort as possible.

Now we give a formal definition of our framework.

Definition 1 (Parametric semantics).

A parametric semantics is a quintuple where is a collection of semantic functions for syntactic structures, as outlined below; is a set of representations of computation state, which can be anything to suit a particular analysis; is a set of all possible values that an expression can be evaluated to be; is an initial program state; and is the set of primitive operations of the semantics. We assume throughout that different states are incomparable. In other words, is ordered by identity.

Throughout the analysis, these primitive operations are the parameters of our analysis:

  • takes a state and reports whether it is escaping (i.e., whether or not control flow reaches the successor statement)

  • interprets the meaning of a branching point when a value and two transformations (one for true and another for false) are given

  • takes an identifier and a value, and performs assignment

  • takes an identifier and produces its meaning

  • takes a constant and produces its meaning

  • and define the meanings of console I/O operations

  • defines the meaning of all binary operations given two values

  • defines the meaning of a return statement given the value to be returned

  • defines the meaning of dynamic execution of a function declaration

  • defines the meaning of (possibly partially) applying a function to a list of values

  • and define the meaning of getting or setting a member of an object

  • and define the meanings of keywords and , respectively

  • defines the meaning of instantiating a new object from a particular allocation site

Types of these operations are given in section 6 as we introduce them.

Our model language, as we let it evolve through this thesis, has a set of features found commonly in scripting languages. In the remainder of the thesis we provide the semantics for a language with many different features. We introduce the components of the language step by step. The aim is to demonstrate that the semantic formalism enables such a stepwise development, each step being incremental in the sense that it does not require revision of the semantic equations developed in earlier steps.

1function fact(f,x) {
2  if(x < 2) { return 1; }
3  return x * f(f,x-1);
4}
5output fact(fact,input);
6
7fa=fact(fact);
8output fa(input);
9
10function Fruit(v) {
11  this.value = v;
12}
13
14global.answer = 0;
15
16function juicible(fruit, juice) {
17  function juiceMe(j,x) {
18    return this.value + j + x;
19  }
20  fruit.juice = juiceMe(juice); #currying
21  global.answer=42;
22}
23
24# Juicibles
25apple = new Fruit(15);
26juicible(apple, 20);
27grape = new Fruit(30);
28juicible(grape, 50);
29
30# Non-juicibles
31banana = new Fruit(20);
32watermelon = new Fruit(25);
33
34output apple.juice(10); # 15 + 20 + 10
35output grape.juice(10); # 30 + 50 + 10
36output global.answer; # 42
37
38try {
39  if(input > 42) {throw 42;}
40} catch(e) {
41  output e;
42}
Figure 3.1: Example SDTL program

Figure 3.1 is an example of a program written in the model language. We call this model language Simple Duck-Typed Language(SDTL). A locally-scoped procedural language with support for higher order functions (lines 1 to 5) is introduced in Section 6.1. Function currying (lines 7, 8 and 20) is introduced in Section 6.2. Object oriented features, including duck-typing and reflection, are introduced in Section 6.3. Finally, exception handling (lines 38 to 42) is introduced in Section 6.4.

4 Analytic framework

In this section we introduce a monadic construct specifically designed for the purpose of program analysis. We then introduce polymorphic auxiliary functions that are useful in extending theories in a modular manner.

First we define the monadic constructions. We define a type constructor and a bind operator .

Definition 2 (Type constructor).

The type constructor has the following polymorphic definition. is the set of program semantics. It is necessary to have this as an input to the state transformation in order to give the fixed point characterisation of semantics. is given a formal definition in section 5. The parameter to the type is used in different context to extract different information from the semantics.

Observe that a single state can give rise to multiple corresponding successor states. We are essentially modelling a non-deterministic state transformation. This gives us the flexibility to handle both concrete and abstract semantics within a single framework.

Every statement is understood as a state transformer. We distinguish between “normal” and “escaping” statements, the latter yielding an “escape” state. For example, when a function returns, the return statement transforms the current state into an escape state. Our “bind” operator relies on a parametric operation to spell out the precise mechanism for escaping the current program execution flow. The function returns true if a state does not continue to the next expression or statement (having encountered a return statement, for example). This provides a flexible and general formalisation of a control flow, and it allows the handling of exceptions as well as function return statements.

In such escaping cases, there is no appropriate value of the type to be associated with the successor states. Hence, we introduce to be assigned to successor states of the escaping states.

Definition 3 (Bind operator).

We define a bind operator .

Definition 4 (Point-wise ordering of state transformations).

Given ,
iff

Definition 5 (Point-wise ordering of monadic functions).

Given , iff

Theorem 6 (Preservation of monotonicity).

Given monads , and ,

Proof.

When a state is a member of for some and an initial state , there exists an intermediate state from which is derived by . Clearly, such intermediate state is also a member of by definition of point-wise ordering.

Formally, by the definition of bind operation. Now, . Hence, . ∎

Having a monadic structure helps provide modularity. For example, if a particular parametrised operation takes a state but only produces a value, it would be redundant to include a state as a part of returning type, to match the definition of monadic binding. In such a case, we take a function that returns only a value, then lift it to be used in the monadic context.

Definition 7 (Monadic functions).

We define the following auxiliary functions to incorporate non-monadic functions as a part of monadic transformation:

  • (return for ) is an identity state transformer that takes a constant and lifts it to an identity state transformer with the constant as a return value

  • (lift for ) lifts a function that takes a state and returns a value to a monadic function

  • takes a non-deterministic transformation and lifts it to a monadic function

Definition 8 (Record updater).

We model a state as a record with named fields. In this way, an update operation written for a particular set of fields can be reused without redefining it when we add extra dimensions to a domain to accommodate features that are orthogonal to the features of the previous version.

When we have a record with named fields , and when an updater function updates fields , we define a function that takes a record, projects its fields into an n-tuple corresponding to the selected fields (), lets update the tuple, and finally updates the whole record with the updated tuple ().

where is a value for the field of a record

Similarly, we define an operation to update a record and return a value.

where and

Finally, we define a value extractor, that takes a record and selects a value from it.

When is an n-tuple space for chosen fields and is a domain of a record, the functions defined here have the following type signatures:

Example 9 (Record updater example).

To see these functions in use, suppose we have a simple record structure for personal contacts.

  • updates age field of a contact record.

  • returns the previous age field value while updating the age field.

  • extracts age information from a contact record.

Definition 10 (Singleton lifting).

Another commonly occurring pattern is that functions often return a singleton set. We define a function that takes a function returning a value and lifts it to be a function that returns a singleton set.

For simplicity of notation, we compose this function with the other functions from Definition 8.

We now have monadic constructs and auxiliary functions to describe the semantic functions of the model language. We can now define the semantic functions of the language.

5 Semantic functions

We use syntax nodes as references to various items constituting program environment. To all statements and expressions in a program, we designate unique identifiers in order to reference them. For that purpose, we define the following syntactic nodes and unique identifier spaces.

  • is the set of statement nodes.

  • is the set of expression nodes.

  • is the set of left-expression nodes.

  • is the set of statement identifiers.

  • is the set of expression identifiers.

  • is the set of alphanumeric identifiers.

Note that we use an sid of a function declaration statement as a reference point for the function defined. The and functions take such an sid and return a list of parameter names, and the arity of the function, respectively.

Where we specifically refer to an identifier to a syntactic construct, we write to mean a statement or expression with an id . In cases where such identifiers are not directly referenced, we omit them for simplicity.

Definition 11 (Semantic functions).

The analytic framework contains the following semantic functions:

and are semantic functions for statements, expressions and left expressions, respectively. is a function space to model the collection of functions in a program. Given the sid of a function declaration site, it gives a statement node for function declaration and a state transformer. Note that in this picture a function “returns” a value by giving a state transformation. Incorporating such a concept as a return value in a itself provides a greater flexibility in describing the effects of executing a statement or an expression at a particular program point.

We define following auxiliary functions to describe the use of the references to syntax nodes.

6 The language under study

We define the model language, the SDTL (Simple Duck-Typed Language).

6.1 The procedural core language

We start off with a procedural language with C-like syntax.

<con> ::= <Num> | <Bool>

<Lexp> ::= ID

<Exp> ::= <con> | <Lexp> | ‘input’ <Lexp> ‘(’ [<Exp> [,<Exp>]*]? ‘)’ <Exp> <binop> <Exp> ‘(’ <Exp> ‘)’

<binop> ::= ‘+’ | ‘-’ | ‘*’ | ‘/’ | ‘>’ | ‘<’ | ‘==’

<Stm> ::= nil | <Stm> ‘;’ <Stm> | <Exp> ‘output’ <Exp> <Lexp> ‘=’ <Exp> ‘if’ ‘(’ <Exp> ‘)’ ‘’ <Stm> ‘’ ‘if’ ‘(’ <Exp> ‘)’ ‘’ <Stm> ‘’ ‘else’ ‘’ <Stm> ‘’ ‘while’ ‘(’ <Exp> ‘)’ ‘’ <Stm> ‘’ ‘function’ Id ‘(’ [Id [, Id]*]? ‘)’ ‘’ <Stm> ‘’ ‘return’ <Exp>

Here Num and Bool are the syntactic categories for integers and boolean values.

SDTL does not have a separate category for function and variable declarations. Variables are declared ad hoc whenever such variables appear as a left expression to assignment statements. Function declarations are statements themselves, which allow them to appear anywhere in the program.

SDTL supports higher-order functions, which allows functions to be recursively referenced. For example, we can define a factorial function in a recursive manner.

Example 12 (Recursively defined factorial function).

In SDTL, the factorial function can be implemented in a recursive way.

1function fact(f,n) {
2  if(n>1) { return f(f,n-1) * n; } else { return 1; }
3}
4
5z=fact(fact,input);
6output z;

In this example, the function takes two arguments. The first is the function pointer to recursively invoke, and the second is the usual argument to the function. This example illustrates that recursive functions are possible even in the absence of lexical scoping or other special scoping rules to allow a function body to refer to the function itself.

Given the availability of higher-order functions, we formulate the meaning of a function as a fixed point (see the definition of ). The semantic functions for SDTL are defined in figure 6.1. We use auxiliary functions and to describe a function call. (Note that we use for the empty sequence, for the set of sequences of any number of values of type , and the notation to denote concatenation of sequences and .)

is a parametrised function that takes a caller’s state at the time of function invocation and an id-to-constant value mapping, and constructs an initial state for a callee. takes both the caller’s state and the resulting states of callee’s, and constructs the caller’s states after the function call. These functions are parametrised so as to allow each interpretation to define the exact shape of a program state and its manipulation during a function call and return. These functions have the following types:

Figure 6.1: Semantic equations for a procedural core of SDTL

The types of primitive operations are as follows.

6.1.1 Concrete interpretation

Domain

is a Cartesian product of environment, input/output state and a return value. The return value is set to be a value when a function is returning any value inside a function body. This has been incorporated as a part of a program state so that we can signal escaping from a program flow. The initial program state is where is an initial IO state.

Functions

We omit a detailed description of the IO environment. Normally, IO can be modelled as a queue of inputs and outputs as they are given and produced during the execution of a program.

Example 13 (Concrete interpretation of recursive factorial function).

The program in example 12 is concretely interpreted as follows.

  • At line 1, updates environment to be assuming the function declaration has a unique id of 1.

  • At line 5, gives a user input. Assume that the input was 2. gives . In a function call , we first construct the initial state of a function call. gives .

  • At line 2, evaluating expression yields true. invokes another function call, with initial state .

  • On the second call fact(f,1), yields false. Hence, invokes which gives final state of .

  • On the first call, this state is first evaluated to yield value by . Then, f(f,1) * 2 evaluates to 2, which becomes the ultimate return value.

  • At line 5, after the function call gives as the final state. adds symbol to the environment:

  • At line 6, evaluates to 2, which is the final output of the program.

6.1.2 Abstract interpretation

At this stage, abstract interpretation looks largely similar to concrete interpretation. Notable differences are that we approximate each constant by its type, and that is a non-deterministic transformation where it collects effects of both branches at a branching point.

Domain

Composition of an abstract domain is similar to that of the concrete counterpart, except that it does not include an IO state. is an undetermined value, which is used to approximate unknown function calls at an initial stage. The initial program state is .

Functions

Function definitions are largely similar to that of concrete definition.

Example 14 (Abstract interpretation of recursive factorial function).

The program in example 12 is abstractly interpreted as follows.

  • At line 1, updates environment to be assuming the function declaration has a unique id of 1.

  • At line 5, gives an abstract value . gives . In a function call , we first construct initial state of a function call. gives

  • The meaning of this function call is determined via a fixed point iteration by progressively updating the current approximation of the meaning of the function call, starting from a null hypothesis that the function call does not return any state.

Current approximation Meaning of function call Note
Fixed point

Implementation of this fixed point iteration is found in the function in appendix A.4.

  • This yields

  • After , and , we have the final state of the program:

Example 15 (Abstract interpretation of a while loop).

The following example illustrates an interpretation of a while loop through a fixed point iteration.

The program calculates sum of a sequence. For the purpose of illustration, we have added variable that changes its type inside a while loop.

1sum = 0;
2z = input;
3x = 50;
4while(z>0) {
5  sum = sum + z;
6  z = z - 1;
7  x = true;
8}
9
10output sum;
  • At line 4, we have environment .

  • As an initial hypothesis, we assume that the statement body of a while loop does not cause any change in the program state for any given initial state. We progressively update this approximation until we meet a fixed point.

Current Approximation Init Final State

Implementation of this fixed point iteration can be found in the function in appendix A.4.

  • The resulting final states of the program is calculated to be

6.2 Function currying

We now introduce function currying to the SDTL language. Introduction of this language feature allows the language to be flexible enough to express what JavaScript programmers would do with lexical scoping.

Example 16 (Function currying).

We take a simple add function, and curry one argument to produce different adders.

1function add(x,y) {
2  return x+y;
3}
4
5add5 = add(5);
6add7 = add(7);
7
8output add5(input) + add7(input);

If we were to write this in JavaScript, we could have written the following for the same effect.

1function adder(toadd) {
2  return function(y) {
3    return toadd+y;
4  }
5}
6
7add5 = adder(5);
8add7 = adder(7);
9// suppose input is a platform-specific console input function
10console.log(add5(input()) + add7(input()));

The introduction of function currying does not change the syntax of the language. Therefore, there is no inherent reason for changing semantic functions. However, we do redefine the semantics of function calls to include an eid as an input, for the reason explained below.

Here we introduce the function. It invokes the function when the function arguments are saturated, or it returns a pointer to a curried function otherwise. Its type is as follows.

6.2.1 Concrete interpretation

We need to extend the definition of to hold curried parameters. This means that the function also needs to be modified to match the new type signature.

Domain

The initial program state is unchanged.

Functions

Here, is the length of a sequence .

Example 17 (Concrete interpretation of a function currying).

Consider the program shown in example 16.

  • At line 1, we are presented with a function declaration. adds the identifier as a reference to a function pointer with no curried value. Assuming has given a unique id of during the parsing of the program, we have in environment .

  • At lines 5 and 7, we partially apply the function. Since the number of arguments is not saturated, gives for . Similarly, gets .

  • At line 8, we saturate the parameters, give , where is an arbitrary number given from user input. Hence we have the addition done. It works similarly for .

6.2.2 Abstract interpretation

Note that curried functions introduce a possibility of creating closures requiring an infinite number of arguments.

Example 18 (Currying loop).

Consider the following program.

1function foo(a,b) {
2  return a;
3}
4
5x = 0;
6while(true) {
7  x = foo(x);
8}

If we naively interpret this program, we would not be able to reach a fixed point in analysis. Instead, we would have:

A solution to this problem is to have a curried function anchored to a particular language construct. In this case, we can use an eid of a curried expression as a point of reference (or ’0’ if not curried).

Domain