Original data for the text to code experiments of Richardson and Kuhn
Recent work by (Richardson and Kuhn, 2017a,b; Richardson et al., 2018) looks at semantic parser induction and question answering in the domain of source code libraries and APIs. In this brief note, we formalize the representations being learned in these studies and introduce a simple domain specific language and a systematic translation from this language to first-order logic. By recasting the target representations in terms of classical logic, we aim to broaden the applicability of existing code datasets for investigating more complex natural language understanding and reasoning problems in the software domain.READ FULL TEXT VIEW PDF
Rule-based models are attractive for various tasks because they inherent...
For building question answering systems and natural language interfaces,...
We propose a new shared task for tactical data-to-text generation in the...
In this paper we present the initial development of a general theory for...
Lambek pregroups are algebraic structures modelling natural language syn...
hO is a custom restricted dialect of Oberon, developed at the Max-Planck...
Distributed representations (such as those based on embeddings) and disc...
Original data for the text to code experiments of Richardson and Kuhn
Recent work in natural language processing has looked at learning text to code translation models using parallel pairs of text and code samples from example source code libraries (for a review, see Neubig (2016)). In particular, Richardson and Kuhn (2017a, b); Richardson et al. (2018) look at learning to translate short text descriptions to function signature representations as a first step towards modeling the semantics of function documentation. Examples pairs of docstring and function signature representations are shown in Figure 1; using such pairs, the goal is to learn a general model that can robustly translate a given description of a function to a formal representation of that function.
Initially, these datasets were proposed as a synthetic resource for studying semantic parser induction (Mooney, 2007), or for building models that learn to translate text to formal meaning representations from parallel data (see Richardson et al. (2017)
for a proposal on using these datasets for the inverse problem of data-to-text generation). To date, we have built around 45 API datasets across 11 popular programming languages (e.g., Python, Java, C, Scheme, Haskell, PHP) and 7 natural languages (seeRichardson (2017)), each using an ad hoc rendering of the target function signature representations. In this brief note, we define a unified syntax for expressing these representations, as well as a systematic mapping into first-order logic and a small subject domain model. In doing this, we aim to answer the following question: what do these function signatures that are being learned actually mean, and how can they be used for solving more complex natural language understanding problems (for a similar idea, see Bos (2016))?
By recasting the learned representations in terms of classical logic, the hope is that our datasets will in particular be made more accessible to studies on natural language based program synthesis (Raza et al., 2015) and natural language programming more generally. In what follows, we first define a general syntax for these representations, then discuss the mapping into logic and the various applications that motivate our particular approach and subject domain model.
(Syntax of Function Signatures)
As shown in Figure 1, function signature representations across different programming languages consist of the following components: a namespace N (indicating the position or path in the target API), a class or local name identifier C, a function name f, a sequence of (optionally typed t) named parameters p, and an (optional) return value r. Below shows the different parts of the Java max function:
In languages or software projects where some of this information is missing, we can mark the positions using special tokens, such as UNK, or unknown, for types in dynamically typed languages, or core and builtin in cases where the namespace and class information are missing. In Definition 1, we define a generic syntax for function signature representations in order to eliminate superficial differences between different programming languages. This definition includes an additional token l that identifies the particular programming language or software project from which the function f is drawn (see Figure 2 for a normalized version of our Java example).
|Docstring||Returns the greater of two long values|
|Signature||(Java) lang Math long max(long a,long b)|
|Docstring||Compares two values numerically and returns the maximum|
|Signature||(Python) decimal Context max(a b)|
|Docstring||gibt den größeren dieser Werte zurück|
|Signature||(PHP) mixed max(mixed $value1, mixed $value2, ..)|
In order to provide a model theoretic semantics of these signature representations, we define a systematic mapping from to logic. We also use a small inventory of domain specific predicates to define the semantics, which are motivated by some of the applications that we discuss in the concluding section.
Definition 2 shows the semantics of general function signatures:
The semantics can be described in the following way: for a given function with some set of function variables (bound here using lambda abstraction), there should exist a value which is equal to (shown here using using a special predicate eq) the value that results when the particular function constant fun is applied to said variables. For example, the variable in the following example (where lambda conversion is performed on the input 4L, 5L):
takes the value of 5L, or the result of applying max(4L,5L). In order to capture additional constraints about typing, naming, and the language from which the function is draw, we use the following domain specific predicates: fun (associates the function variable with the function constant or name f, e.g., max), lang (the language or project associated with ), and type (the type of a given variable, in this case relating the function return variable with the return type constant r, e.g., long).
|Returns the greater of two long values|
|Signature (informal)||lang Math long max(long a,long b)|
|Normalized||java lang Math::max(long:a,long:b) -> long|
|java lang Math::max(long:a,long:b) -> long|
|Expansion to Logic||
Definition 3 shows the semantics of function arguments.
The same naming and typing constraints are expressed using similar predicates for variables. The predicate var associates a given variable assignment with an argument name p. In addition, the predicate has_param explicitly associates a given argument or parameter and its position with a function .
Definition 4 shows the semantics of namespaces and classes:
(Namespace and Class Semantics)
Here, we use the predicates namespace and class to identify the type of the variables and . As with arguments, two additional predicates, in_namespace and in_class, are introduced in order to associate particular namespaces and classes with particular function values.
Figure 2 shows a full translation from an ad hoc signature representation to a normalized representation and finally to a representation in logic using the definitions introduced above. We note that while we use a specific, and seemingly arbitrary, set of domain predicates, new predicates and information can be added as needed. In the next section, we motivate the particular predicates chosen above by describing some possible applications of our formulation.
In any application of logic, logical formulas can be used either to reason extensionally (i.e., about the particular real-world entities denoted by or involved in a given formula) or intentionally (i.e., about abstract relationships and consequences between concepts). Taking the example in Figure 2 and its expansion to logic, we could reason extensionally using pure logic about the exact value that this function will return given a particular input. In contrast, we could also, with the help of additional domain specific knowledge, reason intensionally about abstract relationships between different programming languages, class and namespace structures, and so on.
While we think that there is value in the first type of reasoning, especially for building executable models of functions, our primary focus is on reasoning abstractly about programming language constructs and relationships across different programming languages and projects. One benefit of the source code domain is that much of the declarative knowledge needed for reasoning can be extracted straightforwardly from the target libraries directly, including information about class containment and subsumption relations, lists of related utilities (e.g., via see-also annotations and documentation hyperlinking), function naming alternatives or aliases, and the relative position or distance between different functions and namespaces. Having such knowledge and an expressive logical language can in general facilitate more complex forms of API question-answering and code retrieval (see Richardson and Kuhn (2017b)). As an example, we might might use the following notation (in which each expands to an existential variable in Definition 2):
|java N? C?::f?(long:a,long:p?) -> long||(2)|
to request the following: Find some java function somewhere (i.e., in some class and namespace), that takes two long values as arguments (with the first value having the name a) and returns a long value. Such a request might be used for finding structurally related functions or for mining software clones (Rattan et al., 2013).
|1.||Source API: (en, Haskell)||Input: Shift the argument left by the specified number of bits.|
|Output||Language: Haskell||Translation: Haskell Data.Bits builtin::shiftL(UNK:a,Int:UNK) -> UNK|
|Language: Java||Translation: Java java.math BigInteger::shiftLeft(int:n) -> BigInteger|
|Language: Clojure||Translation: Clojure clojure.core builtin::bit-shift-left(UNK:x,UNK:n) -> UNK|
Our primary focus is on building models that can robustly translate high-level natural language descriptions to code, and hence to the logical representations proposed above. We believe that under this scenario, natural language can prove to be a useful tool for deriving new forms of declarative knowledge. For example, our recent work looks at polyglot translation (Richardson et al., 2018), or building text-to-code translators that can translate descriptions to function representations in multiple APIs. An example is provided in Figure 3, where the model translates the description about bit-shifting operations (originally drawn from the Haskell standard library) to equivalent function translations in the Haskell, Java and Clojure standard libraries. With this output, one could straightforwardly extract rules about function equivalences in different languages (e.g., bit-shift-left in Clojure is the same function as shiftLeft in Java), and learn further relationships between the associated function names and variables.
Using the notation introduced above, we can express cross language queries about equivalent functions in the following way:
|java java.math BigInteger::EquivIn(shiftLeft,haskell)(long:a,long:b) -> long||(3)|
where the special predicate EquivIn is used to request the Haskell equivalent of the shiftLeft function in Java. The semantics of EquivIn can therefore be defined in the following way (where background knowledge about the eq predicate can be derived from the output of our polyglot model as discussed above):
One interesting direction is using general knowledge about software libraries and logic reasoning to help learn more robust translation models. The formalism introduced above is part of an effort to move in this direction, and we hope that integrating symbolic reasoning more generally will open the doors to new ideas and approaches to solving everyday software search and reusability issues.