Naturalizing a Programming Language via Interactive Learning

04/23/2017 ∙ by Sida I. Wang, et al. ∙ Stanford University 0

Our goal is to create a convenient natural language interface for performing well-specified but complex actions such as analyzing data, manipulating text, and querying databases. However, existing natural language interfaces for such tasks are quite primitive compared to the power one wields with a programming language. To bridge this gap, we start with a core programming language and allow users to "naturalize" the core language incrementally by defining alternative, more natural syntax and increasingly complex concepts in terms of compositions of simpler ones. In a voxel world, we show that a community of users can simultaneously teach a common system a diverse language and use it to build hundreds of complex voxel structures. Over the course of three days, these users went from using only the core language to using the naturalized language in 85.9% of the last 10K utterances.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In tasks such as analyzing and plotting data gulwani2014nlyze, querying databases zelle96geoquery; berant2013freebase, manipulating text kushman2013regex, or controlling the Internet of Things campagna2017almond and robots tellex2011understanding, people need computers to perform well-specified but complex actions. To accomplish this, one route is to use a programming language, but this is inaccessible to most and can be tedious even for experts because the syntax is uncompromising and all statements have to be precise. Another route is to convert natural language into a formal language, which has been the subject of work in semantic parsing (zettlemoyer05ccg; artzi11conversations; artzi2013weakly; pasupat2015compositional). However, the capability of semantic parsers is still quite primitive compared to the power one wields with a programming language. This gap is increasingly limiting the potential of both text and voice interfaces as they become more ubiquitous and desirable.

In this paper, we propose bridging this gap with an interactive language learning process which we call naturalization. Before any learning, we seed a system with a core programming language that is always available to the user. As users instruct the system to perform actions, they augment the language by defining new utterances — e.g., the user can explicitly tell the computer that ‘X’ means ‘Y’. Through this process, users gradually and interactively teach the system to understand the language that they want to use, rather than the core language that they are forced to use initially. While the first users have to learn the core language, later users can make use of everything that is already taught. This process accommodates both users’ preferences and the computer action space, where the final language is both interpretable by the computer and easier to produce by human users.

Compared to interactive language learning with weak denotational supervision wang2016games, definitions are critical for learning complex actions (Figure 1). Definitions equate a novel utterance to a sequence of utterances that the system already understands. For example, ‘go left 6 and go front’ might be defined as ‘repeat 6 [go left]; go front’, which eventually can be traced back to the expression ‘repeat 6 [select left of this]; select front of this

’ in the core language. Unlike function definitions in programming languages, the user writes concrete values rather than explicitly declaring arguments. The system automatically extracts arguments and learns to produce the correct generalizations. For this, we propose a grammar induction algorithm tailored to the learning from definitions setting. Compared to standard machine learning, say from demonstrations, definitions provide a much more powerful learning signal: the system is told directly that ‘

a 3 by 4 red square’ is ‘3 red columns of height 4’, and does not have to infer how to generalize from observing many structures of different sizes.

We implemented a system called Voxelurn, which is a command language interface for a voxel world initially equipped with a programming language supporting conditionals, loops, and variable scoping etc. We recruited 70 users from Amazon Mechanical Turk to build 230 voxel structures using our system. All users teach the system at once, and what is learned from one user can be used by another user. Thus a community of users evolves the language to becomes more efficient over time, in a distributed way, through interaction. We show that the user community defined many new utterances—short forms, alternative syntax, and also complex concepts such as ‘add green monster, add yellow plate 3 x 3’. As the system learns, users increasingly prefer to use the naturalized language over the core language: 85.9% of the last 10K accepted utterances are in the naturalized language.

Figure 2: Interface used by users to enter utterances and create definitions.
Rule(s) Example(s) Description

select left; add red perform actions sequentially

repeat 3-1 add red top repeat action times

if has color red [select origin] action if is non-empty

while not has color red [select left of this] action while is non-empty

foreach this [remove has row row of this] action for each item in

[select left or right; add red; add red top] group actions for precedence

{select left; add red} scope only selection

isolate [add red top; select has color red] scope voxels and selection

select all and not origin set the selection
remove remove has color red remove voxels
update update color [color of left of this] change property of selection
this current selection
all   |  none   |  origin all voxels, empty set,

of   |  has
has color red or yellow   |  has row [col of this] lambda DCS joins

not   |   and   |   or
this or left and not has color red set operations

  |  +   |  -
1,…,10   |  1+2   |  row of this + 1 numbers and arithmetic

argmax   |  argmin
argmax col has color red superlatives

color   |  row   |  col   |  height   |  top   |  left   |   voxel relations

red   |  orange   |  green   |  blue   |  black   |   color values

top   |  bot   |  front   |  back   |  left   |  right direction values

very of
very top of very bot of has color green syntax sugar for argmax

add []   |  move
add red   |  add yellow bot   |  move left add voxel, move selection
Table 1: Grammar of the core language (DAL), which includes actions (), relations (), and sets of values (). The grammar rules are grouped into four categories. From top to bottom: domain-general action compositions, actions using sets, lambda DCS expressions for sets, and domain-specific relations and actions.

2 Voxelurn


A world state in Voxelurn contains a set of voxels, where each voxel has relations ‘row’, ‘col’, ‘height’, and ‘color’. There are two domain-specific actions, ‘add’ and ‘move’, one domain-specific relation ‘direction’. In addition, the state contains a selection, which is a set of positions. While our focus is Voxelurn, we can think more generally about the world as a set of objects equiped with relations — events on a calendar, cells of a spreadsheet, or lines of text.

Core language.

The system is born understanding a core language called Dependency-based Action Language (DAL), which we created (see Table 1 for an overview).

The language composes actions using the usual but expressive control primitives such as ‘if’, ‘foreach’, ‘repeat’, etc. Actions usually take sets as arguments, which are represented using lambda dependency-based compositional semantics (lambda DCS) expressions liang2013lambdadcs. Besides standard set operations like union, intersection and complement, lambda DCS leverages the tree dependency structure common in natural language: for the relation ‘color’, ‘has color red’ refers to the set of voxels that have color red, and its reverse ‘color of has row 1’ refers to the set of colors of voxels having row number 1. Tree-structured joins can be chained without using any variables, e.g., ‘has color [yellow or color of has row 1]’.

We protect the core language from being redefined so it is always precise and usable.111Not doing so resulted in ambiguities that propagated uncontrollably, e.g., once ‘red’ can mean many different colors. In addition to expressivity, the core language interpolates well with natural language. We avoid explicit variables by using a selection, which serves as the default argument for most actions.222The selection is like the turtle in LOGO, but can be a set. For example, ‘select has color red; add yellow top; remove’ adds yellow on top of red voxels and then removes the red voxels.

To enable the building of more complex structures in a more modular way, we introduce a notion of scoping. Suppose one is operating on one of the palm trees in Figure 2. The user might want to use ‘select all’ to select only the voxels in that tree rather than all of the voxels in the scene. In general, an action can be viewed as taking a set of voxels and a selection , and producing an updated set of voxels and a modified selection . The default scoping is ‘[]’, which is the same as ‘’ and returns . There are two constructs that alter the flow: First, ‘{}’ takes and returns , thus restoring the selection. This allows to use the selection as a temporary variable without affecting the rest of the program. Second, ‘isolate []’ takes , calls with (restricting the set of voxels to just the selection) and returns , where consists of voxels in and voxels in that occupy empty locations in . This allows to focus only on the selection (e.g., one of the palm trees). Although scoping can be explicitly controlled via ‘[ ]’, ‘isolate’, and ‘{ }’, it is an unnatural concept for non-programmers. Therefore when the choice is not explicit, the parser generates all three possible scoping interpretations, and the model learns which is intended based on the user, the rule, and potentially the context.

3 Learning interactively from definitions

The goal of the user is to build a structure in Voxelurn. In wang2016games, the user provided interactive supervision to the system by selecting from a list of candidates. This is practical when there are less than tens of candidates, but is completely infeasible for a complex action space such as Voxelurn. Roughly, 10 possible colors over the box containing the palm tree in Figure 2 yields distinct denotations, and many more programs. Obtaining the structures in Figure 1 by selecting candidates alone would be infeasible.

This work thus uses definitions in addition to selecting candidates as the supervision signal. Each definition consists of a head utterance and a body, which is a sequence of utterances that the system understands. One use of definitions is paraphrasing and defining alternative syntax, which helps naturalize the core language (e.g., defining ‘add brown top 3 times’ as ‘repeat 3 add brown top’). The second use is building up complex concepts hierarchically. In Figure 2, ‘add yellow palm tree’ is defined as a sequence of steps for building the palm tree. Once the system understands an utterance, it can be used in the body of other definitions. For example, Figure 3 shows the full definition tree of ‘add palm tree’. Unlike function definitions in a programming language, our definitions do not specify the exact arguments; the system has to learn to extract arguments to achieve the correct generalization.

def: add palm tree
      def: brown trunk height 3
           def: add brown top 3 times
                repeat 3 [add brown top]
               def: go to top of tree
                     select very top of has color brown
                    def: add leaves here
                          def: select all sides
                               select left or right or front or back
                              add green
Figure 3: Defining ‘add palm tree’, tracing back to the core language (utterances without def:).
begin execute :
      if  does not parse then define ;
      if user rejects all parses then define ;
      execute user choice
      begin define :
           repeat starting with
                user enters ;
                if  does not parse then define ;
                if user rejects all  then define ;
                until user accepts as the def’n of ;
Figure 4: When the user enters an utterance, the system tries to parse and execute it, or requests that the user define it.

The interactive definition process is described in Figure 4. When the user types an utterance , the system parses into a list of candidate programs. If the user selects one of them (based on its denotation), then the system executes the resulting program. If the utterance is unparsable or the user rejects all candidate programs, the user is asked to provide the definition body for . Any utterances in the body not yet understood can be defined recursively. Alternatively, the user can first execute a sequence of commands , and then provide a head utterance for body .

When constructing the definition body, users can type utterances with multiple parses; e.g., ‘move forward’ could either modify the selection (‘select front’) or move the voxel (‘move front’). Rather than propagating this ambiguity to the head, we force the user to commit to one interpretation by selecting a particular candidate. Note that we are using interactivity to control the exploding ambiguity.

4 Model and learning

Let us turn to how the system learns and predicts. This section contains prerequisites before we describe definitions and grammar induction in Section 5.

Semantic parsing.

Our system is based on a semantic parser that maps utterances to programs , which can be executed on the current state (set of voxels and selection) to produce the next state . Our system is implemented as the interactive package in SEMPRE (berant2013freebase); see liang2016executable for a gentle exposition.

A derivation represents the process by which an utterance turns into a program . More precisely, is a tree where each node contains the corresponding span of the utterance , the grammar rule , the grammar category , and a list of child derivations .

Following zettlemoyer05ccg, we define a log-linear model over derivations given an utterance produced by the user :



is a feature vector and

is a parameter vector. The user does not appear in previous work on semantic parsing, but we use it to personalize the semantic parser trained on the community.

We use a standard chart parser to construct a chart. For each chart cell, indexed by the start and end indices of a span, we construct a list of partial derivations recursively by selecting child derivations from subspans and applying a grammar rule. The resulting derivations are sorted by model score and only the top are kept. We use to denote the set of all partial derivations across all chart cells. The set of grammar rules starts with the set of rules for the core language (Table 1), but grows via grammar induction when users add definitions (Section 5). Rules in the grammar are stored in a trie based on the right-hand side to enable better scalability to a large number of rules.

Feature Description
Rule.ID ID of the rule
Rule.Type core?, used?, used by others?
Social.Author ID of author
Social.Friends (ID of author, ID of user)
Social.Self rule is authored by user?
Span (left/right token(s), category)
Scope type of scoping for each user
Table 2: Summary of features.


Derivations are scored using a weighted combination of features. There are three types of features, summarized in Table 2.

Rule features fire on each rule used to construct a derivation. ID features fire on specific rules (by ID). Type features track whether a rule is part of the core language or induced, whether it has been used again after it was defined, if it was used by someone other than its author, and if the user and the author are the same ( features).

Social features fire on properties of rules that capture the unique linguistic styles of different users and their interaction with each other. Author features capture the fact that some users provide better, and more generalizable definitions that tend to be accepted. Friends features are cross products of author ID and user ID, which captures whether rules from a particular author are systematically preferred or not by the current user, due to stylistic similarities or differences ( features).

Span features include conjunctions of the category of the derivation and the leftmost/rightmost token on the border of the span. In addition, span features include conjunctions of the category of the derivation and the 1 or 2 adjacent tokens just outside of the left/right border of the span. These capture a weak form of context-dependence that is generally helpful ( features for a vocabulary of size ).

Scoping features track how the community, as well as individual users, prefer each of the 3 scoping choices (none, selection only ‘{A}’, and voxels+selection ‘isolate {A}’), as described in Section 2. 3 global indicators, and 3 indicators for each user fire every time a particular scoping choice is made ( features).

Parameter estimation.

When the user types an utterance, the system generates a list of candidate next states. When the user chooses a particular next state from this list, the system performs an online AdaGrad update (duchi10adagrad) on the parameters

according to the gradient of the following loss function:

which attempts to increase the model probability on derivations whose programs produce the next state


5 Grammar induction

Recall that the main form of supervision is via user definitions, which allows creation of user-defined concepts. In this section, we show how to turn these definitions into new grammar rules that can be used by the system to parse new utterances.

Previous systems of grammar induction for semantic parsing were given utterance-program pairs . Both the GENLEX (zettlemoyer05ccg) and higher-order unification (kwiatkowski10ccg) algorithms over-generate rules that liberally associate parts of with parts of . Though some rules are immediately pruned, many spurious rules are undoubtedly still kept. In the interactive setting, we must keep the number of candidates small to avoid a bad user experience, which means a higher precision bar for new rules.

Fortunately, the structure of definitions makes the grammar induction task easier. Rather than being given an utterance-program pair, we are given a definition, which consists of an utterance (head) along with the body , which is a sequence of utterances. The body is fully parsed into a derivation , while the head is likely only partially parsed. These partial derivations are denoted by .

At a high-level, we find matches—partial derivations of the head that also occur in the full derivation of the body . A grammar rule is produced by substituting any set of non-overlapping matches by their categories. As an example, suppose the user defines

Then we would be able to induce the following two grammar rules:

The first rule substitutes primitive values (‘red’, ‘top’, and ‘3’) with their respective pre-terminal categories (, , ). The second rule contains compositional categories like actions (), which require some care. One might expect that greedily substituting the largest matches or the match that covers the largest portion of the body would work, but the following example shows that this is not the case: