1 Introduction
Learning computer programs from inputoutput examples, or Inductive Program Synthesis (IPS), is a fundamental problem in computer science, dating back at least to Summers (1977) and Biermann (1978). The field has produced many successes, with perhaps the most visible example being the FlashFill system in Microsoft Excel (Gulwani, 2011; Gulwani et al., 2012).
Learning from examples is also studied extensively in the statistics and machine learning communities. Trained decision trees and neural networks could be considered to be synthesized computer programs, but it would be a stretch to label them as such. Relative to traditional computer programs, these models typically lack several features: (a) key functional properties are missing, like the ability to interact with external storage, (b) there is no compact, interpretable source code representation of the learned model (in the case of neural networks), and (c) there is no explicit control flow (e.g.
while loops and ifstatements). The absence of a precise control flow is a particular hindrance as it can lead to poor generalization. For example, whereas natural computer programs are often built with the inductive bias to use control statements ensuring correct execution on inputs of arbitrary size, models like Recurrent Neural Networks can struggle to generalize from short training instances to instances of arbitrary length.
Several models have already been proposed which start to address the functional differences between neural networks and computer programs. These include Recurrent Neural Networks (RNNs) augmented with a stack or queue memory (Giles et al., 1989; Joulin and Mikolov, 2015; Grefenstette et al., 2015)
(Graves et al., 2014), Memory Networks (Weston et al., 2014), Neural GPUs (Kaiser and Sutskever, 2016), Neural ProgrammerInterpreters (Reed and de Freitas, 2016), and Neural Random Access Machines (Kurach et al., 2015). These models combine deep neural networks with external memory, external computational primitives, and/or builtin structure that reflects a desired algorithmic structure in their execution. Furthermore, they have been been shown to be trainable by gradient descent. However, they do not fix all of the absences noted above. First, none of these models produce programs as output. That is, the representation of the learned model is not interpretable source code. Instead, the program is hidden inside “controllers” composed of neural networks that decide which operations to perform, and the learned “program” can only be understood in terms of the executions that it produces on specific inputs. Second, there is still no concept of explicit control flow in these models.These works raise questions of (a) whether new models can be designed specifically to synthesize interpretable source code that may contain looping and branching structures, and (b) whether searching over program space using techniques developed for training deep neural networks is a useful alternative to the combinatorial search methods used in traditional IPS. In this work, we make several contributions in both of these directions.
To address the first question we develop models inspired by intermediate representations used in compilers like LLVM (Lattner and Adve, 2004) that can be trained by gradient descent. These models address all of the deficiencies highlighted at the beginning of this section: they interact with external storage, handle nontrivial control flow with explicit if statements and loops, and, when appropriately discretized, a learned model can be expressed as interpretable source code. We note two concurrent works, Adaptive Neural Compilation (Bunel et al., 2016) and Differentiable Forth (Riedel et al., 2016), which implement similar ideas. Each design choice when creating differentiable representations of source code has an effect on the inductive bias of the model and the difficulty of the resulting optimization problem. Therefore, we seek a way of rapidly experimenting with different formulations to allow us to explore the full space of modelling variations.
To address the second question, concerning the efficacy of gradient descent, we need a way of specifying an IPS problem such that the gradient based approach can be compared to a variety of alternative approaches in a likeforlike manner. These alternative approaches originate from both a rich history of IPS in the programming languages community and a rich literature of techniques for inference in discrete graphical models in the machine learning community. To our knowledge, no such comparison has previously been performed.
These questions demand that we explore both a range of model variants and a range of search techniques on top of these models. Our answer to both of these issues is the same: TerpreT, a new probabilistic programming language for specifying IPS problems. TerpreT provides a means for describing an execution model (e.g., a Turing Machine, an assembly language, etc.) by defining a parameterization (a program representation) and an interpreter that maps inputs to outputs using the parametrized program. This TerpreT description is independent of any particular inference algorithm. The IPS task is to infer the execution model parameters (the program) given an execution model and pairs of inputs and outputs. To perform inference, TerpreT is automatically “compiled” into an intermediate representation which can be fed to a particular inference algorithm. Interpretable source code can be obtained directly from the inferred model parameters. The driving design principle for TerpreT is to strike a subtle balance between the breadth of expression needed to precisely capture a range of execution models, and the restriction of expression needed to ensure that automatic compilation to a range of different backends is tractable.
TerpreT currently has four backend inference algorithms, which are listed in tbl:backends: gradientdescent (thus any TerpreT model can be viewed as a differentiable interpreter), (integer) linear program (LP) relaxations, SMT, and the program synthesis system (SolarLezama, 2008). To allow all of these backends to be used regardless of the specified execution model requires some generalizations and extensions of previous work. For the gradient descent case, we generalize the approach taken by Kurach et al. (2015), lifting discrete operations to operate on discrete distributions, which then leads to a differentiable system. For the linear program case, we need to extend the standard LP relaxation for discrete graphical models to support if statements. In sec:lpbackend, we show how to adapt the ideas of gates (Minka and Winn, 2009) to the linear program relaxations commonly used in graphical model inference (Schlesinger, 1976; Werner, 2007; Wainwright and Jordan, 2008). This could serve as a starting point for further work on LPbased message passing approaches to IPS (e.g., following Sontag et al. (2008)).
Technique name  Family  Optimizer/Solver  Description 

FMGD
(Forward marginals, gradient descent) 
Machine learning  TensorFlow  A gradient descent based approach which generalizes the approach used by Kurach et al. (2015). 
(I)LP
((Integer) linear programming) 
Machine learning  Gurobi  A novel linear program relaxation approach based on adapting standard linear program relaxations to support Gates (Minka and Winn, 2009). 
SMT
(Satisfiability modulo theories) 
Program synthesis  Z3  Translation of the problem into a firstorder logical formula with existential constraints. 
Program synthesis  View the TerpreT model as a partial program (the interpreter) containing holes (the source code) to be inferred according to a specification (the inputoutput examples). 
Finally, having built TerpreT, it becomes possible to develop understanding of the strengths and weaknesses of the alternative approaches to inference. To understand the limitations of using gradient descent for IPS problems, we first use TerpreT
to define a simple example where gradient descent fails, but which the alternative backends solve easily. By studying this example we can better understand the possible failure modes of gradient descent. We prove that there are exponentially many local optima in the example and show empirically that they arise often in practice (although they can be mitigated significantly by using optimization heuristics like adding noise to gradients during training
(Neelakantan et al., 2016b)). We then perform a comprehensive empirical study comparing different inference backends and program representations. We show that some domains are significantly more difficult for gradient descent than others and show results suggesting that gradient descent performs best when given redundant, overcomplete parameterizations. However, the overwhelming trend in the experiments is that the techniques from the programming languages community outperform the machine learning approaches by a significant margin.In summary, our main contributions are as follows:

A novel ‘Basic Block’ execution model that enables learning programs with complex control flow (branching and loops).

TerpreT, a probabilistic programming language tailored to IPS, with backend inference algorithms including techniques based on gradient descent, linear programming, and highlyefficient systems from the programming languages community (SMT and ). TerpreT also allows “program sketching”, in which a partial solution is provided to the IPS system. For this, some parameters of an execution model can simply be fixed, e.g. to enforce control flow of a specified shape.

A novel linear program relaxation to handle the if statement structure that is common in execution models, and a generalization of the smoothing technique from Kurach et al. (2015) to work on any execution model expressible in TerpreT.

Analytic and experimental comparisons of different inference techniques for IPS and experimental comparisons of different modelling assumptions.
This report is arranged as follows: We briefly introduce the ‘Basic Block’ model in sec:motivating example to discuss what features TerpreT needs to support to allow modeling of rich execution models. In sec:frontend we describe the core TerpreT language and illustrate how to use it to explore different modeling assumptions using several example execution models. These include a Turing Machine, Boolean Circuits, a RISClike assembly language, and our Basic Block model. In sec:backend we describe the compilation of TerpreT models to the four backend algorithms listed in tbl:backends. Quantitative experimental results comparing these backends on the aforementioned execution models is presented in sec:experiments. Finally, related work is summarized in sec:relatedWork and we discuss conclusions and future work in sec:discussion.
2 Motivating Example: Differentiable Control Flow Graphs
As an introductory example, we describe a new execution model that we would like to use for IPS. In this section, we describe the model at a high level. In later sections, we describe how to express the model in TerpreT and how to perform inference.
Control flow graphs (CFGs) (Allen, 1970) are a representation of programs commonly used for static analysis and compiler optimizations. They consist of a set of basic blocks, which contain sequences of instructions with no jumps (i.e., straightline code) followed by a jump or conditional jump instruction to transfer control to another block. CFGs are expressive enough to represent all of the constructs used in modern programming languages like C++. Indeed, the intermediate representation of LLVM is based on basic blocks.
Our first model is inspired by CFGs but is limited to use a restricted set of instructions and does not support function calls. We refer to the model as the Basic Block model. An illustration of the model appears in fig:basicBlock. In more detail, we specify a fixed number of blocks , and we let there be registers that can take on values . We are given a fixed set of instructions that implement basic arithmetic operations, like ADD, INCREMENT, and LESSTHAN. An external memory can be written to and read from using special instructions READ and WRITE. There is an instruction pointer that keeps track of which block is currently being executed. Each block has a single statement parameterized by two argument registers, the instruction to be executed, and the register in which to store the output. After the statement is executed, a condition is checked, and a branch is taken. The condition is parameterized by a choice of register to check for equality to 0 (Cstyle interpretation of integers as booleans). Based upon the result, the instruction pointer is updated to be equal to the then block or the else block. The identities of these blocks are the parameterization of the branch decision.
The model is set up to always start execution in block 0, and a special end block is used to denote termination. The program is executed for a fixed maximum number of timesteps . To represent inputoutput examples, we can set an initial state of external memory, and assert that particular elements in the final memory should have the desired value upon termination.
The job for TerpreT in this case is to precisely describe the execution model—how statements are executed and the instruction pointer is updated—in a way which can be translated into a fully differentiable interpreter for the Basic Block language or into an intermediate representation for passing to other backends. In the next sections, we describe in more detail how TerpreT execution models are specified and how the backends work.
3 Frontend: Describing an IPS problem
One of our central aims is to disentangle the description of an execution model from the inference task so that we can perform likeforlike comparisons between different inference approaches to the same IPS task. For reference, the key components for solving an IPS problem are illustrated in fig:highlevel. In the forward mode the system is analogous to a traditional interpreter, but in a reverse mode, the system infers a representation of source code given only observed outputs from a set of inputs. Even before devising an inference method, we need both a means of parameterizing the source code of the program, and also a precise description of the interpreter layer’s forward transformation. This section describes how these modeling tasks are achieved in TerpreT.
3.1 The TerpreT Probabilistic Programming Language
The full grammar for syntactically correct TerpreT programs is shown in fig:SyntaxDef, and we describe the key semantic features of the language in the following sections. For illustration, we use a running example of a simple automaton shown in fig:automaton1. In this example the ‘source code’ is parameterised by a boolean array, ruleTable, and we take as input the first two values on a binary tape of length , {tape[0], tape[1]}. The forward execution of the interpreter could be described by the following simple Python snippet:
Given an observed output, tape[], inference of a consistent ruleTable is very easy in this toy problem, but it is instructive to analyse the TerpreT implementation of this automaton in the following sections. These sections describe variable declaration, control flow, user defined functions and handling of observations in TerpreT.
3.1.1 Declarations and Assignments
We allow declarations to give names to “magic” constants, as in line 1 of fig:automaton1. Additionally, we allow the declaration of parameters and variables, ranging over a finite domain using Param() and Var(), where has to be a compiletime constant (i.e., a natural number or an expression over constants). Parameters are used to model the source code to be inferred, whereas variables are used to model the computation (i.e., intermediate values). For convenience, (multidimensional) arrays of variables can be declared using the syntax foo = Var()[, , …], and accessed as foo[, , …]. Similar syntax is available for Params. These arrays can be unrolled during compilation such that unique symbols representing each element are passed to an inference algorithm, i.e., they do not require special support in the inference backend. For this reason, dimensions and indices need to be compiletime constants. Example variable declarations can be seen in lines 6 and 11 of fig:automaton1.
Assignments to declared variables are not allowed via the usual assignment operator ( = ) but are instead written as .set_to(), to better distinguish them from assignments to constant variables. Static single assignment (SSA) form is enforced, and it is only legal for a variable to appear multiple times as the target of set_to statements if each assignment appears in different cases of a conditional block. Because of the SSA restriction, a variable can only be written once. However, note that programs that perform multiple writes to a given variable can always be translated to their corresponding SSA forms.
1
2#######################################################
3# Source code parametrisation #
4#######################################################
5ruleTable = Param(2)[2, 2]
6
7#######################################################
8# Interpreter model #
9#######################################################
10tape = Var(2)[const_T]
11
12#__IMPORT_OBSERVED_INPUTS__
13for t in range(1, const_T  1):
14 with tape[t] as x1:
15 with tape[t  1] as x0:
16 tape[t + 1].set_to(ruleTable[x0, x1])
17#__IMPORT_OBSERVED_OUTPUTS__

3.1.2 Control flow
TerpreT supports standard controlflow structures such as ifelse (where elif is the usual shorthand for else if) and for. In addition, TerpreT uses a unique with structure. The need for the latter is induced by our requirement to only use compiletime constants for accessing arrays. Thus, to set the 2nd element of tape in our toy example (i.e., the first step of the computation), we need code like the following to access the values of the first two values on the tape:
Intuitively, this snippet simply performs case analyses over all possible values of tape[1] and tape[0]. To simplify this pattern, we introduce the with as : controlflow structure, which allows to automate this unrolling, or avoid it for backends that do not require it (such as Sketch). To this end, all possible possible values of (known from its declaration) are determined, and the withstatement is transformed into if == 0 then: ; elif == 1 then: ; …elif == then: ;, where denotes the statement in which all occurrences of the variable have been replaced by . Thus, the snippet from above can be written as follows.
In TerpreT, for loops may only take the shape for in range(, ): , where and are compiletime constants. Similar to the with statement, we can unroll such loops explicitly during compilation, and thus if the values of and are and , we generate ; ; …; . Using the with and for statements, we can thus describe the evaluation of our example automaton for const_T timesteps as shown in lines 1417 of fig:automaton1.
3.1.3 Operations
TerpreT supports userdefined functions to facilitate modelling interpreters supporting nontrivial instruction sets. For example, bar(,,) will apply the function bar to the arguments ,,. The function bar can be defined as a standard Python function with the additional decoration @CompileMe(, ), specifying the domains of the input and output variables.
1
2@CompileMe([2, 2], 3)
3def add(a, b):
4 s = a + b
5 return s
6
7#######################################################
8# Source code parametrisation #
9#######################################################
10ruleTable = Param(2)[3]
11
12#######################################################
13# Interpreter model #
14#######################################################
15tape = Var(2)[const_T]
16tmpSum = Var(3)[const_T  1]
17
18#__IMPORT_OBSERVED_INPUTS__
19for t in range(1, const_T  1):
20 tmpSum[t].set_to(add(tape[t  1], tape[t]))
21 with tmpSum[t] as s:
22 tape[t + 1].set_to(ruleTable[s])
23#__IMPORT_OBSERVED_OUTPUTS__

To illustrate this feature, fig:automaton2 shows variation of the running example where the automaton updates the tape according to a ruleTable which depends only on the sum of the preceding two entries. This is implemented using the function add in lines 36. Note that we use standard Python to define this function and leave it up to the compiler to present the function appropriately to the inference algorithm.
3.1.4 Modelling Inputs and Outputs
Using statements from the preceding sections, an execution model can be fully specified, and we now connect this model to input/output observations to drive the program induction. To this end, we use the statements set_to_constant (resp. observe_value) to model program input (resp. program output). Thus, a single inputoutput observation for the running example could be written in TerpreT as follows.
To keep the execution model and the observations separate, we store the observation snippets in a separate file and use preprocessor directives #__IMPORT_OBSERVED_*__ to pull in the appropriate snippets before compilation (see lines 13 and 18 of fig:automaton1). We also allow any constant literals to be stored separately from the TerpreT execution model, and we import these values using preprocessor directives of the form = #__HYPERPARAM_ __.
In general, we want to infer programs from inputoutput examples. The simplest implementation achieves this by augmenting each Var declaration with an additional array dimension of size and wrapping the execution model in a for loop over the examples. Examples of this are the outermost loops in the models in app:models.
3.2 Example Execution Models
To illustrate the versatility of TerpreT, we use it to describe four example execution models. Broadly speaking, the examples progress from more abstract execution models towards models which closely resemble assembly languages for RISC machines.
In each case, we present the basic model and fill in three representative synthesis tasks in tab:benchmarkTasks to investigate. In addition, we provide the metrics for the “difficulty” of each task calculated from the minimal computational resources required in a solution. Since the difficulty of a synthesis problem generally depends on the chosen inference algorithm these metrics are primarily intended to give a sense of the scale of the problem. The first difficulty metric, , is the number of structurally distinct (but not necessarily functionally distinct) programs which would have to be enumerated in a worstcase bruteforce search, and the second metric, , is the unrolled length of all steps in the synthesized program.
3.2.1 Automaton: Turing Machine
A Turing machine consists of an infinite tape of memory cells which each contain one of symbols, and a head which moves over the tape in one of states (one state is the special halt case). At each execution step, while the head is in an unhalted state , it reads the symbol at its current position, , on the tape, then it writes the symbol newValue[,] to position , moves in the direction specified by direction[,] (one cell left or right or no move) and adopts a new state newState[,]. The source code for the Turing machine is the entries of the control tables newValue, direction and newState, which can be in any of configurations.
We modify the canonical Turing machine to have a circular tape of finite length, , as described in the TerpreT model in app:turingModel. For each of our examples, we represent the symbols on the tape as .
TURING MACHINE  Description  

Invert  1  5  4  6  Move from left to right along the tape and invert all the binary symbols, halting at the first blank cell. 
Prepend zero  2  5  9  6  Insert a “0” symbol at the start of the tape and shift all other symbols rightwards one cell. Halt at the first blank cell. 
Binary decrement  2  5  9  9  Given a tape containing a binary encoded number and all other cells blank, return a tape containing a binary encoding of and all other cells blank. 
BOOLEAN CIRCUITS  Description  

2bit controlled shift register  4  10  4  Given input registers , output if otherwise output (i.e. is a control bit stating whether and should be swapped). 
full adder  4  13  5  Given input registers representing a carry bit and two argument bits, output a sum bit and carry bit , where . 
2bit adder  5  22  8  Perform binary addition on twobit numbers: given registers , output where . 
BASIC BLOCK  Description  

Access  5  2  5  14  5  Access the element of a contiguous array. Given an initial heap , and , where for , and , terminate with . 
Decrement  5  2  5  19  18  Decrement all elements in a contiguous array. Given an initial heap for and , terminate with . 
ListK  8  2  8  33  11  Access the element of a linked list. The initial heap is , , and , where linkList is a linked list represented in the heap as adjacent [next pointer, value] pairs in random order, and is a pointer to the head element of linkList. Terminate with 
ASSEMBLY  Description  

Access  5  2  5  13  5  
Decrement  5  2  7  20  27  As above. 
ListK  8  2  10  29  16 
3.2.2 Straightline programs: Boolean Circuits
As a more complex model, we now consider a simple machine capable of performing a sequence of logic operations (AND, OR, XOR, NOT, COPY) on a set of registers holding boolean values. Each operation takes two registers as input (the second register is ignored in the NOT and COPY operation), and outputs to one register, reminiscent of standard threeaddress code assembly languages. To embed this example in a realworld application, analogies linking the instruction set to electronic logic gates and linking the registers to electronic wires can be drawn. This analogy highlights one benefit of interpretability in our model: the synthesized program describes a digital circuit which could easily be translated to real hardware (see e.g. fig:overcompleteCircuit). The TerpreT implementation of this execution model is shown in app:booleanModel.
There are possible programs (circuits) for a model consisting of sequential instructions (logic gates) each chosen from the set of possible operations acting on registers (wires).
3.2.3 Loopy programs 1: Basic block model
To build loopy execution models, we take inspiration from compiler intermediate languages (e.g., LLVM Intermediate Representation), modeling full programs as graphs of “basic blocks”. Such programs operate on a fixed number of registers, and a byteaddressable heap store accessible through special instructions, READ and WRITE. Each block has an instructions of the form = , followed by a branch decision if goto else goto (see fig:basicBlock, and the TerpreT model in app:bbModel). This representation can easily be transformed back and forth to higherlevel program source code (by standard compilation/decompilation techniques) as well as into executable machine code.
We use an instruction set containing instructions: ZERO, INC, DEC, ADD, SUB, LESSTHAN, READ, WRITE and NOOP. This gives possible programs for a system with registers and basic blocks (including a special stop block which executes NOOP and redirects to itself). We consider the case where registers and heap memory cells all store a single data type  integers in the range , where is the number of memory cells on the heap. This single data type allows both intermediate values and pointers into the heap to be represented in the registers and heap cells.
While this model focuses on interpretability, it also builds on an observation from the results of Kurach et al. (2015). In NRAMs, a RNNbased controller chooses a short sequence of instructions to execute next based on observations of the current program state. However, the empirical evaluation reports that correctly trained models usually picked one sequence of instructions in the first step, and then repeated another sequence over and over until the program terminates. Intuitively, this corresponds to a loop initialization followed by repeated execution of a loop body, something which can naturally be expressed in the Basic Block model.
3.2.4 Loopy programs 2: Assembly model
In the basic block model every expression is followed by a conditional branch, giving the model great freedom to represent rich control flow graphs. However, useful programs often execute a sequence of several expressions between each branch. Therefore, it may be beneficial to bias the model to create chains of sequentially ordered basic blocks with only occasional branching where necessary. This is achieved by replacing the basic blocks with objects which more closely resemble lines of assembly code. The instruction set is augmented with the jump statements jumpifzero (JZ), and jumpifnotzero (JNZ), the operation of which are shown in fig:assembly (and in the TerpreT code in app:assemblyModel). Each line of code acts like a conditional branch only if the assigned otherwise it acts like a single expression which executes and passes control to the next line of code. This assembly model can express the same set of programs as the basic block model, and serves as an example of how the design of the model affects the success of program inference.
In addition, we remove NOOP from the instruction set (which can be achieved by a jump operation pointing to the next line) leaving instructions, and we always include a special stop line as the line of the program. The total size of the search space is then .
4 Backends: Solving the IPS problem
TerpreT is designed to be compiled to a variety of intermediate representations for handing to different inference algorithms. This section outlines the compilation steps for each of the backend algorithms listed in tbl:backends.
For each backend we present the compiler transfomation of the TerpreT primitives listed in fig:factorGraphParts. For some backends, we find it useful to present these transformations via an intermediate graphical representation resembling a factor graph, or more specifically, a gated factor graph (Minka and Winn, 2009), which visualises the TerpreT program. Below we describe gated factor graphs and provide the mapping from TerpreT syntax to primitives in these models. Then in sec:fmgd  4.5 we show how to compile TerpreT for each backend solver.
Graph element  TerpreT representation  Graphical representation 

Random variable (intermediate)  Xi = Var()  
Random variable (inference target)  Xi = Param()  
Observed variable (input)  Xi.set_to_constant(x)  
Observed variable (output)  Xi.observe_value(x)  
Factor (copy)  X0.set_to(X1)  
Factor (general)  X0.set_to((X1,X2,))  
Gates 
if C == 0:
elif C == 1: elif C == 2: elif C == n: 
4.1 TerpreT for Gated Factor Graph Description
A factor graph is a means of representing the factorization of a complex function or probability distribution into a composition of simpler functions or distributions. In these graphs, inputs, outputs and intermediate results are stored in
variable nodes linked by factor nodes describing the functional relationships between variables. A TerpreT model defines the structure of a factor graph, and an inference algorithm is used to populate the variable nodes with values consistent with observations.Particular care is needed to describe factor graphs containing conditional branches since the value of a variable in conditions of the form == is not known until inference is complete. This means that we must explore all branches during inference. Gated factor graphs can be used to handle these if statements, and we introduce additional terminology to describe these gated models below. Throughout the next sections we refer to the TerpreT snippet shown in fig:gatesExample for illustration.
Local unary marginal.
We restrict attention to the case where each variable is discrete, with finite domain . For each variable we instantiate a local unary marginal defined on the support . In an integral configuration, we demand that is only nonzero at a particular value , allowing us to interpret . Some inference techniques relax this constraint and consider a continuous model . In these relaxed models, we apply continuous optimization schemes which, if successful, will converge on an interpretable integral solution.
1 # X4 = 0 if X0 == 0 and X1 == 0
2 #  X2 + 1 if X0 == 0 and X1 == 1
3 #  2*X2 if X0 == 1
4 #
5 # Observe X4 = 5; infer X0, X1, X2
6
7 @CompileMe([2, 10], 10)
8 def Plus(a, b): return (a + b) % 10
9 @CompileMe([10], 10)
10 def MultiplyByTwo(a): return (2 * a) % 10
11
12 X0 = Param(2); X1 = Param(2); X2 = Param(10)
13 X3 = Var(10); X4 = Var(10)
14
15 if X0 == 0:
16 if X1 == 0:
17 X3.set_to(0); X4.set_to(X3)
18 elif X1 == 1:
19 X3.set_to(Plus(X1, X2)); X4.set_to(X3)
20 elif X0 == 1:
21 X4.set_to(MultiplyByTwo(X2))
22
23 X4.observe_value(5)


Gates.
Following Minka and Winn (2009), we refer to if statements as gates. More precisely, an if statement consists of a condition (an expression that evaluates to a boolean) and a body (a set of assignments or factors). We will refer to the condition as the gate condition and the body as the gate body. In this work, we restrict attention to cases where all gate conditions are of the form . In future work we could relax this restriction.
In the example in fig:gatesExample, there is a nested gate structure. At the outermost level, there are two gates with gate conditions (X0 == 0) (lines 1620) and (X0 == 1) (lines 2122). Inside the (X0 == 0) gate, there are two nested gates (corresponding to (X1 == 0) and (X1 == 1)).
Path conditions.
Each gate has a path condition , which is a list of variables and values they need to take on in order for the gate body to be executed. For example, in fig:gatesExample, the path condition for the innermost gate body on lines 1920 is , where commas denote conjunction. We will use the convention that the condition in the deepest gate’s if statement is the last entry of the path condition. Gates belong to a tree structure, and if gate with gate condition is nested inside gate with path condition , then we say that is a parent of , and the path condition for is . We can equally speak of the path condition of a factor , which is the path condition of the most deeply nested gate that the factor is contained in.
Active variables.
Define a variable to be active in a gate if both of the following hold:

is used in or one of its descendants, and

is declared in or one of its ancestors.
That is, is active in iff is on the path between ’s declaration and one of its uses.
For each gate in which a variable is active, we instantiate a separate local marginal annotated with the path condition of (). For example, inside the gate corresponding to (X0 == 0) in fig:gatesExample, the local marginal for is .^{2}^{2}2Strictly speaking, this notation does not handle the case where there are multiple gates with identical path conditions; for clearness of notation, assume that all gate path conditions are unique. However, the implementation handles repeated path conditions (by identifying local marginals according to a unique gate id). In the global scope we drop the superscript annotation and just use . We can refer to parentchild relationships between different local marginals of the same variable; the parent (child) of a local marginal is the local marginal for in the parent (child) gate of .
Gate marginals.
Let the gate marginal of a gate be the marginal of the gate’s condition in the parent gate of . In fig:gatesExample, the first outer gate’s gate marginal is , and the second outer gate’s is . In the inner gate, the gate marginal for the (X1 == 0) gate is .
4.2 Forward Marginals Gradient Descent (FMGD) Backend
The factor graphs discussed above are easily converted into computation graphs representing the execution of an interpreter by the following operations.

Annotate the factor graph edges with the direction of traversal during forwards execution of the TerpreT program.

Associate executable functions with factor operating on scope . The function transforms the incoming variables to the outgoing variable, .
In the FMGD approach, we initialize the source nodes of this directed graph by instantiating independent random variables at each Param node, and variables at nodes associated with input observations of the form Xi.set_to_constant(). Here is a distribution over with unit mass at . We then propagate these distributions through the computation graph using the FMGD approximation, described below, to obtain distributions at the output nodes associated with an observe_value() statement. This fuzzy system of distributions is fully differentiable. Therefore inference becomes an optimization task to maximize the weight assigned to the observations by updating the parameter distributions by gradient descent.
The key FMGD approximation arises whenever a derived variable, depends on several immediate input variables, . In an ungated graph, this occurs at factor nodes where . FMGD operates under the under the approximation that all are independent
. In this case, we imagine a local joint distribution
constructed according to the definition of and the independent unary marginal distributions for . From this distribution we marginalize out all of the input variables to obtain the unary marginal (see sec:fm). Only is propagated forward out of the factor node and correlations between and (only captured by the full local joint distribution) are lost. In the next section we explicitly define these operations and extend the technique to allow for gates in the factor graph.It is worth noting that there is a spectrum of approximations in which we form joint distributions for subgraphs of size ranging from single nodes (FMGD) to the full computation graph (enumerative search) with only independent marginal distributions propagated between subgraphs. Moving on this spectrum trades computational efficiency for accuracy as more correlations can be captured in larger subgraphs. An exploration of this spectrum could be a basis for future work.
4.2.1 Forward Marginals…
fig:fmgdFactBox illustrates the transformation of each graphical primitive to allow a differentiable forward propagation of marginals through a factor graph. Below we describe more details of factor and gate primitives in this algorithm.

Factors.
The scope of a factor function contains the immediate input variables and the immediate output, . In this restricted environment, we enumerate the possible outputs from all possible input configurations of the form for . We then marginalise over the configuration index, , using weightings to produce as follows:
(1) 
where is an indicator function and the weighting function is:
(2) 
Gates.
We can include gates in the FMGD formulation as follows. Let be the set of child gates of which are controlled by gate marginal . Inside gate , there is a subgraph described by TerpreT code which references a set of active variables . We divide into containing variables which are writtento during execution of (i.e. appear on the left hand side of expressions in ), and containing variables which are not writtento (i.e. appear only on the right hand side of expressions in ). In addition, we use to refer to the active variables in , and to be variables used in the graph downstream of gates on paths which terminate at observed variables.
On entering gate , we import references to variables in the parent scope, , for all :
(3) 
We then run , to produce variables . Finally, when leaving a gate, we marginalise using the gate marginal to set variables :
(4) 
Restrictions on factor functions.
The description above is valid for any , subject to the condition that , where is the domain of the variable which is used to store the output of . One scenario where this condition could be violated is illustrated below:
The function makeSmall has a range which contains elements outside . However, deterministic execution of this program does not encounter any error because the path condition isLarge == 1 guarantees that the invalid cases would never be reached. In general, it only makes sense to violate if we are inside a gate where the path condition ensures that the input values lie in a restricted domain such that . In this case a we can simply enforce the normalisation of to account for any leaked weight on values .
(5) 
With this additional caveat, there are no further constraints on factor functions .
4.2.2 … Gradient Descent
Given a random initialization of marginals for the the Param variables , we use the techniques above to propagate marginals forwards through the TerpreT model to reach all variables, , associated with an observe_value statement. Then we use a cross entropy loss, , to compare the computed marginal to the observed value.
(6) 
reaches its lower bound if each of the marginals representing the Params put unit weight on a single value such that the assignments describe a valid program which explains the observations. The synthesis task is therefore an optimisation problem to minimise
, which we try to solve using backpropagation and gradient descent to reach a zero loss solution.
To preserve the normalisation of the marginals during this optimisation, rather than updating directly, we update the log parameters defined by . These are initialized according to
(7) 
where
are hyperparameters.
4.2.3 Optimization Heuristics
Using gradient information to search over program space is only guaranteed to succeed if all points with zero gradient correspond to valid programs which explain the observations. Since many different programs can be consistent with the observations, there can be many global optima () points in the FMGD loss landscape. However, the FMGD approximation can also lead to local optima which, if encountered, stall the optimization at an uninterpretable point where assigns weight to several distinct parameter settings. For this reason, we try several different random initializations of and record the fraction of initializations which converge at a global optimum. Specifically, we try two approaches for learning using this model:

Optimized FMGD. Add the heuristics below, which are inspired by Kurach et al. (2015) and designed to avoid getting stuck in local minima, and optimize the hyperparameters for these heuristics by random search. We also include the initialization scale and the gradient descent optimization algorithm in the random search (see sec:paritychainexperiments for more details). By setting , parameters are initialized uniformly on the simplex. By setting smaller, we get peakier distributions, and by setting
larger we get more uniform distributions.
Gradient clipping.
The FMGD neural network depth grows linearly with the number of time steps. We mitigate the “exploding gradient” problem
(Bengio et al., 1994) by globally rescaling the whole gradient vector so that its norm is not bigger than some hyperparameter value .Noise.
Entropy.
Ideally, the algorithm would explore the loss surface to find a global minimum rather than fixing on some particular configuration early in the training process, causing the network to get stuck in a local minimum from which it’s unlikely to leave. To bias the network away from committing to any particular solution during early iterations, we add an entropy bonus
to the loss function. Specifically, for each softmax distribution in the network, we subtract the entropy scaled by a coefficient
, which is a hyperparameter. The coefficient is exponentially decayed with rate , which is another hyperparameter.Limiting the values of logarithms.
FMGD uses logarithms in computing both the cost function as well as the entropy. Since the inputs to these logarithms can be very small, this can lead to very big values for the cost function and floatingpoint arithmetic overflows. We avoid this problem by replacing
with wherever a logarithm is computed, for some small value of .Kurach et al. (2015) considered two additional tricks which we did not implement generally.
Enforcing Distribution Constraints.
Because of the depth of the networks, propagation of numerical errors can result in . Kurach et al. (2015) solve this by adding rescaling operations to ensure normalization. We find that we can avoid this problem by using 64bit floatingpoint precision.
Curriculum learning.
Kurach et al. (2015) used a curriculum learning scheme which involved first training on small instances of a given problem, and only moving to train on larger instances once the error rate had reduced below a certain value. Our benchmarks contain a small number of short examples (e.g., 510 examples acting on memory arrays of up to 8 elements), so there is less room for curriculum learning to be helpful. We manually experimented with handcrafted curricula for two hard problems (shift and adder), but it did not lead to improvements.
To explore the hyperparameters for these optimization heuristics we ran preliminary experiments to manually chose a distribution over hyperparameter space for use in random search over hyperparameters. The aim was to find a distribution that is broad enough to not disallow reasonable settings of hyperparameters while also being narrow enough so that runs of random search were not wasted on parameter settings that would never lead to convergence. This distribution over hyperparameters was then fixed for all random search experiments.
4.3 (Integer) Linear Program Backend
We now turn attention to the first alternative backend to be compared with the FMGD. Casting the TerpreT program as a factor graph allows us to build upon standard practice in constructing LP relaxations for solving maximum a posteriori (MAP) inference problems in discrete graphical models (Schlesinger, 1976; Wainwright and Jordan, 2008). In the following sections we describe how to apply these techniques to the TerpreT models, and in particular, how to extend the methods to handle gates.
4.3.1 LP Relaxation
The inference problem can be phrased as the task of finding the highest scoring configuration of a set of discrete variables . The score is defined as the sum of local factor scores, , where , and is the joint configuration space of the variables with indices spanning the scope of factor . In the simplest case (when we are searching for any valid solution) the factor score at a node representing a function will simply measure the consistency of the inputs () and output () at that factor:
(9) 
Alongside these scoring functions, we can build a set of linear constraints and an overall linear objective function which represent the graphical model as an LP. The variables of this LP are the local unary marginals as before, and new local factor marginals for associated with each factor, .
In the absence of gates, we can write the LP as:
(10) 
where the final set of constraints say that when is fixed to value and all other variables are marginalized out from the local factor marginal, the result is equal to the value that the local marginal for assigns to value . This ensures that factors and their neighboring variables have consistent local marginals.
If all local marginals are integral, i.e., restricted to be 0 or 1, then the LP above becomes an integer linear program corresponding exactly to the original discrete optimization problem. When the local marginals are realvalued (as above), the resulting LP is not guaranteed to have equivalent solution to the original problem, and fractional solutions can appear. More formally, the LP constraints define what is known as the local polytope . which is an outer approximation to the convex hull of all valid integral configurations of the local marginals (known as the marginal polytope ). In the case of program synthesis, fractional solutions are problematic, because they do not correspond to discrete programs and thus cannot be represented as source code or executed on new instances. When a fractional solution is found, heuristics such as rounding, cuts, or branch & bound search must be used in order to find an integral solution.
4.3.2 Linear Constraints in Gated Models
We now extend the LP relaxation above to cater for models with gates. In each gate we instantiate local unary marginals for each active variable and local factor marginals for each factor, where is the path condition of the parent gate.
The constraints in the LP are then updated to handle these gate specific marginals as follows:

Normalization constraints.
The main difference in the Gate LP from the standard LP is how normalization constraints are handled. The key idea is that each local marginal in gate is normalized to sum to ’s gate marginal. Thus the local marginal for in the gate with path condition with gate marginal is:
(11) 
For local marginals in the global scope (not in any gate), the marginals are constrained to sum to 1, as in the standard LP.
Factor local marginals.
The constraint enforcing local consistency between the factor local marginals and the unary local marginals is augmented with path condition superscripts:
(12) 
Parentchild consistency.
There needs to be a relationship between different local marginals for the same variable. We do this by enforcing consistency between parentchild local marginals. Let be a parent gate of , and let be active in both and . Then we need to enforce consistency between and . It is not quite as simple as setting these quantities equal; in general there are multiple children gates of , and may be active in many of them. Let be the set of children gates of , and suppose that is active in all of the children. Then the constraint is
(13) 
This can be thought of as setting a parent local marginal to be a weighted average of children local marginals, where the “weights” come from children marginals being capped at their corresponding gate marginal’s value.
Ghost marginals.
A problem arises if a variable is used in some but not all children gates. It may be tempting in this case to replace the above constraint with one that leaves out the children where the variable is inactive:
(14) 
This turns out to lead to a contradiction. To see this, consider in fig:gatesExample. is inactive in the (X0 == 1) gate, and thus the parentchild consistency constraints would be
(15) 
However, the normalization constraints for these local marginals are
(16)  
(17) 
This implies that , which means we must assign zero probability to the case when is not active. This removes the possibility of from consideration which is clearly undesirable, and if there are disjoint sets of variables active in the different children cases, then the result is an infeasible LP.
The solution is to instantiate ghost marginals, which are local marginals for a variable in the case where it is undefined (hence the term “ghost”). We denote a ghost marginal with a path condition entry where the value is set to , as in . Ghost marginals represent the distribution over values in all cases where a variable is not defined, so the normalization constraints are defined as follows:
(18) 
Finally, we can fix the parentchild consistency constraints in the case where a variable is active in some children. The solution is to consider the ghost marginal as one of the child cases. In the example of , the constraint would be the following:
(19) 
The full set of constraints for solving TerpreT IPS problems using gated (integer) LPs is summarized in fig:lp.
4.4 SMT Backend
At its core, an IPS problem in TerpreT induces a simple linear integer constraint system. To exploit mature constraintsolving systems such as Z3 (de Moura and Bjørner, 2008), we have implemented a satisfiability modulo theories (SMT) backend. For this, a TerpreT instance is translated into a set of constraints in the SMTLIB standard (Barrett et al., 2015), after which any standard SMT solver can be called.
To this end, we have defined a syntaxguided transformation function that translates TerpreT expressions into SMTLIB expressions over integer variables, shown in fig:smtExprTranslation. We make use of the unrolling techniques discussed earlier to eliminate arrays, for loops and with statements. When encountering a function call as part of an expression, we use inlining, i.e., replace the call by the function definition in which formal parameters have been replaced by actual arguments. This means that some TerpreT statements have to be expressed as SMTLIB expressions, and also means that the SMT backend only supports a small subset of functions, namely those that are using only TerpreT (but not arbitrary Python) constructs.
Comments
There are no comments yet.