An emerging body of work advocates program synthesis
as an approach to machine learning. The methods here learn functions represented as programs in symbolic, domain-specific languages (DSLs)ellis2016sampling; ellis2018learning; young2019learning; houdini; pirl; propel. Such symbolic models have a number of appeals: they can be more interpretable than neural models, they use the inductive bias embodied in the DSL to learn reliably, and they use compositional language primitives to transfer knowledge across tasks.
In this paper, we study how to learn differentiable programs, which use structured, symbolic primitives to compose a set of parameterized, differentiable modules. Differentiable programs have recently attracted much interest due to their ability to leverage the complementary advantages of programming language abstractions and differentiable learning. For example, recent work has used such programs to compactly describe modular neural networks that operate over rich, recursive data types houdini.
To learn a differentiable program, one needs to induce the program’s “architecture” while simultaneously optimizing the parameters of the program’s modules. This co-design task is difficult because the space of architectures is combinatorial and explodes rapidly. Prior work has approached this challenge using methods ranging such as greedy enumeration, Monte Carlo sampling, and evolutionary algorithmspirl; houdini; Ellis2019Write. However, such approaches can often be expensive, due to not fully exploiting the structure of the underlying combinatorial search problem.
In this paper, we show that the differentiability of programs opens up a new line of attack on this search problem. A standard strategy for combinatorial optimization is to exploit (ideally fairly tight) continuous relaxations of the search space. Optimization in the relaxed space is typically easier and can efficiently guide search algorithms towards good or optimal solutions. In the case of program learning, we propose to use various classes of neural networks as relaxations of partial programs. We frame our problem as searching a graph, in which nodes encode program architectures with missing expressions, and paths encode top-down program derivations. For each partial architectureencountered during this search, the relaxation amounts to substituting the unknown part of with a neural network with free parameters. Because programs are differentiable, this network can be trained on the problem’s end-to-end loss. If the space of neural networks is an (approximate) proper relaxation of the space of programs (and training identifies a near-optimum neural network), then the training loss for the relaxation can be viewed as an (approximately) admissible heuristic.
We instantiate our approach, called (abbreviation for Neural Admissible Relaxation), on top of two informed search algorithms: and an iteratively deepened depth-first search that uses a heuristic to direct branching as well as branch-and-bound pruning (). We evaluate the algorithms in the task of learning programmatic classifiers in three behavior classification applications. We show that the algorithms substantially outperform state-of-the-art methods for program learning, and can learn classifier programs that bear natural interpretations and are close to neural models in accuracy.
To summarize, the paper makes three contributions. First, we identify a tool — heuristics obtained by training neural relaxations of programs — for accelerating combinatorial searches over differentiable programs. So far as we know, this is the first approach to exploit the differentiability of a programming language in program synthesis. Second, we instantiate this idea using two classic search algorithms. Third, we present promising experimental results in three sequence classification applications.
2 Problem Formulation
We view a program in our domain-specific language (DSL) as a pair , where is a discrete (program) architecture and
is a vector of real-valued parameters. The architectureis generated using a context-free grammar hopcroftullman. The grammar consists of a set of rules , where is a nonterminal and are either nonterminals or terminals. A nonterminal stands for a missing subexpression; a terminal is a symbol that can actually appear in a program’s code. The grammar starts with an initial nonterminal, then iteratively applies the rules to produce a series of partial architectures: sentences made from one or more nonterminals and zero or more terminals. The process continues until there are no nonterminals left, i.e., we have a complete architecture.
The semantics of the architecture is given by a function , defined by rules that are fixed for the DSL. We require this function to be differentiable in . Also, we define a structural cost for architectures. Let each rule in the DSL grammar have a non-negative real cost . The structural cost of is where is the multiset of rules used to create .
To define our learning problem, we assume an unknown distribution over inputs and labels , and consider the prediction error function , where is the indicator function. Our goal is to find an architecturally simple program with low prediction error, i.e., to solve the optimization problem: