Learning Differentiable Programs with Admissible Neural Heuristics

We study the problem of learning differentiable functions expressed as programs in a domain-specific language. Such programmatic models can offer benefits such as composability and interpretability; however, learning them requires optimizing over a combinatorial space of program "architectures". We frame this optimization problem as a search in a weighted graph whose paths encode top-down derivations of program syntax. Our key innovation is to view various classes of neural networks as continuous relaxations over the space of programs, which can then be used to complete any partial program. This relaxed program is differentiable and can be trained end-to-end, and the resulting training loss is an approximately admissible heuristic that can guide the combinatorial search. We instantiate our approach on top of the A-star algorithm and an iteratively deepened branch-and-bound search, and use these algorithms to learn programmatic classifiers in three sequence classification tasks. Our experiments show that the algorithms outperform state-of-the-art methods for program learning, and that they discover programmatic classifiers that yield natural interpretations and achieve competitive accuracy.


page 1

page 2

page 3

page 4


Synthesis of Differentiable Functional Programs for Lifelong Learning

We present a neurosymbolic approach to the lifelong learning of algorith...

Graph-based Heuristic Search for Module Selection Procedure in Neural Module Network

Neural Module Network (NMN) is a machine learning model for solving the ...

Learning Graph Structure With A Finite-State Automaton Layer

Graph-based neural network models are producing strong results in a numb...

Neural Weighted A*: Learning Graph Costs and Heuristics with Differentiable Anytime A*

Recently, the trend of incorporating differentiable algorithms into deep...

Differentiable programming: Generalization, characterization and limitations of deep learning

In the past years, deep learning models have been successfully applied i...

Programming with a Differentiable Forth Interpreter

Given that in practice training data is scarce for all but a small set o...

Learning compositional programs with arguments and sampling

One of the most challenging goals in designing intelligent systems is em...

1 Introduction

An emerging body of work advocates program synthesis

as an approach to machine learning. The methods here learn functions represented as programs in symbolic, domain-specific languages (DSLs)

ellis2016sampling; ellis2018learning; young2019learning; houdini; pirl; propel. Such symbolic models have a number of appeals: they can be more interpretable than neural models, they use the inductive bias embodied in the DSL to learn reliably, and they use compositional language primitives to transfer knowledge across tasks.

In this paper, we study how to learn differentiable programs, which use structured, symbolic primitives to compose a set of parameterized, differentiable modules. Differentiable programs have recently attracted much interest due to their ability to leverage the complementary advantages of programming language abstractions and differentiable learning. For example, recent work has used such programs to compactly describe modular neural networks that operate over rich, recursive data types houdini.

To learn a differentiable program, one needs to induce the program’s “architecture” while simultaneously optimizing the parameters of the program’s modules. This co-design task is difficult because the space of architectures is combinatorial and explodes rapidly. Prior work has approached this challenge using methods ranging such as greedy enumeration, Monte Carlo sampling, and evolutionary algorithms

pirl; houdini; Ellis2019Write. However, such approaches can often be expensive, due to not fully exploiting the structure of the underlying combinatorial search problem.

In this paper, we show that the differentiability of programs opens up a new line of attack on this search problem. A standard strategy for combinatorial optimization is to exploit (ideally fairly tight) continuous relaxations of the search space. Optimization in the relaxed space is typically easier and can efficiently guide search algorithms towards good or optimal solutions. In the case of program learning, we propose to use various classes of neural networks as relaxations of partial programs. We frame our problem as searching a graph, in which nodes encode program architectures with missing expressions, and paths encode top-down program derivations. For each partial architecture

encountered during this search, the relaxation amounts to substituting the unknown part of with a neural network with free parameters. Because programs are differentiable, this network can be trained on the problem’s end-to-end loss. If the space of neural networks is an (approximate) proper relaxation of the space of programs (and training identifies a near-optimum neural network), then the training loss for the relaxation can be viewed as an (approximately) admissible heuristic.

We instantiate our approach, called (abbreviation for Neural Admissible Relaxation), on top of two informed search algorithms: and an iteratively deepened depth-first search that uses a heuristic to direct branching as well as branch-and-bound pruning (). We evaluate the algorithms in the task of learning programmatic classifiers in three behavior classification applications. We show that the algorithms substantially outperform state-of-the-art methods for program learning, and can learn classifier programs that bear natural interpretations and are close to neural models in accuracy.

To summarize, the paper makes three contributions. First, we identify a tool — heuristics obtained by training neural relaxations of programs — for accelerating combinatorial searches over differentiable programs. So far as we know, this is the first approach to exploit the differentiability of a programming language in program synthesis. Second, we instantiate this idea using two classic search algorithms. Third, we present promising experimental results in three sequence classification applications.

2 Problem Formulation

We view a program in our domain-specific language (DSL) as a pair , where is a discrete (program) architecture and

is a vector of real-valued parameters. The architecture

is generated using a context-free grammar hopcroftullman. The grammar consists of a set of rules , where is a nonterminal and are either nonterminals or terminals. A nonterminal stands for a missing subexpression; a terminal is a symbol that can actually appear in a program’s code. The grammar starts with an initial nonterminal, then iteratively applies the rules to produce a series of partial architectures: sentences made from one or more nonterminals and zero or more terminals. The process continues until there are no nonterminals left, i.e., we have a complete architecture.

The semantics of the architecture is given by a function , defined by rules that are fixed for the DSL. We require this function to be differentiable in . Also, we define a structural cost for architectures. Let each rule in the DSL grammar have a non-negative real cost . The structural cost of is where is the multiset of rules used to create .

To define our learning problem, we assume an unknown distribution over inputs and labels , and consider the prediction error function , where is the indicator function. Our goal is to find an architecturally simple program with low prediction error, i.e., to solve the optimization problem: