People infer recursive visual concepts from just a few examples

04/17/2019
by   Brenden M. Lake, et al.
0

Machine learning has made major advances in categorizing objects in images, yet the best algorithms miss important aspects of how people learn and think about categories. People can learn richer concepts from fewer examples, including causal models that explain how members of a category are formed. Here, we explore the limits of this human ability to infer causal "programs" -- latent generating processes with nontrivial algorithmic properties -- from one, two, or three visual examples. People were asked to extrapolate the programs in several ways, for both classifying and generating new examples. As a theory of these inductive abilities, we present a Bayesian program learning model that searches the space of programs for the best explanation of the observations. Although variable, people's judgments are broadly consistent with the model and inconsistent with several alternatives, including a pre-trained deep neural network for object recognition, indicating that people can learn and reason with rich algorithmic abstractions from sparse input data.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/14/2019

Human few-shot learning of compositional instructions

People learn in fast and flexible ways that have not been emulated by ma...
05/20/2021

Flexible Compositional Learning of Structured Visual Concepts

Humans are highly efficient learners, with the ability to grasp the mean...
06/25/2020

Learning Task-General Representations with Generative Neuro-Symbolic Modeling

A hallmark of human intelligence is the ability to interact directly wit...
06/13/2022

From Perception to Programs: Regularize, Overparameterize, and Amortize

Toward combining inductive reasoning with perception abilities, we devel...
05/11/2022

Identifying concept libraries from language about object structure

Our understanding of the visual world goes beyond naming objects, encomp...
12/06/2018

Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs

Humans can infer concepts from image pairs and apply those in the physic...
02/18/2018

Ab initio Algorithmic Causal Deconvolution of Intertwined Programs and Networks by Generative Mechanism

To extract and learn representations leading to generative mechanisms fr...

1 Introduction

Computer vision now approaches or exceeds human performance on certain large-scale object recognition tasks (Krizhevsky ., 2012; Russakovsky ., 2015; Simonyan  Zisserman, 2014; He ., 2015), yet the best systems miss critical aspects of how people learn and think about categories (Lake ., 2017). People learn richer and more abstract concepts than the best machines and, remarkably, they need fewer examples in order to do it. Children can make meaningful generalizations after seeing just one or a handful of “pineapples” or “school buses” (Smith ., 2002; Bloom, 2000; Xu  Tenenbaum, 2007), while the leading machine systems require far more examples. People can also use their concepts in more sophisticated ways than machines can – not just for classification (Fig. 1A), but also for generation, explanation, and extrapolation (Fig. 1B). A central challenge is understanding how people learn such rich concepts from such limited amounts of experience.

An important limitation of contemporary AI systems is that they lack a human-like understanding of causality (Lake ., 2017). People utilize causal understanding for classification (Rehder  Hastie, 2001; Murphy, 2002)

, yet leading object recognition systems based on convolutional neural networks

(ConvNets; LeCun ., 1989) often fail to do so. For instance, people can use causality to help group a young tree and old tree in the same class, while ConvNets fail to see this type of similarity, even after training on over a million natural images (Fig. 1A). In addition to classification, people can predict how a tree will grow, or how a tree could be trimmed to keep it healthy (Fig. 1B), demonstrating the causal basis of many everyday explanations and extrapolations.

Figure 1: Causal understanding influences everyday conceptual judgments in classification (A) and extrapolation (B). The top and left images of trees (A) have the same causal structure and were generated from the same simple program (L-system; Prusinkiewicz  Lindenmayer, 1990). However leading object recognition systems trained on natural images (Simonyan  Zisserman, 2014; He ., 2015) understand little of that causal structure, perceiving the other two images as more similar (top and right) than the natural continuation (top and left; based on euclidean/cosine distance in the top hidden layer). (B) People also use their causal knowledge to make extrapolations, including predicting how trees grow. (C) In addition to trees, natural fractal concepts with rich causal structure include brocolli, ice drippings, and peacock plumage.

To capture more causal and flexible types of learning, concept learning has been modeled as Bayesian program induction (Lake ., 2015; Goodman ., 2015; Stuhlmuller ., 2010; Piantadosi ., 2012; Ellis ., 2015, 2018)

. Many real-wold concepts can be well-described as programs with high-level algorithmic units (e.g., loops and recursion), reflecting causal structure about how examples of that category are formed. Bayesian inference provides a powerful engine for learning causal and compositional representations from examples, and these rich representations allow for flexible generalization to multiple tasks. For instance, Bayesian “motor” program induction has been used to learn new handwritten characters from a single example, leading to human-like patterns of generalization in various generative tasks

(Lake ., 2015)(Lake ., 2019). Beyond learning visual concepts, program induction has been used to model the acquisition of number words (Piantadosi ., 2012) and problem solving with toy train systems (Khemlani ., 2013).

This previous work covers interesting yet specialized cases: motor programs for generating characters or rearranging toy trains have unusually concrete and embodied semantics that are not representative of all programs. Moreover, program induction has steep computational demands, and traditional inference algorithms struggle to search the non-smooth and combinatorial program spaces. If the mind can induce genuine programs to represent concepts, what are the limits of this ability? Do people need explicit instruction regarding the underlying causal process – as in practice writing handwritten letters – or can people infer the causal process from just its outputs? Do mental concepts naturally include powerful computational techniques such as recursion?

Program induction over abstract recursive structures is not just a theoretical exercise. Recursion is central to language and thought (Hauser ., 2002; Corballis, 2014), and many natural categories arise through recursive generative processes (Fig. 1C). Visual concepts such as trees, Romanesco brocolli, peacock plumage, ice drippings, rivers, fingerprints, wax drippings, clouds, etc. are natural fractals – objects with characteristic patterns that appear at every scale (Mandelbrot, 1983). Breaking off a piece of Romanesco, chopping off the branch of a tree, or zooming in on a subset of a river delta each results in miniature versions of the original. This signature invites learners to search for simpler causal processes that can explain the visual complexity.

In this paper, we studied how people and machines learn abstract, recursive visual concepts from examples. The tasks were designed to explore the limits of the human ability to infer structured programs from examples – in terms of the difficulty of the concepts, the amount of data provided (just one or a few examples), and the range of ways people can generalize (both classification and generation). While examining human learning, our tasks also present a new challenge for computational cognitive modeling and machine learning. We develop a Bayesian program learning (BPL) model to learn program-based concepts from examples (Lake ., 2015)

, providing both a measure of ideal performance as well as an algorithmic-level model of the learning process. We compare with multiple computational approaches including deep neural networks, classic pattern recognition algorithms, and a lesioned model that do not operationalize concept learning as program induction.

2 Model

We introduce a hierarchical Bayesian model for learning visual concepts from examples. During learning, the model receives a limited set of outputs (in this case, images) from an unknown program without the intermediate generative steps. The aim is to search the space of possible programs for those most likely to have generated those outputs, considering both sequential and recursive programs as candidate hypotheses. To construct the model, we first specify a language for visual concepts that is used both for generating the experimental stimuli and for the computational modeling. Second, we describe how to infer programs from their outputs through the hierarchical Bayesian framework.

Figure 2: A hierarchical generative model for recursive visual concepts. A probabilistic context-free grammar () samples an L-system () which defines a type of image (a concept). The L-system specifies an axiom, a turn angle, and re-write rules for the “F” and “G” symbols. Tokens of a concept have both a symbolic () and a visual “Turtle graphics” form (), where indicates the depth of recursion. In this example, the recursion operates as follows: the axiom “F” () is re-written to become “G-G+F+G-G” (), which is rewritten to become , and so on (the “…” indicates line breaks and are not symbols). To transform into , turtle starts at the bottom leftmost point of each figure with a rightward heading.

A language for recursive visual concepts

Lindenmayer systems (L-systems) provide a flexible language for recursive visual concepts, with applications to modeling cellular division, plant growth, and procedural graphics (Lindenmayer, 1968; Prusinkiewicz  Lindenmayer, 1990; Mĕch  Prusinkiewicz, 1996). We use a class of L-systems that closely resemble context-free grammars, specifying a start symbol (axiom) and a set of symbol re-write rules. Each recursive application of the re-write rules produces a new string of symbols. Unlike context-free grammars that apply the re-write rules sequentially, L-systems apply all rules in parallel. As the rules are applied, each intermediate output is a different member of the category, which has both a symbolic () and visual interpretation (), where indicates the depth of recursion. An example L-system is shown in Fig. 2.

Building on prior work on understanding figure perception as symbolic compression (Leeuwenberg, 1969; Buffart ., 1981), the symbolic description () of an example is interpreted visually () using “Turtle graphics” (Fig. 2). Turtle graphics is a common environment for teaching programming, and it has been used for program induction as well (Ellis ., 2015). The environment provides a virtual turtle that lives in 2D canvas with a location and orientation. The turtle can be controlled with simple instructions such as “go straight” (here, denoted by symbols “F” and “G”), turn left (“-”), and turn right (“+”). As she moves, the turtle produces ink on the canvas. In this paper, turtle always moves a fixed number of steps, turns a fixed number of degrees, and produces ink while moving, although more complex control structures for the turtle are possible. Together with L-systems, this framework can represent a large number of visual concepts. We use recursive concepts to study how people learn program-like abstractions from examples, as well as the learning conditions that best facilitate generalization.

Bayesian program learning

A Bayesian Program Learning (BPL) model is used to infer an unknown program given just one or a small number of images produced by the program (Fig. 2; Lake ., 2015) (see also, Probabilistic Language of Thought models; Piantadosi, 2011; Goodman ., 2015; Overlan ., 2016). The core of the BPL model is the hierarchical generative process shown in Fig. 2. First, a meta-grammar samples a concept type, which is a L-system based program . To produce a token, the depth of recursion

is either pre-specified or sampled from a uniform distribution (from 0 to 4). The program

is applied to its own output times, and the symbolic form is stochastically rendered as a binary image

. The joint distribution a type

and tokens is

(1)

Concept learning becomes a problem for posterior inference, or reasoning about the distribution of programs given a set of images, either (if the depths are pre-specified) or (for a single image of unknown depth).

We now describe each term in Eq. 1 to specify the full model. The meta-grammar is a probabilistic context free grammar (PCFG) for generating . At each step, a PCFG re-write rule is chosen uniformly at random (from the applicable set) to define the prior on L-systems,

. The random variables in

are the turtle’s turning angle and F-rule, the re-write rule for the “F” symbol. The PCFG generates a F-rule by beginning at “Start” and applying production rules until the string consists of only terminal symbols, (Fig. 2). Instead of allowing a free combination of primitives, the model is restricted to a special set of L-systems that resemble natural growth when rendered in visual form. At each iteration, these programs sprout symmetric growths from a subset of their straight line segments (“F” symbols sprout and “G” symbols do not), while otherwise maintaining the same global shape (Fig. 2). The details for generating from are provided in the Appendix.

Last, we sample from the stochastic process for generating images, . First, is unrolled for iterations to produce a string of turtle symbols (Fig 2). Second, the turtle traces her trajectory, which is centered and rescaled to have a common width. Third, the stochastic ink model from Lake . (2015) is used to line the trajectory with grayscale ink, which is a refinement of the approach developed by Hinton  Nair (2006)

. Each real valued pixel defines the probability of producing a black pixel (rather than white) under an independent Bernoulli model.

111Images presented to participants were rendered with standard Python graphics rather than the BPL ink model. The BPL ink model parameters were fit (via maximum likelihood) to the graphics using random turtle scribbles.

To summarize, the BPL model specifies a grammar (PCFG) for generating another grammar-like program (L-system), and a process for expanding and rendering L-systems as raw images (via turtle graphics). The model can also solve the inverse problem: given an image, it can search for the underlying program (L-system) that is most likely to have generated it. Any given image is consistent with both recursive and non-recursive interpretations, and thus the model decides for itself which interpretation to take. An interpretation unrolled to iterations is a recursive generative process, while an interpretation with is a static, non-recursive generative process. Recovered programs can be run forward to generate further recursive iterations, generalizing beyond the input to perform a range of tasks.

To approximate Bayesian inference, we can draw posterior samples from (pre-specified depth) or

(unknown depth) using Markov Chain Monte Carlo (MCMC) and a general inference scheme based on the Metropolis-Hastings algorithm and sub-tree regrowth proposals

(Goodman, Tenenbaum ., 2008; Piantadosi ., 2012). The algorithm was implemented in the LOTlib software package (Piantadosi, 2014). This algorithm is effective for our problem, except that complex hypotheses that produce very long symbolic forms () can lead to computational issues during simulation. To address this, we capped the length of the hypothesis () to be just above the longest concept in the experiments. Hypotheses that exceeded this limit were decremented in recursive depth to .

It is important to note that the model has several key advantages over people, at least for the purposes of our experiments. The model is provided with exactly the right programming language, allowing it to learn concepts in this family but not others. This advantages the model over people, who do not benefit from having exactly the right language. To the extent that people are successful on these tasks, they are unlikely to use precisely the same representation language as the model; instead, their “language of thought” (Fodor, 1975; Piantadosi, 2011; Goodman ., 2015; Goodman, Mansinghka ., 2008) must be general enough to learn these programs as well as many others.

Alternative models

We compare the BPL model with several competitive alternatives that do not have explicit program-like structures or abstractions. The first algorithm is a deep convolutional neural network (ConvNet; Krizhevsky ., 2012)

, which uses extensive knowledge of objects and natural scenes statistics instead of symbolic programs. The ConvNet is pre-trained on the large scale ImageNet object recognition challenge with 1.2 million natural images

(Russakovsky ., 2015)

, and similarity is measured as cosine distance in the top-most feature vector before the output layer.

We also compare with two classic distance metrics. The Euclidean metric is used as a simple measure of similarity between images. Additionally, the modified Hausdorff metric is a more sophisticated measure of shape, applied to the binarized images

(Dubuisson  Jain, 1994).

Finally, we compare with a lesioned “non-recursive BPL” model restricted to depth , meaning it cannot use recursion to explain the visual examples. The algorithm seeks to explain the (most mature) example contour with a complex sequence of Turtle commands (the most complex contour contains 470 symbols). To reduce its considerable search burden, the model is provided with the ground truth generative sequence for the most mature example. Given the contour is modeled as a flat sequence, the likelihood of any new image is simply modeled as , without any recursive expansion.

3 Experiments

Two experiments explored the human limits of inferring program-based concepts from examples. Participants were asked to learn new visual concepts from just one or a few examples of their outputs, and their ability to generalize was evaluated either through classifying new examples (experiment 1) or generating new examples (experiment 2). People, BPL, and the alternative models were compared on a set of tasks of varying difficulty, providing a comprehensive picture of the human ability and its boundaries. All of the experiments are available online,222https://cims.nyu.edu/~brenden/supplemental/lrvc/vp-exp.html and the details are provided below.

Figure 3: Classifying a new example of a recursive visual concepts. Examples trials are shown for the block (A) vs. incremental condition (B). Answers: bottom-left (A) and top-middle (B).
Figure 4: Mean performance across participants on classification (A) and generation (B) tasks with recursive visual concepts. Accuracy for classification is based on a six-way choice. Accuracy for generation is measured on the basis of individual decisions (left) and whether exactly the right exemplar was produced (right). Error bars are SEM.

Experiment 1: Classification

This experiment examines how people classify new examples of a recursive visual concept.

Methods

Thirty participants in the United States were recruited on Amazon Mechanical Turk using psiTurk (Gureckis ., 2015). Participants were paid $2.50, and there was no special incentive for high accuracy. The experiment took an average of 11:53 minutes with a range from 3:52 to 36:51 minutes.

Participants were shown examples from 24 different recursive visual concepts and asked to classify new examples (Fig. 3).333One trial was removed after collecting the data because two visually identical distractors were mistakenly included.

Each of 24 trials introduce a separate concept, and participants made one classification judgment per trial. No feedback was provided to participants, in order to prevent supervised learning in the task. The instructions specified that each trial introduced “a new type of alien crystal” that had infected a surface and had been growing for some time. Participants were asked to predict what the crystal will look like as it continues to grow, and they were presented with a choice of six images. The stimuli were quite visually complex, and participants could magnify the details by rolling their mouse over a particular image area. After reading the instructions, participants were quizzed on their content and cycled back to re-read, until they got all of the comprehension questions correct

(Crump ., 2013).

Participants were assigned to one of two conditions that differed in the number of training examples: the “incremental” condition observed each step of growth (; Fig. 3B), and the “block” condition observed only the final step of growth (; Fig. 3A). The incremental condition is an example of the Visual Recursive Task, and previous work has shown that both children and adults can perform the task successfully (Martins ., 2015, 2014). Our aim differs in that the incremental classification experiment is just the simplest of the evaluations we study. The more challenging “block” condition requires generalization from just a single static example of the concept (one-shot learning), and the generation task (experiment 2) probes richer forms of generalization beyond classification.

The 24 visual concepts were created by sampling L-systems from the BPL prior distribution. The provided examples were enrolled to depth , and the task was to choose the natural extension (formally, the next iteration in depth ). The distractors in the forced-choice judgments were created by taking the example stimulus at (the “After infection” image in Fig. 3A or the “Step 2” image in Fig. 3B) and applying the expansion rules from a different L-system. The 24 concepts were sampled from the prior with a few additional constraints that standardized the concepts and eased cognitive penetrability: the fractal grows upwards, the turtle does not cross over her own path, and the F-rule expansion does not allow for two adjacent straight line symbols. Distractors were sampled without these constraints to ensure a sufficient variety of continuation types.

To perform the classification, BPL uses the last sample produced using MCMC to approximate the posterior predictive distribution, which is either

(incremental condition with known depth) or (block condition with a single exemplar at unknown depth ). Each possible answer choice is scored according to this distribution, and the image option with highest posterior predictrive probability is selected.

Results

Overall, participants extrapolated in ways consistent with the underlying program (Fig. 4A). The average accuracy across participants was 64.9% (), which is significantly better than chance performance of 16.7% (, ). Neither the pre-trained deep ConvNet nor Modified Hausdorff distance could classify the images better than chance (accuracy was 4.4% and 17.4%, respectively, choosing the test image that most closely matched the last training image). There was no significant difference in average accuracy between participants in the incremental condition (, , ) and block condition (, , ; , ) and average item accuracy was also correlated across the two conditions (, ). Evidently, people could intelligently reason about latent generative processes: seeing the incremental steps did not noticeably help with inferring the underlying L-system, indicating a powerful inductive ability.

There was substantial variability in participant performance. Overall accuracy was correlated with time spent on the experiment (, ). In a post-experiment survey, participants who reported recognizing the stimuli as fractals performed better () than those that did not (; , ). Importantly, even participants who did not recognize the stimuli as fractals performed above chance (, ). Additionally, the degree of familiarity with fractals and whether or not a participant was a computer programmer did not significantly predict accuracy, which is noteworthy since recursion is an important technique in computer programming.

Classification with BPL was simulated using MCMC for posterior inference. With enough samples, the model is a perfect classifier. To achieve high performance, the model depends crucially on its ability to learn recursive programs. In the block condition, the posterior mode program (aggregated across chains) was always recursive and correctly predicted the ground truth recursive depth, across all of the trials. In contrast, the non-recursive BPL model failed to classify new examples correctly, achieving only 30.4% accuracy, performing at the level of the simple Euclidean distance metric.

BPL can classify perfectly with enough samples, but people are not perfect classifiers. The behavioral deviations from this ideal can be modeled as failures of search (or limited sampling; Lieder ., 2012; Vul ., 2014; Bramley ., 2017), with individual participants simulated as separate MCMC chains where some may fail to discover a suitable explanation of the data. With 15 simulated participants in each condition, 240 Metropolis-Hastings proposals for each chain were suitable for matching human-level accuracy, if decisions are made based on the last sample (64.5% correct on average). Indeed, with this algorithmic limitation, BPL’s predictions for which concepts are easier or harder to learn were moderately associated with human performance, in that the mean accuracy for the model and participants was correlated (, ). Unlike the alternatives, BPL is able to make correct classification decisions as well as predict which decisions are easier or harder for people.

Note, however, that it is possible that people used a heuristic to make classification decisions and that these heuristics only approximated BPL. In particular, choosing the image that contains a smaller (possibly rotated) version of the example image is closely related, but also distinct from, learning a recursive visual program. Although this heuristic does solve the classification task, it does not help to explain, as with BPL, why some concepts are easier to learn than others. Moreover, this heuristic does not specify how to generate new examples of the concept, which is the task that was evaluated next.

Experiment 2: Generation

This experiment examines how people generate new examples of a recursive visual concept. Compared to classification experiment, this is a more difficult and unconstrained form of generalization that further explores the boundaries of the human ability.

Methods

Thirty participants in the United States were recruited on Amazon Mechanical Turk using psiTurk. As before, participants were paid $2.50 and there was no special incentive for high accuracy. The experiment took an average of 15:02 minutes with a range of 6:35 to 30:17 minutes.

The procedures were adapted from experiment 1 and were the same except where noted. As before, participants were randomly assigned to either the incremental or block condition. There were 13 trials each with novel concepts, and example trials are shown in Fig. 5. In the incremental condition, participants saw novel concepts with three steps of growth (unrolled to depths ), and they were asked to predict just one additional step () beyond the most mature exemplar that was viewed. In the block condition, participants saw just the last, most mature step of growth () and were asked to demonstrate what the crystal will look like as it continues to grow ().

Participants used a custom web interface to generate the new example (Fig. 5ii). Clicking on a line segment toggles it from a deactivated state to an activated state (turning a “G” into an “F” in the symbolic language), or vice versa. Moving the mouse over a line segment highlights it in another color, and the color reveals how a click would affect the state. When highlighted green, clicking the segment sprouts a growth, activating it. When highlighted red, the segment is already activated and clicking it causes it to deactivate. Shortcut buttons allow participants to activate or deactivate all of the segments with a single click. Participants could interact with the display for as long as they needed.

Participants were informed that the order of their actions was not important, and only the final product mattered. For some of the concepts in the previous experiment, the segments were too small for participants to effectively see and click; thus, this experiment used the 13 concepts from the classification experiment with the largest line segments. For 3 of the 13 concepts, some of the segments are redundant in that they create the same visual growth when activated as other segments (see the example in Fig. 5A-ii). For this reason, accuracy was scored according to the resulting visual form rather than the individual clicks, since different click patterns could result in the same visual pattern.

Four participants were excluded from the analysis. Two participants activated all of the growths for every single trial, and one participant did so for all but one trial. The data failed to record for one participant.

To generate a new example, BPL uses the last sample from MCMC to approximate the posterior predictive distribution, which is either (incremental condition with known depth) or (block condition with a single exemplar at unknown depth ). BPL makes a response using the same interactive display that participants used, allowing for possible responses given a display with expandable segment. To make a choice, the model visits each segment once in random order, and greedily decides whether or not to expand it in order to maximize the posterior predictive probability.

Figure 5: Generating new examples of recursive visual concepts. Responses for individual participants are shown for two trials (A and B) with different concepts (i). The incremental condition observed all three steps, while the block condition observed just step 0 and step 3. An interactive display allowed participants to grow the figure by clicking on line segments of the example at step 3 (ii). Generated examples are shown in (iii) and (iv), and the number below each figure is the number of participants who generated it. The most frequently generated stimulus was correct in all cases except A-iv (not all responses shown for this group).

Results

Participants generated exemplars that were highly structured and generally consistent with the underlying program, although there was substantial variability across participants (Fig. 5iii-iv). Since each trial consists of many individual judgments (stimuli ranged between 22 and 125 segments), the accuracy of a trial was first computed by averaging across individual segment decisions, and then the accuracy of a participant was computed by averaging across trials. Since growths tend to be sparse in the ground truth concepts, a baseline that deactivates all segments achieves 57.7% correct. Both groups performed significantly better than baseline (Fig. 4B): the incremental condition achieved 89.2% correct (, , ) and the block condition achieved 70.0% correct (, , ). The difference in means between groups was also significant (, ).

Remarkably, participants were able to generate precisely the right example on a substantial number of trials, such that the example was only marked as correct if every individual segment was correct (Fig. 4B). A random responder is effectively guaranteed to perform at 0% correct, since even the simplest trial has over 4 million possible responses. Alternatively, a baseline that activates all segments achieves 23.1% correct. Participants in the incremental condition produced precisely the right exemplar in 59.9% of cases (), while participants in the block condition did so in 24.4% of cases (; difference in means was significant, , ). Although both groups were far better than the random baseline, only the incremental group was significantly better than the complete activation baseline on this conservative measure of accuracy (, ). Thus, participants were accurate in both individual decisions and in aggregate, producing up to 125 decisions correctly to generate the right example. In the most difficult condition, participants produced exactly the right example only a quarter of the time, even if their accuracy was 70% on individual decisions, suggesting people’s inductive capabilities were nearing their limits.

The BPL model can also use the interactive interface to generate new examples. If the MCMC simulation is run for long enough, BPL achieves perfect performance, demonstrating that BPL can successfully produce new examples of recursive visual concepts. As with classification, we used short MCMC chains to simulate individual participants and modeled a response based on the last sample. The number of MCMC steps was fit to match human accuracy for generating exactly the right example, finding 160 steps matches the incremental group and 80 steps matches the block group. Unlike the classification task, MCMC alone did not predict which items are easier for participants, and model and participant accuracies were not significantly correlated ( and for incremental and block groups, respectively). Instead, properties of the response interface were the driving factors in item difficulty. Examples that can be generated with one or two well-placed actions were easier, while examples that required many actions were more difficult. For instance, the concept in Fig. 5B can be correctly extrapolated by activating all the segments with the appropriate shortcut button, while the concept in Fig. 5A requires eight individual activation actions. Assuming participants can begin from a fully activated or deactivated display, item accuracy is predicted by the number of required actions to produce the correct exemplar, with a correlation of () for the incremental condition and (

) for the block condition. This effect can be reproduced by the BPL model with response noise when acting using the response interface, but in our simulations this did not account for additional variance beyond the number of optimal actions.

BPL demonstrates the computational abilities necessary for solving the generation task, differentiating itself from a range of alternatives. For instance, the deep ConvNet, modified Hausdorff distance, Euclidean distance, and non-recursive BPL utterly fail to provide reasonable responses on this task. For these algorithms, the best response is always to (incorrectly) create a new exemplar with zero growths activated, since it is maximally similar to the previous exemplar. Instead, program-like representations provide an account of how a range of generalizations are possible from only the briefest exposure to a new concept.

4 Discussion

Compared to the best object recognition systems, people learn richer concepts from fewer examples. Recent research in cognitive science, machine learning, and computer vision has begun to model learning as a form of program induction (Lake ., 2015; Goodman ., 2015; Stuhlmuller ., 2010; Piantadosi ., 2012; Khemlani ., 2013; Savova  Tenenbaum, 2008; Savova ., 2009; Zhu  Mumford, 2006; Ellis ., 2015, 2018), yet there are steep computational obstacles to building a general purpose program learner. It is unclear how the mind could learn genuine programs in a general purpose way.

Here, we explored the boundaries of the human ability to learn programs from examples. We probed several key dimensions in a concept learning task: the difficulty of the concepts, the number of examples, and the format of generalization. We found that people could both classify and generate new examples in ways consistent with a Bayesian program learning model, even though the model was provided with substantial knowledge about the structure of the concept space. In a classification task that fools the best object recognition algorithms, participants responded with high accuracy and in accordance with the underlying recursive structure, regardless of whether they saw just one or two examples (block versus incremental condition). In a more challenging generation task, people constructed new examples that were consistent with the underlying programs. For generation, additional examples provided a boost in accuracy (three examples versus one example), while the one-shot case proved taxing and approached the boundary of people’s inductive abilities.

People’s success contrasts with the performance of pre-trained object recognition systems (ConvNets) and other pattern recognition techniques that do not explicitly represent causal processes. Although feature-based approaches have proven effective in machine learning and computer vision, causal modeling and program induction hold promise for additional advances. Causal representations are central to human perceptual and conceptual abilities (Murphy  Medin, 1985; Gelman, 2003; Leyton, 2003; Rehder  Hastie, 2001; Bever  Poeppel, 2010), and they can also form the basis for high performance classification and prediction algorithms (Lake ., 2015; Pearl, 2019). Causal models can help explain wide variations in appearance without the need for extensive training data– for instance, highlighting the commonalities between young and old trees of the same species, despite dramatic differences in their superficial features (Fig 1A). Causal knowledge can also inform other types of everyday conceptual judgments (Fig 1B): Is this tree growing too close to my house? What will it look like next summer as it continues to grow?

There are several straightforward and meaningful extensions of the representation language studied here (Prusinkiewicz  Lindenmayer, 1990). Although the current concepts have some stochastic properties, including the depth of recursion and the stochastic renderer, they are more deterministic than their natural analogs (Fig. 1C). Our concepts grow in an orderly sequence of recursive steps, while natural growth is more stochastic and flexible. Simple extensions to our language can produce these characteristics. Both failures to grow and spontaneous growths can be modeled with an additional stochasticity; before applying the re-write rules at each step, the model could allow “F” symbols (growth) to mutate to “G” (non-growth) symbols and vice versa with a small probability. Moreover, context-sensitive L-systems also produce richer and more naturalistic concepts (Prusinkiewicz  Lindenmayer, 1990), which would provide a more powerful language for studying other types of concepts. Finally, people likely have primitives for simple shapes like triangles, squares, and rectangles, while the current BPL implementation can produce these shapes but does not represent them differently than other contours. Providing models with a richer set of primitives could further help close the gap between human and machine performance. All of these extensions provide additional opportunities to study program induction in more challenging context, with the aim of further illuminating its relationship to conceptual representation.

Our tasks pose a new challenge problem for machine learning. Although one could explore different pre-training regimens for deep ConvNets, which could improve their performance on our tasks, an alternative is to explore learning generative, program-like representations with neural networks. There has been significant recent interest in “differentiable programming,” or using neural networks to learn simple types of programs from examples, including sorting (Graves ., 2014), arithmetic (Weston ., 2015), and finding shortest paths in a graph (Graves ., 2016). It is unclear whether this toolkit can tackle the tasks studied here, but it is a worthy pursuit. Domain general program induction remains an important computational and scientific goal, with potential to deepen our understanding of how people learn such rich concepts, from such little data, across such a wide range of domains.

Overall, our results suggest that the best program learning techniques will likely need to include explicit – or easily formulated – high-level computational abstractions. As demonstrated here, people can learn visual concepts with rich notions of recursion, growth, and graphical rendering from just one or a few examples. Computational approaches must similarly engage rich algorithmic content to achieve human-level concept learning.

Acknowledgments

We gratefully acknowledge support from the Moore-Sloan Data Science Environment, and the NYU ConCats group for helpful discussion and suggestions. We thank Neil Bramley for providing helpful comments on a preliminary draft.

Appendix: Generating L-systems from the meta-grammar

The L-system axiom is always “F”. The angle is sampled uniformly from . The constrained F-rule is generated as follows. The start symbol produces three non-terminals: “X” (prefix), “Y” (body), and “Z” (middle). After each non-terminal grounds out as a string of terminals, the right-hand-side of the F-rule is defined by the resulting string concatenation , where concatenates and reorders string. For instance, if , , and , the rule F G-G+F+G-G is produced. The G-rule is defined deterministically given the F-rule, so that all line segments (“F” and “G”) grow at the same rate at each iteration.

References

  • Bever  Poeppel (2010) Bever2010Bever, TG.  Poeppel, D.  2010. Analysis by synthesis: a (re-) emerging program of research for language and vision Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics4174–200.
  • Bloom (2000) Bloom2000Bloom, P.  2000. How Children Learn the Meanings of Words How Children Learn the Meanings of Words. Cambridge, MAMIT Press.
  • Bramley . (2017) Bramley2017Bramley, NR., Dayan, P., Griffiths, TL.  Lagnado, DA.  2017. Formalizing Neurath’s Ship: Approximate Algorithms for Online Causal Learning Formalizing Neurath’s Ship: Approximate Algorithms for Online Causal Learning. Psychological Review.
  • Buffart . (1981) Buffart1981Buffart, H., Leeuwenberg, E.  Restle, F.  1981. Coding theory of visual pattern completion Coding theory of visual pattern completion. Journal of Experimental Psychology: Human Perception and Performance72241–274.
  • Corballis (2014) Corballis2014Corballis, MC.  2014. The recursive mind: The origins of human language, thought, and civilization The recursive mind: The origins of human language, thought, and civilization. Princeton, NJPrinceton University Press.
  • Crump . (2013) Crump2013Crump, MJC., McDonnell, JV.  Gureckis, TM.  2013. Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PLoS ONE83.
  • Dubuisson  Jain (1994) Dubuisson1994Dubuisson, MP.  Jain, AK.  1994. A modified hausdorff distance for object matching A modified hausdorff distance for object matching. International Conference on Pattern Recognition International conference on pattern recognition ( 566–568).
  • Ellis . (2018) Ellis2018Ellis, K., Ritchie, D., Solar-lezama, A.  Tenenbaum, JB.  2018. Learning to Infer Graphics Programs from Hand-Drawn Images Learning to Infer Graphics Programs from Hand-Drawn Images. Advances in Neural Information Processing Systems 31. Advances in neural information processing systems 31.
  • Ellis . (2015) Ellis2015Ellis, K., Solar-lezama, A.  Tenenbaum, JB.  2015. Unsupervised Learning by Program Synthesis Unsupervised Learning by Program Synthesis. Advances in Neural Information Processing Systems.
  • Fodor (1975) Fodor1975Fodor, J.  1975. The Language of Thought The Language of Thought. Harvard University Press.
  • Gelman (2003) Gelman2003Gelman, SA.  2003. The Essential Child: Origins of Essentialism in Everyday Thought The Essential Child: Origins of Essentialism in Everyday Thought. New York, NYOxford University Press.
  • Goodman, Mansinghka . (2008) Goodman2008Goodman, ND., Mansinghka, VK., Roy, DM., Bonawitz, K.  Tenenbaum, JB.  2008. Church: A language for generative models Church: A language for generative models.

    Uncertainty in Artificial Intelligence.

  • Goodman, Tenenbaum . (2008) Goodman2008aGoodman, ND., Tenenbaum, JB., Feldman, J.  Griffiths, TL.  2008. A rational analysis of rule-based concept learning A rational analysis of rule-based concept learning. Cognitive Science321108–54.
  • Goodman . (2015) Goodman2014Goodman, ND., Tenenbaum, JB.  Gerstenberg, T.  2015. Concepts in a probabilistic language of thought Concepts in a probabilistic language of thought. E. Margolis  S. Laurence (), Concepts: New Directions. Concepts: New directions. Cambridge, MAMIT Press.
  • Graves . (2014) Graves2014Graves, A., Wayne, G.  Danihelka, I.  2014. Neural Turing Machines Neural Turing Machines. arXiv preprint. http://arxiv.org/abs/1410.5401v1
  • Graves . (2016) Graves2016Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A.Hassabis, D.  2016. Hybrid computing using a neural network with dynamic external memory Hybrid computing using a neural network with dynamic external memory. Nature.
  • Gureckis . (2015) PsiturkGureckis, TM., Martin, J., McDonnell, J., Alexander, RS., Markant, DB., Coenen, A.Chan, P.  2015. psiTurk: An open-source framework for conducting replicable behavioral experiments online psiTurk: An open-source framework for conducting replicable behavioral experiments online. Behavioral Research Methods.
  • Hauser . (2002) Hauser2002Hauser, MD., Chomsky, N.  Fitch, WT.  2002. The Faculty of Language: What Is It, Who Has It, and How Did It Evolve? The Faculty of Language: What Is It, Who Has It, and How Did It Evolve? Science29855981569–1579.
  • He . (2015) He2015He, K., Zhang, X., Ren, S.  Sun, J.  2015. Deep Residual Learning for Image Recognition Deep Residual Learning for Image Recognition. arXiv preprint. http://arxiv.org/abs/1512.03385
  • Hinton  Nair (2006) HintonNair2006Hinton, GE.  Nair, V.  2006. Inferring motor programs from images of handwritten digits. Inferring motor programs from images of handwritten digits. Advances in Neural Information Processing Systems 18 Advances in Neural Information Processing Systems 18 ( 515–522).
  • Khemlani . (2013) Khemlani2013Khemlani, SS., Mackiewicz, R., Bucciarelli, M.  Johnson-Laird, PN.  2013. Kinematic mental simulations in abduction and deduction Kinematic mental simulations in abduction and deduction. Proceedings of the National Academy of Sciences1104216766–71.
  • Krizhevsky . (2012) Krizhevsky2012Krizhevsky, A., Sutskever, I.  Hinton, GE.  2012. ImageNet classification with deep convolutional neural networks ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 Advances in Neural Information Processing Systems 25 ( 1097–1105).
  • Lake . (2015) LakeScience2015Lake, BM., Salakhutdinov, R.  Tenenbaum, JB.  2015. Human-level concept learning through probabilistic program induction Human-level concept learning through probabilistic program induction. Science35062661332–1338.
  • Lake . (2019) LakeOmniglotProgressLake, BM., Salakhutdinov, R.  Tenenbaum, JB.  2019. The Omniglot Challenge: A 3-Year Progress Report The Omniglot Challenge: A 3-Year Progress Report. arXiv preprint. http://arxiv.org/abs/1902.03477
  • Lake . (2017) Lake2016Lake, BM., Ullman, TD., Tenenbaum, JB.  Gershman, SJ.  2017. Building machines that learn and think like people Building machines that learn and think like people. Behavioral and Brain Sciences40E253.
  • LeCun . (1989) LeCun1989LeCun, Y., Boser, B., Denker, JS., Henderson, D., Howard, RE., Hubbard, W.  Jackel, LD.  1989. Backpropagation applied to handwritten zip code recognition Backpropagation applied to handwritten zip code recognition. Neural Computation1541–551.
  • Leeuwenberg (1969) Leeuwenberg1969Leeuwenberg, EL.  1969. Quantitative specification of information in sequential patterns Quantitative specification of information in sequential patterns. Psychological Review762216–220.
  • Leyton (2003) Leyton2003Leyton, M.  2003. A generative theory of shape A generative theory of shape. Springer.
  • Lieder . (2012) LiederaLieder, F., Griffiths, TL.  Goodman, ND.  2012. Burn-in, bias, and the rationality of anchoring Burn-in, bias, and the rationality of anchoring. Advances in Neural Information Processing Systems 25. Advances in neural information processing systems 25.
  • Lindenmayer (1968) Lindenmayer1968aLindenmayer, A.  1968. Mathematical models for cellular interactions in development. I. Filaments with one-sided inputs. Mathematical models for cellular interactions in development. I. Filaments with one-sided inputs. Journal of Theoretical Biology18280–299.
  • Mandelbrot (1983) Mandelbrot1983Mandelbrot, BB.  1983. The fractal geometry of nature The fractal geometry of nature (W. H. Freeman and Company, ). San Francisco, CA.
  • Martins . (2014) Martins2014Martins, MD., Laaha, S., Freiberger, EM., Choi, S.  Fitch, WT.  2014. How children perceive fractals: hierarchical self-similarity and cognitive development How children perceive fractals: hierarchical self-similarity and cognitive development. Cognition133110–24.
  • Martins . (2015) Martins2015aMartins, MD., Martins, I.  Fitch, W.  2015. A novel approach to investigate recursion and iteration in visual hierarchical processing A novel approach to investigate recursion and iteration in visual hierarchical processing. Behavioral Research Methods.
  • Mĕch  Prusinkiewicz (1996) Mech1996Mĕch, R.  Prusinkiewicz, P.  1996. Visual models of plants interacting with their environment Visual models of plants interacting with their environment. Proceedings of SIGGRAPH397–410.
  • Murphy (2002) Murphy2002Murphy, GL.  2002. The Big Book of Concepts The Big Book of Concepts. Cambridge, MAMIT Press.
  • Murphy  Medin (1985) Murphy1985Murphy, GL.  Medin, DL.  1985. The role of theories in conceptual coherence The role of theories in conceptual coherence. Psychological Review923289–316.
  • Overlan . (2016) Overlan2016Overlan, MC., Jacobs, RA.  Piantadosi, ST.  2016. A Hierarchical Probabilistic Language-of-Thought Model of Human Visual Concept Learning A Hierarchical Probabilistic Language-of-Thought Model of Human Visual Concept Learning. Proceedings of the 38th Annual Conference of the Cognitive Science Society. Proceedings of the 38th Annual Conference of the Cognitive Science Society.
  • Pearl (2019) Pearl2019Pearl, J.  2019. The Book of Why: The New Science of Cause and Effect The Book of Why: The New Science of Cause and Effect. Basic Books.
  • Piantadosi (2011) Piantadosi2011Piantadosi, ST.  2011.  Learning and the language of thought Learning and the language of thought .  Massachusetts Institute of Technology.
  • Piantadosi (2014) piantadosi2014lotlibPiantadosi, ST.  2014. LOTlib: Learning and Inference in the Language of Thought. LOTlib: Learning and Inference in the Language of Thought.
  • Piantadosi . (2012) Piantadosi2012Piantadosi, ST., Tenenbaum, JB.  Goodman, ND.  2012. Bootstrapping in a language of thought: A formal model of numerical concept learning. Bootstrapping in a language of thought: A formal model of numerical concept learning. Cognition1232199–217.
  • Prusinkiewicz  Lindenmayer (1990) Prusinkiewicz1990Prusinkiewicz, P.  Lindenmayer, A.  1990. The algorithmic beauty of plants The algorithmic beauty of plants. Springer-Verlag.
  • Rehder  Hastie (2001) Rehder2001Rehder, B.  Hastie, R.  2001. Causal Knowledge and Categories: The Effects of Causal Beliefs on Categorization, Induction, and Similarity Causal Knowledge and Categories: The Effects of Causal Beliefs on Categorization, Induction, and Similarity. Journal of Experimental Psychology: General1303323–360.
  • Russakovsky . (2015) Russakovsky2014Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.Fei-Fei, L.  2015. ImageNet large scale visual recognition challenge ImageNet large scale visual recognition challenge .
  • Savova . (2009) Savova2009Savova, V., Jakel, F.  Tenenbaum, JB.  2009. Grammar-based object representations in a scene parsing task Grammar-based object representations in a scene parsing task. Proceedings of the 31st Annual Conference of the Cognitive Science Society. Proceedings of the 31st Annual Conference of the Cognitive Science Society.
  • Savova  Tenenbaum (2008) Savova2008Savova, V.  Tenenbaum, JB.  2008. A Grammar-Based Approach to Visual Category Learning A Grammar-Based Approach to Visual Category Learning. Proceedings of the 30th Annual Conference of the Cognitive Science Society. Proceedings of the 30th Annual Conference of the Cognitive Science Society.
  • Simonyan  Zisserman (2014) Simonyan2014Simonyan, K.  Zisserman, A.  2014. Very Deep Convolutional Networks for Large-Scale Image Recognition Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (ICLR). International Conference on Learning Representations (ICLR).
  • Smith . (2002) smith-etal02Smith, LB., Jones, SS., Landau, B., Gershkoff-Stowe, L.  Samuelson, L.  2002. Object name learning provides on-the-job training for attention Object name learning provides on-the-job training for attention. Psychological Science13113–19.
  • Stuhlmuller . (2010) Stuhlmuller2010Stuhlmuller, A., Tenenbaum, JB.  Goodman, ND.  2010. Learning Structured Generative Concepts Learning Structured Generative Concepts. Proceedings of the Thirty-Second Annual Conference of the Cognitive Science Society. Proceedings of the Thirty-Second Annual Conference of the Cognitive Science Society.
  • Vul . (2014) Vul2014Vul, E., Goodman, N., Griffiths, TL.  Tenenbaum, JB.  2014. One and Done? Optimal Decisions From Very Few Samples One and Done? Optimal Decisions From Very Few Samples. Cognitive Science384599–637.
  • Weston . (2015) Weston2015Weston, J., Chopra, S.  Bordes, A.  2015. Memory Networks Memory Networks. International Conference on Learning Representations (ICLR). International Conference on Learning Representations (ICLR).
  • Xu  Tenenbaum (2007) Xu2007Xu, F.  Tenenbaum, JB.  2007. Word learning as Bayesian inference Word learning as Bayesian inference. Psychological Review1142245–272.
  • Zhu  Mumford (2006) Zhu2006Zhu, SC.  Mumford, D.  2006. A stochastic grammar of images A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision24259–362.