Generative Grading: Neural Approximate Parsing for Automated Student Feedback

05/23/2019 ∙ by Ali Malik, et al. ∙ Stanford University 0

Open access to high-quality education is limited by the difficulty of providing student feedback. In this paper, we present Generative Grading with Neural Approximate Parsing (GG-NAP): a novel approach for providing feedback at scale that is capable of both accurately grading student work while also providing verifiability--a property where the model is able to substantiate its claims with a provable certificate. Our approach uses generative descriptions of student cognition, written as probabilistic programs, to synthesise millions of labelled example solutions to a problem; it then trains inference networks to approximately parse real student solutions according to these generative models. We achieve feedback prediction accuracy comparable to professional human experts in a variety of settings: short-answer questions, programs with graphical output, block-based programming, and short Java programs. In a real classroom, we ran an experiment where humans used GG-NAP to grade, yielding doubled grading accuracy while halving grading time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computer-assisted education promises open access to world-class instruction and a reduction in the growing cost of learning [2]. A major barrier to this promise of scalable education is the need to automatically provide feedback on student work.

Learning to provide feedback has proven to be a hard machine learning problem. Despite dozens of projects that combine massive education data with cutting-edge deep learning in NeurIPS and beyond

[16; 1; 28; 23; 15; 10], most approaches fall short. Five issues have emerged: (1) student work is Zipf distributed and as such most incorrect solutions are unique even in large corpora, (2) student work is hard and expensive to label, (3) we want to provide feedback (without historical data) for even the very first student, (4) there is a high human cost to inaccurate predictions, and (5) predictions must be explainable to instructors and students. These challenges are typical of many human-centred AI problems, such as diagnosing rare diseases.

Rather than labelling student solutions, experts are much more adept at thinking “generatively”: they can easily imagine the misconceptions a student might have, and construct the space of solutions a student with these misconceptions would produce. Recently, Wu et al. [26]

used this intuition to show that a neural network trained on samples from a teacher-written probabilistic context free grammar (PCFG) describing student decisions outperforms a data-hungry supervised neural network and a deep generative model

[24]. While groundbreaking, it is difficult for experts to write cognitive models in the form of PCFGs when assignments are complex and open-ended. Further, the inference techniques of [26] do not scale well to very large grammars. The technical contributions of our work are to address these issues by introducing a more flexible generative model class for describing assignment solutions and providing an inference technique able to handle this class.

In this paper we introduce a probabilistic program based grammar; an expressive class that allows for the functional transformations and complex variable dependencies supported by probabilistic programming languages (PPL). However, with added power, this class presents a challenging inference problem. In response, we develop Neural Approximate Parsing (GG-NAP) with two important ideas: (1) to handle complex context-sensitivity, GG-NAP learns to parse as a form of compiled inference, and (2) to handle the long-tailed nature of these generative distributions, GG-NAP is trained via “adaptive" sampling that ensures sufficient resolution in the tails. While we explore it for the education domain, we believe GG-NAP will be useful for many simulation-based modelling applications.

When we apply GG-NAP to open-access datasets we are able to grade student work with close to expert human-level fidelity, substantially improving upon the state of the art across a spectrum of public education datasets: introduction to computer programming, short answers to a US citizenship test and graphics-based programs. We show a 50%, 160% and 350% improvement above the state of the art, respectively. When used with human verification in a real classroom, we are able to double grading accuracy while reducing by half the grading time. Moreover, the grading decisions made by our algorithm are auditable and interpretable by an expert teacher. Our algorithm is “zero-shot" and thus works for the very first student. Since predicted labels correspond to meaningful cognitive states, not merely grades, they can be used in many ways: to give hints to students without teachers, or to help teachers understand their class, etc.

(a) Datasets in Computational Education
(b) P8
(c) CS1: Liftoff
(d) Pyramid
(e) Powergrading
Figure 1: (a) We show the prompt and example solutions for 4 problems from programming assignments to history tests. (b)-(e): These datasets all have Zipf-like distributions, as represented by the linear relationship between frequency and rank in log space. This phenomena holds for several modalities: image rendering (Pyramid), natural language (Powergrading), block-based code (, and Java code (Liftoff).

1.1 Datasets

We consider four educational contexts. Fig. 0(a) shows example solutions for each problem. (Block Coding) Wu et al. [26] released a dataset of student responses to 8 exercises from, involving drawing shapes with nested loops. We take the most difficult problem—drawing polygons with an increasing number of sides—which has 302 human graded responses with 26 labels regarding looping and geometry (e.g. “missing for loop” or “incorrect angle”).

Powergrading (Language)

Powergrading [1] contains 700 responses to a US citizenship exam, each graded for correctness by 3 humans. Responses are in natural language, but are typically short (average of 4.2 words). We focus on the most difficult question, as measured by [17]: “name one reason the original colonists came to America". Responses span economics, politics, and religion.

PyramidSnapshot (Graphics)

PyramidSnapshot is a university CS1 course assignment intended to be a student’s first exposure to variables, objects, and loops. The task is to build a pyramid using Java’s ACM graphics library. The dataset is composed of images of rendered pyramids from intermediary “snapshots" of student work. Yan et al. [28] annotated 12k unique snapshots with 5 categoies representing “knowledge stages" of understanding.

Liftoff (Java)

Liftoff is another assignment in a CS1 course. Students write a program that prints a countdown from 10 to 1 followed by the phrase "Liftoff". We use GG-NAP with human verification to grade 176 solutions from a semester of students and measure accuracy and grading time.

1.2 The Grading Task

Figure 2: A visual representation of a grammar for the Powergrading dataset.

There are two important machine learning tasks related to grading. First, auto-predicting feedback, or labelling a given student solution with meaningful mistakes. Second, verifiable nearest neighbour, an alternative when the cost of grading errors is high, in which the algorithm produces a nearest neighbour example whose feedback can be verified with respect to the expert grammar. This system can work with a human-in-the-loop, who focuses on the differences between solutions, to achieve super-human grading precision while reducing grading time.

1.3 Generative Grading

We approach these grading problems by having an expert describe the decisions students make and their resulting answer to an assignment. If we can instantiate these expert priors as a real generative model (e.g. grammar), then we possess a simulator from which we can sample infinite amounts of labelled data, allowing for zero-shot (in terms of real data) learning. While generating solutions to large problems is difficult, representing the prior as a hierarchical set of decisions allows decomposition of this hard task into simpler ones, making it surprisingly easy for experts to express their knowledge. The challenge is then defining a robust enough class of probabilistic models (Sec. 3.2) that can capture the complexities of expert priors (and student behaviour), and constructing the machinery needed to infer student thinking from their solutions (Sec. 3.3). Figure 2 shows a pictorial representation of the generative model we use for the Powergrading task and samples that it produces. Theoretical inspiration for our grammars derives from Brown’s “Repair Theory" which argues that the best way to help students is to understand the generative origins of their mistakes [4].

2 Neural Parsing for Inference in Grammars

In this section, we define the class of grammars called Probabilistic Program Grammars and describe several motivating properties that make them useful for generative grading.

2.1 Probabilistic Program Grammar

We aim to describe a class of grammars powerful enough to easily encode any instructor’s knowledge of the student decision-making process. While it is easy to reason about context free grammars, context independence is a strong restriction that generally limits what instructors can express. As an example, imagine capturing the intuition that students can write a for loop two ways:

for (int i = 0; i < 10; i++) { println(10 - i); }  # version 1
for (int n = 10; n > 0; n-=1) { println(n); }      # version 2

Clearly, the decision for the for loop header (i < 0; i++), and print statement are dependent on the start index (i = 0) and the choice of variable name (i) as are future decisions like off-by-one. Coordinating these decisions in a context-free grammar requires a great profusion of non-terminals and production rules, which are burdensome for a human to create. Generally, the ability to express arbitrary functional relationships between variables in the grammar is crucial for real world applications. For instance, for capturing method decomposition in programming code or tense and sentence structure in natural language—basic building blocks for a good generative model in education.

We thus introduce a broader class of grammars called Probabilistic Program Grammars (PPGs) that enable us to condition choices on previous decisions and a globally accessible state. A Probabilistic Program Grammar is more rigorously defined as a subclass of general probabilistic programs, equipped with a tuple denoting a set of nonterminals, a set of terminals, a start node, a global state, and a set of probabilistic programs, respectively. A production from the grammar is a recursive generation from the start node to a sequence of terminals based on production rules. Unlike PCFGs, a production rule is described by a probabilistic program

so that a given nonterminal can be expanded in different ways based on samples from random variables in

, the shared state , and contextual information about other nonterminals rendered in the production. Further, the production rule can also modify the global state , thus affecting the behaviour of future nonterminals. Lastly, the PPG can transform the final sequence of terminals into an arbitrary space (e.g. from strings to images), to yield the production . Each derivation is associated with a trajectory of nonterminals111Note that the length of the trajectory can vary for different . encountered during execution. Here, denotes a unique lexical identifier for each random variable encountered in order and

stores the specific value that was sampled. Define the joint distribution (induced by

) over trajectories and productions as . We refer to the procedure of generating a sample as .

Given such a grammar, we are interested in parsing: this is the task of mapping a production to the most likely trajectory, that could have produced . This is a difficult search problem: the number of trajectories grows exponentially even for simple grammars, and common methods for parsing by dynamic programming (Viterbi, CKY) are not applicable in the presence of context-sensitivity and functional transformations. To make this problem tractable, we present deep neural networks to approximate the posterior distribution over trajectories. We call this approach neural approximate parsing with generative grading, or GG-NAP.

2.2 Neural Inference Engine

The challenge of doing inference over trajectories is a difficult one. Trajectories can vary in length and contain nonterminals with different support. To approach this with neural nets, we decompose the inference task into a set of easier sub-tasks. The posterior distribution over a trajectory given a yield can be written as the product of individual posteriors over each nonterminal

using the chain rule:


where denotes previous nonterminals . Eqn. 1 shows that we can learn each posterior separately. With an RNN, we efficiently represent the influence of previous nonterminals

autoregressively using a shared hidden representation over

timesteps. To encode the production , we use standard machinery (e.g. CNNs for images, RNNs for text). To allow for nonterminals with different support, we define three layers for each random variable : (1) an index embedding layer that maps index

to a fixed dimension vector, (2) a value embedding layer that maps the value of

to a fixed dimension vector and (3) an inference layer that transforms the RNN hidden state into parameters of the posterior for the next nonterminal . Thus, the input to the RNN is fixed, being the concatenation of the value embedding, index embedding, and production encoding.

To train the GG-NAP, we optimize the objective,


where are all trainable parameters and represents the posterior distribution defined by the inference engine222Since we are given , we can parameterise to be from the correct distributional family.

. The second equality is a Monte Carlo estimate using a dataset of samples

from . At test time, given only a production , GG-NAP recursively samples for and uses this sample as the input to the next RNN step, like in usual sequence generation models [8].

2.3 Relationship to Viterbi Parsing

In [26], the authors released PCFGs for two exercises from (P1 and P8) that produce code. These grammars are large: P1 has 3k production rules whereas P8 has 263k. Given a PCFG,

PCFG Trajectory Acc. P1 (MAP) 0.943 P1 (best-of-10) 0.987 P8 (MAP) 0.917 P8 (best-of-10) 0.921
PCFG Parser # Production Rules Cost (Sec.) P1 Viterbi  3k 0.79 1.2 P1 NAP  3k 0.17 0.1 P8 Viterbi  263k 182.8 40.2 P8 NAP  263k 0.25 0.2
Table 1: Comparison of Inference and Cost between Viterbi and Neural Parsing

we compare GG-NAP to Viterbi (CYK) in terms of retrieving the correct trajectory for productions from the grammar. We measure trajectory accuracy: the fraction of nodes that are in both parses.

Using 5k samples from each PCFG, we found trajectory accuracies of 94% and 92% for P1 and P8 respectively, meaning that Viterbi and GG-NAP agree in almost all cases. Further, if we draw multiple samples from the GG-NAP posterior and take the best one, we find improvements of up to 4%. In exchange for being approximate, GG-NAP is not restricted to PCFGs, can invert transformations on productions, and is orders of magnitude faster than Viterbi (0.3 vs 183 sec).

2.4 Verifiable Nearest Neighbour Retrieval

Given a production from a grammar , the GG-NAP algorithm can provide a verifiable certificate for its predicted parsing. Let refer to the inferred trajectory for and refer to the true (unknown) trajectory. If we repeatedly call while fixing the values for each encountered random variable to , then we should be able to generate the exact production , showing with certainty that . In practice, very few samples are needed to recover . On the other hand, if an observation is not in the grammar (like some real student programs), is not well-defined and the inferred trajectory will be incorrect. However, will be still specify a production that we can interpret as an approximate nearest neighbour to in . Intuitively, we expect and to be “similar" semantically as specified by the nonterminals in . In practice, we can measure a domain-specific distance between and e.g. token edit distance for text.

2.5 -Nearest Neighbour Baseline

We present a strong baseline that is also capable of performing verifiable approximate parsing. This algorithm is simply a

-nearest neighbour classifier: we generate and store a dataset

with hundreds of thousands of unique productions as well as their associated trajectories. At test time, given an input to parse, we can find its nearest neighbour in the stored samples and return its associated trajectory. If the neighbour is an exact match, the prediction is verifiable. We refer to this baseline as GG-kNN. Depending on the grammar,

will be in a different output space (images, code, text) and thus the distance metric used for GG-kNN will be domain dependent.

2.6 Adaptive Sampling

Figure 3: Good-Turing Estimates

As both GG-kNN and GG-NAP require a dataset of samples for training, we must be able to generate unique productions from a grammar efficiently. For GG-kNN specifically, the number of unique productions strictly defines the quality of the model. However, due to the nature of Zipfs, generating unique data points can be expensive due to over-sampling of the most common productions. Furthermore, a second concern is that we do not want to completely ignore the prior distributions defined by the grammar. Otherwise we would sample very unlikely (albeit unique) productions that do not describe student behaviour.

To balance competing interests, we present a novel method called Adaptive Grammar Sampling that downweights the probabilities of decisions proportional to how many times they lead to duplicate productions. This algorithm has many useful properties and is based on Monte-Carlo Tree Search and the Wang-Landau algorithm from statistical physics. We consider this an interesting corollary and refer the reader to the supplement. Fig. 

7 shows an example of how much more efficient this algorithm is compared to simply sampling naively from the Liftoff grammar by plotting the Good-Turing estimates (probability of encountering an unseen program) over the number of samples made so far. In practice, adaptive sampling has a parameter that can be toggled to control how fast we explore the Zipf, allowing us to preserve likely productions from the head and body.

3 Results

Figure 4: Summary of results for three datasets. GG-NAP outperforms the old state of the art (SOTA).

For the task of inferring student understanding, we find that GG-NAP beats the previous state-of-the-art (SOTA) by a significant margin in all four educational domains. Further, it approaches (or surpasses) human performance (see Fig. 4). Below, we first describe GG-NAP’s performance on labelled datasets as compared to previous work followed by its performance when used for grading student code in a real classroom.

To evaluate our models, we separately calculate performance for different regions of the Zipf: we define the head as the most popular solutions, the tail as solutions that appear only once or twice, and the body as the rest. As solutions in the head can be memorised, we focus on the body and tail.

3.1 Autonomous Grading

In each domain, we are given a dataset of student programs and labelled feedback. By design, we include each of the labels as a nonterminal in the grammar, thereby reducing prediction to parsing.

Figure 5: Cumulative distribution of token edit distance between student programs and nearest-neighbours produced by various strategies. GG-NAP has 30% exact matches and 55% in 5 token edits. GG-kNN only captures 15%.

GG-NAP sets the new SOTA on the dataset, beating [26] in both the body and tail, and surpassing human performance (historically measured as F1). There is a rich history of previous work involving supervised classifiers [26; 23] that struggled with the tiny amount of labelled data, resulting in poor performance. Even some zero-shot approaches like [26], which trains an RNN on synthetically labelled samples from an expert-designed PCFG, are significantly below human quality. The potential impact of a human-level autonomous grader is large: is used by 610 million students worldwide, and has unsuccessfully launched initiatives in the past to crowdsource feedback for student solutions. Instead of thousands of human hours of teacher work, GG-NAP could provide the same quality of feedback at scale.


For this open dataset of short answer responses, GG-NAP outperforms the previous SOTA with an F1 score of 0.93, an increase of 0.35 points. We close the gap to human performance, measured to be F1 = 0.97 (which generously considers the majority of the three raters to be the gold label). Earlier work either used hand-crafted features for natural language [6] or the latest supervised neural network architecture [17]. With 700 labelled data points, these methods heavily overfit to the training set. Further, since the Powergrading task is unique in that it contains natural language, the PPG we designed had to explain variations both in writing style and in semantic understanding. The strong performance of GG-NAP suggests that even beyond education, the idea of representing expert priors as expressive simulators can be generalised to many domains.


As in the last two cases, GG-NAP is the new SOTA, out-performing baselines (kNN between images and a VGG classifier) from [28] that are trained on 200 labelled images by about a 50% gain in accuracy. Unlike other datasets, PyramidSnapshot includes student’s intermediary work, showing stages of progression through multiple attempts at solving the problem. With our near-human level performance, instructors could use GG-NAP to measure student cognitive understanding over time as students work. This builds in a real-time feedback loop between the student and teacher that enables a quick and accurate way of assessing teaching quality and characterising both individual and classroom learning progress. From a technical perspective, since PyramidSnapshot only includes rendered images (and not student code), GG-NAP was responsible for parsing student understanding from unstructured images, a feat not possible with simpler grammars like PCFGs.

3.2 Human Guided Grading

While good performance on benchmark datasets is promising, a true test of an algorithm is its effectiveness in the real world. For GG-NAP, we would like to gauge its impact on grading accuracy and speed in a real classroom setting. We hired a cohort of expert graders (teaching assistants from a large private university with similar experience) to each grade 30 real student solutions to Liftoff, a university course assignment. For each student solution, we also retrieve the auto-graded nearest neighbour using GG-NAP. (As an aside, GG-NAP excels at finding semantically relevant neighbours compared to baseline methods. Fig. 5 compares the token edit distance between the student program and the nearest neighbours retrieved by GG-NAP versus GG-kNN, finding significantly better matches with the former.) For control, half the graders proceed normally, assigning a set of feedback labels measuring understanding of looping concepts by analysing student solutions. The other half of graders additionally have access to (1) the feedback assigned to the nearest neighbour by GG-NAP and (2) a code diff between the student program and the nearest neighbour. Some example feedback labels include “off by one increment", “uses while loop", or “confused > with <". All grading is done on a web application that keeps track of the time taken for the grader to grade a problem.

We found that the average time (to grade 30 problems) for graders with GG-NAP is 507 sec. Without GG-NAP, the average time is 1130 sec, a more than double increase. With GG-NAP, 3 grading errors were made with respect to gold-standard feedback given by the course Professor. Without GG-NAP, 8 errors were made. By halving both the number of errors and the amount of time, GG-NAP can have a large impact in classrooms today, saving instructors and teaching assistants unnecessary hours and worry over grading assignments.

Figure 6: (a) Plot of average time taken to grade 30 student solutions to Liftoff. GG-NAP convincingly reduces grading time for 26/30 solutions. The amount of time saved correlates with the token edit distance (yellow). (b) GG-NAP allows for automatically associating student work with fine-grained automated feedback.

4 Related Work

“Rubric sampling” [26] first introduced the concept of encoding expert priors in grammars of student decisions, and was the inspiration for our work. The authors design PCFGs to curate synthetically labelled datasets to train a supervised classifier. Our approach builds on this, but GG-NAP operates on a more expressive family of grammars that are context sensitive. Due to this complexity, new innovations were required to effectively do inference. From, we see that expressivity is responsible for pushing GG-NAP past human level performance. Further, our paradigm adds an important notion of verifiability lacking in previous work. Rubric sampling as previously presented suffers from the black-box nature of neural networks.

Inference over grammar trajectories is similar to “compiled inference" for execution traces in probabilistic programs. As such, our inference engine shares similarities to PPL literature [14]

. By limiting ourselves to a class of grammars, we get a nice interpretation of compiled inference as a parsing algorithm. Further, we show the promise of compiled inference in much larger probabilistic programs (with skewed prior distributions). Previous work

[14; 27; 13] usually involve 4 or 5 random variables whereas our grammars grow to hundreds.

The design of PPGs also draws on many influences from natural language processing. For starters, our neural inference engine can be viewed as an encoder (or “inference network") in a RNN-based variational autoencoder


that specifies a posterior distribution over many categorical variables. Further, the index embedding layer serves as a unique identifier similar to the positional encoding in transformers

[21]. Finally, the verifiable properties of GG-NAP have strong ties to explainable AI [19; 9; 12], especially in the healthcare domain [25; 18] where interpretability is paramount.

5 Discussion

Highlighting feedback in student solutions

Rather than predicting feedback labels, it would be even more useful to provide “dense" feedback that highlights the section of the code or text responsible for the student misunderstanding. To achieve this, we use GG-NAP to infer a trajectory, for a given production . For every nonterminal , we want to measure its impact on . If for each we have an associated production rule with an intermediate output , then highlighting amounts to finding the part of which was responsible for. Fig. 6 shows a random program with automated, segment-specific feedback given by GG-NAP. This level of explainability is sorely needed in both online education and AI and could revolutionise how students are given feedback at scale.

Automatically improving grammars

Building PPGs is an iterative process, requiring time for improvements in design. A user wishing to improve a PPG would like a sense of where their grammar is lacking. Fortunately, given a set of difficult examples where GG-NAP does poorly, we can deduce the set of nodes in the PPG that consistently led to mistakes. To illustrate this, we took the Liftoff PPG which crucially contains a node that decides between incrementing up or down in a “for" loop, and removed the option of incrementing down. If we train GG-NAP on the smaller PPG, we will fail to parse examples that “increment down". In this case, the set of nodes that consistently led to mistakes all related to incrementation. At this time, an expert can quickly diagnose the issue.

Need for probabilistic program grammars

In practice, we have experienced the benefits of having a grammar which allows for the full expressivity of a computer program. One non-obvious benefit of having state is the ability to break independence assumptions between mistakes. If a PCFG describes different places where a student could err, as tends towards infinity it will be increasingly improbable to produce a sample with only one mistake, which we know to be a very common case among students. A PPG allows for a natural way to have a continuous ability for students which can model the binomial phenomena of either making many mistakes or only a few. Adding ability as a state alone increased the F1 scores for by 0.1 points.

Not only experts can write good grammars.

Writing a good grammar does not require immense experience. For instance, the PyramidSnapshot grammar that sets the new SOTA was written by a first-year undergraduate. Further, grammars are re-usable: similar assignments will share nonterminals and some invariances (e.g. all the ways of writing i++ are the same everywhere).

6 Conclusion

In this paper we make novel contributions to the task of providing automated student feedback that beats numerous state-of-the-art approaches and shows significant impact when used in practice. The ability to finely predict student decisions opens up many doors in education. This work could be used to automate feedback, visualise student approaches for instructors, and make grading easier, faster, and more consistent. Although more work needs to be done on making powerful grammars easier to write, we believe this is an exciting direction for the future of education and a huge step in the quest for combining machine learning and human-centred artificial intelligence.


Appendix A Model Hyperparameters

For reproducibility, we include all hyperparameters used in training GG-NAP. Unless otherwise stated, we use a batch size of 64, train for 10 or 20 epochs on 100k samples from a PPG. The default learning rate is 5e-4 with a weight decay of 1e-7. We use Adam

[Kingma and Ba, 2014] for optimization. If the encoder network is an RNN, we use the Elman network with 4 layers, a hidden size of 256, and a probability of dropping out hidden units of 1%. If the encoder network is a CNN, we train VGG-11 [Simonyan and Zisserman, 2014] with Xavier initialization [Glorot and Bengio, 2010]

from scratch. For training VGG, we found it important to lower the learning rate to 1e-5. The neural inference engine itself is an unrolled RNN: we use a gated recurrent unit with a hidden dimension of 256 and no dropout. The value and index embedding layers output a vector of dimension 32. These hyperparameters were chosen using grid search.

Appendix B Adaptive Grammar Sampling

In the text, we introduced a nearest neighbour baseline (KNN) for verifiable parsing. The success of KNN is highly dependent on storing a set of unique samples. With Zipfs, i.i.d. sampling often over-samples from the head of the distribution, resulting in a low count of unique samples and poor performance. To build a strong baseline, we must sample uniques more efficiently.

Input: Probabilistic program grammar , decay factor , reward , and desired size of dataset .

Output: Dataset of unique samples from the grammar: .

1:procedure AdaptiveSample(, , , )
3:     while  do
5:          if  then
7:          for  to  do
8:                get -th node in trajectory, , of length
Algorithm 1 Adaptive Sampling

Further, training the neural inference engine requires sampling a dataset from a PPG . These samples need to cover enough of the grammar to allow the model to learn meaningful representations and, moreover, they again need to be unique. The uniqueness requirement is paramount for Zipfs since otherwise models would be overwhelmed by the most probable samples.

Naively, we can i.i.d. sample a set of unique observations and use it train NAP. However, again, due to the Zipfian nature, generating unique data points can be expensive as gets large due to having to discard duplicates. To sample efficiently, a simple idea is to pick each decision uniformly (we call this uniform sampling). Although this will generate uniques more often, it has two major issues: (1) it disregards the priors, resulting in very unlikely productions, and (2) it might not be effective as multiple paths can lead to the same production.

Ideally, we would sample in a manner such that we cover all the most likely programs and then smoothly transition into sampling increasingly unlikely programs. This would generate uniques efficiently while also retaining samples that are relatively likely. To address these desiderata, we propose a method called Adaptive Grammar Sampling (Alg. 1) that downweights the probabilities of decisions proportional to how many times they lead to duplicate productions. We avoid overly punishing nodes early in the decision trace by discounting the downweighting by a decay factor . This method is inspired by Monte-Carlo Tree Search [Chang et al., 2005] and shares similarities with Wang-Landau from statistical physics [Wang and Landau, 2001].

(a) Uniqueness and Good-Turing Estimates
(b) Likelihood of Samples over Time
Figure 7: Effectiveness of sampling strategies for Liftoff. Left/Middle: Number of unique programs generated (left) and Good-Turing estimate (middle) as a function of total samples. Right: Likelihood of generated samples over time for various sampling strategies. In particular, we note the effect of reward and decay on the exploration rate. The ideal sampling strategy for Zipfs first samples from the head, then body, and finally the tail.

b.1 Properties of Adaptive Sampling

In the main text, we expressed the belief that adaptive grammar sampling increases the likelihood of generating unique samples. To test this hypothesis, we sampled 10k (non-unique) Java programs using the Liftoff PPG and track the number of uniques over time. Fig. 7a shows that adaptive sampling has linear growth in number of unique programs compared to sublinear growth with i.i.d. or uniform sampling. Fig. 7b compute the Good-Turing estimate, a measure for the probability of the next sample being unique, and found adaptive sampling to “converge" to a constant while other sampling methods approach zero. Interestingly, adaptive sampling is customisable. Fig. 7c show the log probability of the sampled trajectories over time. With higher reward or a smaller decay rate , adaptive sampling will sample less from the head/body of the Zipf. In contexts where we care about the rate of sample exploration, adaptive sampling provides a tune-able algorithm to search a distribution.

Appendix C Grammar Descriptions

We provide an overview of the grammars for each domain, covering the important choices. P8

This PPG contains 52 decisions. The primary innovation in this grammar decision is the use of a global random variable that represents the ability of the student. In this turn will affect the distributions over values for nonterminals later in the trajectory such as deciding the loop structure and body. The intuition this captures is that high ability students make very few to no mistakes whereas low ability students tend to make many correlated misunderstandings (e.g. looping and recursion).

CS1: Liftoff

This PPG contains 26 decisions. It first determines whether to use a loop, and, if so, chooses between “for" and “while" loop structures. It then formulates the loop syntax, choosing a condition statement and whether to count up or count down. Finally, it chooses the syntax of the print statements. Notably, each choice is dependent on previous ones. For example, choosing an end value in a for loop is sensibly conditioned on a chosen start value.

Powergrading: Short Answer

This PPG contains 53 nodes. Unlike code, grammars over natural language need to explain variance in both semantic meaning and prose. This is not as difficult for short sentences. In designing the grammar, we inspect the first 100 responses to gauge student thinking. Procedurally, the grammar’s first decision is choosing whether the production will be correct or incorrect. It then chooses a subject, verb, and noun. These three choices are dependent on the correctness. Correct answers lead to topics like religion, politics, and economics while incorrect answers are about taxation, exploration, or physical goods. Finally, the grammar chooses a writing style to craft a sentence. To capture variations in tense, we use a conjugator

333Python’s mlconjug library: as a functional transformation on the output.


The grammar contains 121 nodes, the first of which decides between 13 “strategies" (e.g. making a parallelogram, right triangle, a brick wall, etc.). Each of the 13 options leads its own set of nodes that are responsible for deciding shape, location, and colour. Finally, the trajectory of decisions is used to render an image. The first version of the grammar was created by peaking at 200 images. A second version was updated by viewing 50 more.

Appendix D NAP Architecture

Figure 8: Architecture of the neural inference engine. We show a single RNN update to parameterize . This procedure is repeated for each , the length of the trajectory.

Fig. 8 visualizes the architecture for the neural inference engine in NAP. Critically, NodeEmbeddingLayer, IndexEmbeddingLayer, and InferenceLayer are specific to each nonterminal to support arbitrary dimensionality and distributions for random variables. The EncoderNetwork is responsible for transforming unstructured images and text to vector space.

Appendix E Grading UI

Figure 9: Grading UI based on GG-NAP

We show an image of the user-interface used in the field experiment. This is the view a grader (with access to NAP) would see. The real student response is give on the left and the nearest neighbour given by GG-NAP on the right. A differential between the two images is provided, inspired by Github design. On the very right is a set of labels that the grader is responsible for assigning values to.

Appendix F Improving the Grammar

Figure 10: Given a Liftoff grammar that can only increment up from 1 to 10 (e.g. i++), if we attempt inference on an unseen program that increments down from 10 to 1 (e.g. i–), we can track at which nonterminals inference fails, and use that to estimate where we need to add additional nodes, thereby helping the user improve the grammar. The height of each bar represents the likelihood that improvements are needed for that nonterminal, the highest of which are all related to looping.

In the discussion of the main text, we introduced an experiment to test if we could detect nodes at which we were failing to parse out-of-distribution examples: we took the Liftoff PPG (which crucially contains a node that decides between incrementing up or down in a “for" loop), and removed the option of incrementing down. If we train GG-NAP on the smaller PPG, we will fail to parse examples that “increment down". In this case, the set of nodes that consistently led to mistakes all related to incrementation. Fig. 10 shows the distribution over which nodes GG-NAP believes to be responsible for the failed parse. The top 6 nonterminals that GG-MAP picked out related to looping and incrementation. As an expert, this is enough of a diagnosis to improve the grammar.

Appendix G Grammar Sample Zoo: Powergrading

Appendix H Grammar Sample Zoo:

Appendix I Grammar Sample Zoo: Liftoff

Appendix J Grammar Sample Zoo: PyramidSnapshot