ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback

07/23/2021 ∙ by Mike Wu, et al. ∙ Stanford University 3

High-quality computer science education is limited by the difficulty of providing instructor feedback to students at scale. While this feedback could in principle be automated, supervised approaches to predicting the correct feedback are bottlenecked by the intractability of annotating large quantities of student code. In this paper, we instead frame the problem of providing feedback as few-shot classification, where a meta-learner adapts to give feedback to student code on a new programming question from just a few examples annotated by instructors. Because data for meta-training is limited, we propose a number of amendments to the typical few-shot learning framework, including task augmentation to create synthetic tasks, and additional side information to build stronger priors about each task. These additions are combined with a transformer architecture to embed discrete sequences (e.g. code) to a prototypical representation of a feedback class label. On a suite of few-shot natural language processing tasks, we match or outperform state-of-the-art performance. Then, on a collection of student solutions to exam questions from an introductory university course, we show that our approach reaches an average precision of 88 assistants. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university. This is, to the best of our knowledge, the first successful deployment of a machine learning based feedback to open-ended student code.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High quality education at scale is a long-standing unsolved challenge that is only becoming more important. The price of education per student is growing faster than economy-wide costs (Bowen, 2012), bounding the amount of resources available to support individual student learning. However, the demand for higher education is steadily increasing (Kumar and Hurwitz, 2015). This is especially urgent for computer science education as workers in traditional jobs are being displaced and require re-skilling. Massive open online courses, or MOOCs, have responded to this challenge by democratizing access to high quality content. But, content is only a part of the solution. MOOCs notoriously suffer from high rates of student disengagement and dropout (Kizilcec et al., 2013) – only a small fraction of students who start the courses finish. This is in part because MOOCs ignore another important ingredient for learning: feedback. Focusing on computer science education, this paper studies the problem of providing feedback on student code, a challenging process that is critical to learning but requires expertise and is difficult to scale due to the vast diversity of questions and student responses. Existing autonomous methods such as unit tests can help recognize a correct solution; however they are not especially helpful at providing feedback which could guide a student to better understanding. As such, well resourced courses typically spend a huge amount of human effort manually providing feedback to student code.

With datasets of student work becoming more commonplace (Settles, 2018; Husain et al., 2019; Riid, 2020)

, there is an opportunity to use machine learning to provide feedback at scale. But, naively treating this as a standard supervised learning problem faces challenges with overfitting to small datasets and trouble generalizing to new students and questions

(Wu et al., 2019b; Basu et al., 2013; Yan et al., 2019; Wang et al., 2017; Liu et al., 2019a; Hu and Rangwala, 2019). In order for a method to be applicable in the real world, it must use expert annotations more efficiently.

Figure 1: ProtoTransformer Networks for Few-Shot Classification of Student Code: Given a programming question, the ProtoTransformer Network is trained to predict feedback for student code using only a small set of annotated examples. Feedback categories are specified according to a rubric e.g. “Incorrect syntax” or “Missing variable”. Question and rubric descriptions are embedded with pretrained SBERT. Student code is then encoded through stacked transformer layers conditioned on question and rubric embeddings. A “prototype” is the average code embedding for each class label. New examples are embedded and compared to each prototype.

We therefore formulate feedback to students as a few-shot classification problem: given a handful of annotated examples of student solutions to a new question, a model must quickly adapt to provide feedback to new student responses. Meta-learning approaches have demonstrated promising results on such problems by learning how to quickly learn a classifier from a few examples. However, prior meta-learning methods have largely focused on few-shot visual recognition tasks constructed by repurposing existing datasets

(Lake et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017; Yu et al., 2018b). In contrast, the feedback problem presents a number of new challenges that arise in real-world applications including imbalanced data distributions, limitations on meta-learning data size, and shifts in the distribution of tasks. Together, these challenges make it nontrivial to apply existing meta-learning algorithms.

The main contribution of this work is a meta-learning framework for few-shot classification of sequence data, including programming code, that aims to mitigate these challenges. We propose an augmentation technique for code data that can generate additional tasks for meta-learning. To handle the ambiguity of learning from only a few examples and better guide the few-shot learner, we incorporate side information. To address the exponential number of unique variable and function names in code sequences, we propose to use byte-pair encoding or program obfuscation. All of these components are integrated with an architecture that uses transformers (Vaswani et al., 2017) to compute prototypical embeddings of a class label.

To benchmark our model in existing meta-learning contexts, we first study a suite of few-shot text classification tasks, where we find our approach to rival, if not outperform, several state-of-the-art meta-learning NLP algorithms. Then, on a dataset of student solutions to university exams, our approach achieves an average precision (AP) of 88% when providing feedback to unseen problems, compared to the 82% AP from teaching assistants and 65% AP from standard supervised approaches. That is, our model rivals the average human instructor, opening opportunities for real-world impact in the classroom. We also include a detailed ablation study to measure the impact of individual components of our approach, and exhaustively compare to alternative choices. Finally, we deployed our approach to provide feedback to 16,000 student solutions for a diagnostic exam in a large online programming course, demonstrating the immediate impact possible.

2 Preliminaries

We review the basics of few-shot learning and introduce the feedback challenge.

Few-Shot Learning Setup Suppose we are given a collection of tasks containing examples for each of classes. Each task is divided into a support set of examples per class, , and a query set of examples from the same classes, . For simplicitly, we assume is the same across all tasks. Every

represents a data example (e.g. image, text, or vector), and every

is a class label. The challenge of few-shot learning is to fit a model that uses the support set to predict labels on the query set for every task. Tasks are divided into a meta-training set and a meta-test set. The model parameters are chosen based on performance on the meta-training set. In evaluation, the model is judged by performance on the query set of each task in the meta-test set.

Prototypical Networks Prototypical Networks (Snell et al., 2017) learn an embedding function that maps every example to a vector in . In practice,

is a deep neural network where

represents trainable parameters. This embedding is used to map the support set to define a prototype embedding for each class (usually the average embedding over shots). The goal is for the embedding of query examples to be closer to the prototype of their correct class than alternative classes.

Precisely, the prototype for the -th class is where is the subset of the support set with examples labeled with class . Then, the prototypical network objective is to minimize a softmax over distances of each query example to each prototype:


where represents L distance. The parameter can be either a constant or learned (Oreshkin et al., 2018). Eq. 1

, averaged over all query examples, is minimized with stochastic gradient descent. At meta-test time, the support set

is again embedded to construct prototypes, following which a label is predicted for a query example by finding the closest prototype to its embedding, .

The Feedback Challenge To introduce the feedback challenge, suppose we have access to student solutions to a set of programming exercises. Each student response is annotated with a rubric, which contains several rubric items, each describing a student misconception. For example, if the problem required the student to use a “for loop”, rubric items may explore misunderstandings with the iterator, the termination condition, or general syntax. Each rubric item contains a text description and several rubric options that an annotator may pick, varying from a “perfect solution” to multiple misconception types. More than one option may be chosen for a single rubric item. See Figure 2 for example rubrics. Notably, every programming question has its own rubric with mostly unique items and options. The goal of the feedback challenge is to predict the correct assignments for every rubric item from student code with as little required annotation as possible.

Figure 2: Feedback rubrics: a rubric contains multiple items, each of which has a text description and several options. Rubric options are varied in granularity and number.

The hypothesis of this paper is tackle the feedback challenge through few-shot learning of student prototypes. Our motivation for learning prototypes comes from a close relationship to applications of item response theory in examination (Edgeworth, 1888), a popular framework that measures student ability and question difficulty by averaging over responses (i.e. a simple prototype). This is contrast with most previous attempts at feedback (Wu et al., 2019b; Malik et al., 2019), which leverage mostly hand-crafted models. Despite their success on block-based programs, scaling feedback in education is still an unsolved real-world problem. We see an important contribution of ours as carefully formulating student feedback as a machine learning problem, and more specifically a few-shot meta-learning problem. With a performant few-shot feedback system, instructors could cheaply provide feedback for students, allowing the automated system to provide feedback for the rest.

Although we have explained the crux of the problem, there are multiple challenges that persist in real-world applications that make the feedback challenge so difficult.

Limitations on Task Annotation Unlike labeling object classes in an image, labeling domain-specific data requires field expertise and intensive labor. To make matters worse, meta-learning algorithms require not just one dataset, but a set of datasets, one for each task. Whereas benchmark meta-learning datasets, e.g. Omniglot or miniImagenet, leverage thousands of tasks, limited annotation in real world settings could restrict the number of training tasks to be a few hundred at best. In computer science education, grading programming code requires the “annotator” to infer the core misconception made by the student. This complicated inference task is at least at difficult as debugging someone else’s code, as the reader must understand the student’s thought process and trace errors back to the source. Unsurprisingly, this is incredibly hard to scale. Wu et. al. (Wu et al., 2019b) found that annotating 800 Blockly programs took university teaching assistants 25 human-hours.

Long-tailed Data Distributions Real world datasets face a challenging problem of data imbalance. Several classes might be much more rare than others, such as uncommon coding mistakes students make on a class exercise. Strong class imbalance limits the size of the support set available to train a meta-learning model. Rare classes also pose a more difficult generalization challenge as examples in the long-tail of the data distribution are very diverse. In an average classroom, students have an exponential number of ways to approach the same problem. For a coding assignment, the distribution of student code has been shown to be approximately Zipfian (Wu et al., 2019b). Despite this, instructors deeply care about feedback in the “long-tail”, as it contains struggling students.

Side Information In real-world contexts, we often have access to weak auxiliary information about a task, such as a description, a webpage, or metadata. However, many few-shot learning methods are deprived of this information, operating only on the support and query examples. Careful consideration of such “side information” could reduce ambiguity of individual tasks that arises from limited data. In education, graded student work can be accompanied by a description of the question, the rubric for grading, and even metadata about the student, grader, or course.

3 ProtoTransformer Networks

In this section, we introduce a few-shot learning framework with several components designed to help address the above challenges. We start by introducing the ProtoTransformer architecture, which can process sequence data points and readily incorporate side information to solve a few-shot task. Following, we introduce two components designed to address limited and long-tailed distributions of programming code: task augmentation and variable name encodings.

As in Section 2, we are given a support set and a query set for every task. Additionally, we now assume that every example is a sequence of discrete tokens i.e. where each token is one of words in a vocabulary and is a maximum length. In the ProtoTransformer, we parameterize the embedding network as stacked transformer encoder layers as presented in RoBERTa (Liu et al., 2019c) (see Figure 1

). The RoBERTa architecture maps a sequence of tokens to a sequence of embeddings, which we aggregate into a single embedding by averaging over non-padding tokens. Given this average embedding, we compute prototypes and optimize Eq. 

1. Even for the prototypical objective, we found it important to use Adam with = 0.9, = 0.999, = 1e-6, and warm up the learning to 1e-4, followed by linear decay, as in (Liu et al., 2019c).

3.1 Conditioning on Side Information

For a given task, we may have some side information , along with the support and query sets. We assume side information is known apriori as a fixed discrete sequence of tokens, . Further, we assume access to a second embedding function that maps to a vector. For instance, if is a English description, we may choose to be a pretrained sentence model, such as SBERT (Reimers and Gurevych, 2019).

Figure 3: Side Information as a “Task Token”: Side information (pink) is embedded by a network and prepended to token embeddings (blue). The combined sequence is fed to the transformer.

A question remains on how to incorporate side information vectors into the embedding function so that the model can fuse information from the support set and the side information together. We propose a very simple approach: first, the side information embedding is transformed to be the same dimensionality as the input embedding in (e.g. 768 for RoBERTa) using a linear layer. Then, we prepend this resulting vector to each input sequence in the support set and query set. That is, we treat as the embedding of a special “task token”. This new input sequence is provided to the stacked transformers, in which the attention layers will mix the side information into the final token embeddings. See Figure 3 for a illustration. In our experiments, we compare performance with and without side information, finding a 2-4 point increase by including it.

3.2 ProtoTransformers for Code

When applying our approach to few-shot classification of programming code, we introduce technical innovations specific to code but not to any single programming language nor application domain.

Standardizing Code Input While it is common to use abstract syntax trees (Wu et al., 2019b; Jain et al., 2020; Alon et al., 2018), we opt for a simpler standardization that is more faithful to the original program string: we use the built-in parser to deconstruct the string into tokens (e.g. pythonlang.tokenize for Python). The resulting output tags segments of code with semantic labels, such as comments, names, numerics, symbols, etc. For every program, we remove all inline comments, convert all variable and function names from camel to snake case, replace newlines with a unique symbol (<newline>), and explicitly demarcate entering and exiting a new scope with unique symbols (e.g. <scope> and </scope>). Note that programs with syntax errors can still be tokenized in this manner.

Self-Supervised Tasks for Code We explore two approaches of constructing synthetic tasks from unlabeled programming code. First, inspired by subset masked language modeling tasks, or SMLMT (Bansal et al., 2020), we similarly wish to utilize an “un-masking” task. As SMLMT was designed for natural language, it has a subtle shortcoming when applied to programming code: as a significant portion of tokens in a program are used for function and variable names, randomly masking tokens through SMLMT will often mask these naming tokens. Unfortunately, naming tokens have little to do with program syntax nor structure, and can be arbitrary. That is, multiple choices can be equally as good despite only one being the true token. For example, sum, z, or total are all good choices for un-masking <mask> = x + y. Therefore, we propose a separate “cloze” task to work around this issue: we do not mask function or variable names to focus on program semantics. We use the parser to identify naming tokens. Then, we randomly choose tokens (e.g. def or range in Python) and examples for each class that contain the chosen tokens — together, these compose a -way classification problem. For each program, all instances of the chosen token are replaced with <mask>. In Section 5, we compare our cloze task to the SMLMT task and find 3-4 points of improvement.

We propose a second synthetic task by predicting compilation errors. In Python, all programs can be compiled with the pythonlang.tokenize function. Given a set of possible compilation outcomes e.g. TypeError, SyntaxError, or IndentationError, we construct a task by randomly choosing outcomes, and picking programs that compile to each of the classes. Successfully predicting compilation error requires the model to understand program logic. Alternatively, if one had access to unit tests, they could be used in addition to compilation tasks.

Variable and Function Names A subtle difficulty of working with programming code is handling variable and function names, which are responsible for more than 99% of the unique vocabulary. Naively training the ProtoTransformer on programs with their original variable and function names is challenging. since unique names in unseen programs make it very hard for the embedding function to generalize. We explore two approaches to “normalize” naming.

The first approach is to take advantage of byte-pair encodings (BPE). Given we have converted all variable and function names to snake case, most naming is comprised of english words conjoined by underscores (e.g. my_variable). As such, byte-pairs of programming names resemble standard English. To this end, we can leverage the RoBERTa tokenizer (Liu et al., 2019c).

A second approach is to obfuscate variable and function names. To do this, we allot the model variable name tokens (e.g. <var:1>, , <var:>) and function name tokens (e.g. <func:1>, , <func:>). In meta-training, every support or query example is transformed to replace all names with randomly chosen name tokens, in a consistent manner. The randomness is important so as not to memorize relationships between specific name tokens and positions in code. In meta-test, we sequentially replace variable and function names from left-to-right. The motivation is to provide the model the expressivity to represent variables and functions but minimize the vocabulary size.

Pretraining on Unlabeled Data Given a large corpus of unlabeled programming code, we learn an unsupervised representation that captures semantic information on code, which could be useful for solving few-shot classification tasks: since CodeBERT (Feng et al., 2020) also use stacked transformer layers, we can initialize the ProtoTransformer Network with its pretrained weights, and finetune only the top few layers with Eq. 1. Since the unlabeled corpus is magnitudes of order larger than the annotated one, we found initializing weights to improve meta-learning capability by up to 30 points.

4 Few-Shot Natural Language Processing Experiments

Before we focus on educational feedback, we recognize that education is a relatively new application for machine learning with limited public data due to student privacy. At the same time, our approach is not specific to code nor education. We find side information and task augmentation benefit meta-learning more generally. To compare our approach in a more familiar context, we study few-shot text classification problems in natural language processing, and find encouraging results.

2-way 3-way
Model 1-shot 2-shot 1-shot 2-shot
All (ours) 85.5 15 87.0 17 77.9 11 82.3 07
No Pretrain 55.5 14 62.5 11 57.9 15 57.6 12
No Side 83.5 16 85.5 20 73.3 08 79.3 09
No SMLMT 84.0 20 85.5 21 72.3 11 76.9 12
LSTM 52.5 14 52.5 09 47.9 09 47.3 10
Matching 84.5 17 81.6 16
Supervised 53.0 23 57.0 12 35.8 06 39.9 08
Model Meta-Test Acc.
Matching Network (Vinyals et al., 2016) 65.73
Prototypical Network (Snell et al., 2017) 68.17
Graph Network (Garcia and Bruna, 2017) 82.61
Relation Network (Sung et al., 2018) 83.07
SNAIL (Mishra et al., 2017) 82.57
ROBUSTTC-FSL (Yu et al., 2018a) 83.12
Induction Network (Geng et al., 2019) 85.63
ProtoTransformer (Ours) 85.07 ( 2.4)
ProtoTransformer (No Pretrain) 75.86 ( 2.5)
ProtoTransformer (No Side Info) 83.67 ( 2.0)
ProtoTransformer (No Cloze) 84.82 ( 2.2)
ProtoTransformer (LSTM) 77.67 ( 2.1)
Table 1: Few-shot topic classification (left): ProtoTransformer performance on few-shot topic classification using the 20-newsgroups dataset, varying the number of shots and classes. Few-shot sentiment classification (right): we matches state-of-the-art performance on 5-shot ARSC multi-domain sentiment classification.

Experiment Setup First, we re-purpose the 20-newsgroups dataset, which contains posts on 20 topics (such as sports, politics, or religion) for few-shot topic classification. To build a task, we randomly choose topics and pick examples per topic (We vary and but fix =10). We reserve 5 topics for meta-testing randomly. To incorporate side information, we provide the model the SBERT embedding of the

topic names. Second, we study few-shot sentiment analysis using the Amazon Reviews (ARSC) dataset

(Blitzer et al., 2007), replicating Geng et al. (2019) to construct 69 five-shot binary classification tasks. Here, we leverage the product name (e.g. books, DVDs, or electronics) as side information. Finally, we evaluate our model with a suite of five text classification datasets proposed by Bao et al. (2019). Unless ablated, we encode text with BPE, use a pretrained RoBERTa, and add 10% SMLMT tasks.

Model 20News Amazon HuffPost Reuters FewRel
MAML (Finn et al., 2017) 43.9 47.2 49.6 62.5 57.8
Proto. Net (Snell et al., 2017) 45.4 51.9 41.6 68.1 46.5
Ridge (Bertinetto et al., 2018) 57.2 72.7 54.8 90.0 72.2
Signatures (Bao et al., 2019) 68.3 81.1 63.5 96.0 83.5
ProtoTransformer 61.3 (0.4) 85.2 (0.3) 65.0 (0.3) 97.7 (0.1) 75.9 (0.3)
Table 2: Suite of 5-way 5-shot text classification tasks. The ProtoTransformer Network closely matches state-of-the-art algorithms, outperforming them in 3 out of 5 datasets.

Table 1 (left) reports the meta-learning approach to outperform a supervised baseline by 32 points absolute. Table 1 (left) also reports several ablations isolating the impact of different components of our approach. We find large improvements of 30 points by utilizing transformers and pretraining as well as 2-4 points of improvement through side information and augmented tasks. Table 1 (right) and Table 2 show that the ProtoTransformer rivals state-of-the-art methods. Induction Networks and ProtoTransformers are comparable on ARSC, surpassing SNAIL, graph, and relation networks by 3% absolute and standard prototypical networks by 17% absolute. On Bao et al. (2019)’s suite of datasets, we outperform Distributional Signatures (Bao et al., 2019) on 3 out of 5 datasets, also surpassing MAML and Prototypical Networks by 20-30% absolute. In light of these results, we view the ProtoTransformer as a modernization of the Prototypical Network — with straightforward extensions, the prototype approach remains competitive with more sophisticated meta-learning algorithms for natural language.

5 Scaling Feedback for University Exams

In the coming experiments, we are interesting in the following questions: (1) How does feedback from few-shot learning with the ProtoTransformer Network compare to human feedback? (2) In the low data meta-learning setting, do constructing synthetic tasks, adding side information, and careful data preprocessing improve performance? Which ones are more important?

Held-out rubric
Model AP P@50 P@75 ROC-AUC
ProtoTransformer 84.2 85.2 74.2 82.9
(1.7) (3.8) (1.4) (1.3)
Supervised 66.9 59.1 53.9 61.0
(2.2) (1.7) (1.5) (2.1)
Human TA 82.5
Held-out question
Model AP P@50 P@75 ROC-AUC
ProtoTransformer 88.1 89.8 82.1 89.0
(1.5) (2.1) (0.8) (0.9)
Supervised 64.5 58.3 53.6 59.5
(1.4) (1.6) (1.0) (1.6)
Human TA 82.5
Held-out exam
Model AP P@50 P@75 ROC-AUC
ProtoTransformer 74.2 77.3 67.3 77.0
(1.6) (2.7) (2.0) (1.4)
Supervised 65.8 60.1 54.3 60.7
( 2.1) (3.0) (1.8) (1.6)
Human TA 82.5
Table 3: Performance for feedback prediction in terms of precision and area under the ROC curve. P@X is the precision at a recall of X. Averaged over 3 runs.

Computer Science Course Dataset We use a corpus of student solutions to 4 final exams and 4 midterm exams from a university “introduction to computer science” course collected over 3 years. Every student solution has feedback, in the form of a rubric, from at least one course teaching assistant. 10% of the questions were annotated by more than one teaching assistant to measure grader consistency. In total, the dataset includes 63 exam questions with 24.8k student solutions in Python. We remove rubric options with fewer than positive examples and those where all students scored perfectly. After processing, we have 259 tasks, of which 20 are held-out. During meta-training, we randomly sample examples per class to form the support set and examples for the query set . The corpus also contains student solutions to 55 assignment questions. However, assignment solutions are 8 times longer than exam solutions, as the two distributions are different. On the next page, we investigate using assignments to construct additional tasks.

Few-shot Feedback Tasks Recall the feedback challenge from Section 2. To make this challenge amenable to few-shot learning, we need to build tasks from a rubric. The most obvious choice is to treat each rubric item as a task. However, this faces technical challenges. First, options in a rubric item are very unbalanced e.g. some misunderstandings such as “variable scoping issue” are rare. Second, rubric items have different numbers of options, requiring models to generalize over varying class counts. We instead opted to treat each rubric option as a task. For example, in Figure 2, the first rubric item on the left would comprise of four tasks. This implies that the same student programs may appear in multiple tasks, although the labels will be different. This also implies that all tasks are binary classification problems. Under this formulation, the number of tasks increases considerably, which we found to be crucial. In creating train test splits, we will always ensure rubric options from the same item are in the same split, preventing information leakage.

In addition, every programming question comes with a text description of the prompt. We propose to use both the question and the rubric item as side information to condition the embedding, as discussed in Section 3.2. Concretely, let be a sequence of tokens representing the prompt and represent the rubric description. Then we condition the embedding of any example in the support or query set on , the sum of the prompt and rubric embeddings. In practice, we choose to be a pretrained SBERT.

Evaluation Protocol We evaluate the model on three splits: first, we reserve 10% of rubric items, uniformly sampled, for meta-test tasks. In this held-out rubric items setup, the same student programs might appear in both splits, but under different rubric items. (We could imagine a setting where instructors partially grade every exam while the model is responsible for the rest.) Second, we hold-out a set of questions for meta-test split. This held-out question setup ensures all rubric items for the same question are in the same split. (We could imagine instructors leaving a few questions to be auto-graded.) Third, we hold-out a full exam. This held-out exam setup poses a more difficult challenge as unseen tokens will appear at meta-test time. This more faithfully models a few-shot autonomous system.

Figure 4: Model Comparisons: model variations compared to the best configuration (red). Shaded regions represent repeated runs. The x-axis plots AP on held-out rubric items whereas the y-axis plots held-out exams.

Main Results Table 3

reports the test performance of the ProtoTransformer on providing feedback to student solutions in both splits, averaged over three runs with different exams and rubrics being held-out. We train the model for 300 epochs, preprocess code with BPE, and augment the meta-train set with 10% cloze and 10% compilation tasks. See Figure 

1 for a model summary. For metrics, we report AP, precision at recall 0.50 and 0.75, and area under the ROC curve (ROC-AUC).

We include a supervised baseline that fits a new model per task. For every held-out task, we use examples to train a classifier from student code to a prediction for the rubric option associated with the task. We evaluate performance on

held-out examples. To ensure the baseline is comparable, we use the same setup as the ProtoTransformer. To train the supervised baseline, we found it crucial to use pretrained weights from CodeBERT, otherwise performance remained at chance. Further, we estimate human performance (“human TA” in Table 


) by computing agreement in grading between multiple teaching assistants (TA) assigned to grade the same student solutions. We found agreement to be relatively low at 82.5%. This variance can be attributed to multiple interpretations of student misconceptions from code as well as grader fatigue. As the few-shot model is trained with labels from many TAs, in principle, a perfect model can outperform any individual grader.

Table 3 shows that the ProtoTransformer outperforms the supervised baseline by 17% in held-out rubrics, 23% in held-out questions, and 9% in held-out exams. Further, in held-out rubrics and questions, we find the ProtoTransformer to be highly performant, surpassing human precision by 1.7 points absolute and 5.6 points absolute, respectively. The strong performance in the held-out questions setting might suggest a benefit of meta-training on all rubrics of a question at once. As held-out exams pose a more difficult challenge, we see a decrease of 10 points in AP and 5 points in ROC-AUC compared to held-out rubrics. However, the ProtoTransformer still outperforms the supervised baseline by a healthy margin, showing the utility of meta-learning.

Model Comparisons To pinpoint the effectiveness of individual techniques used in the ProtoTransformer, we conduct comparisons that change the architecture, auxiliary tasks, pretraining, and more. To compare to prior methods, we vary the meta-learning algorithm. Figure 4

show the results. In each comparison, all hyperparameters were kept constant aside from the one of interest.

Comparing the prototypical network to competing meta-learning algorithms, such as matching networks (Vinyals et al., 2016) and relation networks (Sung et al., 2018), we find the prototypical network to outperform others by 3 and 12 points respectively (green curve). We also find a significant benefit (of up to 10 points) in using transformer architectures opposed to recurrent architectures (blue curves). As the number of stacked transformer layers increased, the performance strengthened. For these depth experiments, models were trained from scratch.

We find that adding either the cloze or compile augmentation tasks introduced in Section 3 improves performance by 6-7 points (pink curve in Figure 4), with adding both further increasing performance by 1 point. Notably, we find that adding cloze tasks improves performance by 3 points over adding SMLMT, despite the similarity between the two methods. Recall that SMLMT predominantly masks function and variable names, unlike cloze. These results suggest that tasks predicting diverse naming tokens can negatively impact performance. Finally, we investigate using student code from course assignments to augment the meta-training tasks. Although these provide a 1 to 2 point benefit, it does not rival the performance of using cloze or compile tasks. The much longer solutions from assignments are different in distribution, likely distracting the meta-learner. We see evidence of this when using cloze, compile, and assignment tasks together: performance is 5 points worse than using cloze and compile tasks alone.

To measure the impact of unlabeled data, we compare several pretraining schemes. First, to confirm that unsupervised data has a positive impact, we compare performance using CodeBERT weights as initialization versus a random initialization (“no pretrain” in the yellow curve). We observe over 10 points of increase for both held-out rubrics and exams, suggesting pretraining to be crucial. Moreover, to show that pretraining on unlabeled code is important, we compare CodeBERT to the standard RoBERTa which is trained on natural language. Using RoBERTa weights shows roughly half of the improvement of CodeBERT over random initialization. While CodeBERT is fit on a corpus of 6 programming languages (including Python), we finetuned a RoBERTa model only on the Python subset of CodeSearchNet (Husain et al., 2019), 1.1M of the 6.5M total examples. Using PythonBERT outperforms standard RoBERTa but does not match CodeBERT.

Incorporating side information proved useful (cyan curve). We compared the method proposed in Section 3 (“Task Token”) to several alternatives. FiLM (Perez et al., 2018) uses side information to learn an additive and multiplicative shift applied in each transformer layer. The “adapter” approach fits a bottleneck network to combine side information with attention outputs, inspired by (Houlsby et al., 2019). Finally, the “Concatenation” approach merely joins side information with the final embedding post-transformer. All in all, the “Task Token” approach is the simplest and most performant.

Lastly, to test preprocessing techniques, we compare the ProtoTransformer trained using obfuscated code, byte-pair encoded (BPE) code, and vanilla code (“no preprocess”). While both obfuscation and BPE improve performance, the latter achieves 1 to 3 points of additional improvement (brown curve in Figure 4). We hypothesis this to be because BPE preserves some semantic meaning in variable and function names (e.g. sum = x + y) whereas obfuscation does not.

Real World Deployment We deployed our approach to provide feedback to 16,000 student solutions on a diagnostic programming exam in a course offered by a tier 1 university. Normally, no feedback is given to students as it is not feasible for the teaching team to properly examine so many solutions. Rather, students are asked to self assess. On an important ethical note, our model’s feedback was used solely for enrichment, and had no impact on course grading or evaluation. 1,096 students responded to a survey after receiving 15,134 pieces of feedback. The students’ reception to the feedback was overwhelmingly positive: Across the 15k pieces of feedback students agreed with AI suggestion 97.9% 0.001 of the time. The 1,096 students were asked how useful they found the feedback on a 5 point scale. Their average rating was 4.6 0.018 out of 5. To the best of our knowledge this is the first successful deployment of AI feedback for open ended student code. Analysis of the long term impact of such feedback, the reception of students to messaging concerning AI delivered feedback and the user interface for delivering such feedback is ongoing research.

6 Related Work


There is a rich collection of literature on meta-learning spanning multiple decades (Schmidhuber, 1999; Bengio et al., 1990, 1992; Younger et al., 2001; Vanschoren, 2018; Hospedales et al., 2020). Our model is based on the popular prototypical network (Snell et al., 2017). Traditionally, prototypical networks have focused on few-shot visual learning e.g. with Omniglot (Lake et al., 2011) and miniImageNet (Vinyals et al., 2016), although recent work has expanded to text classification (Sun et al., 2019; Gao et al., 2019; Wu et al., 2019a; Geng et al., 2019; Obamuyide and Vlachos, 2020) and medical diagnosis (Prabhu, 2019). In this vein, we study a new problem of few-shot code classification for education. Recent research has expanded the prototypical network in several directions. One such direction (Medina et al., 2020; Rajasegaran et al., 2020) joins self-supervision together with few-shot learning. Similar papers (Liu et al., 2019b; Ren et al., 2018; Bateni et al., 2020; Liu et al., 2018) leverage unlabeled examples to modify prototype embeddings. Our approach does not use unlabeled data at the few-shot learning stage, only for initialization. Recent work also introduced task-specific parameters during meta-optimization (Oreshkin et al., 2018; Logeswaran et al., 2020; Denevi et al., 2020). Unlike this direction, our method has no trainable task parameters, only fixed side information. On a third front, many papers have studied task curation: AAL (Antoniou and Storkey, 2019) randomly assigns labels; CACTUS (Hsu et al., 2018) uses unsupervised clusters as labels; LASIUM (Khodadadeh et al., 2020)

fits a generative model to sample task examples. In computer vision,

(Liu et al., 2020; Khodadadeh et al., 2019) create new tasks from image transformations. In natural language, DReCa (Murty et al., 2020) clusters BERT embeddings whereas SMLMT (Bansal et al., 2020) predicts masked tokens for tasks. Building on these, we curate tasks specific to code, which we find to work much better than tasks specifically for natural language.

Machine learning for code Applying neural networks to code (outside of education) has a substantial body of work (Allamanis et al., 2018), studying type inference (Hellendoorn et al., 2018; Pandi et al., 2020), code summarization (Allamanis et al., 2016; Iyer et al., 2016), or program induction (Kant, 2018; Devlin et al., 2017; Huang et al., 2018). Several works investigate learning code representations (Alon et al., 2018; Feng et al., 2020; Kanade et al., 2019; Jain et al., 2020), which we leverage to initialize our few-shot model. However, on their own, we find that existing representations out-of-the-box are not sufficient for feedback.

Machine learning for education In knowledge tracing, a machine models student knowledge as they interact with coursework (Corbett and Anderson, 1994; Piech et al., 2015; Shin et al., 2020). However, these models are not able to provide feedback. Recently, (Wu et al., 2019b; Malik et al., 2019) propose to provide feedback using expert-built grammars. However, developing faithful grammars becomes intractable for complex programs, such as university level exams. Our approach relies only on standard annotation for a small set of examples: a cheap cost for any expert. We find good performance on university level student code.

7 Limitations and Conclusion

While promising, there are important limitations to consider. The few-shot learning setup implicitly assumes access to labeled examples per class. But in practice, we may have to label many more as some classes are rare. Table 4 shows how using less than annotations impacts performance. Second, it is not clear how far the approach can generalize e.g. a new exam or a new course. On a broader scale, we are conscious of the responsibility of AI in education. While an automated system can be powerful, if it is not interpretable and equitable in its predictions, there is a potential negative impact to students’ education. There could be bias especially if the model performs differently for lower performing students who have more complex solutions, raising ethical considerations. We believe it is crucial that our system works together with a teaching staff. In deployment, we took extreme care in quality analysis and crafting careful language around presenting feedback.

In summary, we introduced a few-shot approach to predict feedback on student programming code, matching human experts in accuracy. We observe the combination of transformers, side information, and augmented tasks made impactful improvements in education and NLP. As priori methods struggled with university assignments, we believe our work takes an important step forward.


  • M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton (2018) A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51 (4), pp. 1–37. Cited by: §6.
  • M. Allamanis, H. Peng, and C. Sutton (2016) A convolutional attention network for extreme summarization of source code. In International conference on machine learning, pp. 2091–2100. Cited by: §6.
  • U. Alon, S. Brody, O. Levy, and E. Yahav (2018) Code2seq: generating sequences from structured representations of code. arXiv preprint arXiv:1808.01400. Cited by: §3.2, §6.
  • A. Antoniou and A. Storkey (2019) Assume, augment and learn: unsupervised few-shot meta-learning via random labels and data augmentation. arXiv preprint arXiv:1902.09884. Cited by: §6.
  • T. Bansal, R. Jha, T. Munkhdalai, and A. McCallum (2020) Self-supervised meta-learning for few-shot natural language classification tasks. arXiv preprint arXiv:2009.08445. Cited by: §A.2, §3.2, §6.
  • Y. Bao, M. Wu, S. Chang, and R. Barzilay (2019) Few-shot text classification with distributional signatures. In International Conference on Learning Representations, Cited by: Table 2, §4, §4.
  • S. Basu, C. Jacobs, and L. Vanderwende (2013) Powergrading: a clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics 1, pp. 391–402. Cited by: §1.
  • P. Bateni, J. Barber, J. van de Meent, and F. Wood (2020) Improving few-shot visual classification with unlabelled examples. arXiv preprint arXiv:2006.12245. Cited by: §6.
  • S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei (1992) On the optimization of a synaptic learning rule. In Preprints Conf. Optimality in Artificial and Biological Neural Networks, Vol. 2. Cited by: §6.
  • Y. Bengio, S. Bengio, and J. Cloutier (1990) Learning a synaptic learning rule. Citeseer. Cited by: §6.
  • L. Bertinetto, J. F. Henriques, P. Torr, and A. Vedaldi (2018) Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, Cited by: Table 2.
  • J. Blitzer, M. Dredze, and F. Pereira (2007) Biographies, bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 440–447. Cited by: §4.
  • W. G. Bowen (2012) The “cost disease” in higher education: is technology the answer. The Tanner Lectures Stanford University. Cited by: §1.
  • A. T. Corbett and J. R. Anderson (1994) Knowledge tracing: modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4 (4), pp. 253–278. Cited by: §6.
  • G. Denevi, M. Pontil, and C. Ciliberto (2020) The advantage of conditional meta-learning for biased regularization and fine-tuning. arXiv preprint arXiv:2008.10857. Cited by: §6.
  • J. Devlin, R. R. Bunel, R. Singh, M. Hausknecht, and P. Kohli (2017) Neural program meta-induction. In Advances in Neural Information Processing Systems, pp. 2080–2088. Cited by: §6.
  • F. Y. Edgeworth (1888) The statistics of examinations. Journal of the Royal Statistical Society 51 (3), pp. 599–635. Cited by: §2.
  • e. al. Falcon (2019) PyTorch lightning. GitHub. Note: 3. Cited by: Appendix H.
  • Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, et al. (2020) Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155. Cited by: §A.1, §3.2, §6.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §1, Table 2.
  • T. Gao, X. Han, Z. Liu, and M. Sun (2019) Hybrid attention-based prototypical networks for noisy few-shot relation classification. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6407–6414. Cited by: §6.
  • V. Garcia and J. Bruna (2017) Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043. Cited by: Table 1.
  • R. Geng, B. Li, Y. Li, X. Zhu, P. Jian, and J. Sun (2019) Induction networks for few-shot text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3895–3904. Cited by: Table 1, §4, §6.
  • V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis (2018) Deep learning type inference. In Proceedings of the 2018 26th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp. 152–162. Cited by: §6.
  • T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey (2020) Meta-learning in neural networks: a survey. arXiv preprint arXiv:2004.05439. Cited by: §6.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019)

    Parameter-efficient transfer learning for nlp

    arXiv preprint arXiv:1902.00751. Cited by: §A.3, §5.
  • K. Hsu, S. Levine, and C. Finn (2018) Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334. Cited by: §6.
  • Q. Hu and H. Rangwala (2019) Reliable deep grade prediction with uncertainty estimation. arXiv preprint arXiv:1902.10213. Cited by: §1.
  • P. Huang, C. Wang, R. Singh, W. Yih, and X. He (2018) Natural language to structured query generation via meta-learning. arXiv preprint arXiv:1803.02400. Cited by: §6.
  • H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt (2019) Codesearchnet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436. Cited by: §A.1, §1, §5.
  • S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer (2016)

    Summarizing source code using a neural attention model

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2073–2083. Cited by: §6.
  • P. Jain, A. Jain, T. Zhang, P. Abbeel, J. E. Gonzalez, and I. Stoica (2020) Contrastive code representation learning. arXiv preprint arXiv:2007.04973. Cited by: §3.2, §6.
  • A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi (2019) Pre-trained contextual embedding of source code. arXiv preprint arXiv:2001.00059. Cited by: §6.
  • N. Kant (2018) Recent advances in neural program synthesis. arXiv preprint arXiv:1802.02353. Cited by: §6.
  • S. Khodadadeh, L. Boloni, and M. Shah (2019) Unsupervised meta-learning for few-shot image classification. In Advances in Neural Information Processing Systems, pp. 10132–10142. Cited by: §6.
  • S. Khodadadeh, S. Zehtabian, S. Vahidian, W. Wang, B. Lin, and L. Bölöni (2020)

    Unsupervised meta-learning through latent-space interpolation in generative models

    arXiv preprint arXiv:2006.10236. Cited by: §6.
  • R. F. Kizilcec, C. Piech, and E. Schneider (2013) Deconstructing disengagement: analyzing learner subpopulations in massive open online courses. In Proceedings of the third international conference on learning analytics and knowledge, pp. 170–179. Cited by: §1.
  • A. Kumar and M. Hurwitz (2015) Supply and demand in the higher education market: college enrollment. research brief.. College Board. Cited by: §1.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §1.
  • B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum (2011) One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, Vol. 33. Cited by: §6.
  • J. Liu, F. Chao, and C. Lin (2020) Task augmentation by rotating for meta-learning. arXiv preprint arXiv:2003.00804. Cited by: §6.
  • J. Liu, Y. Xu, and L. Zhao (2019a) Automated essay scoring based on two-stage learning. arXiv preprint arXiv:1901.07744. Cited by: §1.
  • L. Liu, T. Zhou, G. Long, J. Jiang, L. Yao, and C. Zhang (2019b) Prototype propagation networks (ppn) for weakly-supervised few-shot learning on category graph. arXiv preprint arXiv:1905.04042. Cited by: §6.
  • Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang (2018) Learning to propagate labels: transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002. Cited by: §6.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019c) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §A.1, Appendix F, §3.2, §3.
  • L. Logeswaran, A. Lee, M. Ott, H. Lee, M. Ranzato, and A. Szlam (2020) Few-shot sequence learning with transformers. arXiv preprint arXiv:2012.09543. Cited by: §A.3, §6.
  • A. Malik, M. Wu, V. Vasavada, J. Song, J. Mitchell, N. Goodman, and C. Piech (2019) Generative grading: neural approximate parsing for automated student feedback. arXiv preprint arXiv:1905.09916. Cited by: §2, §6.
  • C. Medina, A. Devos, and M. Grossglauser (2020) Self-supervised prototypical transfer learning for few-shot classification. arXiv preprint arXiv:2006.11325. Cited by: §6.
  • N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141. Cited by: Table 1.
  • S. Murty, T. B. Hashimoto, and C. D. Manning (2020) DReCa: a general task augmentation strategy for few-shot natural language inference. arXiv preprint. Cited by: §6.
  • A. Obamuyide and A. Vlachos (2020) Model-agnostic meta-learning for relation classification with limited supervision. arXiv preprint. Cited by: §6.
  • B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: §A.3, §2, §6.
  • I. V. Pandi, E. T. Barr, A. D. Gordon, and C. Sutton (2020) OptTyper: probabilistic type inference by optimising logical and natural constraints. arXiv preprint arXiv:2004.00348. Cited by: §6.
  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §A.3, §5.
  • C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, and J. Sohl-Dickstein (2015) Deep knowledge tracing. In Advances in neural information processing systems, pp. 505–513. Cited by: §6.
  • V. U. Prabhu (2019) Few-shot learning for dermatological disease diagnosis. Ph.D. Thesis, Georgia Institute of Technology. Cited by: §6.
  • J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah (2020) Self-supervised knowledge distillation for few-shot learning. arXiv preprint arXiv:2006.09785. Cited by: §6.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: §3.1.
  • M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676. Cited by: §6.
  • Riid (2020) Riiid aied challenge. External Links: Link Cited by: §1.
  • J. Schmidhuber (1999) A general method for incremental self-improvement and multi-agent learning. In Evolutionary Computation: Theory and Applications, pp. 81–123. Cited by: §6.
  • B. Settles (2018) Data for the 2018 duolingo shared task on second language acquisition modeling (slam). Available at: doi 10. Cited by: §1.
  • D. Shin, Y. Shim, H. Yu, S. Lee, B. Kim, and Y. Choi (2020) SAINT+: integrating temporal features for ednet correctness prediction. arXiv preprint arXiv:2010.12042. Cited by: §6.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §1, §2, Table 1, Table 2, §6.
  • S. Sun, Q. Sun, K. Zhou, and T. Lv (2019) Hierarchical attention prototypical networks for few-shot text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 476–485. Cited by: §6.
  • F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1199–1208. Cited by: Table 1, §5.
  • J. Vanschoren (2018) Meta-learning: a survey. arXiv preprint arXiv:1810.03548. Cited by: §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 3637–3645. Cited by: §1, Table 1, §5, §6.
  • L. Wang, A. Sy, L. Liu, and C. Piech (2017) Learning to represent student knowledge on programming exercises using deep learning.. In EDM, Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: Appendix H.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2019) HuggingFace’s transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Cited by: §A.1.
  • J. Wu, W. Xiong, and W. Y. Wang (2019a) Learning to learn and predict: a meta-learning approach for multi-label classification. arXiv preprint arXiv:1909.04176. Cited by: §6.
  • M. Wu, M. Mosse, N. Goodman, and C. Piech (2019b) Zero shot learning for code education: rubric sampling with deep learning inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 782–790. Cited by: §1, §2, §2, §2, §3.2, §6.
  • L. Yan, N. McKeown, and C. Piech (2019) The pyramidsnapshot challenge: understanding student process from visual output of programs. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, SIGCSE ’19, New York, NY, USA, pp. 119–125. External Links: ISBN 978-1-4503-5890-3, Link, Document Cited by: §1.
  • A. S. Younger, S. Hochreiter, and P. R. Conwell (2001)

    Meta-learning with backpropagation

    In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), Vol. 3. Cited by: §6.
  • M. Yu, X. Guo, J. Yi, S. Chang, S. Potdar, Y. Cheng, G. Tesauro, H. Wang, and B. Zhou (2018a) Diverse few-shot text classification with multiple metrics. In NAACL-HLT, Cited by: Table 1.
  • T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine (2018b) One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557. Cited by: §1.

Appendix A Model Details

a.1 Details: Unsupervised Pretraining

Pretrained RoBERTa weights are taken from the HuggingFace Transformers repository [Wolf et al., 2019] under the key roberta-base. Pretrained CodeBERT [Feng et al., 2020] weights are publically available at and can be accessed with the HuggingFace Transformers library under the key microsoft/codebert-base. Pretrained PythonBERT weights are obtained by finetuning the top 6 layers of a pretrained RoBERTa model (roberta-base) on the CodeSearchNet repository [Husain et al., 2019] limited to Python code, unlike CodeBERT which is trained on multiple languages. We optimize with Adam with weight decay of 0.01, =0.9, =0.999 for 10 epochs, a batch size of 8, and the learning schedule from [Liu et al., 2019c] with a peak learning rate of 1e-4 and 10k warmup steps.

a.2 Details: Task Augmentation

All programs in the dataset are in python. For the “compile” task, we use the built-in python function compile. We use a try-catch statement on failure with the exception string denoting a class label. A successful compilation indices a “success” class label. We group compilation errors that contain the text “Did you mean to print …” into one class, and similarly for the text “invalid character in identifier …”. In total, we have 26 possible compilation class labels, of which two are randomly chosen at a time to construct a task.

The SMLMT [Bansal et al., 2020] and our “cloze” task are very similar as both create classification tasks that involve predicting the identity of masked tokens in a discrete sequence. The primary difference for the cloze task is (1) it is always a binary classification problem, and (2) we ignore all special tokens (e.g. start of sentence, etc) and all function and variable names. As such, the masked tokens largely represent built-in python functions (e.g. def or in).

a.3 Details: Side Information Architectures

(a) Task Token
(b) Adapter Network
(c) FiLM Network
(d) Concatenation
Figure 5: Four options for integrating side information into a stacked transformer architecture. Some approaches modify the input to the ProtoTransfer (a) while others modify each transformer layer (b,c), and others still modify the final transformer representation.

To incorporate side information in to the ProtoTransformer architecture, we considered four different options:

The simplest method, denoted “concat”, first separately feeds the input sequence through the stacked transformer model, and computes an average embedding over non-padding sequence tokens of shape batch_size x dim. This embedding is then concatenated with the side information, which we assume to be a vector, or equivalently, pre-embedded into a vector. The concatenated vector is then fed into an MLP to collapse the embedding back to shape batch_size x dim. While simple, one disadvantage of this approach is that the side information does not directly influence how the input sequence is embedded in the transformer layers.

The second method, denoted “FiLM”, attempts to modify the internal architecture of each transformer layer. As shown in Fig. 5c, two FiLM layers are added after dense layers in the transformer layer architecture. The FiLM layer [Perez et al., 2018]

was proposed as a general method to condition convolutional neural networks on known information by learning and additive and multiplicative shift applied to the output of a layer in the neural network. The FiLM idea was extended to transformers in TADAM

[Oreshkin et al., 2018]. In this work, we propose to use FiLM to condition the embedding function on side information. Side information is put through an MLP to produce two terms, and which are then respectively multiplied and added (elementwise) to the output of the a dense layer in the transformer.

The third method, denoted “adapter”, similarly edits the transformer layer architecture. Proposed in [Houlsby et al., 2019], it was introduced as a parameter-efficient finetuning procedure. Two “adapter networks” are added after dense layers in the transformer (see Fig. 5

b). Each adapter is a bottleneck network that down-projects the input to a small hidden dimension before up-projecting it back to the original dimensionality, with an nonlinearity in between. There is a residual connection between the original input and the reconstructed input of the adapter network. The original architecture from

[Houlsby et al., 2019] did not include side information. However, we propose to concatenate side information with the input (call this ) to the adapter network. In this new design, the up-projection layer produces a vector of the same dimensionality as (not including added dimensions by concatenating side information). The output of this adapter network is again the original with the reconstructed residual added. Intuitively, the residual captures side information.

The fourth and final approach, is based on TAM [Logeswaran et al., 2020]. We use an MLP to project the side information to the side dimensionality as the token embedding in the transformer (e.g. 768 for RoBERTa). We then treat this vector as a “special” token, prepending it to the sequence of embedded tokens. See Fig. 5a for an illustration.

a.4 Details: Code Preprocessing

For byte-pair encoding, we use the RobertaTokenizer from HuggingFace’s Transformers repository. To do obfuscation, we first must identify variable and function names. To accomplish this, we use the pythonlang.tokenize function, which is a lexical scanner for python source code and semantically tags tokens with types (e.g. variable, function, etc.). This lexical scanner does not require the program to compile. Then we denote a finite set of reserved tokens for representation token and variable names e.g. <VAR:0>, …, <VAR:100>, or <FUNC:0>, …, <FUNC:10>. When obfuscating program, every time we see a variable or function token, we replace all instances of it with the next available reserved token. During training, we randomly shuffle variable and function tokens, e.g. all instances of <VAR:0> to <VAR:5>, in order to prevent memorizing. The lexical scanner is also used for the “cloze” task to decide which tokens to mask.

a.5 Details: Supervised Baseline

The supervised baseline is trained for 25 epochs to avoid overfitting. We first tried to train for 300 epochs to be a faithful comparison to the ProtoTransformer but found no improvements after 25 epochs. Initializing weights from CodeBERT slowed overfitting to some extent. Without it, performance for the supervised baseline plateaued at chance (AP near 0.5).

a.6 Details: Additional Examples

We provide more examples of ProtoTransformer predictions for feedback on exam questions. Fig. 6 shows two more examples in addition to the one in the main text. Since we are not able to share student work, the examples are generated by an author who purposefully introduces a misconception in their solution. In particular, in Fig. 6a, the comparision betweens string_1 and string_2 should be an inequality. In Fig. 6b, the line if content[i][j] // 2 == 0: should instead be if content[i][j] % 2 == 0: e.g. replace integer division with a mod. In Fig. 6c, we did not check to see if the key k exists in the object dict2 before accessing it. In each of the cases, the ProtoTransformer predicts feedback labels that reflect the original misconception.

(a) Example: string comparisons
(b) Example: file reading and processing
(c) Example: dictionary comparisons
Figure 6: More Examples of rubric items from a trained ProtoTransformer. The question description and the true misconception are shown in the left two boxes whereas the predicted rubric item is shown with a checkmark out of three possible items.

Appendix B Dataset Details

Fig. 7 compares student solutions to course exams to course assignments in terms of code length and uniqueness. Fig. 7a and b show that assignments are nearly 10 times longer than exams, as assignments are often more difficult than exam problems. As a consequence, student solutions to assignment problems are much more likely to be unique as well (see Fig. 7d). The diversity of student solutions also results in a larger vocabulary size (see Fig. 7c). For purposes of learning algorithms, the distribution of student exam solutions and assignment solutions are significantly different. It is likely much more difficult to model assignment solutions. A final note: our data analysis indicates that students programming code has a very long tail with a Zipf-like nature. Fig. 7 plots unique student solutions (x-axis) against the log-log count of number of appearances in the dataset (we log twice in order to visualize the fall-off). It is clear that a handful of programs appear often whereas the majority of student solutions have not been seen before. This property of student data makes traditional supervised learning approaches difficult as one has to annotate the tail to generalize.

(a) Exams
(b) Assignments
(c) Token Uniqueness
(d) Program Uniqueness
Figure 7: Sub-figures (a) and (b) show number of lines in the exams and assignments (black dashed line shows median). Sub-figure (c) shows the log counts of unique tokens in exams and assignments. Sub-figure (d) shows the log-log counts of unique programs.

b.1 Dataset Specifics

Student submissions were collected from 3 years of an introductory programming course at a university (information redacted for anonymity). Each course contains 500 to 1000 students. For every course, students are given programming assignments and two exams: one midterm and one final. We do not record student attempts, only the code of their final submission. A team of teaching assistants was assigned to grade every student submission according to a rubric, which we use as side information in the model. Because grading styles are not normalized, a set of randomly chosen problems are graded by multiple teaching assistants, which we use to compute precision.

Personal information about students was scrubbed from the dataset: we do not utilize grader or student information in the model, only the question text, rubric text, and solution. In the solution, all comments were removed.

b.2 Dataset Release

We are cognizant that the “Introduction to computer science” dataset of student solutions from Section LABEL:sec:dataset

is not public. We have attempted to open-source this dataset many times but found it extremely difficult to do so given important privacy concerns over student data. We are working towards open-sourcing a version of a dataset of student code for feedback prediction so that others can improve upon our approach. Our motivation on exploring applications of the ProtoTransformer to natural language was to construct a proxy task for others to compare against our method while we work towards a public dataset of student code.

Appendix C Additional Results

We include plots of the precision recall curve and ROC curve of the ProtoTransformer and supervised baseline. See Fig. 8. The black dotted line represents performance of random chance. Thus, in both subfigures, a curve closer to the black dotted line is a worse performing model.

(a) Precision-Recall
(b) ROC curve
Figure 8: Comparision of the ProtoTransformer (red) to a supervised baseline (blue). The dotted black line represents chance.

Next, we explore how meta-test performance degrades as we build prototypes in test time with fewer than examples (Note we will assume access to exapmles in meta-train tasks). Table 4 shows a gradual loss in performance as fewer shots are provided during meta-test. However, the performance drop is relatively small when using 5 shots (half the number).

Held-out rubrics Held-out exams
Table 4: Effect of support set size on meta-test performance: Using a ProtoTransformer Network trained on meta-training tasks with shots, we find gradual loss in performance as fewer shots are provided during meta-test.

Appendix D Additional Ablations

We include two ablation experiments not shown in the main paper. First, we vary the amount of task augmentation to answer the question: how many synthetic tasks is too many? From Table 4(g), adding 10% of augmented tasks results in the best performing model. As the percentage of synthetic tasks grows below or above 10%, performs continues to decrease. Second, we want to know how does the number of shots affect performance? In our main experiments, we set . As Table 4(h) shows, the bigger is, the higher the performance, though there may be a marginally decreasing effect. In practice, we cannot choose to be too big. For the education application, it is not feasible to annotate too many examples per feedback class. By choosing , we believe it to be a compromise between annotation effort and performance. In Table 4(g), we also include numbers for the ablations in the main text.

Meta Algorithm Held-out rubric Held-out exam
Prototypical Network 80.1 71.6
Matching Network 77.8 68.3
Relation Network 67.5 61.3
(a) Ablation: meta-learning algorithm.
Architecture Held-out rubric Held-out exam
LSTM (4x) 67.7 58.8
LSTM (8x) 67.7 59.6
LSTM (10x) 67.6 59.4
Transformer (4x) 73.4 64.8
Transformer (8x) 74.2 66.3
Transformer (10x) 77.0 68.2
(b) Ablation: embedding network architecture.
Augmented Task Held-out rubric Held-out exam
None 71.4 63.3
SMLMT 75.6 67.8
Assignments (Assn.) 72.8 66.3
Compile 77.0 71.7
Cloze 78.4 73.2
Cloze, Compile 79.1 73.7
Cloze, Compile, Assn. 73.5 68.5
(c) Ablation: task augmentation.
Pretraining Held-out rubric Held-out exam
None 70.9 61.8
RoBERTa 76.3 66.1
CodeBERT 82.4 72.4
PythonBERT 77.6 68.4
(d) Ablation: pretraining scheme
Side Info. Arch. Held-out rubric Held-out exam
None 80.1 70.5
Concatenation 80.6 66.1
Adapter 83.4 72.8
Task Token 85.1 74.4
FiLM 75.8 68.2
(e) Ablation: side information architecture.
Code Preprocessing Held-out rubric Held-out exam
None 70.9 63.3
Obfuscation 73.3 66.9
Byte-pair 74.2 69.8
(f) Ablation: code preprocessing.
Aug % Held-out rubric Held-out exam
1% 72.2 68.1
5% 76.4 69.6
10% 78.2 72.6
25% 77.7 68.2
50% 75.0 65.3
100% 70.0 64.8
(g) Ablation: task augmentation amount.
# shots Held-out rubric Held-out exam
1-shot 57.8 58.9
2-shot 64.5 60.1
5-shot 76.1 66.9
10-shot 80.5 71.4
(h) Ablation: number of shots.
Table 5: Albation experiments testing the impact of different components of the ProtoTransformer on model performance.

Appendix E Analysis of Embeddings

We can explore what the ProtoTransformer has learned about student code by exploring clusters in the shared embedding space. Figure 9 shows a PCA projection of embedding solutions to two dimensions. For each solution, we are given a true numeric grade assigned by a teaching assistant (between 0 and 100), which is not used in training nor evaluation.

(a) All Students
(b) Random Student
(c) Random Question
(d) Random Question
Figure 9: Embedding clustering: The color represents the true grade (darker is lower). In (a), we visualize the average embedding (over questions) for every student. Subfigure (b) shows embeddings for each question for a random student. Subfigures (c) and (d) show all student solutions for two randomly chosen questions.

For every student, we first compute the average embedding over all questions, which we can interpret as a measure of student understanding. We see that Figure 9a uncovers a linear relationship as students in the top-left corner have lower performance. In Figure 9b we focus on a single student, visualizing their embedded solutions. We observe distinct clusters that separate the questions that this student excels at versus those this student does poorly on. This information could be useful for an instructor in measuring learning and growth.

Similarly, we visualize all student attempts for a single question in Figures 9c & d, finding surprisingly different structure across questions. Figure 9c contains a large cluster of high-scoring solutions with several distinct clusters of low-scoring solutions. In contrast, Figure 9d contains a single cluster with solutions scoring higher from bottom-left to top-right in a continuous manner. In practice, this analysis could help instructors evaluation question difficulty and curriculum design.

Appendix F NLP Experiment Details

In all cases, we fit the ProtoTransformer with Adam for 30 epochs with weight decay of 0.01, =0.9, =0.999, and a linear learning rate schedule to 1e-4 with 10k steps warmup [Liu et al., 2019c]. The LSTM baseline was bi-directional with 4 stacked layers. Pretraining weights were taken a RoBERTa model from HuggingFace.

f.1 Additional NLP Experiments

The results in the main paper split meta-training and meta-test tasks by topic such that tasks in the meta-test set use topics not seen in meta-training. Here we run a similar study in which we ensure that no combination of classes appearing in a meta-training task appear in a meta-test task. That is, the meta-test set might contain all 20 topic classes but every meta-test task of classes is novel e.g., sci.crypt, and alt.atheism form a 3-way task that is unique if this exact combination does not appear in the meta-training set, although each class individually may be used in tasks for meta-training. Note that this is an easier setup than the experiment in the main paper. Examples for meta-training tasks are taken from training split of 20-newsgroups whereas examples for meta-test tasks are taken from the test split. All hyperparameters and architecture choices are as in supplement  F.

2-way 5-way
Model 1-shot 2-shot 1-shot 2-shot
All 92.4 93.2 82.2 85.2
No Pretrain 62.1 80.1 55.7 63.5
No Side 89.5 91.5 76.8 82.2
No Cloze 88.4 90.9 80.3 83.3
LSTM 62.5 64.2 38.1 43.6
Matching 92.2 82.3
Table 6: Few-shot topic classification: Performance of the ProtoTransformer Network with various ablations on predicting sentence topic using the 20-newsgroups dataset.

Appendix G Runtime and Cost

Training the ProtoTransformer for 300 epochs on the dataset of university coursework takes 12 to 13 hours on a single Titan X GPU. Finetuning 3 layers of a pretrained RoBERTa model fits within the 11Gb memory. All models, baselines, and ablations used the same compute resources. A supervised baseline for a single task takes roughly 15 to 30 minutes to train.

Appendix H Training Tools

We use the Huggingface transformers library [Wolf et al., 2020] for a RoBERTa implementation, which is under a Apache 2.0 license. We use the CodeBERT repository (, which is under a MIT license. For our infrastructure, we use PyTorch and PyTorch Lightning [Falcon, 2019], which have BSD and Apache 2.0 licenses, respectively.