The need for high quality education at scale poses a difficult challenge. The price of education per student is growing faster than economy-wide costs [Bowen2012], limiting the resources available to support student learning. When also considering the rising need to provide adult retraining, the gap between the demand for education and our ability to provide is especially large. In recent years, massively open online courses (MOOCs) from platforms like Coursera and Code.org have made progress by scaling the delivery of content. However, MOOCs largely ignore an equally important ingredient for learning: high quality feedback. The clear societal need, alongside massive amounts of data has led to a machine learning grand challenge: learn how to provide feedback for education at scale, especially in computer science due to its apparent structure and high demand.
Scaling feedback (a.k.a. “feedback" challenge) has proven to be a hard machine learning problem. Despite dozens of projects to combine massive datasets with cutting edge deep learning, current approaches fall short. Three issues emerge: (1) for even basic computer science education, homework datasets have statistical distributions with heavy tails similar to natural language; (2) hand labeling feedback is expensive, rendering supervised solutions infeasible; (3) in real world contexts feedback is needed for assignments with small (or zero) historical records of student learning. For the billions of learners around the world, most education and assessments have at most hundreds of records. Even if students use Code.org, assignments are constantly changing, making the small-data context perennial. It is a zero-shot solution that has potential to deliver enormous social impact.
We build upon a simple insight that enables us to move beyond the supervised paradigm: When experts give feedback they are asked to perform the hard task of predicting misconception () given program (). When breaking down the cognitive steps that experts go through, they often solve the inference by first thinking generatively . They imagine, “if a student were to have a particular set of misconceptions, what sorts of programs are they likely to produce." Thinking generatively is much easier: while there are a finite set of decomposable misconceptions, they combine into exponential amounts of unique solutions.
We formalize this intuition into a technique we call “rubric sampling" to elicit samples from an expert prior of the joint distributionand use deep learning for inference . With no historical examples, rubric sampling enables feedback with accuracy close to the fidelity of human teachers, outperforming data-intensive state of the art algorithms. We case study this technique on Code.org, an online programming platform that has been used by 610 million students and has provided a full curriculum to 29 million students, equivalent to 39% of the US K12 population.
Specific contributions in this paper:
We introduce the Zero Shot Feedback Challenge and an open access dataset from 8 assignments from Code.org along with an evaluation set of 800 labels.
We articulate a novel solution: rubric sampling with deep learning inference which sets the new state of the art in code feedback prediction: F1 score doubled over baseline, approaching human level accuracy.
We introduce the ability to (i) attribute feedback to specific parts of code, (ii) trace learning over time and (iii) generate synthetic datasets.
The Zero Shot Feedback Challenge
The “Zero-Shot" Feedback Challenge is to infer the misconceptions that led to errors in a student’s answer using zero historical examples of student work and zero expert annotations. Though this challenge is difficult, it is a task that humans find straightforward. Experts are especially adept at generalizing: an instructor does not need to see thousands of instances of a misunderstanding in order to understand it.
Why is zero-shot so important? Human annotated examples are surprisingly time consuming to acquire. In 2014, Code.org launched an initiative where hundreds of thousands of instructors were crowdsourced to provide feedback to student solutions111http://www.code.org/hints. Labeling was hard and undesirable work and the long tail of unique solutions meant that even after thousands of human hours of teacher work, the annotations were only scratching the surface of feedback. The initiative was cancelled after two years and the effort has not been reproduced since. For small classrooms and massive online platforms alike, it is infeasible to acquire the supervision required for contemporary nonlinear methods.
We foresee three main approaches: (1) learn to transfer information from one assignment to another, (2) learn to incorporate expert knowledge, and (3) form algorithms that can generalize from small amounts of human annotations.
If you were to solve an assignment on Code.org today, the hints you would be given are generated from a unit test system combined with static analysis of the students solution. It has been a widely reported social-good objective to improve upon these hints [Price and Barnes2017] especially since the state of the art is far from ideal [O’Rourke, Ballweber, and Popovií2014]. Achieving this goal has proved to be hard. Previous research on a more basic set of Code.org challenges (the “Hour of Code") have scratched the surface with respect to providing feedback at scale. Original work found that latent patterns in how students solve programming assignments have signal as to how they should proceed [Piech et al.2015c]
. Applying a neural network improved prediction of feedback[Piech et al.2015a] but models were (1) too far from human accuracy, (2) weren’t able to explain its predictions and (3) required massive amounts of data. The current state of the art combines these ideas and provides some improvements [Wang et al.2017]. In this paper we propose a method which uses less data, approaches human accuracy and works on more complex Code.org assignments by diverging from the classic supervised framework. Research on feedback for even more complex assignments such as medical education [Geigle, Zhai, and Ferguson2016] and natural language questions [Bulgarov and Nielsen2018] has also relied on data-hungry supervised learning and perhaps would benefit from a rubric sampling inspired approach.
Theoretical inspiration for our expert-based generative rubric sampling derives from Brown’s “Repair Theory" which argues that the best way to help students is to understand the generative origins of their mistakes [Brown and VanLehn1980]. Simulating student cognition has been applied to simple arithmetic problems [Koedinger et al.2015] and recent hard coded models have been very successful in inferring why students make subtraction mistakes [Feldman et al.2018]. Researchers have argued that such expert models are infeasible for assignments as complex as coding [Paaßen et al.2017]. However, the automated hierarchical decomposition achieved by [Nguyen et al.2014] inspired us to develop rubric sampling, a simple expert model that works for programming assignments.
The Code.org Exercises Dataset
Code.org is an online education platform for teaching beginners fundamental concepts in programming. Students build their solutions in a drag-and-drop interface that pieces together blocks of code. Its growing popularity since 2013 has captured a large audience, having been used by 610 million students worldwide. We investigate a single curriculum consisting of 8 exercises from Code.org’s catalog. In this particular unit, students are learning to combine nested for loops with variables, and particularly the use of a for loop counter in calculations. The problems are all presented as drawing geometric shapes in a 2D coordinate space, requiring knowledge of angles. For instance, the curriculum begins with the task of drawing an equilateral triangle (see Figure 1).
The dataset is compiled from 54,488 students. Each time a student runs their code, the submission is saved. Each student is then associated with a trajectory of programs whose length depends on how many tries the student took. In total, there are 1,598,375 submissions. Since the exercises do not have a bounded solution space, students could produce arbitrarily long programs. This implies that, much like natural language, the distribution of student submissions has extremely heavy tails. Figure 2 shows how closely the submissions follow a Zipf distribution. To emphasize the difficulty, even after a million students, there is still a 15% chance that a new student generates a solution never seen before.
If we only cared about accuracy, we would prioritize the handful of 5 to 10 programs that make up the majority of the dataset. But given the infinite number of possible programs, struggling students who would benefit most from feedback will almost certainly not submit any one of the “most likely" programs. Knowing this, we define our evaluation metrics in terms of the Zipf itself: let thehead (of the Zipf) refer to the top 20 programs ordered by frequency, the tail as any program with a frequency of three or less, and the body as everything in between. Figure 2 shows the rough placement of these three splits. When evaluating models, we ignore the head: these very common programs can be manually labeled. Instead, we will report two F1 scores222We choose F1 score over accuracy as the labels are not close to balanced. Thus, accuracy tends to overinflate numbers.: one for programs in the body and one for the tail.
We collected fine-grained human annotations to over 800 unique solutions (chosen randomly from P1 and P8) from 7 teaching assistants from the Stanford introduction to programming course. The annotations are binary labels of 20 misconceptions that cover geometric concepts (e.g. doesn’t understand equilateral is 60 degrees) to control flow concepts (e.g. repeats code instead of using a loop). 200 annotations were used to measure inter-rater reliability and the remaining 600 were used for evaluation. We refer to this dataset as . Labeling took 25.9 hours (117 seconds per program). At this rate, the entire dataset would take 9987 hours of expert time, over a year of continual work.
We consider learning tasks given a dataset of labeled examples, where each example (indexed by ) has an input stringcomposed of independent binary labels. In this context, we assume each string represents a block-based program in Lisp-like notation. Specifically, a program string is a sequence of tokens, each representing either an operator (functions, for loop, if statements, etc.), an operand (i.e. variable), an open parenthesis “(", or a close parenthesis “)". See Listing 1 for an example program. Formally then, we describe the dataset as: where . The goal is to learn a function such that we minimize the error metric defined above, . For supervised approaches, we split into a training () and test set () via a 80-20 ratio. For unsupervised methods, we evaluate on the entire set .
As a sanity check, we can naively make predictions by voting for the majority label from .
Predicting from Output
The ubiquitous way to provide feedback is via unit tests: analyze the output of executing . For Code.org, each execution trace results in a sequence of output vectors
representing coordinates in 2D space where a line has been drawn. We train a recurrent neural network (RNN) to predictfrom . Unfortunately, this model requires to compile.
Feedforward Neural Network
To circumvent compilation, one can tackle the more difficult problem of predicting feedback from raw program strings [Piech et al.2015b]. We train a
-dimensional classifier composed of a RNN over tokens by minimizing the binary cross entropy between predictionsand ground truth vectors. The model architecture borrows the sentence encoder (without any stochastic layers) from [Bowman et al.2015]
and concatenates a 3-layer perceptron with a softmax overoutput dimensions. As we will reuse these architectures for other models, we refer to deterministic encoder as the program network and the latter MLP as the feedback network.
No model so far uses the fact that each student submits many programs before either stopping or reaching the correct solution. In fact, the current SOTA [Wang et al.2017] is to associate a trajectory of programs with the label corresponding to the last program, . For each program , we can train an embedding , where is the program network. This results in a sequence of embeddings for a single trajectory. We concurrently train a second RNN to compress the sequence to a single vector . This is then provided as input to the feedback network. The hope is that structure induced by a trajectory implicitly provides labels that strengthen learning.
Deep Generative Model
Finally, we present a new baseline that is promising in the context of limited data. If we consider programs and feedback as two modalities, one approach is to capture the joint distribution . Doing so, we can make predictions by sampling from the conditional having seen the program:
. To do this, we train a multimodal variational autoencoder, MVAE[Wu and Goodman2018] with two channels. Succinctly, this generative model uses a product-of-experts inference network where the joint distribution factorizes into a product of distributions defined by two modalities: . We optimize the multimodal evidence lower bound [Wu and Goodman2018, Vedantam et al.2017], which is a sum of three lower bounds:
To parameterize and , we use architectures from [Bowman et al.2015]. For and , we use 3-layer MLPs333 is composed of the program network and a stochastic layer; is equivalent to the feedback network.. To the best of our knowledge, this is the first application of a deep generative model to the feedback challenge.
So far the models have been severely constrained by the number of labels. If we had a more efficient labeling strategy, we could better train these deep models to their full potential. For instance, imagine instead of focusing on individual programs, we ask experts to describe a student’s thought process, enumerating strategies to get to a right or wrong answer. Given a detailed enough description, we can use it to label indefinitely. Granted, these labels will be noisy but the quantity should make up for any uncertainty. In fact, we can formalize such a “description" as a context-free grammar.
A context-free grammar (CFG) is composed of a set of acyclic production rules describing a space of output strings. As its name suggests, each rule is applied regardless of context (meaning no conditional arguments). Formally, a production rule is made up of non-terminal and terminal symbols. Non-terminal symbols are hidden and either produce another non-terminal symbol or a terminal one. Terminal symbols are made up of tokens that appear in the final output string. For instance, consider the following CFG: . S and A are non-terminal symbols whereas and are terminal. It is easy to see that this CFG can only generate one of . A probabilistic context-free grammar (PCFG) is a CFG parameterized by a vector where each production rule has a probability attached. For example, we can make our CFG from above probabilistic: . Now, the space of possible outputs has not changed but for example, will be more probable than .
For the feedback challenge, the non-terminal symbols are labels, and the terminal symbols are programs, . For example, a possible production rule might be:
With a PCFG, we can generate an infinite amount of synthetically labeled programs, , and use to train data-hungry models. We refer to this process as rubric sampling. In practice, we sample 1e6 synthetic examples.
Creating rubrics is surprisingly easy. For a novice (undergraduate) and an expert (professor), making a PCFG took 19.4 minutes. To make the process even easier, we developed a simple meta language for representing a grammar.444This language will be open sourced after review.
Further Learning from Unlabeled Programs
As students use the platform, unlabeled submissions accumulate over time. We refer to the dataset as .
We can use unlabeled data in rubric sampling to automatically learn . This means alleviating some of the burden for a human-in-the-loop, since choosing is often more difficult than designing the grammar itself. But since a PCFG is discrete, we cannot directly differentiate. However, we can hope to approximate local gradients by sampling values within some -neighborhood and computing finite differences along these random directions [Salimans et al.2017]. If we repeatedly take a linear combination of the “best" samples as measured by a fitness function, then over many iterations, we expect the PCFG to improve. The challenge is in picking the fitness function.
A good choice is to pick whose generated distribution, is “closest" to 555In tuning , we consider all programs, not just the unique set.. As both are Zipf-ian, we can properly measure “closeness" using a rank order metric [Havlin1995], as rank is independent of frequency.
Rubric Sampling with MVAE
Another way to service unlabeled data is to train with it: one of the features of the MVAE is that it can handle missing modalities. We can fit the MVAE with two data sources: and .
Thus, the MVAE is shown both a synthetic minibatch, , which is used to compute the multimodal elbo, and an unlabeled minibatch, , which computes Equation 2
. We can jointly optimize the two losses by summing the gradients prior to taking an optimization step. Intuitively, this interpolates betweenand , no longer completely relying on the PCFG. One can also interpret this as a soft-version of structure learning since using is somewhat akin to “editing" the PCFG.
Capturing a Zipf is hard for a generative model since it is very easy to memorize the top 5 programs and very hard to capture the tail. To make it easier, we apply a log transformation,666We preserve examples that appear only once in to i.e. where and . i.e. if a program appears 10 times in , it only appears once in the transformed dataset, . Then, when generating with the MVAE, we invert the log by exponentiating the frequency of each unique program (exp-Zipf). Intuitively, log-Zipf is similar to “tempering" a distribution as it reduces any extremes.
Recreation of Human Labels
reports a set of F1 scores, including human-level performance estimated from annotations. Each model is given two bar plots, one for programs in the body (left) and one in the tail (right). First, we see that baselines have lower F1 scores compared to models that take advantage of synthetic data. That being said, the new baseline MVAE we introduced already performs far better than the previous SOTA. In P1, using rubric sampling increases the F1 score by 0.31 in the body and 0.13 in the tail (we find similar gains in P8). These promising results imply that the grammar indeed is effective. We also find that combining the MVAE with rubric sampling boosts the F1 by an additional 0.2, reaching 94% accuracy in P1 and 95% in P8. With these scores, we are reasonably confident that for a new student, despite the likelihood that he/she will submit a never-before-seen program, we will provide good feedback.
To put these results in terms of impact, Table 1 estimates the number of correct feedback we could have given to students in P1 to P8 based on projections from the annotated set. Over the full curriculum, our best model would have provided the correct feedback to an expected 126,000 additional programs compared to what Code.org currently uses, potentially helping thousands more students.
|Model||Amount of Correct Feedback|
|Predicting from output||1,483,157 (86.0%)|
|Rubric sampling with MVAE||1,610,020 (93.7%)|
|Expert human||1,658,162 (96.2%)|
Tracing Knowledge Across Curricula
We can build similar rubrics and train supervised networks to predict feedback for P2 through P7. This is quite powerful as it allows us to estimate student understanding over a curricula. Critically, we can gauge the performance of both individual students and the student body as a whole. These sort of queries are valuable to teachers to be able to (1) measure a student’s progress scalably and (2) judge how useful assignments and lessons have been.
In Figure 5, we analyze the average student’s level of understanding over the 8 problems for two main concepts: loops and geometry (i.e. shapes, angles, movement). For each submission in a student’s trajectory, we classify it as having either 1) no errors, 2) more loop errors, or 3) more geometry errors777We do so by comparing the summed predicted probabilities for all labels related to loops, and labels related to geometry, . If both and , then we classify this program as “no errors". Otherwise, we classify based on which quantity is larger.. The figure shows the distribution of students in each of the three categories from the first 10 submissions. From looking at behavior within a problem and between problems, we can make the following inferences:
Most students are successfully completing each problem. In other words, the blue area is increasing over time. Equivalently, the number of students still struggling by the 10th submission approaches a small number.
The difficulty of problems is not uniform. P6 is much more difficult than the others as the fraction of students with correct solutions is roughly constant. In contrast, P1, P4, and P5 are easier, where students quickly cease to make mistakes. As a teacher, one could use this information to improve the educational content and better hone in on areas where more students struggle.
Students are learning geometry better than looping. The rate that the pink area approaches zero is consistently faster than that of the orange area. By P8, students are barely struggling with geometry but a fair proportion still find looping difficult. As the curriculum was intended to teach nested looping, one interpretation is that the drawing aspect was somewhat distracting.
Fine-grain Feedback: Code Highlighting
With most online programming education, the form of feedback is limited to pop-up tips or compiler messages. But, with generative models we can provide more fine-grain feedback through highlighting subsets of the program responsible for the model predicting each feedback label.
First, if we consider a PCFG, the task of highlighting a program is equivalent to finding the most likely parsing in a probabilistic tree that would generate . In practice, we use the A* algorithm for fast Viterbi parsing [Klein and Manning2003]. Given the most likely parsing, we can follow the trajectory from root to leaf and record which sub-programs are generated by non-terminal nodes. This has one major limitation: only programs within the support of the PCFG can be highlighted. To bypass this, we can curate a synthetic dataset with each program having a segmentation mask denoting which tokens to highlight. If we treat the mask as an additional modality, we can then learn the joint distribution over programs, labels, and masks. See [Wu and Goodman2018] for details in defining a VAE with three modalities. In Figure 4b, we randomly sample 4 programs and show segmentation masks. The running time to compute a mask is negligible, meaning that this can be used for providing on-the-fly feedback to students. Moreover, highlighting provide a notion of interpretability (which is extremely important if we are working with students), much like Grad-Cam [Selvaraju et al.2017]
did for computer vision.
Clustering Students by Level of Understanding
With any latent variable generative model, the rich latent space provides a vector representation for unstructured data. Using the MVAE, we first sample for all ; then we train a t-SNE model [Maaten and Hinton2008] on the samples to reduce to two dimensions. In Figure 6b and 6c, we color each embedded program from by whether the true label is positive or negative. Clearly, we see that the space is partitioned to group programs with similar feedback together. In Figure 6a, we see that is also organized into disjoint clusters. This implies that even with no data about a new student we can draw inferences just by knowing who their neighbors are in latent space.
We get closer to human level performance than previous SOTA.
Any of the rubric sampling models beat the SOTA by at least 0.46 in P1 and 0.24 in P8, nearly tripling the F1 score in P1 and doubling in P8. In both exercises, our best model is just under 95% accuracy, which is encouraging for this to be implemented in the real world.
We can effectively track student growth over time.
With a high performing model, we can analyze students over time. For Code.org, we were able to (1) gauge what individual students struggle with, (2) gauge what a collective of students struggle with, (3) identify the effectiveness of a curriculum, and (4) identify the difficulty of problems.
Making a rubric is not difficult nor costly.
It is not the case that only experts can make a good rubric. We also asked an undergraduate (novice) to make a rubric for P1 and found that while an experts’ rubric averaged in F1 score, the novice’s averaged , both of which are much higher than baselines (). Furthermore, we measured that it took a group of teaching assistants 24.9 hours to label 800 unique programs. In comparison, it took a novice an average of 19.4 minutes to make a rubric.
We provide feedback to programs that do not compile.
Rubic sampling and MVAE make no assumptions on program behavior, working out-of-the-box from the 1st student.
We do not need to handpick when designing a rubric.
In Figure 1, the PCFG uses hand-picked by the creators. However, one can argue that it is difficult to know how often students make mistakes and yet, the choice of is important: performance drops if we randomly sample. For example, in P1, using hand-picked has a increase over random in F1-score. Fortunately, we can use evolutionary strategies to find a good minima starting from a random initialization. Over 3 runs, we found that learning reduces the difference to in P1 and even beating expert parameters by in P8. The takeaway is that we only have to define the rubric structure, not the probabilities.
We can generate and open-source a large dataset of student programs.
Datasets with student data are difficult to release to the open community. But large public datasets have been a dominant force in pushing the boundaries of research. Luckily, with generative models, we can curate our own “Imagenet" for education. But, we want to ensure that our dataset matchesin distribution. Intuitively, it is impossible that a PCFG can capture since that would require production rules that span the entire tail of the Zipf. In fact, as shown in Figure 7, the PCFG is not that faithful. One remedy is to use the MVAE trained with as that is interpolating between distributions. Figure 7, confirms that the MVAE matches the shape of much better.
We introduce the zero shot feedback challenge and offered a novel solution. On a widely used platform, we show rubric sampling to far surpass the SOTA. We combine this with a generative model to cluster students, highlight misconceptions, and incorporate historical data. We see our approach as a viable form for feedback ready for real world use.
- [Bowen2012] Bowen, W. G. 2012. The cost disease in higher education: is technology the answer? The Tanner Lectures Stanford University.
- [Bowman et al.2015] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- [Brown and VanLehn1980] Brown, J. S., and VanLehn, K. 1980. Repair theory: A generative theory of bugs in procedural skills. Cognitive science 4(4):379–426.
- [Bulgarov and Nielsen2018] Bulgarov, F. A., and Nielsen, R. 2018. Proposition entailment in educational applications using deep neural networks. In AAAI.
- [Feldman et al.2018] Feldman, M. Q.; Cho, J. Y.; Ong, M.; Gulwani, S.; Popović, Z.; and Andersen, E. 2018. Automatic diagnosis of students’ misconceptions in k-8 mathematics. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 264. ACM.
- [Geigle, Zhai, and Ferguson2016] Geigle, C.; Zhai, C.; and Ferguson, D. C. 2016. An exploration of automated grading of complex assignments. In Proceedings of the Third (2016) ACM Conference on Learning@ Scale, 351–360. ACM.
- [Havlin1995] Havlin, S. 1995. The distance between zipf plots. Physica A: Statistical Mechanics and its Applications 216(1-2):148–150.
- [Klein and Manning2003] Klein, D., and Manning, C. D. 2003. A parsing: fast exact viterbi parse selection. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, 40–47. Association for Computational Linguistics.
- [Koedinger et al.2015] Koedinger, K. R.; Matsuda, N.; MacLellan, C. J.; and McLaughlin, E. A. 2015. Methods for evaluating simulated learners: Examples from simstudent. In AIED Workshops.
- [Maaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579–2605.
- [Nguyen et al.2014] Nguyen, A.; Piech, C.; Huang, J.; and Guibas, L. 2014. Codewebs: scalable homework search for massive open online programming courses. In Proceedings of the 23rd international conference on World wide web, 491–502. ACM.
- [O’Rourke, Ballweber, and Popovií2014] O’Rourke, E.; Ballweber, C.; and Popovií, Z. 2014. Hint systems may negatively impact performance in educational games. In Proceedings of the first ACM conference on Learning@ scale conference, 51–60. ACM.
- [Paaßen et al.2017] Paaßen, B.; Hammer, B.; Price, T. W.; Barnes, T.; Gross, S.; and Pinkwart, N. 2017. The continuous hint factory-providing hints in vast and sparsely populated edit distance spaces. arXiv preprint arXiv:1708.06564.
- [Piech et al.2015a] Piech, C.; Huang, J.; Nguyen, A.; Phulsuksombati, M.; Sahami, M.; and Guibas, L. 2015a. Learning program embeddings to propagate feedback on student code. In Proceedings of the 32nd International Conference on Machine Learning.
- [Piech et al.2015b] Piech, C.; Huang, J.; Nguyen, A.; Phulsuksombati, M.; Sahami, M.; and Guibas, L. 2015b. Learning program embeddings to propagate feedback on student code. arXiv preprint arXiv:1505.05969.
- [Piech et al.2015c] Piech, C.; Sahami, M.; Huang, J.; and Guibas, L. 2015c. Autonomously generating hints by inferring problem solving policies. In Proceedings of the Second (2015) ACM Conference on Learning@ Scale, 195–204. ACM.
- [Price and Barnes2017] Price, T. W., and Barnes, T. 2017. Position paper: Block-based programming should offer intelligent support for learners. In Blocks and Beyond Workshop (B&B), 2017 IEEE, 65–68. IEEE.
- [Salimans et al.2017] Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
- [Selvaraju et al.2017] Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 618–626.
- [Vedantam et al.2017] Vedantam, R.; Fischer, I.; Huang, J.; and Murphy, K. 2017. Generative models of visually grounded imagination. arXiv preprint arXiv:1705.10762.
- [Wang et al.2017] Wang, L.; Sy, A.; Liu, L.; and Piech, C. 2017. Learning to represent student knowledge on programming exercises using deep learning. In Proceedings of the 10th International Conference on Educational Data Mining; Wuhan, China, 324–329.
- [Wu and Goodman2018] Wu, M., and Goodman, N. 2018. Multimodal generative models for scalable weakly-supervised learning. arXiv preprint arXiv:1802.05335.