AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3 an 8th Grade science exam challenge. This paper reports unprecedented success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90 exam's non-diagram, multiple choice (NDMC) questions. In addition, our Aristo system, building upon the success of recent language models, exceeded 83 the corresponding Grade 12 Science Exam NDMC questions. The results, on unseen test questions, are robust across different test years and different variations of this kind of test. They demonstrate that modern NLP methods can result in mastery on this task. While not a full solution to general question-answering (the questions are multiple choice, and the domain is restricted to 8th Grade science), it represents a significant milestone for the field.READ FULL TEXT VIEW PDF
We present a new question set, text corpus, and baselines assembled to
The recent success of question answering systems is largely attributed t...
Textbook Question Answering is a complex task in the intersection of Mac...
We present a novel method for obtaining high-quality, domain-targeted
Answering science questions posed in natural language is an important AI...
We describe two new related resources that facilitate modelling of gener...
This paper presents the current state of a work in progress, whose objec...
This paper reports on the history, progress, and lessons from the Aristo project, a six-year quest to answer grade-school and high-school science exams. Aristo has recently surpassed 90% on multiple choice questions from the 8th Grade New York Regents Science Exam (see Figure 2).111See Section 4.1 for the experimental methodology. We begin by offering several perspectives on why this achievement is significant for NLP and for AI more broadly.
|1. Which equipment will best separate a mixture of iron filings and black pepper? (1) magnet (2) filter paper (3) triple-beam balance (4) voltmeter|
|2. Which form of energy is produced when a rubber band vibrates? (1) chemical (2) light (3) electrical (4) sound|
|3. Because copper is a metal, it is (1) liquid at room temperature (2) nonreactive with other substances (3) a poor conductor of electricity (4) a good conductor of heat|
|4. Which process in an apple tree primarily results from cell division? (1) growth (2) photosynthesis (3) gas exchange (4) waste removal|
In 1950, Alan Turing proposed the now well-known Turing Test as a possible test of machine intelligence: If a system can exhibit conversational behavior that is indistinguishable from that of a human during a conversation, that system could be considered intelligent (Turing, 1950). As the field of AI has grown, the test has become less meaningful as a challenge task for several reasons. First, its setup is not well defined (e.g., who is the person giving the test?). A computer scientist would likely know good distinguishing questions to ask, while a random member of the general public may not. What constraints are there on the interaction? What guidelines are provided to the judges? Second, recent Turing Test competitions have shown that, in certain formulations, the test itself is gameable; that is, people can be fooled by systems that simply retrieve sentences and make no claim of being intelligent (Aron, 2011; BBC, 2014). John Markoff of The New York Times wrote that the Turing Test is more a test of human gullibility than machine intelligence. Finally, the test, as originally conceived, is pass/fail rather than scored, thus providing no measure of progress toward a goal, something essential for any challenge problem.
Instead of a binary pass/fail, machine intelligence is more appropriately viewed as a diverse collection of capabilities associated with intelligent behavior. Finding appropriate benchmarks to test such capabilities is challenging; ideally, a benchmark should test a variety of capabilities in a natural and unconstrained way, while additionally being clearly measurable, understandable, accessible, and motivating.
Standardized tests, in particular science exams, are a rare example of a challenge that meets these requirements. While not a full test of machine intelligence, they do explore several capabilities strongly associated with intelligence, including language understanding, reasoning, and use of common-sense knowledge. One of the most interesting and appealing aspects of science exams is their graduated and multifaceted nature; different questions explore different types of knowledge, varying substantially in difficulty. For this reason, they have been used as a compelling—and challenging—task for the field for many years (Brachman et al., 2005; Clark and Etzioni, 2016).
With the advent of contextualized word-embedding methods such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2018), and most recently RoBERTa (Liu et al., 2019b), the NLP community’s benchmarks are being felled at a remarkable rate. These are, however, internally-generated yardsticks, such as SQuAD (Rajpurkar et al., 2016), Glue (Wang et al., 2019), SWAG (Zellers et al., 2018), TriviaQA (Joshi et al., 2017), and many others.
In contrast, the 8th Grade science benchmark is an external, independently-generated benchmark where we can compare machine performance with human performance. Moreover, the breadth of the vocabulary and the depth of the questions is unprecedented. For example, in the ARC question corpus of science questions, the average question length is 22 words using a vocabulary of over 6300 distinct (stemmed) words (Clark et al., 2018). Finally, the questions often test scientific knowledge by applying it to everyday situations and thus require aspects of common sense. For example, consider the question: Which equipment will best separate a mixture of iron filings and black pepper? To answer this kind of question robustly, it is not sufficient to understand magnetism. Aristo also needs to have some model of “black pepper” and “mixture” because the answer would be different if the iron filings were submerged in a bottle of water. Aristo thus serves as a unique “poster child” for the remarkable and rapid advances achieved by leveraging contextual word-embedding models in, NLP.
Within NLP, machine understanding of textbooks is a grand AI challenge that dates back to the ’70s, and was re-invigorated in Raj Reddy’s 1988 AAAI Presidential Address and subsequent writing (Reddy, 1988, 2003). However, progress on this challenge has a checkered history. Early attempts side-stepped the natural language understanding (NLU) task, in the belief that the main challenge lay in problem-solving. For example, Larkin et al. (1980) manually encoded a physics textbook chapter as a set of rules that could then be used for question answering. Subsequent attempts to automate the reading task were unsuccessful, and the language task itself has emerged as a major challenge for AI.
In recent years there has been substantial progress in systems that can find factual answers in text, starting with IBM’s Watson system (Ferrucci et al., 2010), and now with high-performing neural systems that can answer short questions provided they are given a text that contains the answer (e.g., Seo et al., 2016; Wang et al., 2018). The work presented here continues along this trajectory, but aims to also answer questions where the answer may not be written down explicitly. While not a full solution to the textbook grand challenge, this work is thus a further step along this path.
Project Aristo emerged from the late Paul Allen’s long-standing dream of a Digital Aristotle, an “easy-to-use, all-encompassing knowledge storehouse…to advance the field of AI.” (Allen, 2012). Initially, a small pilot program in 2003 aimed to encode 70 pages of a chemistry textbook and answer the questions at the end of the chapter. The pilot was considered successful (Friedland et al., 2004), with the significant caveat that both text and questions were manually encoded, side-stepping the natural language task, similar to earlier efforts. A subsequent larger program, called Project Halo, developed tools allowing domain experts to rapidly enter knowledge into the system. However, despite substantial progress (Gunning et al., 2010; Chaudhri et al., 2013), the project was ultimately unable to scale to reliably acquire textbook knowledge, and was unable to handle questions expressed in full natural language.
In 2013, with the creation of the Allen Institute for Artificial Intelligence (AI2), the project was rethought and relaunched as Project Aristo (connoting Aristotle as a child), designed to avoid earlier mistakes. In particular: handling natural language became a central focus; Most knowledge was to be acquired automatically (not manually); Machine learning was to play a central role; questions were to be answered exactly as written; and the project restarted at elementary-level science (rather than college-level)(Clark et al., 2013).
The metric progress of the Aristo system on the Regents 8th Grade exams (non-diagram, multiple choice part, for a hidden, held-out test set) is shown in Figure 2
. The figure shows the variety of techniques attempted, and mirrors the rapidly changing trajectory of the Natural Language Processing (NLP) field in general. Early work was dominated by information retrieval, statistical, and automated rule extraction and reasoning methods(Clark et al., 2014, 2016; Khashabi et al., 2016; Khot et al., 2017; Khashabi et al., 2018)
. Later work has harnessed state-of-the-art tools for large-scale language modeling and deep learning(Trivedi et al., 2019; Tandon et al., 2018), which have come to dominate the performance of the overall system and reflects the stunning progress of the field of NLP as a whole.
We now describe the architecture of Aristo, and provide a brief summary of the solvers it uses.
The current configuration of Aristo comprises of eight solvers, described shortly, each of which attempts to answer a multiple choice question. To study particular phenomena and develop solvers, the project has created larger datasets to amplify and study different problems, resulting in 10 new datasets222Datasets ARC, OBQA, SciTail, ProPara, QASC, WIQA, QuaRel, QuaRTz, PerturbedQns, and SciQ. Available at https://allenai.org/data/data-aristo-all.html and 5 large knowledge resources333The ARC Corpus, the AristoMini corpus, the TupleKB, the TupleInfKB, and Aristo’s Tablestore. Available at https://allenai.org/data/data-aristo-all.html for the community.
The solvers can be loosely grouped into:
Statistical and information retrieval methods
Large-scale language model methods
Over the life of the project, the relative importance of the methods has shifted towards large-scale language methods.
Several methods make use of the Aristo Corpus, comprising a large Web-crawled corpus ( tokens (280GB)) originally from the University of Waterloo, combined with targeted science content from Wikipedia, SimpleWikipedia, and several smaller online science texts (Clark et al., 2016).
Three solvers use information retrieval (IR) and statistical measures to select answers. These methods are particularly effective for “lookup” questions where an answer is explicitly stated in the Aristo corpus.
The IR solver searches to see if the question along with an answer option is explicitly stated in the corpus, and returns the confidence that such a statement was found. To do this, for each answer option , it sends + as a query to a search engine (we use ElasticSearch), and returns the search engine’s score for the top retrieved sentence , where also has at least one non-stopword overlap with , and at least one with . This ensures has some relevance to both and . This is repeated for all options to score them all, and the option with the highest score selected. Further details are available in (Clark et al., 2016).
The PMI solver uses pointwise mutual information (Church and Hanks, 1989) to measure the strength of the associations between parts of and parts of . Given a large corpus
, PMI for two n-gramsand is defined as . Here
is the joint probability thatand occur together in , within a certain window of text (we use a 10 word window). The term , on the other hand, represents the probability with which and would occur together if they were statistically independent. The ratio of to is thus the ratio of the observed co-occurrence to the expected co-occurrence. The larger this ratio, the stronger the association between and . The solver extracts unigrams, bigrams, trigrams, and skip-bigrams from the question and each answer option . It outputs the answer with the largest average PMI, calculated over all pairs of question n-grams and answer option n-grams. Further details are available in (Clark et al., 2016).
Finally, ACME (Abstract-Concrete Mapping Engine) searches for a cohesive link between a question and candidate answer using a large knowledge base of vector spaces that relate words in language to a set of 5000 scientific terms enumerated in a term bank
. ACME uses three types of vector spaces: terminology space, word space, and sentence space. Terminology space is designed for finding a term in the term bank that links a question to a candidate answer with strong lexical cohesion. Word space is designed to characterize a word by the context in which the word appears. Sentence space is designed to characterize a sentence by the words that it contains. The key insight in ACME is that we can better assess lexical cohesion of a question and answer by pivoting through scientific terminology, rather than by simple co-occurence frequencies of question and answer words. Further details are provided in(Turney, 2017).
These solvers together are particularly good at “lookup” questions where an answer is explicitly written down in the Aristo Corpus. For example, they correctly answer:
Infections may be caused by (1) mutations (2) microorganisms [correct] (3) toxic substances (4) climate changes
as the corpus contains the sentence “Products contaminated with microorganisms may cause infection.” (for the IR solver), as well as many other sentences mentioning both “infection” and “microorganisms” together (hence they are highly correlated, for the PMI solver), and both words are strongly correlated with the term “microorganism” (ACME).
The TupleInference solver uses semi-structured knowledge in the form of tuples, extracted via Open Information Extraction (Open IE) (Banko et al., 2007). Two sources of tuples are used:
A knowledge base of 263k tuples (), extracted from the Aristo Corpus plus several domain-targeted sources, using training questions to retrieve science-relevant information.
On-the-fly tuples (), extracted at question-answering time from t¡he same corpus, to handle questions from new domains not covered by the training set.
TupleInference treats the reasoning task as searching for a graph that best connects the terms in the question (qterms) with an answer choice via the knowledge; see Figure 3 for a simple illustrative example. Unlike standard alignment models used for tasks such as Recognizing Textual Entailment (RTE) (Dagan et al., 2010), however, we must score alignments between the tuples retrieved from the two sources above, , and a (potentially multi-sentence) multiple choice question .
The qterms, answer choices, and tuples fields (i.e. subject, predicate, objects) form the set of possible vertices, , of the support graph. Edges connecting qterms to tuple fields and tuple fields to answer choices form the set of possible edges, . The support graph, , is a subgraph of where and denote “active” nodes and edges, respectively. We define an ILP optimization model to search for the best support graph (i.e., the active nodes and edges), where a set of constraints define the structure of a valid support graph (e.g., an edge must connect an answer choice to a tuple) and the objective defines the preferred properties (e.g. active edges should have high word-overlap). Details of the constraints are given in (Khot et al., 2017). We then use the SCIP ILP optimization engine (Achterberg, 2009) to solve the ILP model. To obtain the score for each answer choice , we force the node for that choice to be active and use the objective function value of the ILP model as the score. The answer choice with the highest score is selected. Further details are available in (Khot et al., 2017).
Multee (Trivedi et al., 2019) is a solver that repurposes existing textual entailment tools for question answering. Textual entailment (TE) is the task of assessing if one text implies another, and there are several high-performing TE systems now available. However, question answering often requires reasoning over multiple texts, and so Multee learns to reason with multiple individual entailment decisions. Specifically, Multee contains two components: (i) a sentence relevance model, which learns to focus on the relevant sentences, and (ii) a multi-layer aggregator, which uses an entailment model to obtain multiple layers of question-relevant representations for the premises and then composes them using the sentence-level scores from the relevance model. Finding relevant sentences is a form of local entailment between each premise and the answer hypothesis, whereas aggregating question-relevant representations is a form of global entailment between all premises and the answer hypothesis. This means we can effectively repurpose the same pre-trained entailment function for both components. Details of how this is done are given in (Trivedi et al., 2019). An example of a typical question and scored, retrieved evidence is shown in Figure 4. Further details are available in (Trivedi et al., 2019).
The QR (qualitative reasoning) solver is designed to answer questions about qualitative influence, i.e., how more/less of one quantity affects another (see Figure 5). Unlike the other solvers in Aristo, it is a specialist solver that only fires for a small subset of questions that ask about qualitative change, identified using (regex) language patterns.
The solver uses a knowledge base of 50,000 (textual) statements about qualitative influence, e.g., “A sunscreen with a higher SPF protects the skin longer.”, extracted automatically from a large corpus. It has then been trained to apply such statements to qualitative questions, e.g.,
John was looking at sunscreen at the retail store. He noticed that sunscreens that had lower SPF would offer protection that is (A) Longer (B) Shorter [correct]
In particular, the system learns through training to track the polarity of influences: For example, if we were to change “lower” to “higher” in the above example, the system will change its answer choice. Another example is shown in Figure 5. Again, if “melted” were changed to “cooled”, the system would change its choice to “(B) less energy”.
The QR solver learns to reason using the BERT language model (Devlin et al., 2018), using the approach described in Section 3.4 below. It is fine-tuned on 3800 crowdsourced qualitative questions illustrating the kinds of manipulation required, along with the associated qualitative knowledge sentence. The resulting system is able to answer questions that include significant linguistic and knowledge gaps between the question and retrieved knowledge (Table 1).
Because the number of qualitative questions is small in our dataset, the solver does not significantly change Aristo’s performance, although it does provide an explanation for its answers. For this reason we omit it in the results later. Further details and a detailed separate evaluation is available in (Tafjord et al., 2019).
|“warmer” “increase temperature”|
|“more difficult” “slower”|
|“need more time” “have lesser amount”|
|“decreased distance” “hugged”|
|“cost increases” “more costly”|
|“increase mass” “add extra”|
|“more tightly packed” “add more”|
|“more land development” “city grow larger”|
|“not moving” “sits on the sidelines”|
|“caught early” ‘sooner treated”|
|“lets more light in” “get a better picture”|
|“stronger electrostatic force” “hairs stand up more”|
|“less air pressure” “more difficult to breathe”|
|“more photosynthesis” “increase sunlight”|
|“stronger acid” “vinegar” vs. “tap water”|
|“more energy” “ripple” vs. “tidal wave”|
|“closer to Earth” “ball on Earth” vs. “ball in space”|
|“mass” “baseball” vs. “basketball”|
|“rougher” “notebook paper” vs. “sandpaper”|
|“heavier” “small wagon” vs. “eighteen wheeler”|
The field of NLP has advanced substantially with the advent of large-scale language models such as ELMo (Peters et al., 2018), ULMFit (Howard and Ruder, 2018), GPT (Radford et al., 2018), BERT (Devlin et al., 2018), and RoBERTa (Liu et al., 2019b). These models are trained to perform various language prediction tasks such as predicting a missing word or the next sentence, using large amounts of text (e.g., BERT was trained on Wikipedia + the Google Book Corpus of 10,000 books). They can also be fine-tuned to new language prediction tasks, such as question-answering, and have been remarkably successful in the few months that they have been available.
We apply BERT to multiple choice questions by treating the task as classification: Given a question with answer options and optional background knowledge , we provide it to BERT as:
[CLS] [SEP] [SEP] [SEP]
for each option (only the answer option is assigned as the second BERT ”segment”). The [CLS] output token for each answer option is projected to a single logit and fed through a softmax layer, trained using cross-entropy loss against the correct answer.
The AristoBERT solver uses three methods to apply BERT more effectively. First, we retrieve and supply background knowledge along with the question when using BERT. This provides the potential for BERT to “read” that background knowledge and apply it to the question, although the exact nature of how it uses background knowledge is more complex and less interpretable. Second, we fine-tune BERT using a curriculum of several datasets, including some that are not science related. Finally, we ensemble different variants of BERT together.
For background knowledge we use up to 10 of the top sentences found by the IR solver, truncated to fit into the BERT max tokens setting (we use 256).
Following earlier work on multi-step fine-tuning (Sun et al., 2019), we first fine-tune on the large (87866 qs) RACE training set (Lai et al., 2017), a challenging set of English comprehension multiple choice exams given in Chinese middle and high schools.
We then further fine-tune on a collection of science multiple choice questions sets:
OpenBookQA train (4957 qs) (Mihaylov et al., 2018)
ARC-Easy train (2251 qs) (Clark et al., 2018)
ARC-Challenge train (1119 qs) (Clark et al., 2018)
22 Regents Living Environment exams (665 qs).444https://www.nysedregents.org/livingenvironment, months 99/06, 01/06, 02/01, 02/08, 03/08, 04/01, 05/01, 05/08, 07/01, 08/06, 09/01, 09/08, 10/01, 11/01, 11/08, 12/06, 13/08, 15/01, 16/01, 17/06, 17/08, 18/06
We optimize the final fine-tuning using scores on the development set, performing a small hyperparameter search as suggested in the original BERT paper(Devlin et al., 2018).
We repeat the above using three variants of BERT, the original BERT-large-cased and BERT-large-uncased, as well as the later released BERT-large-cased-whole-word-masking.555 https://github.com/google-research/bert (5/31/2019 notes) We also add a model trained without background knowledge and ensemble them using the combination solver described below.
The AristoRoBERTa solver takes advantage of the recent release of Roberta (Liu et al., 2019b), a high-performing and optimized derivative of BERT trained on significantly more text. In AristoRoBERTa, we simply replace the BERT model in AristoBERT with RoBERTa, repeating similar fine-tuning steps. We ensemble two versions together, namely with and without the first fine-tuning step using RACE.
Each solver outputs a non-negative confidence score for each of the answer options along with other optional features. The Combiner then produces a combined confidence score (between 0 and 1) using the following two-step approach.
Each solver can also provide other features capturing aspects of the question or the reasoning path. The output of this first step classifier is then a calibrated confidence for each solver and answer option : where is the solver specific feature vector and the associated feature weights.
The second step uses these calibrated confidences as (the only) features to a second logistic regression classifier from answer option to correct/incorrect, resulting in a final confidence in , which is used to rank the answers:
Here, feature weights indicate the contribution of each solver to the final confidence. Empirically, this two-step approach yields more robust predictions given limited training data compared to a one-step approach where all solver features are fed directly into a single classification step.
|Test Set||Num Q||IR||PMI||ACME||TupInf||Multee||AristoBERT||AristoRoBERTa||ARISTO|
ARC-Challenge is defined using IR and PMI results, i.e., are questions that by definition both IR and PMI get wrong (Clark et al., 2018).
ARC (Easy + Challenge) includes Regents 4th and 8th as a subset.
This section describes our precise experimental methodology followed by our results.
In the experimental results reported below, we omitted questions that utilized diagrams. While these questions are frequent in the test, they are outside of our focus on language and reasoning. Moreover, the diagrams are highly varied (see Figure 6) and despite work that tackled narrow diagram types, e.g, food chains (Krishnamurthy et al., 2016), overall progress has been quite limited (Choi et al., 2017).
We also omitted questions that require a direct answer (rather than selecting from multiple choices), for two reasons. First, after removing questions with diagrams, they are rare in the remainder. Of the 482 direct answer questions over 13 years of Regents 8th Grade Science exams, only 38 (8%) do not involve a diagram. Second, they are complex, often requiring explanation and synthesis. Both diagram and direct-answer questions are natural topics for future work.
We evaluate Aristo using several datasets of independently-authored science questions taken from standardized tests. Each dataset is divided into train, development, and test partitions, the test partitions being “blind”, i.e., hidden to both the researchers and the Aristo system during training. All questions are taken verbatim from the original sources, with no rewording or modification. As mentioned earlier, we use only the non-diagram, multiple choice (NDMC) questions. We exclude questions with an associated diagram that is required to interpret the question. In the occasional case where two questions share the same preamble, the preamble is repeated for each question so they are independent. The Aristo solvers are trained using questions in the training partition (each solver is trained independently, as described earlier), and then the combination is fine-tuned using the development set.
The Regents exam questions are taken verbatim from the New York Regents Examination board, using the 4th Grade Science, 8th Grade Science, and 12th Grade Living Environment examinations.666See https://www.nysedregents.org/ for the original exams. The questions are partitioned into train/dev/test by exam, i.e., each exam is either in train, dev, or test but not split up between them. The ARC dataset is a larger corpus of science questions drawn from public resources across the country, spanning grades 3 to 9, and also includes the Regents 4th and 8th questions (using the same train/dev/test split). Further details of the datasets are described in (Clark et al., 2018). The datasets are publicly available777http://data.allenai.org/arc/, and the 12th Grade Regents data is available on request. Dataset sizes are shown in Table 3. All but 39 of the 9366 questions are 4-way multiple choice, the remaining 39 (0.5%) being 3- or 5-way. A random score over the entire dataset is 25.02%.
For each question, the answer option with the highest overall confidence from Aristo’s combination module is selected, scoring 1 point if the answer is correct, 0 otherwise. In the (very rare) case of N options having the same confidence (an N-way tie) that includes the correct option, the system receives 1/N points (equivalent to the asymptote of random guessing between the N).
The results are summarized in Table 2, showing the performance of the solvers individually, and their combination in the full Aristo system. Note that Aristo is a single system run on the five datasets (not retuned for each dataset in turn).
In addition, the results show the dramatic impact of new language modeling technology, embodied in AristoBERT and AristoRoBERTa, the scores for these two solvers dominating the performance of the overall system. Even on the ARC-Challenge questions, containing a wide variety of difficult questions, the language modeling based solvers dominate. The general increasing trend of solver scores from left to right in the table loosely reflects the progression of the NLP field over the six years of the project.
To check that we have not overfit to our data, we also ran Aristo on the most recent years of the Regents Grade Exams (4th and 8th Grade), years 2017-19, that were unavailable at the start of the project and were not part of our datasets. The results are shown in Table 4, a showing score similar to those on our larger datasets, suggesting the system is not overfit.
On the entire exam, the NY State Education Department considers a score of 65% as “Meeting the Standards”, and over 85% as “Meeting the Standards with Distinction”888 https://www.nysedregents.org/grade8/science/618/home.html. If this rubric applies equally to the NDMC subset we have studied, this would mean Aristo has met the standard with distinction in 8th Grade Science.
Several authors have observed that for some multiple choice datasets, systems can still perform well even when ignoring the question body and looking only at the answer options (Gururangan et al., 2018; Poliak et al., 2018). This surprising result is particularly true for crowdsourced datasets, where workers may use stock words or phrases (e.g., “not”) in incorrect answer options that gives them away. A dataset with this characteristic is clearly problematic, as systems can spot such cues and do well without even reading the question.
To measure this phenomenon on our datasets, we trained and tested a new AristoRoBERTa model giving it only the answer options (no question body nor retrieved knowledge). The results on the test partition are shown in Table 5. We find scores significantly above random (25%), in particular for the 12th Grade set which has longer answers. But the scores are sufficiently low to indicate the datasets are relatively free of annotation artifacts that would allow the system to often guess the answer independent of the question. This desirable feature is likely due to the fact these are natural science questions, carefully crafted by experts for inclusion in exams, rather than mass-produced through crowdsourcing.
|“Answer only”||% Drop|
|Test dataset||4-way MC||8-way MC||(relative)|
One way of testing robustness in multiple choice is to change or add incorrect answer options, and see if the system’s performance degrades (Khashabi et al., 2016). If a system has mastery of the material, we would expect its score to be relatively unaffected by such modifications. To explore this, we investigated adversarially adding extra incorrect options, i.e., searching for answer options that might confuse the system, using AristoRoBERTa999 For computational tractability, we slightly modify the way background knowledge is retrieved for this experiment (only), namely using a search query of just the question body (rather than question + answer option )., and adding them as extra choices to the existing questions.
To do this, for each question, we collect a large ( 100) number of candidate additional answer choices using the correct answers to other questions in the same dataset (and train/test split), where the top 100 are chosen by a superficial alignment score (features such as answer length and punctuation usage). We then re-rank these additional choices using AristoRoBERTa, take the top N, and add them to the original K (typically 4) choices for the question.
If we add N=4 extra choices to the normal 4-way questions, they become 8-way multiple choice, and performance drops dramatically (over 40 percentage points), albeit unfairly as we have by definition added choices that confuse the system. We then train the model further on this 8-way adversarial dataset, a process known as inoculation (Liu et al., 2019a). After further training, we still find a drop, but significantly less (around 10 percentage points absolute, 13.8% relative, Table 6), even though many of the new distractor choices would be easy for a human to rule out.
For example, while the solver gets the right answer to the following question:
The condition of the air outdoors at a certain time of day is known as (A) friction (B) light (C) force (D) weather [selected, correct]
it fails for the 8-way variant:
The condition of the air outdoors at a certain time of day is known as (A) friction (B) light (C) force (D) weather [correct] (Q) joule (R) gradient [selected] (S) trench (T) add heat
These results show that while Aristo performs well, it still has some blind spots that can be artificially uncovered through adversarial methods such as this.
This section describes related work on answering standardized-test questions, and on math word problems in particular. It provides an overview rather than exhaustive citations.
Standardized tests have long been proposed as challenge problems for AI (e.g., Bringsjord and Schimanski, 2003; Brachman et al., 2005; Clark and Etzioni, 2016; Piatetsky-Shapiro et al., 2006), as they appear to require significant advances in AI technology while also being accessible, measurable, understandable, and motivating.
Earlier work on standardized tests focused on specialized tasks, for example, SAT word analogies (Turney, 2006), GRE word antonyms (Mohammad et al., 2013), and TOEFL synonyms (Landauer and Dumais, 1997). More recently, there have been attempts at building systems to pass university entrance exams. Under NII’s Todai project, several systems were developed for parts of the University of Tokyo Entrance Exam, including maths, physics, English, and history (Strickland, 2013; NII, 2013; Fujita et al., 2014), although in some cases questions were modified or annotated before being given to the systems (e.g., Matsuzaki et al., 2014). Similarly, a smaller project worked on passing the Gaokao (China’s college entrance exam) (e.g., Cheng et al., 2016; Guo et al., 2017). The Todai project was reported as ended in 2016, in part because of the challenges of building a machine that could “grasp meaning in a broad spectrum” (Mott, 2016).
Substantial progress has been achieved on math word problems. On plane geometry questions, (Seo et al., 2015) demonstrated an approach that achieve a 61% accuracy on SAT practice questions. The Euclid system (Hopkins et al., 2017) achieved a 43% recall and 91% precision on SAT ”closed-vocabulary” algebra questions, a limited subset of questions that nonetheless constitutes approximately 45% of a typical math SAT exam. Closed-vocabulary questions are those that do not reference real-world situations (e.g., ”what is the largest prime smaller than 100?” or ”Twice the product of x and y is 8. What is the square of x times y?”)
Work on open-world math questions has continued, but results on standardized tests have not been reported and thus it is difficult to benchmark the progress relative to human performance. See Amini et al. (2019) for a recent snapshot of the state of the art, and references to the literature on this problem.
Answering science questions is a long-standing AI grand challenge (Reddy, 1988; Friedland et al., 2004). This paper reports on Aristo—the first system to achieve a score of over 90% on the non-diagram, multiple choice part of the New York Regents 8th Grade Science Exam, demonstrating that modern NLP methods can result in mastery of this task. Although Aristo only answers multiple choice questions without diagrams, and operates only in the domain of science, it nevertheless represents an important milestone towards systems that can read and understand. The momentum on this task has been remarkable, with accuracy moving from roughly 60% to over 90% in just three years. Finally, the use of independently authored questions from a standardized test allows us to benchmark AI performance relative to human students.
Beyond the use of a broad vocabulary and scientific concepts, many of the benchmark questions intuitively appear to require reasoning to answer (e.g., Figure 5). To what extent is Aristo reasoning to answer questions? For many years in AI, reasoning was thought of as the discrete, symbolic manipulation of sentences expressed in a formally designed language (Brachman and Levesque, 1985; Genesereth and Nilsson, 2012). With the advent of deep learning, this notion of reasoning has shifted, with machines performing challenging tasks using neural architectures rather than explicit representation languages. Today, we do not have a sufficiently fine-grained notion of reasoning to answer this question precisely, but we can observe surprising performance on answering science questions. This suggests that the machine has indeed learned something about language and the world, and how to manipulate that knowledge, albeit neither symbolically nor discretely.
Although an important milestone, this work is only a step on the long road toward a machine that has a deep understanding of science and achieves Paul Allen’s original dream of a Digital Aristotle. A machine that has fully understood a textbook should not only be able to answer the multiple choice questions at the end of the chapter—it should also be able to generate both short and long answers to direct questions; it should be able to perform constructive tasks, e.g., designing an experiment for a particular hypothesis; it should be able to explain its answers in natural language and discuss them with a user; and it should be able to learn directly from an expert who can identify and correct the machine’s misunderstandings. These are all ambitious tasks still largely beyond the current technology, but with the rapid progress happening in NLP and AI, solutions may arrive sooner than we expect.
We gratefully acknowledge the many other contributors to this work, including Niranjan Balasubramanian, Matt Gardner, Peter Jansen, Daniel Khashabi, Jayant Krishnamurthy, Souvik Kundu, Todor Mihaylov, Michael Schmitz, Harsh Trivedi, Peter Turney, and the Beaker team at AI2.
Which is the effective way for gaokao: information retrieval or neural networks?. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 111–120. Cited by: §5.1.