General-Purpose Question-Answering with Macaw

Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present Macaw, a versatile, generative question-answering (QA) system that we are making available to the community. Macaw is built on UnifiedQA, itself built on T5, and exhibits strong performance, zero-shot, on a wide variety of topics, including outperforming GPT-3 by over 10 Challenge300, a suite of 300 challenge questions, despite being an order of magnitude smaller (11 billion vs. 175 billion parameters). In addition, Macaw allows different permutations ("angles") of its inputs and outputs to be used, for example Macaw can take a question and produce an answer; or take an answer and produce a question; or take an answer and question, and produce multiple-choice options. We describe the system, and illustrate a variety of question types where it produces surprisingly good answers, well outside the training setup. We also identify question classes where it still appears to struggle, offering insights into the limitations of pretrained language models. Macaw is freely available, and we hope that it proves useful to the community. Macaw is available at



page 5


Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Video question answering (VideoQA) is a complex task that requires diver...

Cooperative Learning of Zero-Shot Machine Reading Comprehension

Pretrained language models have significantly improved the performance o...

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfluencies is an under-studied topic in NLP, even though it is ubiquit...

Towards More Equitable Question Answering Systems: How Much More Data Do You Need?

Question answering (QA) in English has been widely explored, but multili...

SD-QA: Spoken Dialectal Question Answering for the Real World

Question answering (QA) systems are now available through numerous comme...

Question Answering Infused Pre-training of General-Purpose Contextualized Representations

This paper proposes a pre-training objective based on question answering...

How Optimal is Greedy Decoding for Extractive Question Answering?

Fine-tuned language models use greedy decoding to answer reading compreh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although modern pretrained language models have proved surprisingly effective at solving datasets, e.g., (Radford et al., 2018; Raffel et al., 2020; Khashabi et al., 2020a), there are still few high-quality, general-purpose, off-the-shelf question-answering (QA) systems freely available. UnifiedQA (Khashabi et al., 2020a) is a powerful QA system, but mainly trained for span prediction and multiple-choice selection rather than answer generation. GPT-3 appears powerful, but is not freely available to the public (Brown et al., 2020). One nearest to our goal is Google’s T5-based CBQA (closed-book QA) system (Roberts et al., 2020), but in our tests of the T5-CBQA model trained on Natural Questions (Kwiatkowski et al., 2019), it did not perform as well as Macaw (Section 4.2).222There are other T5-CBQA versions alternatively trained on WebQuestions and TriviaQA that we did not evaluate, although NaturalQuestions is arguably the most general and varied of the alternate training sets. See Similarly, the new Jurassic-1 model from AI21 Labs333 appears powerful, but again did not perform as well in our QA tests (Section 4.2). In response, we are releasing Macaw, a general-purpose, T5-based QA system that we have found to be to be surprisingly strong, zero-shot. In this paper, we describe how Macaw was trained, illustrate the variety of capabilities it has, and also identify various question classes it still struggles with. We hope Macaw proves useful for the community.

Q: How would you make a house conduct electricity? A: paint it with a metal paint Q: What is an implication of a cat being an animal? A: it has feelings Q: Could a newborn look over the top of a fully grown horseradish plant? A: no, because the plant is so tall Q: Why couldn’t Aristotle have used a cellphone? A: he was born before cellphones Q: A young boy was rushed to the hospital emergency room, but the ER doctor saw the boy and refused to operate. ”This boy is my son,” the doctor said. But the doctor wasn’t the boy’s father. How could this be? [Intended answer to the riddle: The doctor was the boy’s mom] A: the doctor mistook the boy for his own son [Reasonable alternative answer!]

Figure 1: A few examples of the more impressive answers given by Macaw. More examples (both good and bad) are given in Section 5.

Macaw has three interesting features. First, it often produces high-quality answers to questions far outside the domain it was trained on, sometimes surprisingly so. Several examples are shown in Figure 1, and we show numerous other examples later in this paper (Section 5.1). However, it can also make mistakes. We also give a selection of these, and attempt to characterize where its weaknesses are (Section 5.2).

Second, Macaw allows different permutations (“angles”) of inputs and outputs to be used. For example, we can give it a question and get an answer; or give it an answer and get a question; or give it a question and answer and get a set of multiple-choice (MC) options for that question. This multi-angle QA capability444 Hence then model’s name Macaw (“Multi-angle c(q)uestion-answering”). allows versatility in the way Macaw can be used, include recursively using outputs as new inputs to the system. While other researchers have explored permuting inputs and outputs to some degree, e.g., (Hase et al., 2020), Macaw has such capabilities built into its machinery.

Finally, Macaw also generates explanations as an optional output (or even input) element. Although Macaw’s explanations are of typically of lower quality than its answers, and are not full chains of reasoning, the fact it can generate plausible explanations at all is an unusual feature.

We first describe multi-angle training and how Macaw was trained. We then report quantitative experiments with Macaw, including comparing its zero-shot performance with several other large language models on Challenge300, a suite of 300 challenge questions designed to push various limits of question-answering behavior. We find Macaw outperforms other large-scale models by over 10% (absolute) on this dataset, including GPT-3 despite being an order-of-magnitude smaller (11B Macaw vs. 175B GPT-3). We then give a qualitative analysis of Macaw’s behavior, identifying classes of problems where it succeeds and also where it fails. Finally we reflect on its behavior and offer Macaw to the community. Macaw is available at

2 Multi-Angle Question-Answering

2.1 Slots, Values and Angles

We take advantage of the flexible nature of text-to-text transformers like T5 (Raffel et al., 2020) to train models across multiple “angles” for each dataset. Each example in the dataset is considered as a set of slots and corresponding values . An angle then corresponds to a specific set of source slots and a set of target slots , and the associated task is to predict the values of the target slots given the source values.

For instance, in a multiple-choice QA dataset, like ARC (Clark et al., 2018) or RACE (Lai et al., 2017), the slots might be Q (question), M (MC options), A (correct answer), C (context). The usual task is represented by the primary angle QMCA (given question, MC choices and context, what is the correct answer?). Other angles might include QCA (answer the question without seeing the MC choices), QACM (generate plausible MC choices), ACQM (given answer and context, generate a question and answer options). See Figure 2 for more examples.

The semantics of slots are defined by what Macaw saw during training (Section 3). During training, the context (C) contains either a passage or retrieved text relevant to the question, and the explanation (E) consists of a few (typically two or three) general sentences relevant to the answer (but not a formal chain of reasoning). Examples are given in Figure 2 (upper box) and Section 2.6.

Example value Roller skating is a popular hobby these days. Roller skates have four wheels…. Which surface is best for roller-skating? (A) gravel (B) blacktop (C) sand blacktop A wheeled vehicle requires smooth surfaces.

Description Generate answer and explanation given question, choices and context (primary angle). Same, but in absence of retrieved context Only generate answer Generate answer without access to MC options Also include explanation in input Generate plausible MC options given question, answer and context Generate plausible question and MC options, given answer and context

Figure 2: The different slots (input/output elements) and sample angles supported by Macaw.

For each dataset we collect a set of angles which would be considered reasonable tasks to perform. E.g., in the RACE dataset the context is usually critical to answer a situated question (e.g., “What does the doctor think of Heelys?”) so we do not consider the QMA angle without the context, while this angle is appropriate for ARC where the context is just a potentially helpful retrieved text passage.

2.2 Text Encoding of Angles

We employ a simple text format to encode an angle input/output pair :

INPUT: "$$ ; $$ ; ... $$ = ; $$ = ; ..."
OUTPUT: "$$ = ; $$ = ; ..."

In other words, in the INPUT field the desired output slots are listed without any associated value, and the input slots are listed with their corresponding input values. For instance, to provide the “question” and “mcoptions” (multiple-choice options) as inputs, and request the “answer” and “explanation” slots in the output, the INPUT format might look as below, resulting in the corresponding OUTPUT from Macaw:

INPUT: "$answer$ ; $explanation$ ; $question$ = Which surface is best for rollerskating? ; $mcoptions$ = (A) gravel (B) sand (C) blacktop"
OUTPUT: "$answer$ = blacktop ; $explanation$ = A wheeled vehicle requires smooth surfaces."

2.3 Ordering of Slots within an Angle

We can either treat the input slots as an unordered set or in a certain fixed order. Given the nature of the transformer encoder, it is not expected that the input order has great significance. In practice we scramble the order of the input and output slots during training, except putting the ”context” slot at the end as it tends to be the one that might run over the token limit (512 subword tokens in the case of T5).

If there are multiple output slots, such as producing both answer and explanation, their ordering might carry more significance due to the left-to-right nature of decoding. E.g., first producing explanation followed by answer, is technically generating the answer conditioned on the already generated explanation. Again, for simplicity and practicality, for Macaw we train (and evaluate) with randomly scrambled orders of output slots.

2.4 Sampling of Angles during Training

We describe the precise training of Macaw shortly. During training, we sample the possible angles across the training set rather than considering every angle for every training instance. The training recipe includes the following:

  • Each angle can have a heuristic scaling factor for how often it is sampled relative to others (used as weak bias for which angles are more meaningful).

  • We iterate through the training instances multiple times (especially if there are many angles and not that many training instances)

  • If a sampled angle does not exist for a training instance (e.g., the explanation value is only available for a subset of instances), the angle is resampled. This allows handling of heterogenous datasets where slots are partially available.

For evaluation we generate all angles for every instance (with random scrambling of the slot orders if that was the chosen mode during training, as was done for the Macaw model).

2.5 Decoding and Evaluation

Macaw’s default decoding is done with greedy decoding, optionally with a small beam search, which is appropriate for well-defined slot values like answers. For more open-ended slot values, like question generation, Macaw also supports sampling (e.g., nucleus sampling (Holtzman et al., 2020)), allowing alternate outputs to be generated.

When the full output string has been generated, e.g., Section 2.2, it is straightforward to parse it with a regular expression pattern to extract the produced slots and values. These can then be evaluated according the their usual metrics. If an expected slot is missing, it is counted as a failure, but in practice this almost never happens.

An alternative decoding approach, most suitable for multiple-choice answers, is to forcibly generate each answer choice, and the product of probabilities of the output tokens then provide a confidence score for each answer choice.

2.6 Examples of Different Angles

Some examples of using Macaw with different angles are shown below, where the right arrow (“”) indicates Macaw’s generated output, given the inputs. Note that some of these angles were not seen during training (in fact, only the first two are seen during training), yet Macaw is still able to reliably handle them. (The full list of angles seen in training is described shortly in Section 3.2 and Table 3).


Given a question, produce an answer.

Q: What gas do producers produce that consumers need to survive? A: oxygen


Given a question, produce an answer + explanation.

Q: Which force pulls objects to the ground? A: gravity E: Gravitational force causes objects that have mass to be pulled down on a planet.


Given an answer, produce a question + explanation.

A: elephant Q: Which animal has the largest ears? E: The ears of an elephant are the largest.


Given an answer, produce a question + multiple-choice options.

A: car battery Q: Which of these items is necessary for a car to start? M: (A) car battery (B) windshield wiper blade (C) car radio (D) car radio antenna


Given an explanation, generate a question + answer.

E: The leaves of a plant convert sunlight into food for the plant. Q: How do plants get energy? A: from the sun

2.7 Output Confidences

Macaw can also output the generation scores for a set of candidate answers, conventionally invoked with a “X” (eXplicit outputs) slot, e.g.,:

Q: What is the largest animal in the world? X: (A) mouse (B) whale (C) elephant A: whale (0.007), elephant (0.005), mouse (1.4e-8)

Note the confidences do not add to one, as other answers (e.g., “blue whale”) are possible but are not listed. To further condition the naturally generated answers, the question can be formulated as multiple-choice using the “M” slot as well:

Q: What is the largest animal in the world? M: (A) mouse (B) whale (C) elephant X: (A) mouse (B) whale (C) elephant A: whale (0.999), elephant (3.9e-5), mouse (2.4e-11)

In this case the confidences, which are the product of the internal output token probabilities, do tend to add up to one as the model is strongly biased towards picking one of the answers in from the “M” slot.

3 Training Macaw

Macaw is built on top of the text-to-text pretrained T5 transformer (Raffel et al., 2020), by first training a multi-angle version version of UnifiedQA (Khashabi et al., 2020b), followed by further fine-tuning on science questions with explanations, using the ARC  and ARC-DA datasets along with explanations from WorldTree (Jansen et al., 2018).

3.1 Multi-Angle UnifiedQA

Datasets Angles
BoolQ, NarrativeQA, SQuAD 2.0 QCA, ACQ
Table 1: Datasets and angles used in training of multi-angle UnifiedQA (the slots are Q=Question, C=Context, M=MC options, A=Answer).

The multi-angle version of UnifiedQA was trained on the 7 core datasets with associated angles listed in Table 1. The 11B model was finetuned for 120k steps starting from T5-11B with batch size of 8 and the Adafactor optimizer. These datasets vary greatly in size (from 1.5k to 130k training instances), following UnifiedQA we sample equally from the 7 datasets. For direct comparison we also trained a similar single-angle version using the same setup.

For the ARC and OBQA datasets, the context (“C”) consists of 10 sentences retrieved from a general text corpus based on the question text plus each of multiple-choice options (ranked by IR score, but always keeping the top result for each option).555 For this we use the Aristo Corpus, a Web-crawled corpus containing 280GB of general and science-related sentences augmented with 80k additional science textbook sentences (Clark et al., 2016).

The performance of these models on the primary angle is very similar to the original UnifiedQA model. Table 2 shows a comparison between the scores of the single-angle and multi-angle models on the development sets, showing the multi-angle is generally not much behind the single-angle variant, while providing more general functionality through the alternate angles.

We train multi-angle UnifiedQA in three sizes based on T5-11B, T5-3B, and T5-large. As seen in Table 2, for some of the datasets there is a significant drop in evaluation scores for smaller sizes, but the scores are still high in general.

Model Single-Angle Multi-Angle UnifiedQA
Dataset 11B 11B 3B large (770M)
BoolQ 90.8 90.3 89.1 85.4
NarrativeQA 66.5 66.8 65.4 62.8
SQuAD 2.0 91.1 90.3 89.4 86.8
ARC 88.6 87.0 81.9 72.2
MCTest 96.6 95.9 94.4 90.9
OBQA 87.4 88.4 81.8 71.4
RACE 88.0 87.7 84.4 79.2
Table 2: Model performance (averaged over UnifiedQA datasets (dev partition), measuring accuracy except for SQuAD 2.0 (token F1) and NarrativeQA (ROUGE-L)). Multi-angle UnifiedQA retains performance compared with single-angle (for same size models, columns 1 and 2), while adding multi-angle capabilities. Evaluation is on the primary angle (QCA for the 3 first datasets, QMCA for the other 4). ARC includes both the Easy and Challenge categories.

3.2 Macaw

For the final Macaw model, we further fine-tune multi-angle UnifiedQA on the ARC dataset as well as the ARC-DA dataset, a dataset of Direct Answer (“open response”, “freeform”) science questions (Bhakthavatsalam et al., 2021) (with 1250 questions in the training set).

For each question we add an input context (“C”) based on retrieval from a text corpus as described in the previous section (for ARC-DA the retrieval is based only on question text as there are no answer options available).

We also add an explanation (“E”) to each question using data from the WorldTree V2 explanation bank (Jansen et al., 2018). WorldTree contains explanations for a subset of the questions in ARC and ARC-DA (the fraction of questions covered is about 65% for ARC and 50% for ARC-DA). We construct a short explanation paragraph by randomly shuffling the sentences marked as “CENTRAL” (in the few cases with more than 5 such sentences, we sample 5 of them).

Dataset Angles
Table 3: Datasets and angles used in training of Macaw (with slots as in Table 1 plus E=Explanation).

With five available input/output slots there is a plethora of possible angles to train on. We select a subset that seem the most interesting, as listed in Table 3, and use these for fine-tuning for 6k further steps.

Figure 3: Average score of the four different models on different categories of questions (ignoring categories with less than five questions). The numbers in parentheses denotes the number of questions in each category. Categories are ordered by average-of-averages (highest to lowest), i.e.,, the models together perform best on general knowledge and worst on false presuppositions. The tabular version of this data is in the Appendix.

4 Quantitative Performance of Macaw

Dataset+Model AE A A AE A A
Macaw (11B) 90.9 91.2 94.0 85.1 84.9 91.4
Macaw-3B 87.5 87.9 91.6 77.7 76.7 85.3
Macaw-large 82.5 82.5 86.1 66.3 63.9 79.8
ARC (Challenge):
Macaw (11B) 76.9 76.9 86.3 74.6 74.6 86.6
Macaw-3B 69.6 68.2 80.9 66.2 67.9 77.9
Macaw-large 53.9 57.2 66.9 48.2 50.5 67.2
Table 4: Scores for the answer output slot A on ARC (Easy and Challenge) multiple-choice development sets, for six different angles.

4.1 The ARC dataset

While this paper mainly focuses on an analysis of Macaw’s capabilities and limitations, we note that the ”answer-focused” variant Macaw-answer-11B is at the time of publication at the top of the leaderboards for the datasets ARC (with a score of 81.4%),666 ARC-Easy (92.7%),777 and ARC-DA (81%).888 This variant was trained without the explanation slot and with a focus on the answer-output angles. This model is also available in our software release.999

To get a sense of the variation with model size, Table 4 gives scores on the ARC development set for the smaller Macaw-3B and Macaw-large in addition to the default Macaw (11B). There is a clear performance drop for smaller sizes, but the smaller models still provide good performance and might be more practical for deployment and further research iterations.

In Table 4, we observe that if the answer explanation is included in the input angle (result columns 3 and 6), the answer accuracy significantly improves. This is perhaps not surprising as the explanation typically strongly hints at (or even includes) the right answer (the effect is larger than indicated in the tables, since only a subset of questions actually have explanations in the dataset).

One could hypothesize that feeding the model’s own explanation back as input could also help (first run QCAE, then use the E output as input to QECA), but from small-scale tests this generally only had a minor effect on the score, while tending to make the precision-recall curves look worse (presumably because originally uncertain, incorrect answers, will now get reinforced through the explanation to higher confidence).

Category # Qns Description + Example
commonsense 38 Obvious (to a person) facts about the world
If I put some cheese in the fridge, will it melt?
comparison 2 Relation between two entities
How do pandas and parrots differ?
entity substitution 4 Find a suitable replacement entity for a task
How would you bang in tent pegs without a hammer?
entity tracking 13 Tracking entity states over time
My house is red. I painted my house white. What color is my house now?
estimation 4 Fermi-style numeric estimation problems (Kalyan et al., 2021)
How many banknotes can you fit in a school bus?
example generation 2 Create an illustration of a general phenomenon
If you let go of an object, then gravity pulls it to the ground. What is an example of this phenomenon?
explanation 14 “Why…?” questions
Why do houses have roofs?
false presupposition 9 Trick questions that presuppose something that is not true (Kim et al., 2021)
What year did Tom Hanks land on the moon?
general knowledge 70 General facts about the world
What is shiplap?
generation 1 Production of prose
Tell me a story about a car accident.
history 2 Questions about world history
What were the causes of World War II?
human behavior 5 Questions involving human emotions
I feel sad. What could I do to cheer myself up?
hypothetical 29 Questions about hypothetical and/or counterfactual situations
If plastic was a conductor, then would a plastic spoon conduct electricity?
math 2 Numeric computations
What is 241 + 7864?
meta-reasoning 6 Questions requiring reflection about reasoning itself
What is an incorrect implication of a cat being an animal?
riddle 2 Trick stories with a non-obvious explanation
A young boy was rushed to the hospital emergency room, but the ER doctor saw the boy and refused to operate. ”This boy is my son,” the doctor said. But the doctor wasn’t the boy’s father. How could this be?
science 41 Questions in the general area of science
What gases are involved in photosynthesis?
spatial 11 Various spatial reasoning tasks
John is left of Sue. Where is Sue relative to John?
steps 15 List the sequence of actions to achieve a goal
What are the steps involved in replacing a light bulb?
story understanding 25 Tests for facts implicit in a short story
I crashed my car. When I finally left the hospital, all I wanted to do was sleep. I had to call a taxi. Why was I in hospital?
temporal 2 Reasoning including temporal constraints (example below from (Marcus and Davis, 2020a))
Moshe posted on Facebook a photograph showing Maurice Ravel, Francois Poulenc, Frederic Mompou, and Erik Satie. Satie died in 1925. Poulenc was born in 1899. So the photograph must have been taken when?
Winograd 3 Winograd schema questions (requires commonsense for pronoun resolution) (Levesque et al., 2011)
The elephant couldn’t fit into the box because it was too big. What was too big?
Table 5: Categories of questions in the Challenge300 dataset.
Model Score (%) # incoherent
T5-CBQA (T5.1.1.XXL, NaturalQ) 57.3 28
Jurassic-1 (jumbo, T=0) 64.9 12
GPT-3 (davinci T=0) 64.9 10
Macaw (11B) 75.0 2
Table 6: Scores on the Challenge300 dataset, plus absolute number of incoherent (nonsensical) answers produced. Macaw significantly outperforms the other systems on this dataset. All models are applied zero-shot.

4.2 The Challenge300 Dataset

We also assembled a dataset of 300 challenge questions, called Challenge300, based on our attempts to “break” Macaw using a wide variety of question types. Most of the questions were created from scratch, probing different styles of problem, plus a handful were drawn from the excellent challenge questions in (Davis, 2016) and (Marcus and Davis, 2020a). We recorded all the questions tried (both those Macaw got right, and those it got wrong), rather than cherry-picking good/bad cases. We also performed a loose classification of those questions into 22 different categories, described in Table 5. Note that this categorization is somewhat approximate, as questions can fall into more than one category (in such cases, we attempted to select the dominant category). However, it is still informative for analyzing the successes and failures of Macaw, which we discuss in detail in Section 5 shortly.

For comparison, we also gave the Challenge300 questions to T5-CBQA (size XXL)101010The most powerful CBQA model, built on T5-11B with further pretraining using salient span masking (SSM),, GPT-3 (davinci)111111With prompt simply ”Q: question A:”. The GPT-3 continuation invariably contains the answer followed by more QA pairs labeled ”Q:” ”A:”. We truncate off those additional QA pairs., and the recent Jurassic-1 (jumbo) model from AI21 Labs121212 As the questions are direct answer (“open response”), with (typically) multiple, valid answers, we scored the answers manually. We also noted the (rare) cases where an answer was incoherent, i.e., was nonsensical (rather than just wrong). For example, for the question ”Mary owns a canary named Paul. Did Paul has any ancestors who were alive in the year 1750?” the answer ”Paul” was marked as incoherent. The Challenge300 dataset, answers given by each of the four systems tested, and their scores are available as part of the Macaw release.

The results on this dataset are shown in Table 6. On this small but challenging test, Macaw significantly outperforms the other systems (+10% absolute), as well as having fewer incoherent answers.

We also computed the average scores per question category, shown in Figure 3, ignoring categories with fewer than five questions (where averages are more unreliable). The tabular version of this data, including all categories, is also provided in the Appendix. At a high level, the general trends are similar among the different language models tested, e.g., they all tend to excel at general knowledge questions, but struggle with questions based on false presuppositions.

5 Qualitative Results and Analysis

We now present a qualitative analysis of Macaw’s strengths and weaknesses, based on the Challenge300 questions. Note that most of these questions are well outside the domain of the training questions, and no other context is supplied to Macaw to answer them beyond the questions themselves. Also note that all these examples are using Macaw zero-shot.

We first present some of the impressive (to us) examples of Macaw’s behavior. We then show some cases where it produces wrong answers, and characterize those failures.

5.1 Macaw: Good Examples

Below, Q denotes the question posed to Macaw, while A denotes Macaw’s generated answer. We also tentatively speculate about what skills the answers appear to reveal.

1. Entity Substitution: Identifying a suitable replacement item. We observe similar good answers for several other probes (e.g., if no stove, use a microwave).

Q: James went camping in the woods, but forgot to bring a hammer to bang the tent pegs in. What else might he use? A: rocks Q: James went camping in the woods, but forgot to bring a hammer to bang the tent pegs in. Why might a rock be a useful substitute for the hammer? A: it is hard

2. Hypotheticals: Macaw shows some ability to reason about novel (hypothetical/counterfactual) situations. The second constrast query below suggests that Macaw was doing more than just guessing.

Q: If plastic was a conductor, then would a plastic spoon conduct electricity? A: yes Q: If plastic was an insulator, then would a plastic spoon conduct electricity? A: no

Similarly, Macaw gives a suprisingly insightful answer about a novel situation:

Q: How would you make a house conduct electricity? A: paint it with a metal paint

Finally the two examples below hints at some awareness of the role of an object (here, a roof) in a hypothetical situation:

Q: If my house did not have a roof, and it rained, what would happen to my bed upstairs? A: it would get wet Q: If my house had a roof, and it rained, what would happen to my bed upstairs? A: it would stay dry

as well as correctly reasoning about a hypothetical bird:

Q: If a bird didn’t have wings, how would it be affected? A: it would be unable to fly

3. Meta-reasoning:

This surprising behavior hints at knowing what an implication is, and being able to illustrate that knowledge in a specific context.

Q: What is an implication of a cat being an animal? A: it has feelings Q: What is an incorrect implication of a cat being an animal? A: it is a human

4. False presuppositions:

Questions based on a false premise, such as the below, frequently fool language models (Kim et al., 2021). The example below is somewhat well-known, e.g., Google (incorrectly) answers 1995, but Macaw surprisingly handles this gracefully. In general, however, Macaw is fooled by these kind of examples (Section 5.2).

Q: What year did Tom Hanks land on the moon? A: no year

5. Story Understanding:

In our tests, Macaw generally answers reading comprehension questions about short stories well, for example:

Q: Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating. What happened that might cause Karen to become good friends with her roommate? A: they had a good time together

6. Commonsense:

The below suggests Macaw has some commonsense awareness about the relationships between melting, temperature, and fridges.

Q: If I put some cheese in the fridge, will it melt? A: no, because it will be too cold

7. Comparisions:

Macaw is able to answer unusual comparison questions, such as below. Such questions are interesting as they are unlikely to have been seen in Macaw’s/T5’s training corpora.

Q: How do pandas and parrots differ? A: pandas have black and white fur, while parrots have brightly colored feathers and are usually green or yellow in color

5.2 Macaw: Example Failures

Macaw answered about 24% of the Challenge300 questions incorrectly. Below we show some examples of these failures, and categorize the nature of the problem that it struggles with. Again, note that all these examples are using Macaw zero-shot.

1. Non-trivial Arithmetic: Macaw generally fails at non-trivial arithmetic, although it appears to know the answer is some number. Macaw will handle simple arithmetic correctly, e.g., Q: What is 7 + 5? 12 [correct]; Q: What is 25 + 59? 84 [correct], but with larger numbers (3 or more digits) it typically fails. Interestingly, the answers are often in the right ballpark, e.g., below. Similarly, Q: What is 42623 + 543456? 870315 [incorrect, though the right order of magnitude].

Q: What is 241 + 7864? A: 12345 [wrong]

2. Entity Tracking and State Changes: Macaw struggles with stories about a changing world state. The below example is similar to those in the bAbI dataset (Tasks 2 and 3) (Weston et al., 2016). While models can be trained for reasoning about processes and a changing world state, e.g., (Henaff et al., 2016; Dalvi et al., 2018; Tandon et al., 2018), Macaw does not have this capability zero-shot.

Q: John was in the bedroom. John put a coin in a bag. Then John took the bag to the kitchen. Then John took the coin out of the bag. John took the bag to the garden. Where was the bag at the end? A: in the kitchen [wrong]

Similarly, Macaw struggles to combine counting and object tracking, as illustrated below.

Q: Imagine an empty cup. Now put a coin in the cup. Now take the coin out of the cup. How many coins are in the cup now? A: 1 [wrong]

A final example of struggling to track worlds is below:

Q: My house is red. I painted my house white. What color is my house now? A: it is now white [correct] Q: My house is blue. I painted my house red. What color was my house before I painted it? A: blue [correct] Q: My house is red. My neighbor’s house is blue. I painted my house white. What color is my house now? A: it’s the same color as my neighbor’s house [wrong] Q: My house is red. I painted my neighbor’s house white. What color is my house now? A: half red, half white [wrong]

3. Spatial Reasoning:

Macaw is unreliable at non-trivial spatial reasoning, as illustrated below, suggesting it is not constructing a latent spatial layout during question interpretation:

Q: John is left of Sue. Where is Sue relative to John? A: to his right [correct] Q: John is left of Sue. Sue is left of Mike. Where is Mike relative to John? A: to his right [correct] Q: John is left of Sue. Sue is left of Mike. Where is Sue relative to John? A: to his left [wrong]

4. Hypotheticals: Meaning-Shifting Modifiers Below, a subtle change (“electric car”) fools Macaw.

Q: If an electric car did not have any gas in it, how would it be affected? A: it would not move [wrong]

5. Some Winograd Schema: Winograd Schema questions ask a model to correctly resolve a pronoun, typically requiring commonsense (Levesque et al., 2011). While Macaw generally answers them correctly, it can make mistakes such as the example below.

Q: The elephant couldn’t fit into the box because it was too big. What was too big? A: the box [wrong]

6. False Presuppositions: In general, Macaw is fooled by questions with false presuppositions, such as below. (A counterexample was shown earlier in Section 5.1 where Macaw noted Tom Hanks has not been to the moon).

Q: How old was Mark Zuckerberg when he founded Google? - A: 17 [wrong, trick question]

7. Commonsense: While Macaw correctly answers many questions apparently requiring commonsense, it can make errors, such as the below:

Q: What happens if I drop a glass on a bed of feathers? A: the glass shatters [wrong]

Similarly, Macaw fails the following false presupposition, commonsense question adapted from from (Davis, 2016):

Q: Sally’s favorite cow died yesterday. When will the cow be alive again? A: in a few days [wrong]

Marcus and Davis have similarly probed GPT3 with other commonsense problems (asking for a sensible continuation of a setup), and note that GPT3 fails a substantial number of them (Marcus and Davis, 2020b). While we have not probed such commonsense capabilities systematically for Macaw, we expect it will similarly fail many of them also. We note that in our results by category in Figure 6, all the models perform relatively poorly on commonsense questions (compared with other types).

Orthogonal to our question categories, we also note the following interesting behaviors of Macaw:

Overly Brief Answers: Macaw typically generates short answers, sometimes so short that they can barely be considered correct, such as below:

Q: How does photosynthesis work? A: sun shines on plants [highly incomplete]

Iterative queries: Below, we ask a simple question, then re-ask the question but ask for a different answer. Macaw is able to sustain this for several iterations, but eventually breaks down. (Below, the manual question and Macaw answer are shown together on the same line.)

What is made of metal? a car What is made of metal, besides a car? a door What is made of metal, besides a car and a door? a bicycle What is made of metal, besides a car, a door, and a bicycle? a spoon What is made of metal, besides a car, a door, a bicycle, and a spoon? a spoon [Now repeating an answer]

Generating Narratives: We can similarly ask Macaw to generate a plausible event sequence (“story”) by iteratively giving a scenario, asking “What happens next?”, and then adding the answer back into the question and re-asking it. For example:

Some kids are planning a rollerskating race. What happens next? They practice. Some kids are planning a rollerskating race. They practice. What happens next? They fall. Some kids are planning a rollerskating race. They practice. They fall. What happens next? …

Eventually Macaw starts repeating answers, as illustrated below as a continuation of the earlier questions. The sequence of events in the question below reflect Macaw’s earlier answers to the “….What happens next?” questions.

Some kids are planning a rollerskating race. They practice. They fall. They get up and try again. They fall again. They give up. They lose interest in the sport. They stop trying. They never learn. They never learn. They never learn. They never learn. … What happens next? They never learn.

While a possibly plausible sequence, this is hardly a good story.

5.3 Other Models’ Answers

While our focus here is on Macaw

, we note that the three other models tested (GPT-3, T5-CBQA, and Jurassic-1) similarly exhibit moments of both brilliance and ignorance on Challenge300, with overall lower scores than

Macaw (Table 6). The full list of all the models’ answers is included in the Macaw release.

5.4 Harmful and Offensive Answers: A Note of Caution

As a final word of caution: like other pretrained models that have seen potentially harmful/offensive text in pretraining, Macaw is capable of producing biased and/or offensive answers depending on the question, a phenomenon of concern and significant attention in the community, e.g., (Li et al., 2020; Zhou et al., 2021). Care must be used when deploying large-scale language models such as Macaw in practical settings.

6 Summary

To assist other researchers, we have released Macaw, a high-quality, T5-based QA system that exemplifies both the power and limits of current pretrained language models. Macaw exhibits strong performance, zero-shot, on a wide variety of topics, including outperforming GPT-3 by over 10% (absolute) on Challenge300, a suite of 300 challenge questions, despite being an order of magnitude smaller (11 billion vs. 175 billion parameters). In addition, a Macaw-based model currently tops the leaderboards on the ARC datasets (Section 4.1). One might consider Macaw as a language model highly optimized for question-answering tasks, including allowing different permutations of input/output slots (“angles”) related to question-answering.

We have also illustrated some surprisingly impressive answers Macaw produces, as well as some categories of questions that it still struggles with, providing insights into the strengths and weaknesses of Macaw and likely other transformer-based QA systems. We hope that Macaw proves useful to the community, both as a zero-shot QA system, and as a strong starting point for further fine-tuning on specific tasks where training data is available and the highest precision possible is required. Macaw is available at


  • Bhakthavatsalam et al. (2021) S. Bhakthavatsalam, D. Khashabi, T. Khot, B. D. Mishra, K. Richardson, A. Sabharwal, C. Schoenick, O. Tafjord, and P. Clark. Think you have solved direct-answer question answering? try arc-da, the direct-answer ai2 reasoning challenge. ArXiv, abs/2102.03315, 2021.
  • Brown et al. (2020) T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In NeurIPS, 2020.
  • Clark et al. (2018) P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge. ArXiv, abs/1803.05457, 2018.
  • Clark et al. (2016) P. Clark, O. Etzioni, T. Khot, A. Sabharwal, O. Tafjord, P. D. Turney, and D. Khashabi. Combining retrieval, statistics, and inference to answer elementary science questions. In AAAI, 2016.
  • Dalvi et al. (2018) B. Dalvi, L. Huang, N. Tandon, W. tau Yih, and P. Clark. Tracking state changes in procedural text: a challenge dataset and models for process paragraph comprehension. In NAACL-HLT, 2018.
  • Davis (2016) E. Davis. How to write science questions that are easy for people and hard for computers. AI Mag., 37:13–22, 2016.
  • Hase et al. (2020) P. Hase, S. Zhang, H. Xie, and M. Bansal. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In EMNLP, 2020.
  • Henaff et al. (2016) M. Henaff, J. Weston, A. D. Szlam, A. Bordes, and Y. LeCun. Tracking the world state with recurrent entity networks. In ICLR, 2016.
  • Holtzman et al. (2020) A. Holtzman, J. Buys, M. Forbes, and Y. Choi. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2020.
  • Jansen et al. (2018) P. A. Jansen, E. Wainwright, S. Marmorstein, and C. T. Morrison. Worldtree: A corpus of explanation graphs for elementary science questions supporting multi-hop inference. In LREC, 2018. Also arXiv:1802.03052.
  • Kalyan et al. (2021) A. Kalyan, A. Kumar, A. Chandrasekaran, A. Sabharwal, and P. Clark. How much coffee was consumed during emnlp 2019? fermi problems: A new reasoning challenge for ai. In EMNLP, 2021.
  • Khashabi et al. (2020a) D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi. Unifiedqa: Crossing format boundaries with a single qa system. In EMNLP, 2020a.
  • Khashabi et al. (2020b) D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi. Unifiedqa: Crossing format boundaries with a single QA system. In EMNLP, 2020b.
  • Kim et al. (2021) N. Kim, E. Pavlick, B. K. Ayan, and D. Ramachandran. Which linguist invented the lightbulb? presupposition verification for question-answering. ArXiv, abs/2101.00391, 2021.
  • Kwiatkowski et al. (2019) T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  • Lai et al. (2017) G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. RACE: Large-scale reading comprehension dataset from examinations. In EMNLP, 2017.
  • Levesque et al. (2011) H. Levesque, E. Davis, and L. Morgenstern. The winograd schema challenge. In KR, 2011.
  • Li et al. (2020) T. Li, D. Khashabi, T. Khot, A. Sabharwal, and V. Srikumar. Unqovering stereotyping biases via underspecified questions. In EMNLP, 2020.
  • Marcus and Davis (2020a) G. Marcus and E. Davis. Experiments testing gpt-3’s ability at commonsense reasoning: Results. Technical report, NYU, 2020a. (
  • Marcus and Davis (2020b) G. Marcus and E. Davis. Gpt-3, bloviator: Openai’s language generator has no idea what it’s talking about. MIT Technology Review, Aug 2020b.
  • Radford et al. (2018) A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. Technical report, OpenAI, 2018.
  • Raffel et al. (2020) C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu.

    Exploring the limits of transfer learning with a unified text-to-text transformer.

    J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  • Roberts et al. (2020) A. Roberts, C. Raffel, and N. M. Shazeer. How much knowledge can you pack into the parameters of a language model? In EMNLP, 2020.
  • Tandon et al. (2018) N. Tandon, B. Dalvi, J. Grus, W. tau Yih, A. Bosselut, and P. Clark. Reasoning about actions and state changes by injecting commonsense knowledge. In EMNLP, 2018.
  • Weston et al. (2016) J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards AI-Complete question answering: A set of prerequisite toy tasks. In ICLR, 2016.
  • Zhou et al. (2021) X. Zhou, M. Sap, S. Swayamdipta, N. A. Smith, and Y. Choi. Challenges in automated debiasing for toxic language detection. In EACL, 2021.

Appendix: Average Scores of Models on the Challenge300 Question Categories

Table 7 provides the histogram data from Figure 3 in tabular form, plus remaining question categories with fewer than 5 questions that were not included in the histogram (where average scores may be unreliable).

Model Average of
Qn Category # Qns Macaw GPT-3 T5-CBQA Jurassic-1 Averages
commonsense 38 0.50 0.53 0.42 0.47 0.48
comparison 2 1.00 0.50 0.50 1.00 0.75
entity substitution 4 1.00 0.63 1.00 1.00 0.91
entity tracking 13 0.50 0.62 0.65 0.54 0.58
estimation 4 0.88 1.00 0.75 0.50 0.78
example generation 2 1.00 1.00 0.50 0.00 0.63
explanation 14 0.68 0.43 0.68 0.64 0.61
false presupposition 9 0.11 0.00 0.00 0.00 0.03
general knowledge 70 0.93 0.79 0.73 0.80 0.81
generation 1 1.00 1.00 0.00 1.00 0.75
history 2 1.00 1.00 1.00 1.00 1.00
human behavior 5 0.70 0.60 0.90 0.60 0.70
hypothetical 29 0.78 0.59 0.71 0.52 0.65
math 2 0.00 0.50 0.00 0.00 0.13
meta-reasoning 6 1.00 0.67 0.67 0.33 0.67
riddle 2 1.00 0.50 0.00 0.50 0.50
science 41 0.76 0.65 0.43 0.74 0.65
spatial 11 0.73 0.77 0.45 0.82 0.69
steps 15 0.87 0.73 0.31 0.77 0.67
story understanding 25 0.88 0.72 0.77 0.72 0.77
temporal 2 0.25 0.00 0.25 0.25 0.19
Winograd 3 0.67 1.00 0.00 1.00 0.67
ALL 300 0.75 0.65 0.57 0.65 0.66
Table 7: Average score of models on different question categories in Challenge300. (See Figure 3 for histogram).