BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles

by   Yunxiang Zhang, et al.
Peking University

A riddle is a question or statement with double or veiled meanings, followed by an unexpected answer. Solving riddle is a challenging task for both machine and human, testing the capability of understanding figurative, creative natural language and reasoning with commonsense knowledge. We introduce BiRdQA, a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles. For each riddle-answer pair, we provide four distractors with additional information from Wikipedia. The distractors are automatically generated at scale with minimal bias. Existing monolingual and multilingual QA models fail to perform well on our dataset, indicating that there is a long way to go before machine can beat human on solving tricky riddles. The dataset has been released to the community.



There are no comments yet.


page 1

page 2

page 3

page 4


Quizbowl: The Case for Incremental Question Answering

Quizbowl is a scholastic trivia competition that tests human knowledge a...

Improving Question Answering with External Knowledge

Prior background knowledge is essential for human reading and understand...

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Open domain question answering (OpenQA) tasks have been recently attract...

QuALITY: Question Answering with Long Input Texts, Yes!

To enable building and testing models on long-document comprehension, we...

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

We publicly release a new large-scale dataset, called SearchQA, for mach...

CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense

Commonsense reasoning is a critical AI capability, but it is difficult t...

SQuARE: Semantics-based Question Answering and Reasoning Engine

Understanding the meaning of a text is a fundamental challenge of natura...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, many large-scale, high-quality datasets tackle Question Answering from various angles, including extractive question rajpurkar2016squad, multi-hop question yang2018hotpotqa, commonsense question talmor-etal-2019-commonsenseqa and logical question liu2020logiqa. While the reasoning required to answer questions is going deeper and deeper, the QA community seems to neglect the importance of understanding the question itself before reasoning for the answer. Understanding language requires both linguistic knowledge and commonsense knowledge lobue-yates-2011-types. We should also keep in mind that words should not always be taken at face value veale2011creative. However, existing QA datasets are commonly sourced from Wikipedia yang2015wikiqa; rajpurkar2016squad and news articles trischler2017newsqa, inevitably leading to creation of superficial literal questions, which are easy to comprehend. In order to find more challenging questions with diverse linguistic styles and rich commonsense knowledge, we turn to riddles for help.

Figure 1: Examples from the English (the upper one) and Chinese (the lower one) part of BiRdQA. There are four distractors and one correct answer for a riddle and an introduction from English Wikipedia or Chinese Baidu Baike is provided for each candidate. Note that Chinese riddles have additional hints. The checkmark indicates the correct answer.

A riddle is a traditional verbal expression which contains one or more descriptive elements, a pair of which may be in opposition; the referent of the elements is to be guessed robert1963definition. As a classical form of guessing game, riddle has been a popular part of folk literature across many countries. One of the most wide-spread riddles appears in Sophocles’s play Oedipus the King sophocles1994oedipus, where Oedipus is tasked with saving the city of Thebes from a Sphinx by solving her riddle, “What goes on four legs in the morning, on two legs at noon, and on three legs in the evening? – human

”. While it is a matter of life and death for Oedipus to answer the riddle correctly, it is interesting to investigate the potential of AI for solving riddle, given the rise of deep learning and pretraining based methods in NLP.

Consider the English riddle shown in Figure 1, “As far as I run, I never meet my brother on my right. What am I? – wheel”. When people initially come across the word “brother”, they naturally think of a living creature such as human or animal. But if they think deeper about the description of the brother’s location (“on my right”) and the reason why it “never meets” its “brother”, they will realize that the riddle may be talking about an inanimate object. In fact, it is a wheel that runs in parallel with its counterpart on the opposite side of a vehicle. This typical example shows the following properties of riddle which make it challenging:

  • Figurative. A common trick of riddles is personification, which gives everyday objects human characteristics. Other types of figurative language used in riddles include metaphor, pun, and hyperbole senderovich2016review. The usage of various figurative devices can fox existing QA models 10.1145/3375547.

  • Misleading. Hard riddles are meant to play a trick on readers with an apparently irreconcilable contradiction or incongruity robert1963definition. Aristotle commented on this characteristic of riddle in the Poetics 2013poetics that “the very nature indeed of a riddle is this, to describe a fact in an impossible combination of words”. Commonsense reasoning and creative thinking schank1987natural are required in order to develop a valid explanation for the seemingly impossible description of riddling object.

We believe that riddles, as a thought provoking challenge for human, can help sharpen deeper language understanding and reasoning abilities of machines as well. On the one hand, the figurative property of riddle enables communication of far richer meanings than factual question and thus it is beneficial to introduce Figurative Language Processing (FLP) into the field of QA. On the other hand, the misleading nature of riddle puts obstacle in the way of commonsense reasoning.

Riddle is an universal art, since it exists in hundreds of different cultures taylor1977english. Besides, many riddles and riddle objects (i.e., the object that a riddle describes) are internationally widespread aarne1918vergleichende. Given the recent progress in multilingual QA jing2019bipar; lewis-etal-2020-mlqa, we conduct a cross-lingual study of riddles in English and Chinese due to easy access to a large amount of available riddles online in these two primary languages. We present BiRdQA (Bilingual Riddle Question A

nswering), a collection of 6614 English riddles and 8751 Chinese riddles. BiRdQA is a multiple-choice question answering dataset, as we provide a correct answer and four distractors for every riddle. In this way, it is easy and fair to adopt accuracy as evaluation metrics to compare the performance of different QA models on our dataset. Our proposed method for automatic distractor generation allows for robust multiple-choice dataset creation at scale with minimal human effort and sufficient quality. To bring in more commonsense knowledge for reasoning, we also provide a brief introduction of each candidate, which is collected from Wikipedia article. Note that BiRdQA is a

non-parallel bilingual dataset, because it is extremely difficult and sometimes impossible to correctly translate riddles between different languages222For example, for the riddle “what do you call a fruit that is never alone? – pear”, the answer makes sense because “pear” sounds like “peer”, which means “never alone”. However in Chinese or other languages, “pear” and “peer” do not sound similarly.. Figure 1 shows two selected examples of our dataset, one in English and the other in Chinese.

To assess the difficulty of BiRdQA, we conduct experiments on several baseline models. Results show a significant gap between machine and human performance, indicating that there is a lot of headroom for improvement. We also evaluate multilingual pretrained models in zero-shot cross-lingual transfer and multilingual training settings.

The contributions of our work are summarized as follows:

  • We construct a large challenge dataset BiRdQA in a non-trivial way. The dataset can be used for testing the question answering ability on tricky riddles. The dataset will be released and contribute to the QA and NLP community.

  • We conduct thorough analysis on the BiRdQA dataset and benchmark the performance of pretraining based QA models on BiRdQA. Results show a significant gap between machine and human performance.

2 Dataset Collection

We cast solving riddle as a five-way multiple choice task instead of open-ended format, to allow efficient and fair evaluation of the knowledge proficiency of a model zellers2019recognition; qiu-etal-2020-automatic. Given a question with five candidates and their introductions, a model should select one as the correct answer. We create BiRdQA in three steps, for English and Chinese riddles independently: collecting riddles, generating distractors and collecting introductions for candidates333From now on, we use the term “answer” for the correct solution to a riddle, “distractors” for the four wrong choices of a question, and “candidates” for the union of “answer” and “distractors” of a riddle. . We dub English part of our dataset “BiRdQA-en” and Chinese part “BiRdQA-zh”. We aim to maintain symmetry between the two languages with similar formats and standards of data, and comparable methods of dataset creation.

2.1 Riddle Collection

We crawl 16,000+ English riddle-answer pairs from several websites444e.g.,, where uniquely crafted, high-quality riddles are in abundance. We also crawl 12,000+ Chinese riddles and their answers from a single website555 Different from English riddles, every Chinese riddle contains a hint in parenthesis (e.g., “打一物” in Figure 1) and we separate it as an additional field. Details about raw data preprocessing are described in Appendix A.

2.2 Distractor Generation

It is quite challenging to obtain high-quality distractors at scale. Asking humans to write distractors for each correct answer is expensive and time-consuming gierl2017developing; zellers2019recognition, especially for a large-scale dataset. Moreover, it may suffer from spurious correlation, also called annotation artifacts: subtle patterns become prediction rules that work for the majority examples but do not hold in general gururangan2018annotation; poliak2018hypothesis; tu2020empirical. Therefore, our motivation is that a high-quality distractor generated in an automated fashion should

  • be close enough to the correct answer goodrich1977distractor to make it more plausible and adversarial, so long as it does not become another valid answer.

  • be robust to candidate-only bias si2019does; yu2020counterfactual, disincentivizing models from leveraging shortcuts to directly predict the correct answer without reading the question.666While it is also important for the distractors to be consistent with the semantic context of the question gao2019generating, our proposed method is effective enough for distractor generation on this dataset, which is demonstrated empirically in Section 5 later. As a result, we do not take that point into account for simplicity.

Now we describe our distractor generation method in detail. Our goal is to generate four distractors for each correct answer. Two of them are designed to be similar to the correct answer in terms of the cosine distance of word embedding777If the correct answer of a riddle does not have pretrained embedding, we simply remove this riddle., which we refer to as global distractors. The other two distractors are drawn from the correct answers of other questions zellers2019recognition, so as to overcome the issue of candidate-only bias. We use local distractors to refer to these two distractors. For example, for the two riddles in Figure 1, “carriage”, “brake”, “弹棉花” (fluffing cotton), and “纺车” (spinning wheel) are local distractors, while the rest are global ones.

Global Distractor

For each English riddle, we use the correct answer as the query and retrieve the top-2 most similar words or phrases from the pretrained Sense2Vec Trask2015sense2vecA; pan-etal-2021-zero. For each Chinese riddle, we use the word embedding provided by song-etal-2018-directional.

Local Distractor

For each language, we build an answer pool with all distinct correct answers of riddles. For each riddle, we obtain the top-2 most similar correct answers from that answer pool. The similarity is also measured by the cosine distance of word embedding and the choices of pretrained embedding for English and Chinese are the same as global distractors.

Invalid Distractor Replacement

To avoid the case that the generated distractor becomes another correct answer or repeats with other distractors, we define rules to ensure that the distractor has less lexical overlap with the correct answer pan-etal-2021-zero and other distractors of the same riddle. Specifically, two candidates are considered as lexically overlapped if they share at least a same word in English (e.g., “horse” and “wild horse”) or a same character in Chinese (e.g., “蘑菇” and “平菇”). We iteratively replace the problematic distractor with the next best (i.e., most similar to correct answer) generated one until it is valid. We generate local distractors before global ones because the larger search space of global distractors allows more alternatives for replacement. We provide a detailed quality analysis of the generated distractors in Section 3.

2.3 Candidate Introduction Collection

Wikipedia has been widely used as an important knowledge source for Question Answering yang2015wikiqa; chen-etal-2017-reading; liu-etal-2020-rikinet and Commonsense Reasoning storks2019commonsense; lv2020graph. We provide a brief Wikipedia-based introduction of each candidate to help models solve riddles with commonsense knowledge. For Chinese riddles we use Baidu Baike888, the Chinese equivalent to Wikipedia, because it has richer Chinese entries999We use the word “Wikipedia” to refer to both English Wikipedia and Baidu Baike (Chinese online encyclopedia) hereafter.. We match each candidate to its corresponding Wikipedia entry and use the introductory part of the article (i.e., the content before the first section) as its introduction. We define that a candidate is matched to a Wikipedia entry if its name is same as the title of the Wikipedia entry or can be redirected to that entry (e.g., “stone” “Rock (geology)”101010˙(geology)). To resolve disambiguation issues (e.g., “nail” may refer to “Nail (anatomy)”111111˙(anatomy), “Nail (beak)”121212˙(beak), etc.), we select the entry whose introduction has the highest similarity with the question. We calculate the similarity with Sentence-BERT reimers-2019-sentence-bert. Specially, if the correct answer of a riddle does not have Wikipedia entry, we simply remove this riddle. If a generated distractor does not have linked Wikipedia entry, we iteratively replace it with the next best (i.e., most similar to correct answer) generated one until it has one. In this way, we ensure that all candidates in our dataset are accompanied by introductions. We evaluate the quality of collected Wikipedia introductions in Section 3.

Overall, we collect 6614 English and 8751 Chinese riddles with their answers, distractors and introductions. We observe that there are multiple riddles describing the same object in different ways and thus sharing identical answer (Section 3). If they appeared in both training and development/test set, the machine might memorize the correct answer rather than reason it out zellers2019recognition; talmor-etal-2019-commonsenseqa. Therefore, when splitting the dataset into training/development/test set, we ensure that each of the three sets has disjoint correct answers.

3 Dataset Analysis

Measurement BiRdQA-en BiRdQA-zh
# Training examples 4093 5943
# Validation examples 1061 1042
# Test examples 1460 1766
# Total examples 6614 8751
# Avg question tokens 29.74 8.23
# Distinct question tokens 15043 18298
# Distinct hints N/A 690
# Distinct answers 3057 4750
# Distinct candidate tokens 6593 14944
# Avg introduction tokens 231.64 90.93
Table 1: Statistics of BiRdQA.

Table 1 describes the key statistics of BiRdQA. Tokens in Chinese are referred to as words segmented by jieba131313 toolkit. Generally speaking, English riddles and candidate introductions are much longer than Chinese ones, while Chinese riddles have more diverse tokens in questions and candidates. The hints in Chinese riddles do not vary a lot, because they only tell the category of the answer.

Answer # en # zh Answer # en # zh
water 45 0 moon 30 0
clock 43 3 sun 29 0
shadow 38 0 heart 26 0
egg 33 0 tree 26 0
fire 30 0 candle 25 10
9 36
8 25
0 30
6 24
9 27
2 22
9 27
11 21
1 25
15 20
Table 2: Frequencies of 10 most common answers and their counterparts in English and Chinese riddles of BiRdQA.

Quality of Distractors

Although we have reduced lexical overlap between the correct answer and distractors, we cannot completely avoid the problem that the automatically generated distractor becomes another correct answer for the riddle coincidentally (i.e., multiple candidates are equally valid). For example, in a Chinese riddle, the distractor “润色先生” (inkstone) is just another name for the answer “砚” (inkstone). Similarly, in an English riddle, the distractor “acne scar” has the same meaning as the answer “zit”. However, we observe that such coincidences due to synonym rarely happen: we randomly sample 100 English and 100 Chinese riddles from the development set, among which we only observe 2 such cases for each group respectively. Therefore, we leave them as noise in the dataset. This shows that eliminating lexical overlap between generated distractors and the correct answer is an effective strategy to ensure distractor quality.

Quality of Introductions

We manually check whether the Wikipedia introduction of a correct answer actually matches the answer itself141414We do not check the matching quality for the distractor introductions because the ambiguous Wikipedia entries for a distractor are still wrong for the question. by inspecting the 100 English and 100 Chinese random samples from the development set. We only observe 3 and 6 such cases, respectively, mainly due to the lack of corresponding Wikipedia entries. As Wikipedia introduction only serves as an auxiliary source of information in our dataset, we consider this quality as acceptable.

Distribution of Answers

We notice that there are multiple riddles describing the same object from different perspective within a language or even across languages and thus sharing the same answer taylor1977english, though they are not duplicate riddles (we have already removed duplicate riddles in Appendix A). For example, for the riddles “I’m often running yet I have no legs. You need me but I don’t need you. What am I?” and “When it rains, I never get any wetter. What am I?”, both of them have “water” as their answers. Table 2 shows the frequencies of the most common answers in BiRdQA. We observe that common answers in English have higher frequencies than Chinese, and English riddles have less distinct answers (Table 1). This indicates that Chinese answers are more diverse than English. It is also interesting that common answers for Chinese riddles are mostly living things such as animals, while non-living things are more popular in English riddles. The overlap of the riddle objects between English and Chinese makes it easier for models to learn riddles in one language and transfer the knowledge to another language, which is further described in the cross-lingual setting in Section 4.

Figure of Speech # en # zh
Hyperbole 3 3
Metaphor 6 12
Personification 18 11
Pun 11 16
Used 31 44
Table 3: Figures of Speech and their frequencies in 100 English and 100 Chinese sampled riddles.

Figures of Speech in Riddles

An important property of the riddle is the usage of figures of speech (Section 1). We randomly sample 100 English and 100 Chinese riddles from the development set, and manually identify the existence and types of figures of speech. As each example can be annotated with multiple figures of speech, the total frequency in Table 3 does not sum to “Used”, which means that a riddle uses at least one type of figure of speech. Table 3 shows that a significant proportion of riddles in English and Chinese leverage figurative devices to confuse readers about their true meanings.

4 Methods

We conduct experiments with multiple pretraining models on BiRdQA under monolingual, cross-lingual and multilingual settings jing2019bipar.

  • Monolingual. We use data in the same language for training and evaluating models (i.e., en en, zh zh).

  • Cross-lingual.

    We test performance in zero-shot cross-lingual transfer learning, where a multilingual pretrained model is fine-tuned on one source language and evaluated on a different target language (i.e., en

    zh, zh en).

  • Multilingual. We directly mix training instances of the two languages into a single training set and build a single QA model to handle bilingual riddles in BiRdQA (i.e., en+zh en, en+zh zh).

We also conduct additional experiments to investigate the influence of adding introductions (“w/ introduction”) or hints (“w/ hint”), removing questions from inputs (“w/o question”), and transfer learning with CommonsenseQA talmor-etal-2019-commonsenseqa (“Train + CQA”, “Train = CQA”) on one or multiple settings, which are detailed in Section 5 later151515To clarify, we only use introductions and hints in additional experiments..

4.1 Baseline Models

Below we describe the specific English models and Chinese models for monolingual setting, as well as the multilingual models for cross-lingual and multilingual settings.

English Models

We test the performance of BERT devlin2019bert and its variants, RoBERTa liu2019roberta and ALBERT lan2019albert

. Following the standard multiple-choice QA setup for pretrained language models

devlin2019bert, we treat the question as A and each candidate as B, before further linearizing them into [CLS] A [SEP] B [SEP] for encoding. The [CLS]

token embedding is used to produce a score for each candidate, which is then passed through a softmax layer for final prediction. We also experiment with UnifiedQA

khashabi2020unifiedqa, the state-of-the-art QA model for many QA benchamarks. We feed the question and all five candidates together to UnifiedQA so that it can choose among the candidates instead of judging them independently like BERT.

Chinese Models

We experiment with popular Chinese pretrained models, including Chinese BERT devlin2019bert, BERT-wwm, RoBERTa-wwm cui2019pre and ERNIE zhang-etal-2019-ernie

. BERT-wwm and RoBERTa-wwm are models pretrained with Whole Word Masking (WWM). ERNIE incorporates knowledge graphs (KGs) to enhance language representation with external knowledge. We adopt the fine-tuning procedure similar to English BERT.

Multilingual Models

Multilingual pretrained language models are able to perform cross-lingual zero-shot learning on multilingual NLU tasks such as XNLI conneau-etal-2018-xnli and MLQA lewis-etal-2020-mlqa. We test the performance of multilingual BERT (mBERT) devlin2019bert and XLM-R conneau2019unsupervised on BiRdQA. We adopt the fine-tuning procedure similar to English BERT.

Input Format

For additional experiments regarding the influence of adding introduction/hint, the B in [CLS] A [SEP] B [SEP] for BERT-style input is instead the concatenation of the candidate and its introduction/hint, while the A is still the question.

4.2 Evaluation Metric

We evaluate the model performance with two metrics, accuracy and mean reciprocal rank (MRR), which show how well the model makes inference and ranks the candidates, respectively. The model ranks the candidates by their softmax prediction probabilities, except for UnifiedQA. Because UnifiedQA is an end-to-end generative model, we cannot apply MRR metric to it.

4.3 Experimental Setup

We use Huggingface161616

implementations for all the baseline models. Due to limitation of computational resource, we restrict the input length to 256 tokens for all models except 150 for UnifiedQA. All hyper-parameters are decided by the model performance on the development set. For cross-lingual experiment, model selection is constrained to be strictly zero-shot, using only source language development data to pick hyperparameters

lewis-etal-2020-mlqa. To keep symmetric input formats of English and Chinese riddles, we do not use the hints in Chinese riddles because they do not exist in English riddles. We further investigate the effect of adding hints with additional experiments.

4.4 Human Performance Evaluation

We evaluate human performance on BiRdQA under monolingual setting. We employ three bilingual post-graduate students to independently answer 100 Chinese riddles and 100 English riddles randomly sampled from the test sets. They are instructed not to directly search for the answers of riddles online. We then calculate the average accuracy of the three human workers as the final human performance on our dataset.

5 Results

Tables 4 and 5 show monolingual experiment results on BiRdQA-en and BiRdQA-zh, respectively. Table 6 describes cross-lingual and multilingual results. We refer our readers to Appendix B for more comprehensive results. The human accuracy is 81.33% for English and 87.67% for Chinese171717Because human performances are evaluated on a subset of 100 random samples from the test set, they cannot be strictly compared with the model performances evaluated on the whole test set., which indicates that it is not difficult for human testees to distinguish the correct answer from other distractors, though human may struggle to answer the riddles “from scratch” (i.e., no candidates are provided). In contrast, all of the evaluated QA models perform much worse than human, indicating that theses methods are relatively weak in solving riddle and that there is ample room for improvement in future.

Model BiRdQA-en
Dev Test
Random Guess 20.00 / 0.4567 20.00 / 0.4567
BERT-Base 48.82 / 0.6834 41.92 / 0.6404
 w/ introduction 51.46 / 0.6949 44.38 / 0.6609
 w/o question 29.59 / 0.5519 21.44 / 0.4835
 Train + CQA 46.56 / 0.6771 44.86 / 0.6599
 Train = CQA 42.41 / 0.6463 38.29 / 0.6164
BERT-Large 46.65 / 0.6735 44.25 / 0.6519
RoBERTa-Large 47.31 / 0.6749 45.21 / 0.6653
ALBERT-XXL 63.52 / 0.7856 58.70 / 0.7590
 Train + CQA 67.11 / 0.8045 64.79 / 0.7978
UnifiedQA (T5-Large) 67.20 62.60
Human - 81.33
Table 4: Results (accuracy/MRR) of English models on the development and the test data of BiRdQA-en. CQA means CommonsenseQA talmor-etal-2019-commonsenseqa. Human performance is tested on a subset of 100 random samples.

English vs. Chinese

In monolingual setting, the best performance on BiRdQA-en is 62.60% by UnifiedQA (without additional data), higher than that on BiRdQA-zh, which is 59.29% by ERNIE. However, English results are almost worse than the corresponding Chinese results for the same type of model (e.g., BERT-Base vs. BERT-base-chinese). We also observe similar trends in the multilingual setting. However, this does not necessarily mean that English riddles are generally harder than Chinese riddles, since there are less training samples for English riddles and also English answers are less diverse than Chinese ones (Section 3). In another word, models tend to learn less knowledge in English. In future, we plan to increase diversity of English riddle and make the dataset more balanced between the two languages. We provide an error analysis in Appendix C.

Monolingual vs. Multilingual

In monolingual training setting, monolingual model (BERT) can outperform its multilingual counterpart (mBERT), which is called the curse of multilinguality conneau2019unsupervised. We also observe that the multilingual training (e.g., en+zh en) substantially improves the performance comparing with the monolingual training (e.g., en en), especially for English riddles. It shows that multilingual models are able to transfer knowledge from source language to improve performance on different target language on our dataset.

Comparing Different QA Models

In monolingual English setting (Table 4), UnifiedQA performs the best (without additional data) since it can see all the candidates in the input sequence at the same time rather than consider them independently like BERT. Another possible reason is that UnifiedQA has been pretrained on other QA dataset, making it possible to transfer external knowledge and skills to BiRdQA. In monolingual Chinese setting (Table 5), ERNIE achieves the best accuracy surprisingly, given that it has less parameters than RoBERTa-wwm-ext-large. The difference of the domains of pretraining text may account for this phenomenon. ERNIE is trained on larger data other than Wikipedia, including text from online forum, which will be more useful on casual text like riddles cui2019pre. We note that higher accuracy does not necessarily means higher MRR, especially when accuracies of two models are very close to each other. For example, in Table 5, ERNIE beats RoBERTa on accuracy but is left behind on MRR, which shows that RoBERTa is better at ranking candidates. We also observe that larger models consistently perform better than smaller models. In cross-lingual setting, mBERT is the best one for zero-shot transfer, which is different from the case on MLQA lewis-etal-2020-mlqa. While in multilingual setting, XLM-R-Large outperforms other models on both English and Chinese test sets.

Model BiRdQA-zh
Dev Test
Random Guess 20.00 / 0.4567 20.00 / 0.4567
BERT-base-chinese 53.45 / 0.7099 55.10 / 0.7210
 w/ introduction 55.95 / 0.7244 54.08 / 0.7102
 w/o question 28.79 / 0.5413 22.48 / 0.4954
 w/ hint 54.32 / 0.7162 54.08 / 0.7139
BERT-wwm-ext 58.25 / 0.7421 57.02 / 0.7353
RoBERTa-wwm-ext-large 60.17 / 0.7563 58.04 / 0.7415
ERNIE 60.65 / 0.7550 59.29 / 0.7479
Human - 87.67
Table 5: Results (accuracy/MRR) of Chinese models on the development and the test data of BiRdQA-zh. Human performance is tested on a subset of 100 random samples.
Setting Cross-lingual Multilingual
Train en zh en+zh
Test en zh zh en en zh
mBERT 36.30 / 0.6003 40.60 / 0.6188 50.17 / 0.6845 38.97 / 0.6144 40.14 / 0.6246 50.51 / 0.6906
 w/ introduction 42.40 / 0.6467 30.12 / 0.5521 47.85 / 0.6677 41.23 / 0.6359 43.08 / 0.6505 48.75 / 0.6771
XLM-R-Base 28.77 / 0.5408 31.20 / 0.5552 46.49 / 0.6646 25.21 / 0.5181 33.77 / 0.5783 48.41 / 0.6769
XLM-R-Large 39.73 / 0.6249 33.01 / 0.5651 57.25 / 0.7319 38.15 / 0.6099 43.97 / 0.6591 57.13 / 0.7335
Table 6: Results (accuracy/MRR) of multilingual models under cross-lingual and multilingual settings of BiRdQA. We only mark the best performances in cross-lingual (en zh, zh en) and multilingual (en+zh en/zh) experiments as bold.

Influence of Additional Information

Wikipedia introductions and hints (only for Chinese riddles) are auxiliary information in our dataset and they may be useful to models. We evaluate the impact of appending Wikipedia introduction to each candidate for input. We find that adding Wikipedia introduction (“w/ introduction” in Tables 4 and 5 and 6) can boost performance on English riddles across the three experimental settings, but cannot improve performance on Chinese ones, unexpectedly. We conjecture that this is because solving Chinese riddles requires deeper information of candidates, such as the meaning of each character181818Unlike English, each Chinese character has its own meaning and the combination of character meanings does not necessarily equal to the word meaning., rather than factual/literal descriptions from Wikipedia. We also observe that using hints in Chinese riddles (“w/ hint” in Table 5) does not lead to higher performance, indicating that they do not carry a lot of useful information. For example, if all five candidates are animals, the hint “打一动物” (guess an animal) will be useless. This is also correlated to the fact that the hint is a less diverse feature in our dataset – Table 1 shows that there are only 690 distinct hints across all Chinese riddles.

Investigating Candidate-only Bias

We conduct ablation experiments to measure candidate-only bias by removing question from input (“w/o question” in Table 4 and 5). We observe that models can only make nearly random guess (20%) without the question, demonstrating the effectiveness of local distractors (Section 2.2) to prevent machine from learning spurious correlations between distractors and correct answers.

Transfer Learning with CommonsenseQA

BiRdQA-en is similar to CommonsenseQA talmor-etal-2019-commonsenseqa in terms of commonsense reasoning. We conduct transfer learning experiments to gauge the degree of overlap of necessary knowledge between these two datasets. We append the training set of CommonsenseQA to BiRdQA (“Train + CQA” in Table 4), and observe an improvement of performance from 41.92% to 44.86% on BERT-Base model and achieve the highest performance so far (64.79%) on ALBERT-XXL. Besides, we fine-tune a model on the data of CommonsenseQA only (“Train = CQA” in Table 4), and achieve a zero-shot transfer performance of 38.29% with BERT-Base on BiRdQA-en, which indicates that there exists an overlap in terms of commonsense knowledge for solving problems in both BiRdQA and CommonsenseQA.

6 Related Work

To the best of our knowledge, RiddleSense lin-etal-2021-riddlesense is the most similar dataset to ours. It contains 5.7k English riddles and also comes in a multiple-choice format. However, we argue that our work is different in that

  • BiRdQA contains bilingual riddles and thus can facilitate a cross-lingual study of riddles.

  • BiRdQA provides Wikipedia introduction for each candidate to promote future research on commonsense reasoning for riddles.

  • We generate distractors in an automated fashion with quality guarantee and less human efforts. It can work for English, Chinese and other languages as long as pretrained word embeddings are available.

  • The bias in BiRdQA-en is much weaker than RiddleSense. A BERT-Base model simply achieves an accuracy of 42.21% even without question input on RiddleSense, compared with only 21.44% on BiRdQA-en.

  • RiddleSense makes use of ConceptNet for distractor generation, which is potentially unfair for non-ConceptNet-based models and can limit future research directions191919The leaderboard of CommonsenseQA, which also generate distractors with ConceptNet, does not accept submissions of models that make use of ConceptNet..

We further provide a detailed comparison of BiRdQA-en and RiddleSense in Appendix D.

tan2016solving has studied on the solution and generation of Chinese character riddle, which present challenge regarding the structure of Chinese character rather than the meaning of the answer. But BiRdQA-zh explicitly excludes this type of Chinese riddle (Appendix A).

There are also commonsense-based QA datasets such as COSMOS QA huang2019cosmos and CommonsenseQA talmor-etal-2019-commonsenseqa. However, solving riddle requires more than just commonsense knowledge. While normal commonsense questions navigate the readers to the right direction for the answer, riddles tend to mislead them on purpose. Besides, the machine needs to understand the figurative and ambiguous effect of the riddle language. Therefor, riddle is a great benchmark for higher-order understanding of natural language and broader intelligence.

There is a trend for multilingual QA in recent years, with the release of BiPaR jing2019bipar, XQuAD Artetxe:etal:2019, MLQA lewis-etal-2020-mlqa and other cross-lingual QA datasets. While translation from English dataset is a common approach to the creation of multilingual dataset, we build BiRdQA independently yet symmetrically in English and Chinese, due to the difficulty of riddle translation.

7 Conclusion and Future Work

In this paper, we introduce BiRdQA, a large-scale, bilingual multiple-choice question answering dataset to facilitate the development of QA systems capable of solving tricky riddles. The huge gap between the human and machine leaves much room for improvement. In future work, we plan to extend BiRdQA with riddles in other languages and incorporate figurative language understanding into riddle solving. We hope that BiRdQA will stir more research for question answering on riddles.


Appendix A Data Preprocessing

English Riddle

We take the following steps to clean the raw data of English riddles.

  1. We remove prefixes like “I am …”, “A …”, “The …” in the answer to keep only the riddle object. For example, the answer “I am money” turns into “money”.

  2. We remove riddles with more than three words in their answers, since longer answer usually presents too complicated scenario for machines. For instance, we remove the riddle: “A cowboy rides into town on Friday, stays for three days, then leaves on Friday. How did he do it? — His horse’s name was Friday.” In another word, we focus on riddles asking “what” questions instead of “how” or “why” questions.

  3. We remove alphabet riddles whose answer is a single English letter.

  4. We manually remove math riddles since the ability of mathematical reasoning ling2017program; pikekos2021measuring is beyond the scope of this study. An example of math riddle is “Eggs are $0.12 a dozen. How many eggs can you get for a dollar? — 100 eggs.

  5. We remove duplicate riddles collected from different sources. Two riddles are considered as the same if they have the same answer and the cosine similarity between their questions exceeds certain threshold

    202020We set the threshold as 0.8 based on manual inspection of the data.. We calculate the sentence similarity with Sentence-BERT reimers2019sentence.

Chinese Riddle

We exclude Chinese character riddles212121They are manually separated from other riddles on the source website., which have single Chinese characters as solutions and describe the structures of characters tan2016solving. Instead we focus on common Chinese riddles which imply the meaning of the answer. We also observe that the answers of Chinese riddles are neat and concise enough, requiring no extra process of data cleaning. Since the Chinese riddles are collected from a single source, we do not need to remove duplicate riddles either.

Appendix B More Experimental Results

We show more results on BiRdQA-en in Table 7 and BiRdQA-zh in Table 8. Due to limitation of computational resources, we cannot run experiment on ALBERT-XXL if the input is longer with introduction. For UnifiedQA, the concatenation of five candidates and their introductions is often too long to be fed into the model. We observe trends that adding introduction is beneficial for English riddle solving but the improvements on Chinese riddles are negative. Fine-tuning on CommonsenseQA talmor-etal-2019-commonsenseqa can greatly improve performance on BiRdQA-en. Adding hints does not help with prediction on BiRdQA-zh.

Model BiRdQA-en w/ introduction Train = CQA Train + CQA
Dev Test Dev Test Dev Test Dev Test
BERT-Base 48.82 / 0.6834 41.92 / 0.6404 51.46 / 0.6949 44.38 / 0.6609 42.41 / 0.6463 38.29 / 0.6164 46.56 / 0.6771 44.86 / 0.6599
BERT-Large 46.65 / 0.6735 44.25 / 0.6519 49.58 / 0.6868 46.30 / 0.6728 41.28 / 0.6370 39.04 / 0.6260 49.76 / 0.6987 50.41 / 0.6993
RoBERTa-Large 47.31 / 0.6749 45.21 / 0.6653 45.90 / 0.6623 42.19 / 0.6480 47.03 / 0.6715 47.26 / 0.6751 55.98 / 0.7384 54.79 / 0.7302
ALBERT-XXL 63.52 / 0.7856 58.70 / 0.7590 - - 55.42 / 0.7305 54.66 / 0.7277 67.11 / 0.8045 64.79 / 0.7978
UnifiedQA (T5-Large) 67.20 62.60 - - 48.92 48.42 65.22 63.22
Table 7: Results (accuracy/MRR) of English models on the development and the test data of BiRdQA-en.
Model BiRdQA-zh w/ introduction w/ hint
Dev Test Dev Test Dev Test
BERT 53.45 / 0.7099 55.10 / 0.7210 55.95 / 0.7244 54.08 / 0.7102 53.26 / 0.7114 50.62 / 0.6954
BERT-wwm-ext 58.25 / 0.7421 57.02 / 0.7353 57.77 / 0.7380 55.78 / 0.7213 56.33 / 0.7328 56.34 / 0.7276
RoBERTa-wwm-ext-large 60.17 / 0.7563 58.04 / 0.7415 53.26 / 0.7073 51.19 / 0.6983 58.16 / 0.7456 58.15 / 0.7406
ERNIE 60.65 / 0.7550 59.29 / 0.7479 62.48 / 0.7674 56.23 / 0.7285 57.01 / 0.7362 57.53 / 0.7418
Table 8: Results (accuracy/MRR) of Chinese models on the development and the test data of BiRdQA-zh.

Appendix C Error Analysis

Table 9 shows two mistaken examples from English and Chinese part of our dataset, respectively. The lower the prediction probability is, the less confident the model is for its prediction. The first example of English riddle uses personification to present a seemingly impossible scenario (“doesn’t sleep”, “never eats”) regarding the river bed and river mouth. The distractor “ocean” is invalid because ocean does not have “mouth”, though it does have its “bed”, or the so-called seafloor. As for the second English riddle, it requires creative thinking schank1987natural to come up with the rationale that when a person says the word “silence”, the silence is broken. The third example is about a Chinese riddle which uses several metaphors (e.g., “吃肉”, “eat meats”) to describe the kitchen knife. The “but never” clue also marks the misleading property of riddle. The fourth example is very difficult for machine because it describes the meaning of each character in the answer word rather than the meaning of the whole word (“云长” “关”222222“云长” (Yunchang) is the courtesy name for Guan Yu (“关羽”)., “想起” “怀”232323Both of them mean “recall”., “玄德” “备”242424“玄德” (Xuande) is the courtesy name for Liu Bei (“刘备”)., “来” “至”252525“来” serves as a complement in this riddle but also has the same meaning of “come” with “至”.), which is an uncommon technique in English riddles. Note that this is still not a Chinese character riddle because it is not about the structure of the single Chinese character of the answer. Overall, the figurative and misleading properties of riddles present a novel challenge for existing QA models.

Riddle What has a bed but doesn’t sleep and a mouth but never eats?
Candidate ocean pond river equator lake
Prediction probability 0.9125 0.0586 0.0167 0.0064 0.0058

What disappears the moment you say its name?

Candidate whisper darkness noise silence annihilation
Prediction probability 0.4283 0.2516 0.1583 0.1328 0.0291
Riddle 薄薄一张口,能啃硬骨头。吃肉不喝汤,吃瓜不嚼豆。 It has a thin mouth and can gnaw bones. It eats meat but doesn’t drink soup, eats melons but never munch beans.
Candidate 砧板  cutting board 菜刀  kitchen knife 锅铲 turner 木棍 wooden stick 斧头 axe
Prediction probability 0.3568 0.2489 0.1856 0.1622 0.0465
Riddle 云长想起玄德来 Yunchang recalled Xuande.
Candidate 和蔼可亲  be affable 关怀备至  care for sb. 以礼相待 be polite 推心置腹 confide in sb. 体贴入微 be considerate
Prediction probability 0.2747 0.2210 0.2028 0.1613 0.1402
Table 9: Error cases during test by ALBERT-XXL on BiRdQA-en (the upper two) and ERNIE on BiRdQA-zh (the lower two). The cross indicates the candidate chosen by a model, while the checkmark indicates the correct answer. Candidates of each riddle are sorted by their prediction probabilities.

Appendix D Comparison with RiddleSense

Table 10 compares some of the key statistics between BiRdQA-en and RiddleSense. BiRdQA-en contains more examples, longer and more diverse questions than RiddleSense, while RiddleSense is more varied in terms of the answers and candidates.

Measurement BiRdQA-en RiddleSense
# Total examples 6614 5715
# Training examples 4093 3510
# Validation examples 1061 1021
# Test examples 1460 1184
# Avg question tokens 29.74 24.04
Long questions ( 20 tokens) 55.00% 47.30%
# Distinct question tokens 15043 7110
# Distinct answers 3057 3622
# Distinct candidate tokens 6593 9912
Table 10: Comparison of key statistics between BiRdQA-en and RiddleSense.

Table 11 describes experimental results on BiRdQA-en and RiddleSense. Our dataset has comparable difficulty to RiddleSense. This is non-trivial since we generate distractors automatically while the distractors in RiddleSense are collected by human.

Model BiRdQA-en RiddleSense
Dev Test Dev Test
BERT-Base 48.82 / 0.6834 41.92 / 0.6404 54.16 42.43
BERT-Large 46.65 / 0.6735 44.25 / 0.6519 55.24 45.09
RoBERTa-Large 47.31 / 0.6749 45.21 / 0.6653 60.72 52.58
ALBERT-XXL 63.52 / 0.7856 58.70 / 0.7590 66.99 60.65
UnifiedQA (T5-Large) 67.20 62.60 56.21 56.40
Table 11: Comparison of results (accuracy/MRR) between BiRdQA-en and RiddleSense. We show the reported dev/test accuracies on RiddleSense lin-etal-2021-riddlesense for references.