Question answering (QA) addresses the task of returning a precise and concise answer to a natural language question posed by the user. QA received a great deal of attention both in academia and industry. Two main directions within QA are Open-Domain Question Answering (ODQA) and Knowledge Base Question Answering (KBQA). ODQA searches for the answer in a large collection of text documents; the process is often divided into two stages: 1) retrieval of potentially relevant paragraphs and 2) spotting an answer span within the paragraph (referred to as machine reading comprehension, MRC). In contrast, KBQA uses a knowledge base as a source of answers. A knowledge base is a large collection of factual knowledge, commonly structured in subject–predicate–object (SPO) triples, for example (Vladimir_Nabokov, spouse, Véra_Nabokov).
A potential benefit of KBQA is that it uses knowledge in a distilled and structured form that enables reasoning over facts. In addition, knowledge base structure is inherently language-independent – entities and predicates are assigned unique identifiers that are tied to specific languages through labels and descriptions, – which makes KBs more suitable for multilingual QA. The task of KBQA can be formulated as a translation from natural language question into a formal KB query (expressed in SPARQL, SQL, or -calculus). In many real-life applications, like in Jeopardy! winning IBM Watson [DeepQA] and major search engines, hybrid QA systems are employed – they rely on both text document collections and structured knowledge bases.
High-quality annotated data is crucial for measurable progress in question answering. Since the advent of SQuAD [squad], a wide variety of datasets for machine reading comprehension have emerged, see a recent survey [zhang2019machine]. We are witnessing a growing interest in multilingual question answering, which leads to the creation of multilingual MRC datasets [mlqa, artetxe2019cross, tydiqa]. Multilingual KBQA has received a deal of attention in the literature [hakimov2017amuse, diefenbach2018question]. However, almost all available KBQA datasets are English, Chinese datasets being an exception. Existing multilingual QALD datasets are rather small.
In this paper we present RuBQ (pronounced [‘rubik]) – Russian Knowledge Base Questions, a KBQA dataset that consists of 1,500 Russian questions of varying complexity along with their English machine translations, corresponding SPARQL queries, answers, as well as a subset of Wikidata covering entities with Russian labels. To the best of our knowledge, this is the first Russian KBQA and semantic parsing dataset. To construct the dataset, we started with a large collection of trivia Q&A pairs harvested on the Web. We built a dedicated recall-oriented Wikidata entity linking tool and verified the obtained answers’ candidate entities via crowdsourcing. Then, we generated paths between possible question entities and answer entities and carefully verified them.
The freely available dataset is of interest for a wide community of Semantic Web, natural language processing (NLP), and information retrieval (IR) researchers and practitioners, who deal with multilingual question answering. The proposed dataset generation pipeline proved to be efficient and can be employed in other data annotation projects.
2 Related work
|ComplexQuestions [bao-etal-2016-constraint]||2016||2,100||Fb||+||–||L, SE||en|
|30M Factoid QA Corpus [30Mfactoids]||2017||30M||Fb||–||t||FA||en|
|LC QuAD [lc_quad1]||2017||5,000||DBp||+||S||T+PP||en|
|LC-QuAD 2.0 [lc-quad20]||2019||30,000||DBp, Wd||+||S||T+PP||en|
|RuBQ (this work)||2020||1,500||Wd||+||S||QZ||ru|
Table 1 summarizes the characteristics of KBQA datasets that have been developed to date. These datasets vary in size, underlying knowledge base, presence of questions’ logical forms and their formalism, question types and sources, as well as the language of the questions.
The questions of the earliest Free917 dataset [free917] were generated by two people without consulting a knowledge base, the only requirement was a diversity of questions’ topics; each question is provided with its logical form to query Freebase. Berant et al. [webquestions] created WebQuestions dataset that is significantly larger but does not contain questions’ logical forms. Questions were collected through Google suggest API: authors fed parts of the initial question to the API and repeated the process with the returned questions until 1M questions were reached. After that, 100K randomly sampled questions were presented to MTurk workers, whose task was to find an answer entity in Freebase. Later studies have shown that only two-thirds of the questions in the dataset are completely correct; many questions are ungrammatical and ill-formed [yih2016value, wu2020perq]. Yih et al. [yih2016value] enriched 81.5% of WebQuestions with SPARQL queries and demonstrated that semantic parses substantially improve the quality of KBQA. They also showed that semantic parses can be obtained at an acceptable cost when the task is broken down into smaller steps and facilitated by a handy interface. Annotation was performed by five people familiar with Freebase design, which hints at the fact that the task is still too tough for crowdsourcing. WebQuestions were used in further studies aimed to generate complex questions [bao-etal-2016-constraint, talmor2018web].
SimpleQuestions [simpleq] is the largest manually created KBQA dataset to date. Instead of providing logical parses for existing questions, the approach explores the opposite direction: based on formal representation, a natural language question is generated by crowd workers. First, the authors sampled SPO triples from a Freebase subset, favoring non-frequent subject–predicate pairs. Then, the triples were presented to crowd workers, whose task was to generate a question about the subject, with the object being the answer. This approach doesn’t guarantee that the answer is unique – Wu et al. [wu2020perq]estimate that SOTA results on the dataset (about 80% correct answers) reach its upper bound, since the rest of the questions are ambiguous and cannot be answered precisely. The dataset was used for the fully automatic generation of a large collection of natural language questions from Freebase triples with neural machine translation methods [30Mfactoids]. Dieffenbach et al. [simpleQ2wikidata] succeeded in a semi-automatic matching of about one-fifth of the dataset to Wikidata.
The approach behind FreebaseQA dataset [freebaseqa] is the closest to our study – it builds upon a large collection of trivia questions and answers (borrowed largely from TriviaQA dataset for reading comprehension [triviaqa]). Starting with about 130K Q&A pairs, the authors run NER over questions and answers, match extracted entities against Freebase, and generate paths between entities. Then, human annotators verify automatically generated paths, which resulted in about 28K items marked relevant. Manual probing reveals that many questions’ formal representations in the dataset are not quite precise. For example, the question eval-25: Who captained the Nautilus in 20,000 Leagues Under The Sea? is matched with the relation book.book.characters that doesn’t represent its meaning and leads to multiple answers along with a correct one (Captain Nemo). Our approach differs from the above in several aspects. We implement a recall-oriented IR-based entity linking since many questions involve general concepts that cannot be recognized by off-the-shelf NER tools. After that, we verify answer entities via crowdsourcing. Finally, we perform careful in-house verification of automatically generated paths between question and answer entities in KB. We can conclude that our pipeline leads to a more accurate representation of questions’ semantics.
The questions in the KBQA datasets can be simple, i.e. corresponding to a single fact in the knowledge base, or complex. Complex questions require a combination of multiple facts to answer them. WebQuestions consists of 85% simple questions; SimpleQuestions and 30M factoid QA Corpus contain only simple questions. Many studies [lc-quad20, bao-etal-2016-constraint, duan2019overview, talmor2018web, saha2018complex, CompositionGoogle] purposefully target complex questions.
The majority of datasets use Freebase [freebase] as target knowledge base. Freebase was discontinued and exported to Wikidata [freebase2wikidata]; the latest available Freebase dump dates back to early 2016. Three collections [QALD9, lc_quad1, lc-quad20] use DBpedia [dbpedia]. Newer datsets [freebase2wikidata, saha2018complex, lc-quad20] use Wikidata [wikidata], which is much larger, up-to-date, and has more multilingual labels and descriptions. The majority of datasets, where natural language questions are paired with logical forms, employ SPARQL as a more practical and immediate option compared to lambda calculus.
Existing KBQA datasets are almost exclusively English, with Chinese MSParS dataset being an exception [duan2019overview]. QALD-9 [QALD9], the latest edition of QALD shared task,111See overview of previous QALD datasets in [usbeck2019benchmarking]. contains questions in 11 languages: English, German, Russian, Hindi, Portuguese, Persian, French, Romanian, Spanish, Dutch, and Italian. The dataset is rather small; at least Russian questions appear to be non-grammatical machine translations.
There are several studies on knowledge base question generation [30Mfactoids, elsahar2018zero, indurthi2017generating, CompositionGoogle]. These works vary in the amount and form of supervision, as well as the structure and the complexity of the generated questions. However, automatically generated questions are intended primarily for training; the need for high-quality, human-annotated data for testing still persists.
3 Dataset Creation
Following previous studies [freebaseqa, triviaqa], we opted for quiz questions that can be found in abundance online along with the answers. These questions are well-formed and diverse in terms of properties and entities, difficulty, and vocabulary, although we don’t control these properties directly during data processing and annotation.
The dataset generation pipeline consists of the following steps: 1) data gathering and cleaning; 2) entity linking in answers and questions; 3) verification of answer entities by crowd workers; 4) generation of paths between answer entities and question candidate entities; 5) in-house verification/editing of generated paths. In parallel, we created a Wikidata sample containing all entities with Russian labels. This snapshot mitigates the problem of Wikidata’s dynamics – a reference answer may change with time as the knowledge base evolves. In addition, the smaller dataset lowers the threshold for KBQA experiments. In what follows we elaborate on these steps.
3.1 Raw Data
We mined about 150,000 Q&A pairs from several open Russian quiz collections on the Web.222http://baza-otvetov.ru, http://viquiz.ru, and others. We found out that many items in the collection aren’t actual factoid questions, for example, cloze quizzes (Leonid Zhabotinsky was a champion of Olympic games in …[Tokyo]333Hereafter English examples are translations from original Russian questions and answers.), crossword, definition, and multi-choice questions, as well as puzzles (Q: There are a green one, a blue one, a red one and an east one in the white one. What is this sentence about? A: The White House). We compiled a list of Russian question words and phrases and automatically removed questions that don’t contain any of them. We also removed duplicates and crossword questions mentioning the number of letters in the expected answer. This resulted in 14,435 Q&A pairs.
3.2 Entity Linking in Answers and Questions
We implemented an IR-based approach for generating Wikidata entity candidates mentioned in answers and questions. First, we collected all Wikidata entities with Russian labels and aliases. We filtered out Wikimedia disambiguation pages, dictionary and encyclopedic entries, Wikimedia categories, Wikinews articles, and Wikimedia list articles. We also removed uninformative entities with less than four outgoing relations. These steps resulted in 4,114,595 unique entities with 5,430,657 different labels and aliases.
After removing punctuation, we indexed the collection with Elasticsearch using built-in tokenization and stemming. Each text string (question or answer) produces three types of queries to the Elasticsearch index: 1) all token trigrams; 2) capitalized bigrams (many named entities follow this pattern, e.g. Alexander Pushkin, Black Sea
); and 3) free text query containing only nouns, adjectives, and numerals from the original string. N-gram queries (types 1 and 2) are run as phrase queries, whereas recall-oriented free text queries (type 3) are executed as Elasticsearch fuzzy search queries. Results of the latter search are re-ranked using a combination of BM25 scores from Elasticsearch and page view statistics of corresponding Wikipedia articles. Finally, we combine search results preserving the type order and retain Top-10 results for further processing. The proposed approach effectively combines precision- (types 1 and 2) and recall-oriented (type 3) processing.
3.3 Crowdsourcing Annotations
Entity candidates for answers obtained through the entity linking described above were verified on Yandex.Toloka crowdsourcing platform.444https://toloka.yandex.com/ Crowd workers were presented with a Q&A pair and a ranked list of candidate entities. In addition, they could consult a Wikipedia page corresponding to the Wikidata item, see Figure 1. The task was to select a single entity from the list or the None of the above option. The average number of candidates on the list is 5.43.
Crowd workers were provided with a detailed description of the interface and a variety of examples. To proceed to the main task, crowd workers had to first pass a qualification consisting of 20 tasks covering various cases described in the instruction. We also included 10% of honeypot tasks for live quality monitoring. These results are in turn used for calculating confidence of the annotations obtained so far as a weighted majority vote (see details in [ipeirotis2014repeated]). Confidence value governs overlap in annotations: if the confidence is below 0.85, the task is assigned to the next crowd worker. We hired Toloka workers from the best 30% cohort according to internal rating. As a result, the average confidence for the annotation is 98.58%; the average overlap is 2.34; average time to complete a task is 19 seconds.
In total, 9,655 out of 14,435 answers were linked to Wikidata entities. Among the matched entities, the average rank of the correct candidate appeared to be 1.5. The combination of automatic candidate generation and subsequent crowdsourced verification proved to be very efficient. A possible downside of the approach is a lower share of literals (dates and numerical values) in the annotated answers. We could match only a fraction of those answers with Wikidata: Wikidata’s standard formatted literals may look completely different even if representing the same value. Out of 1,255 date and numerical answers, 683 were linked to a Wikidata entity such as a particular year. For instance, the answer for In what year was Immanuel Kant born? matches Q6926 (year 1724), whereas the corresponding Wikidata value is "1724-04-22"^^xsd:dateTime. Although the linkage is deemed correct, this barely helps generate a correct path between question and answer entities.
3.4 Path Generation and In-house Annotation
We applied entity linking described above to the 9,655 questions with verified answers and obtained 8.56 candidate entities per question on average. Next, we generated candidate subgraphs spanning question and answer entities, restricting the length between them by two hops. We examined the questions in the sample and found out that longer distances between question and answer entities are very rare.
We investigated the option of filtering out erroneous question entities using crowdsourcing analogous to answer entity verification. A pilot experiment on a small sample of questions showed that this task is much harder – we got only 64% correct matches on a test set. Although the average number of generated paths decreased (from 1.9 to 0.9 and from 6.2 to 3.5 for paths of length one and two, respectively), it also led to losing correct paths for 14% of questions. Thus, we decided to perform an in-house verification of the generated paths. The work was performed by the authors of the paper.
After sending queries to the Wikidata endpoint, we were able to find chains of length one or two for 3,194 questions; the remaining 6,461 questions were left unmatched. We manually inspected 200 random unmatched questions and found out that only 10 of them could possibly be answered with Wikidata, but the required facts are missing in the KB.
Out of 2,809 1-hop candidates corresponding to 1,799 questions, 866 were annotated as correct. For the rest 2,328 questions, we verified 3,591 2-hop candidates, but only 55 of them were deemed correct. 279 questions were marked as answerable with Wikidata. To increase the share of complex questions in the dataset, we manually constructed SPARQL queries for them.
Finally, we added 300 questions marked as non-answerable over Wikidata, although their answers are present in the knowledge base. These adversarial examples are akin to unanswerable questions in the second edition of SQuAD dataset [squad20]. The majority of these questions are unanswerable because required predicates are missing in Wikidata, e.g. How many bells does the tower of Pisa have? (7). In some cases, although both question and answer entities are present, the relation between them is missing, e.g. What circus was founded by Albert Vilgelmovich Salamonsky in 1880? (Moscow Circus on Tsvetnoy Boulevard). The presence of such questions makes the task more challenging and realistic.
4 RuBQ Dataset
4.1 Dataset Statistics
Our dataset has 1,500 unique questions in total. It mentions 2,357 unique entities – 1,218 in questions and 1,250 in answers. There are 242 unique relations in the dataset. The average length of the original questions is 7.99 words (median 7); machine-translated English questions are 10.58 words on average (median 10). 131 questions have more than one correct answer. For 1,154 questions the answers are Wikidata entities, and for 46 questions the answers are literals.
Inspired by a taxonomy of query complexity in LC QuAD 2.0 [lc-quad20], we annotated obtained SPARQL queries in a similar way. The query type is defined by the constraints in the SPARQL query, see Table 2. Note that some queries have multiple type tags. For example, SPARQL query for the question How many moons does Mars have? is assigned 1-hop and count types and therefore isn’t simple in terms of SimpleQuestions dataset.
Taking into account RuBQ’s modest size, we propose to use the dataset primarily for testing rule-based systems, cross-lingual transfer learning models, and models trained on automatically generated examples, similarly to recent MRC datasets[tydiqa, artetxe2019cross, mlqa]. We split the dataset into development (300) and test (1,200) sets in such a way to keep a similar distribution of query types in both subsets.
|1-hop||198/760||Query corresponds to a single SPO triple|
|multi-hop||14/55||Query’s constraint is applied to more than one fact|
|multi-constraint||21/110||Query contains more than one SPARQL constraint|
|qualifier-answer||1/5||Answer is a value of a qualifier relation, similar to ‘‘fact with qualifiers’’ in LC-QuAD 2.0|
|qualifier-constraint||4/22||Query poses constraints on qualifier relations; a superclass of ‘‘temporal aspect’’ in LC-QuAD 2.0|
|reverse||6/29||Answer’s variable is a subject in at least one constraint|
|count||1/4||Query applies COUNT operator to the resulting entities, same as in LC-QuAD 2.0|
|ranking||3/16||ORDER and LIMIT operators are applied to the entities specified by constraints, same as in LC-QuAD 2.0|
|0-hop||3/12||Query returns an entity already mentioned in the questions. The corresponding questions usually contain definitions or entity’s alternative names|
|exclusion||4/18||Query contains NOT IN, which excludes entities mentioned in the question from the answer|
|no-answer||60/240||Question cannot be answered with the knowledge base, although answer entity may be present in the KB|
4.2 Dataset Format
|Question||Who wrote the novel ‘‘Uncle Tom’s Cabin’’?|
|SPARQL query||SELECT ?answer WHERE wd:Q2222 wdt:P50 ?answer|
|Answers IDs||Q102513 (Harriet Beecher Stowe)|
|Question||Who played Prince Andrei Bolkonsky in Sergei Bondarchuk’s film ‘‘War and Peace’’?|
|SPARQL query||SELECT ?answer WHERE wd:Q845176 p:P161 [ ps:P161 ?answer; pq:P453 wd:Q2737140 ]|
|Answers IDs||Q312483 (Vyacheslav Tikhonov)|
|Question||Who uses a theodolite for work?|
|SPARQL query||SELECT ?answer WHERE wd:Q181517 wdt:P366 [ wdt:P3095 ?answer ]|
|Answers IDs||Q1734662 (cartographer), Q11699606 (geodesist), Q294126 (land surveyor)|
For each entry in the dataset, we provide: the original question in Russian, machine-translated English question obtained through Yandex.Translate,555https://translate.yandex.com/ original answer text (may differ textually from the answer entity’s label retrieved from Wikidata), SPARQL query representing the meaning of the question, a list of entities in the query, a list of relations in the query, a list of answers (a result of querying the Wikidata subset, see below), and a list of query type tags, see Table 3 for examples. RuBQ is distributed under CC BY-SA license and is available in JSON format.
The dataset is accompanied by RuWikidata8M – a Wikidata sample containing all the entities with Russian labels.666https://zenodo.org/record/3751761, project’s page on github points here. It consists of about 212M triples with 8.1M unique entities. As mentioned before, the sample guarantees the correctness of the queries and answers and makes the experiments with the dataset much simpler. For each entity, we executed a series of CONSTRUCT SPARQL queries to retrieve all the truthy statements and all the full statements with their linked data.777Details about Wikidata statement types can be found here: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Statement_types We also added all the triples with subclass of (P279) predicate to the sample. This class hierarchy can be helpful for question answering task in the absence of an explicit ontology in Wikidata. The sample contains Russian and English labels and aliases for all its entities.
|1-hop + reverse||0||3|
|1-hop + count||0||2|
|1-hop + exclusion||0||2|
We provide two RuBQ baselines from third-party systems – DeepPavlov and WDAqua – that illustrate two possible approaches to cross-lingual KBQA.
To the best of our knowledge, the KBQA library888http://docs.deeppavlov.ai/en/master/features/models/kbqa.html from an open NLP framework DeepPavlov [deeppavlov] is the only freely available KBQA implementation for Russian language. The library uses Wikidata as a knowledge base and implements the standard question processing steps: NER, entity linking, and relation detection. According to the developers of the library, they used machine-translated SimpleQuestions and a dataset for zero-shot relation extraction [levy-etal-2017-zero] to train the model. The library returns a single string or not found as an answer. We obtained an answer entity ID using reverse ID-label mapping embedded in the model. If no ID is found, we treated the answer as a literal.
WDAqua [WDAqua] is a rule-based KBQA system that answers questions in several languages using Wikidata. WDAqua returns a (possibly empty) ranked list of Wikidata item IDs along with corresponding SPARQL queries. We obtain WDAqua’s answers by sending RuBQ questions machine-translated into English to its API.999www.wdaqua.eu/qa
WDAqua outperforms DeepPavlov in terms of precision@1 on the answerable subset (16% vs. 13%), but demonstrates a lower accuracy on unanswerable questions (43% vs. 73%). Table 4 presents detailed results. In contrast to DeepPavlov, WDAqua returns a ranked list of entities as a response to the query, and for 23 out of 131 questions with multiple correct answers, it managed to perfectly match the set of answers. For eight questions with multiple answers, WDAqua’s top-ranked answers were correct, but the lower-ranked ones contained errors. To facilitate different evaluation scenarios, we provide an evaluation script that calculates precision@1, exact match, and precision/recall/F1 measures, as well as the breakdown of results by query types.
5 Conclusion and Future Work
We presented RuBQ – the first Russian dataset for Question Answering over Wikidata. The dataset consists of 1,500 questions, their machine translations into English, and annotated SPARQL queries. 300 RuBQ questions are unanswerable, which poses a new challenge for KBQA systems and makes the task more realistic. The dataset is based on a collection of quiz questions. The data generation pipeline combines automatic processing, crowdsourced and in-house verification, and proved to be very efficient. The dataset is accompanied by a Wikidata sample of 212M triples that contain 8.1M entities with Russian and English labels, and an evaluation script. The provided baselines demonstrate the feasibility of the cross-lingual approach in KBQA, but at the same time indicate there is ample room for improvements. The dataset is of interest for a wide community of researchers in the fields of Semantic Web, Question Answering, and Semantic Parsing.
In the future, we plan to explore other data sources and approaches for RuBQ expansion: search query suggest APIs as for WebQuestions [webquestions], a large question log [volske2015users], and Wikidata SPARQL query logs.101010https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en We will also address complex questions and questions with literals as answers, as well as the creation of a stronger baseline for RuBQ.
We thank Mikhail Galkin, Svitlana Vakulenko, Vladimir Kovalenko, Yaroslav Golubev, and Rishiraj Saha Roy for their valuable comments and fruitful discussion on the paper draft. We also thank Pavel Bakhvalov, who helped collect RuWikidata8M sample and contributed to the first version of the entity linking tool. We are grateful to Yandex.Toloka for their data annotation grant.