DeepAI
Log In Sign Up

VLSP 2021 Shared Task: Vietnamese Machine Reading Comprehension

One of the emerging research trends in natural language understanding is machine reading comprehension (MRC) which is the task to find answers to human questions based on textual data. Existing Vietnamese datasets for MRC research concentrate solely on answerable questions. However, in reality, questions can be unanswerable for which the correct answer is not stated in the given textual data. To address the weakness, we provide the research community with a benchmark dataset named UIT-ViQuAD 2.0 for evaluating the MRC task and question answering systems for the Vietnamese language. We use UIT-ViQuAD 2.0 as a benchmark dataset for the shared task on Vietnamese MRC at the Eighth Workshop on Vietnamese Language and Speech Processing (VLSP 2021). This task attracted 77 participant teams from 34 universities and other organizations. In this article, we present details of the organization of the shared task, an overview of the methods employed by shared-task participants, and the results. The highest performances are 77.24 The Vietnamese MRC systems proposed by the top 3 teams use XLM-RoBERTa, a powerful pre-trained language model using the transformer architecture. The UIT-ViQuAD 2.0 dataset motivates more researchers to explore Vietnamese machine reading comprehension, question answering, and question generation.

READ FULL TEXT VIEW PDF

page 2

page 11

02/13/2022

PQuAD: A Persian Question Answering Dataset

We present Persian Question Answering Dataset (PQuAD), a crowdsourced re...
10/22/2019

MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension

We present the results of the Machine Reading for Question Answering (MR...
06/11/2018

Know What You Don't Know: Unanswerable Questions for SQuAD

Extractive reading comprehension systems can often locate the correct an...
10/10/2019

RC-QED: Evaluating Natural Language Derivations in Multi-Hop Reading Comprehension

Recent studies revealed that reading comprehension (RC) systems learn to...
06/23/2021

PALRACE: Reading Comprehension Dataset with Human Data and Labeled Rationales

Pre-trained language models achieves high performance on machine reading...
02/09/2022

FedQAS: Privacy-aware machine reading comprehension with federated learning

Machine reading comprehension (MRC) of text data is one important task i...
08/20/2019

GeoSQA: A Benchmark for Scenario-based Question Answering in the Geography Domain at High School Level

Scenario-based question answering (SQA) has attracted increasing researc...

1 Introduction

Machine Reading Comprehension (MRC) is an emerging and challenging task of natural language understanding that computers can read and understand texts and then find correct answers to any questions. Recently, many shared tasks for machine reading comprehension cui2018span; fisch2019mrqa; zheng2021semeval and various benchmarks richardson-etal-2013-mctest; rajpurkar-etal-2016-squad; rajpurkar-etal-2018-know; joshi2017triviaqa; trischler-etal-2017-newsqa; kocisky-etal-2018-narrativeqa; lai-etal-2017-race; reddy-etal-2019-coqa

have attracted a range of researchers from academic and industry. Therefore, significant progress has been exploited in building computational models for semantics based on deep neural networks and transformers

seo2016bidirectional; devlin2019bert; conneau2019unsupervised; van2021vireader over the last ten years. However, there is no MRC shared task for Vietnamese, this is also the motivation for us to organize this shared task.

We hope to use this shared task to examine the capabilities of state-of-the-art deep learning and transformer models to represent and simulate machine reading comprehension in Vietnamese texts.

We introduce the VLSP-2021 Task 4, Vietnamese Machine Reading Comprehension. Inspired by machine reading comprehension benchmarking rajpurkar-etal-2018-know, we design this shared task of Vietnamese reading comprehension, in which computers are given a document D as well as a human question Q to comprehend. In this work, we construct UIT-ViQuAD 2.0, a new dataset that combines answerable questions from the previous version of UIT-ViQuAD (UIT-ViQuAD 1.0 nguyen-etal-2020-vietnamese) with over 12K new, unanswerable questions about the same passages. Table 1 illustrates two such examples.

Passage: Mã máy nhị phân (khác với mã hợp ngữ) có thể được xem như là phương thức biểu diễn thấp nhất của một chương trình đã biên dịch hay hợp dịch, hay là ngôn ngữ lập trình nguyên thủy phụ thuộc vào phần cứng (ngôn ngữ lập trình thế hệ đầu tiên). Mặc dù chúng ta hoàn toàn có thể viết chương trình trực tiếp bằng mã nhị phân, việc này rất khó khăn và dễ gây ra những lỗi nghiêm trọng vì ta cần phải quản lý từng bit đơn lẻ và tính toán các địa chỉ và hằng số học một cách thủ công. Do đó, ngoại trừ những thao tác cần tối ưu và gỡ lỗi chuyên biệt, chúng ta rất hiếm khi làm điều này. (English: Binary machine code (as opposed to assembly code) can be thought of as the most basic representation of a compiler or assembled program or as a hardware-dependent primitive programming language (the first generation programming). Although it is capable of building programs directly in binary, doing so would be complex and prone to major errors because we must handle every bit as well as compute addresses and constants. As a result, except for procedures requiring optimization and specialized debugging, we very rarely do this.)
Question 1 Dù có thể sử dụng mã máy nhị phân để lập trình, nhưng tại sao các lập trình viên lại không sử dụng nó? (Why don’t programmers utilize binary machine code, even though it is possible?)
Anwer những thao tác cần tối ưu và gỡ lỗi chuyên biệt (except for procedures requiring optimization and except for procedures requiring optimization and specialized debugging)
Answer start 493
Question 2 Ngôn ngữ lập trình thế hệ đầu tiên là ngôn ngữ gì? (What is a first-generation programming language?)
Answer Mã máy nhị phân (Binary machine code)
Answer start 0
Question 3 Ngôn ngữ lập trình hợp ngữ đầu tiên là ngôn ngữ gì? (What is the first assembly language?)
Answer -
Answer start -
Plausible answer Mã máy nhị phân (Binary machine code)
Plausible answer start 0
Table 1: Several passage-question-answer triples extracted from the dataset.

The participating teams also produced 590 total submissions within the official VLSP-2021 evaluation period. We introduce the shared task and present a summary for the evaluation in this work.

In this paper, we have three main contributions described as follows.

  • Firstly, we propose a Vietnamese span-extraction reading comprehension dataset which contains nearly 36,000 human-annotated questions which are unanswerable and answerable to add linguistic diversity in machine reading comprehension and question answering.

  • Secondly, we organize the VLSP2021-MRC Shared Task for evaluating MRC and question answering models in Vietnamese at the VLSP 2021. Our baseline approach could only obtain under 65% F1-score on the public and private test sets, and there is no model of participating teams that pass 78% (in F1) on the private test set, which indicates our dataset is challenging and can encourage the development of MRC models in Vietnamese.

  • Finally, UIT-ViQuAD 2.0 could also be a good resource for multilingual and cross-lingual research purposes when studied along with other MRC and QA datasets.

The following is how the rest of the article is organized. In Section 2, we provide a brief overview of the background and relevant studies. We introduce the VLSP2021-MRC Shared Task in Section 3. Our new dataset (UIT-ViQuAD 2.0) is presented in detail through Section 4. Section 5 presents systems and results proposed by participating teams. In Section 6, we provide further analysis of the shared task results. Finally, Section 7 summarizes the findings of the VLSP2021-MRC shared task and suggests several research directions.

2 Background and Related Work

Dataset Language Size Answerable Unaswerable
SQuAD1.1 rajpurkar-etal-2016-squad English 100k+
SQuAD2.0 rajpurkar-etal-2018-know English 150k+
KorQuAD lim2019korquad1 Korean 70k+
SberQuAD braslavski2020sberquad Russian 50k+
CMRC-2018 cui2018span Chinese 20k+
FQuAD1.1 d2020fquad French 60k+
FQuAD2.0 heinrich2021fquad2 French 60k+
UIT-ViNewsQA van2020new Vietnamese 23k+
UIT-ViQuAD 1.0 nguyen-etal-2020-vietnamese Vietnamese 22k+
UIT-ViQuAD 2.0 (Ours) Vietnamese 35k+
Table 2: Benchmark of existing reading comprehension datasets, including UIT-ViQuAD.

Machine Reading Comprehension (MRC) has attracted many researchers in developing machine learning-based MRC models after the introduction of SQuAD (a large-scale and high-quality dataset)

rajpurkar-etal-2016-squad. The growth in human-annotated datasets and computing capabilities are key factors behind the dramatic progress in the machine reading comprehension models. Particularly, many of datasets are constructed for evaluating the machine reading comprehension task including extractive MRC datasets (SQuAD rajpurkar-etal-2016-squad, SQuAD 2.0 rajpurkar-etal-2018-know, TriviaQA joshi2017triviaqa, and NewsQA trischler-etal-2017-newsqa), abstractive MRC dataset (NarrativeQA kocisky-etal-2018-narrativeqa, RECAM zheng-etal-2021-semeval), multiple-choices datasets (RACE lai-etal-2017-race and MCTest richardson-etal-2013-mctest), and conversational reading comprehension dataset (CoQA reddy-etal-2019-coqa). In addition to the creation of the MRC datasets, various neural network techniques seo2016bidirectional; devlin2018bert; conneau2019unsupervised; van2021deep have been presented and made significant progress in this field. Table 2 shows comparison of different MRC datasets.

Various efforts to create Vietnamese MRC datasets have been conducted. UIT-ViQuAD nguyen-etal-2020-vietnamese, UIT-ViNewsQA van2020new are two corpora for extractive machine reading comprehension task in Vietnamese language. Besides, there are two Vietnamese QA systems xlmrserini; bertqas developed based on automatic reading comprehension techniques. In addition, ViMMRC 9247161 and ViCoQA 10.1007/978-3-030-88113-9_44 are two Vietnamese corpora for multiple-choices reading comprehension and conversational reading comprehension, respectively. Besides, a few MRC and QA methods have been studied on Vietnamese MRC datasets, such as BERT 9352127, ViReader van2021vireader, XLMRserini xlmrserini, and ViQAS bertqas.

Ultimately, SQuAD 2.0 rajpurkar-etal-2018-know, and NewsQA trischler-etal-2017-newsqa are two corpora claiming the challenge of unanswerable questions in machine reading comprehension tasks, which are similar to our shared task. In general, extractive MRC requires computer understanding and retrieving the correct answer from the reading texts, which can evaluate the comprehension of the computer’s natural language texts. However, the computer not only answers given questions as usual but also knows which questions are unanswerable. Our purpose in the shared task is to construct a dataset to evaluate the ability of the computer on both answerable and unanswerable questions for the extractive machine reading comprehension task.

3 The VLSP2021-MRC Shared Task

3.1 Task Definition

This task aims to enable the ability of computers to understanding natural language texts and answer relevant questions from users. The task is defined as below:

  • Input: Given a text T = {t, …, t}and a question Q = {q, …, q} which can be answerable or unanswerable.

  • Output: An answer A = [a, a], where can be a span extracted directly from T or empty if no answer is found.

The answers returned by the system are represented as answer spans by character level and are extracted from the reading text. The spans begin with an index indicating the location of the answer in the reading text. The end of the spans is an index determined by the sum of the start index and the length of the answers text. Nevertheless, the question in this task consists of answerable and unanswerable questions (as described in Table 1), which is more difficult than the ViQuAD dataset nguyen-etal-2020-vietnamese.

According to Table 1, the first and the second questions are answerable questions. The answers are directly extracted from the reading passage (highlighted by colors in the reading passage. The blue one is the answer for the first question, and the red one is the answer for the second question). The third question is unanswerable, however, according to rajpurkar-etal-2018-know, the plausible answers are added to the dataset to make it more diverse and create the challenge for current machine reading comprehension to enhance the ability of computers for understanding natural languages.

3.2 Evaluation Metrics

Following the evaluation metrics on SQuAD2.0

rajpurkar-etal-2018-know, we use EM and F1-score as evaluation metrics for Vietnamese machine reading comprehension. These evaluation metrics are described as below:

  • Exact Match (EM): If the characters of the MRC system’s predicted answer exactly match the characters of (one of) the gold standard answer(s), EM = 1 for each question-answer pair; otherwise, EM = 0. The EM metric is a strict all-or-nothing measurement, with a score of 0 for a single character error. If the method predicts any textual span as an answer when evaluating against a unanswerable question, the question receives a zero score.

  • F1-score: F1-score is a popular metric for natural language processing and is also used in machine reading comprehension. F1-score estimated over the individual tokens in the predicted answer against those in the gold standard answers. The F1-score is based on the number of matched tokens between the predicted and gold standard answers.

The final ranking is evaluated on the test set, according to the F1-score (EM as a secondary metric when there is a tie).

3.3 Schedule and Overview Summary

Table 3 shows important dates of the VLSP2021-MRC shared task. It lasted for two months, during which the participating teams spent 27 days developing the models.

Time Phase
October 1st Trial Data
October 5th Public test
October 25th Private test
October 27th Competition end
November 15th Submission deadline
December 15th Notification of acceptance
December 28th Camera-ready due
Table 3: Schedule of the VLSP2021-MRC shared task.

Besides, Table 4 describes an overview of participants who joined the competition. To get access to the system, each team must nominate a delegate, and register with the organizers. Only delegates of teams can submit the result to the system (as shown on the leaderboard).

Metric Value
#Registration Teams 77
#Joined Teams 42
#Signed Data Agreements 42
#Paper Submissions 6
Table 4: Participation summary.
Public Test Private Test Overall
Total Entries 551 39 590
Highest F1 84.24 77.24 84.24
Highest EM 77.99 67.43 77.99
Mean F1 70.70 60.96 66.37
Mean EM 61.13 50.47 56.39
Std. F1 12.34 23.38 18.52
Std. EM 12.57 20.82 17.38
Table 5: Results summary.

Finally, Table 5

shows the statistical information about the results of participants by F1 and EM scores. Overall, the highest EM score is not higher than 80 percent, while the highest F1 score is nearly 84 percent. Both the highest F1 and EM scores come from the public test. However, the results on the private test set are lower. Notably, the standard deviation of results by F1 and EM score on the private test set is significantly higher than the public test set, which means the results between participating teams are different.

4 Dataset Construction

We proposed a new dataset named UIT-ViQuAD 2.0 for this task, the latest version of the Vietnamese Question Answering Dataset. This dataset includes questions from the first version of UIT-ViQuAD nguyen-etal-2020-vietnamese and nearly 13,000 newly human-generated questions which are unanswerable (see Section 4.1) and answerable (see Section 4.2).

Instead of generating unanswerable questions from scratch like SQuAD 2.0 rajpurkar-etal-2018-know, we transform answerable questions into unanswerable questions. We randomly sample one-half of answerable questions in the original dataset and ask our annotators to transform these questions into unanswerable ones, which are impossible to answer given the information of the passage. The answers for answerable questions are then used as the plausible answers for unanswerable questions. This ensures that the unanswerable questions are similar to answerable ones, and the quality of plausible answers for unanswerable questions is high enough for further research into the behavior of Question Answering models.

4.1 Generating Unanswerable Questions

To generate unanswerable questions, we do a strict process of two phases: (1) unanswerable question creation and (2) unanswerable question validation.

4.1.1 Unanswerable Question Creation

We hire 13 high-quality annotators for the process of generating unanswerable questions, most of whom have experience in annotating different datasets in Vietnamese Natural Language Processing. Our hired annotators are carefully trained in 6 phases in 10 days with 30 questions each phase. In the first 2 phases, we mainly focus on getting our annotators familiar with the task. In the next 4 phases, annotators are asked to create questions with a diverse range of unanswerable categories. We did this by having our 13 annotators transform the same set of questions. Then, when more than two annotators have the same way of transforming an answerable question into an unanswerable one, these annotators will be asked to transform that question again. The result of this process is that there are many categories of unanswerable questions in our dataset, such as Antonym, Overstatement, Understatement, Entity Swap, Normal Word Swap, Adverbial Clause Swap, Modifiers Swap. This proposes new challenges to Vietnamese Machine Reading Comprehension researchers. Table 6 presents categories of unanswerable questions in UIT-ViQuAD 2.0.

We include all answerable questions, besides newly generated unanswerable ones, from the previous version of our dataset. This gives us a dataset with the proportion of roughly one unanswerable question per 2 answerable questions. Table 7 summarizes the dataset’s overall statistics.

Reasoning Description Example
Antonym Antonym used
Sentence: Vào năm 1171, Richard khởi hành đến Aquitaine với mẹ mình và Henry phong ông là Công tước xứ Aquitaine theo yêu cầu của Eleanor. (In 1171, Richard departed for Aquitaine with his mother, and Henry made him Duke of Aquitaine at Eleanor’s request)
Original question: Richard khởi hành đến Aquitaine với mẹ vào năm nào? (In what year did Richard depart for Aquitaine with his mother?)
Unanswerable question: Richard khởi hành từ Aquitaine với mẹ vào năm nào? (In what year did Richard depart from Aquitaine with his mother?)
Overstatement
Word that has similar meaning but with a higher shades of meaning is used
Sentence: Ngày 9 tháng 11 năm 1989, vài đoạn của Bức tường Berlin bị phá vỡ, lần đầu tiên hàng ngàn người Đông Đức vượt qua chạy vào Tây Berlin và Tây Đức. (On November 9, 1989, several parts of the Berlin Wall were collapsed, and for the first time thousands of East Germans crossed into West Berlin and West Germany.)
Original question: Bức tường Berlin đã bị sụp đổ một vài đoạn vào ngày nào? (On which date were some parts of Berlin Wall collapsed?)
Unanswerable question: Bức tường Berlin đã bị sụp đổhoàn toàn vào ngày nào? (On which date was Berlin Wall completely collapsed?)
Understatement
Word that has similar meaning but with a lower shades of meaning is used
Sentence: Quân đội Nhật Bản chiếm đóng Quảng Châu từ năm 1938 đến 1945 trong chiến tranh thế giới thứ hai. (The Japanese army captured Guangzhou from 1938 to 1945 during the second world war.)
Original question: Khi Chiến tranh Thế giới thứ hai xảy ra thì Quảng Châu bị nước nào chiếm đóng? (During the World War II, Guanzong was captured by which country?)
Unanswerable question: Khi Chiến tranh Thế giới thứ hai xảy ra thì Quảng Châu bị nước nào đe dọa? (During the World War II, Guanzong was attacked by which country?)
Entity Swap Entity replaced by other entity
Sentence: Là cảng Trung Quốc duy nhất có thể tiếp cận được với hầu hết các thương nhân nước ngoài, thành phố này đã rơi vào tay người Anh trong chiến tranh nha phiến lần thứ nhất. (As the only Chinese port accessible to most foreign merchants, the city fell to the British during the First Opium War.)
Original question: Trong cuộc chiến nào thì Anh Quốc đã chiếm được Quảng Châu? (In which war did Britain capture Guangzhou?)
Unanswerable quetion: Trong cuộc chiến nào thì Nhật đã chiếm được Quảng Châu? (In which war did Japan capture Guangzhou?)
Normal Word Swap
A normal word replaced by another normal word
Sentence: Sự phát hiện của Hofmeister năm 1851 về các thay đổi xảy ra trong túi phôi của thực vật có hoa […] (Hofmeister’s discovery in 1851 of changes occurring in the embryo sac of flowering plants […])
Original question: Năm 1851 nhà sinh học Hofmeister đã tìm ra điều gì ở thực vật có hoa? (In 1851, the biologist Hofmeister discovered what in flowering plants?)
Unanswerable question: Năm 1851 nhà sinh học Hofmeister đã công nhận điều gì ở thực vật có hoa? (In 1851, the biologist Hofmeister accepted what in flowering plants?)
Adverbial Clause Swap
Adverbial clause replaced by another adverbial clause related to the context
Sentence: Trước đó Phạm Văn Đồng từng giữ chức vụ Thủ tướng Chính phủ Việt Nam Dân chủ Cộng hòa từ năm 1955 đến năm 1976. Ông là vị Thủ tướng Việt Nam tại vị lâu nhất (1955–1987). Ông là học trò, cộng sự của Chủ tịch Hồ Chí Minh. (Pham Van Dong previously held the position of Prime Minister of the Democratic Republic of Vietnam from 1955 to 1976. He was the longest-serving Prime Minister of Vietnam (1955-1987). He was a student and collaborator of President Ho Chi Minh.)
Original question: Giai đoạn năm 1955-1976, Phạm Văn Đồng nắm giữ chức vụ gì? (In the period 1955-1976, what position did Pham Van Dong hold?)
Unanswerable question: Khi là cộng sự của chủ tịch Hồ Chí Minh, Phạm Văn Đồng nắm giữ chức vụ gì? (As a collaborator of President Ho Chi Minh, what position did Pham Van Dong hold?)
Modifiers Swap
Modifier of one word in the given context is used for another word
Sentence: Các phần mềm giáo dục đầu tiên trong lĩnh vực giáo dục đại học (cao đẳng) và tập trung được thiết kế chạy trên máy tính đơn (hoặc các thiết bị cầm tay). Lịch sử của các phần mềm này được tóm tắt trong SCORM 2004 2nd edition Overview (phần 1.3) (The first educational software in the field of higher education (college) and concentration was designed to run on a single computer (or portable devices). The history of these software is summarized in SCORM 2004 2nd edition Overview (section 1.3).)
Original question: Lịch sử của các phần mềm giáo dục đầu tiên trong lĩnh vực giáo dục đại học (cao đẳng) được tóm tắt, ghi nhận ở đâu? (Where did the history of the first educational software in the field of higher education (college) was summarized and recorded?)
Unanswerable quetion: Lịch sử của các phần mềm giáo dục trong lĩnh vực giáo dục đại học (cao đẳng) được tóm tắt, ghi nhận đầu tiên ở đâu? (Where did the history of the educational software in the field of higher education (college) was first summarized and recorded?)
Table 6: Categories of unanswerable questions in UIT-ViQuAD 2.0.

4.1.2 Unanswerable Question Validation

Before publishing the dataset for the evaluation campaign, we have carefully validated newly unanswerable questions following the procedure inspired by nguyen-etal-2020-vietnamese. To help annotators gradually be better at generating new unanswerable questions, after generating every 3,000 unanswerable questions, we asked our annotators to self-validate the questions that they have generated before and write short documents to reflect on their errors. This effort minimizes the possibility that our annotators repeat their errors too many times.

To further reduce the error rate in our unanswerable questions, we have a separate phase of cross-validating after finishing creating 12,000 unanswerable questions. We hired ten annotators who had generated over 1,000 unanswerable questions during the phase of generating new samples for this phase. This effort helped filter out the annotators who have little experience in annotating unanswerable questions to reduce the noise during the validation phase. Our team then investigated and confirmed every error detected by annotators. To maximize the probability of detecting errors in newly generated unanswerable questions, we provide our annotators with incentives to carefully check for the errors in the dataset as we additionally reward them on each error they correctly detect.

4.2 Additional Difficult Answerable Questions

In addition to answerable questions from UIT-ViQuAD 1.0, we also hire five annotators, who have experiences in doing researches with Vietnamese natural language processing and clearly understand different reasoning skills sugawara-etal-2017-evaluation that is important to evaluate the comprehension ability of models to annotate more challenging answerable questions, which requires models more reasoning ability to correctly answer. The selected annotators are then encouraged to spend at least 3 minutes per question. When generating this set of questions, our purpose is to propose more challenges to researchers in the VLSP 2021 Evaluation Campaign and encourage further analysis on the effects of unanswerable questions in future works.

4.3 Overview Statistics of UIT-ViQuAD 2.0

Train Public Test Private Test All
Number of articles 138 19 19 176
Number of passages 4,101 557 515 5,173
Number of total questions 28,457 3,821 3,712 35,990
Number of unanswerable questions 9,217 1,168 1,116 11,501
Average passage length 179.0 167.6 177.3 177.6
Average answerable question length 14.6 14.3 14.7 14.6
Average unanswerable question length 14.7 14.0 14.5 14.6
Table 7: Overview Statistics of UIT-ViQuAD 2.0.

The general statistics of the datasets are given in Table 7. UIT-ViQuAD 2.0 comprises 35,990 question-answer-passage triples (including 9,217 unanswerable . The organizers provide training, public test, and private test sets for the participarting teams, respectively. For public and private test sets, we only provide passages and their questions without answers to the teams.

5 Systems and Results

5.1 Baseline System

Following devlin2019bert

, we adopt transfer learning based on BERT (Bidirectional Encoder Representations from Transformers) for our baseline system. To adapt our dataset, we slightly modify the run squad.py script

111https://github.com/google-research/bert/blob/master/run_squad.py while keeping the majority of the original code. mBERT is trained in 104 languages, including Vietnamese. In addition, we use the transformers library by Hugging Face 222https://huggingface.co/ to fine-tune mBERT for our question-answering dataset. We fine-tuned the parameters to suit our dataset in the training process as well as the model evaluation process.

For the baseline system, we used an initial learning_rate of 3e-5 with a batch_size of 32 and trained for two epochs. The max_seq_length and doc_stride are set to 384 and 128.

5.2 Shared Task Submissions

The AIHUB platform 333https://aihub.vn) was used to manage all submissions. We received entries from 24 teams for the public test, while for the private test, we gained submissions from 19 teams. The systems using the pre-trained language model XLM-R achieve SOTA results. Six of these teams had their system description papers submitted. Each of them is briefly described below.

5.2.1 The vc-tus team

To address unanswerable questions, vlspmrc1 presents a novel Vietnamese MRC technique based on Retrospective Reader zhang2021retrospective. Furthermore, they concentrate on increasing the ability of answer extraction by effectively using attention mechanisms and boosting representation ability through semantic information. They also provide an ensemble method for obtaining considerable improvements in single model results. On the Vietnamese MRC shared task, their strategy help us achieve the first rank.

5.2.2 The ebisu_uit team

vlspmrc4 suggest a novel method for teaching Vietnamese reading comprehension. To tackle the Machine reading comprehension (MRC) test in Vietnamese, they apply BLANC (BLock AttentioN for Context prediction) seonwoo2020context

on pre-trained language models. With this strategy, this model produced good results. This approach achieved 77.222 percent of F1-score on the private exam with the MRC task at the VLSP Shared-task 2021, placing the second rank overall.

5.2.3 The F-NLP team

To learn the correlation between a start position and an end position in pure-MRC output prediction, vlspmrc3 present two types of joint models for answerability prediction and pure-MRC prediction with/ without a dependence mechanism. In addition, we employ ensemble models and a verification technique that includes selecting the best answer from among the top K answers provided by various models.

5.2.4 The UIT-MegaPikachu team

vlspmrc5 propose a new system employs simple yet highly effective methods. The system uses a pretrained language model (PrLM) called XLM-RoBERTa (XLM-R) conneau2019unsupervised, combined with filtering results from multiple outputs to produce the final result. This system created about 5-7 output files and select the answers with the most repetitions as the final prediction answer.

5.2.5 The UITSunWind team

vlspmrc2 present the description of a new approach to solve this task at the VLSP shared task 2021: Vietnamese Machine Reading Comprehension. We propose a model to solve that task, called MRC4MRC. The model is a combination of two MRC components. The MRC4MRC based on the XLM-RoBERTa pre-trained language model is 79.13% of F1-score (F1) and 69.72% of EM (Exact Match) on the public-test set. Although this model ranks in the top 5, the EM performance on answerable questions is the highest at the private test. Our experiments also show that the XLMR language model is better than the powerful PhoBERT language model.

5.2.6 The HN-BERT team

vlspmrc6 present an unsupervised context selector that reduces the length of a given context while keeping the replies in related contexts. They also used numerous training strategies in the VLSP2021-MRC shared task dataset, including unanswerable question sample selection and several adversarial training approaches, which improved performance by 2.5% in the EM score and 1% in the F1 score.

Public Test Private Test
F1 EM F1 EM
Human 87.335 81.818 Human 82.849 75.500
NLP_HUST 84.236 77.728 vc-tus 77.241 66.137
NTQ 84.089 77.990 ebisu_uit 77.222 67.430
ebisu_uit 82.622 73.698 F-NLP 76.456 64.655
vc-tus 81.013 71.316 UIT-MegaPikachu 76.386 65.329
F-NLP 80.578 70.662 SDSOM 75.981 63.012
SDSOM 79.594 69.092 UITSunWind 75.587 64.871
UITSunWind 79.130 69.720 Big Heroes 74.241 61.126
UIT-MegaPikachu 78.637 68.804 914-clover 73.027 61.853
914-Clover 78.515 69.013 NTQ 72.863 60.938
Big Heroes 78.491 68.150 Hey VinMart 70.352 57.786
PhoKho-UIT 75.894 65.533 PhoKho-UIT 70.198 58.378
HN-BERT 75.842 63.544 HN-BERT 70.100 56.466
Hey VinMart 75.759 64.590 Deep-NLP 69.220 59.429
Deep-NLP 74.767 66.789 ABC 63.625 55.280
ABC 69.287 57.864 BASELINE 60.338 49.353
ct-nlp 68.971 58.859
tpp 68.484 57.786
S-NLP 67.589 65.140
BASELINE 63.031 53.546
Table 8: Final results on the public and private test sets, evaluated on EM and F1 scores. Participating teams are ranked by their highest F1-score.

5.3 Human Performance

To estimate the human performance of this task, we employ a team to answer a set of 100 samples from the public test set and 100 samples from the private test set. There are four annotators, and two of them work on each set doing the same thing.

In each instance, we have a passage with a question. The annotator must answer the question using the information in the passage. If there is no answer, it means the question is unanswerable, and then mark ”true” in the field ”is unanswerable”. Following the answering phase, we compute the human accuracy by F1-score and exact match scores for both public and private tests.

To calculate human performance, we use the method given in SQuAD2.0 rajpurkar-etal-2018-know. We have four responses per question in the ground truth. Thus we choose the final ground truth by majority voting and prefer the shortest answer to be the last ground truth, as explained in SQuAD2.0. After obtaining the gold response, we compute the F1 and EM scores in pairs of human-answering and gold answers with the two annotators who previously answered on the public test set. Then, by averaging the results of the two annotators, we compute the final F1 and EM scores of human performance on the public test. The computation is carried out on the private test in the same manner. As a result, the final F1 and EM scores of human performance are 87.34% and 82.85% on the public test set, respectively, and 81.82% and 75.50% on the private test set.

5.4 Experimental Results

According to our statistics, a total of 25 teams registered to participate. These teams from prestigious universities, companies, and organizations participate in the Vietnamese Machine Reading Comprehension task of the VLSP 2021 shared task. And then, out of the 25 teams participating in the development phase of their system on the public test we selected 18 teams that excelled against the baseline to further evaluate their system in the private test.

The results of the teams in the two rounds are aggregated and shown in Table 8. The ranking results of the team are based on F1 points for both rounds. In the public test round, our mBERT baseline model achieved 63.03% on F1 and 53.55% on EM. There were 18 teams with results that outperformed the results of the baseline according to F1. Overall, 14 teams with F1 scores above 70% and 5 teams with over 80%. Specifically, we found the top three teams in the public test, NLP_HUST, NTQ, and ebisu_uit, with F1 results of 84.24%, 84.09%, and 82.63%, respectively. It can be seen that the results of the two top teams in the rankings have very close results. The difference between these two teams is not more than 0.2%. Additionally, the NTQ team’s model scored slightly lower in F1 than NLP_HUST, but their model achieved the highest EM performance of 77.99%.

Regarding the private test round, the baseline model’s results achieved an F1 score of 60.34% and 49.35% on the EM score. Out of the 18 teams that passed the public test, 14 continued to participate in evaluating their system on the private test set. There have been many unexpected changes in the results of the teams’ submissions, especially the way the top three teams appeared. While only placing in 5th with 81.01% of F1 on the public test set, team vc-tus took 1st position in the private test round with F1 score of 77.24%. Besides, the ebisu_uit team maintains a stable level on the model they trained from the public test round to the private test round. They have kept 2nd place in the rankings with their F1 score of 77.22%. Once again, we can see that the results are not much different between the 1st and 2nd place teams. Furthermore, ebisu_uit is also the team with the highest results on the EM measure with 67.43%. If we take a look at the F-NLP team, it shows a similar trend with vc-tus. Remaining 5th in the public test round, their system helped them finish this task at 3rd with 76.46% of F1 score. Generally, all the teams in this round were having trouble with the private test set since its difficulty had increased significantly. As a result, the submission results of the teams are reduced considerably compared to the public test round.

6 Result Analysis

To gain a deeper insight into machine reading comprehension and question answering in Vietnamese, we conduct analysis of the results based on the 5 most powerful models at the VLSP2021-MRC shared task.

Figure 1: Submission progress of the Top 5 teams on the public test phase.
Figure 2: Submission progress of the teams who results are higher than baseline score on the private test phase.
Answerable Unanswerable Overall
Teams Models EM F1 EM F1 EM F1
vc-tus Retrospective Reader + XLM-R (Ensemble) 57.67 73.54 85.84 85.84 66.14 77.24
ebisu_uit BLANC + XLM-R/SemBERT (Ensemble) 56.59 70.59 92.65 92.65 67.43 77.22
F-NLP XLM-R (Ensemble) 58.78 75.66 78.32 78.32 64.66 76.46
UIT-MegaPikachu XLM-R (Single) 58.82 74.63 80.47 80.47 65.33 76.39
UITSunWind XLM-R + BiLSTM (Ensemble) 58.94 74.26 78.67 78.67 64.87 75.59
HN-BERT PhoBERT_Large+R3F+CS (Single) 47.50 66.99 77.33 77.33 56.47 70.10
Baseline mBERT (Single) 41.72 57.43 67.11 67.11 49.35 60.34
Table 9: Final results on answerable and unanswerable questions of the private test set, evaluated on EM and F1 scores.

6.1 Competition Progress Analysis

Figure 1 illustrates the submission progress of the top 5 teams on the public test from October 5, 2021, to October 24, 2021. In this phase, we allow 10 submissions per day. However, according to Figure 1, the submission results on both F1 and EM scores are not stable, which oscillates within the submission time. Besides, the results by EM score are no higher than 80%, indicating the challenge in the dataset for the participants.

In addition, Figure 2 illustrates the final submission results of participant teams. The private test started on October 25, 2021, and ended on October 27, 2021. Within 3 days of submission, the results on the F1 score do not change too much. Both F1 and EM scores achieved by participants are not higher than 80% on this phase. Especially, for the final results, the team name ebisu_uit has a lower result than the vc_tus team but achieved a higher result on the EM score. It can be seen from the chart that the team vc_tus achieved the best results by the F1-score, and the team ebisu_uit achieved the best result by the EM score, which placed the and in the competition.

6.2 Answerable vs. Unanswerable Analysis

To better understand the ability of the MRC systems to answer questions, we analyze human performance and the experimental results of the baseline model and the participating teams vlspmrc1; vlspmrc2; vlspmrc3; vlspmrc4; vlspmrc5; vlspmrc6. Table 9 shows final results on answerable and unanswerable questions of the private test set, evaluated on EM and F1 scores. As seen from the table, performances on unanswerable questions are always higher than on answerable questions. The ebisu_uit team achieved the best performance on unanswerable questions with over 92% of F1. However, the F-NLP and UIT-SunWind teams achieved the highest scores on the answers with 75.66% of F1 and 58.82% of EM, respectively. Interestingly, the vc-test team did not obtain the best performance on unanswerable and answerable questions, but this team achieved the best performance on the overall F1-score because they balanced the performances between the two types of questions better than the other teams.

6.3 Challenging Question Examples

We select several typical examples on answerable and unanswerable questions that make it difficult for the models proposed by the participating teams vlspmrc1; vlspmrc2; vlspmrc3; vlspmrc4; vlspmrc5; vlspmrc6. Table 10 presents several examples and explanations that the models failed to predict correct answers. We will explore more complex questions inspired by the works sugawara2018makes; sugawara2020assessing.

Example Explanation
Passage: Sau khi sinh ra, Edward được một nhũ mẫu có tên Mariota hoặc Mary Maunsel chăm sóc trong vài tháng trước khi bà ta phát bệnh, và Alice de Leygrave trở thành dưỡng mẫu của ông. Ông có thể hoàn toàn không biết mặt người mẹ ruột Eleanor đã ở Gascony với cha ông trong những năm đầu đời của ông. (After his birth, Edward was cared for by a nanny named Mariota or Mary Maunsel for several months before she became ill, and Alice de Leygrave became his nanny. He may be completely unaware of his biological mother Eleanor who was in Gascony with his father during his early years.)
Question: Thân mẫu của Edward II là ai? (Who is Edward II’s mother?)
Correct answer: Eleanor
Questions require multiple reasonings, challenging all MRC systems in the shared task. For example, this question requires co-reference, lexical knowledge and external knowledge to find the correct answer.
Co-reference: Ông (He) is linked to Edward.
Lexical knowledge: Mẹ ruột (mother) is the same meaning of Thân mẫu.
External knowledge: Edward in this context is Edward II, not Edward I, II, etc.
Passage: Tháng 6 năm 2010, Apple cho ra mắt chiếc iPhone 4, chiếc smartphone thiết kế cao cấp với hai mặt kính và khung kim loại, màn hình độ phân giải cao nhất với độ phân giải 960x640 pixel được gọi là màn hình Retina, cùng với vi xử lý Apple A4 (ARM Cortex A8) mạnh mẽ và bộ nhớ Ram 512 MB và camera nâng cấp lớn lên đến 5 MP quay phim 720p với 30 khung hình 1 giây và có đèn Led ở đằng sau, đây cũng là chiếc smartphone được trang bị camera trước với độ phân giải VGA và tính năng gọi video call lần đầu tiên có tên là Facetime độc quyền của Apple. (In June 2010, Apple released the iPhone 4, a premium design smartphone with two glass sides and a metal frame, the highest resolution screen with a resolution of 960x640 pixels called Retina display, a powerful Apple A4 (ARM Cortex A8) processor and 512MB RAM, and a sizeable upgraded camera up to 5 MP with 720p video recording at 30 frames per second and with LED lights on the back, this is also a smartphone that is designed with a front camera with VGA.)
Question: Dung lượng bộ nhớ của Apple A4 là bao nhiêu? (How much memory does the Apple A4 have?)
Correct answer: “”
Predicted answer by top 5 teams: 512 MB
There are many entity objects in the context, the relationship between these objects is ambiguous. In this example, all MRC Systems fail to understand the relation ambiguity between 512MB with Apple A4 or RAM.
Table 10: Several examples and explanations that the models failed to predict correct answers. The example texts in the ViQuAD 2.0 dataset are taken from the Vietnamese Wikipedia.

7 Conclusion and Future Work

The VLSP2021-MRC Shared Task on Machine Reading Comprehension for Vietnamese has been organized at the VLSP 2021. Despite the fact that 77 teams had signed up to get the training datasets, only 24 teams were able to submit their results. Because several teams enrolled for many shared tasks at the VLSP 2021, the other teams may not have enough time to explore MRC models. This shared task provides valuable resources for developing Vietnamese machine reading comprehension, question answering, and other AI applications using MRC and QA models.

To increase performance in machine reading comprehension systems, in the future, we intend to increase the amount and quality of annotated questions. In addition, we also make difficult questions based on findings proposed by the research works sugawara2017evaluation; sugawara2018makes; sugawara2021benchmarking. UIT-ViQuAD 2.0 can also be used to evaluate various other NLP tasks: question answering that use a retriever-reader techniques chen2017reading; bertqas, question generation du2017learning, and information retrieval karpukhin2020dense. We will explore more complex questions inspired by the works sugawara2018makes; sugawara2020assessing. Finally, UIT-ViQuAD 2.0 will be provided to evaluate MRC and QA models, including the training set, the development set (public test set) and the test set (private test set).

Acknowledgments

The authors would like to thank the team of aihub.vn444https://aihub.vn/, and the annotators for their hard work to support the shared task. The VLSP Workshop was supported by organizations: VINIF, Aimsoft, Zalo, Bee, and INT2, and universities: VNU-HCM University of Information Technology, VNU University of Science, and VNU University of Engineering and Technology. Kiet Van Nguyen was funded by Vingroup JSC and supported by the Master, PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Institute of Big Data, code VINIF.2021.TS.026.

References