1 Introduction
Machine Reading Comprehension (MRC) is an emerging and challenging task of natural language understanding that computers can read and understand texts and then find correct answers to any questions. Recently, many shared tasks for machine reading comprehension cui2018span; fisch2019mrqa; zheng2021semeval and various benchmarks richardson-etal-2013-mctest; rajpurkar-etal-2016-squad; rajpurkar-etal-2018-know; joshi2017triviaqa; trischler-etal-2017-newsqa; kocisky-etal-2018-narrativeqa; lai-etal-2017-race; reddy-etal-2019-coqa
have attracted a range of researchers from academic and industry. Therefore, significant progress has been exploited in building computational models for semantics based on deep neural networks and transformers
seo2016bidirectional; devlin2019bert; conneau2019unsupervised; van2021vireader over the last ten years. However, there is no MRC shared task for Vietnamese, this is also the motivation for us to organize this shared task.We hope to use this shared task to examine the capabilities of state-of-the-art deep learning and transformer models to represent and simulate machine reading comprehension in Vietnamese texts.
We introduce the VLSP-2021 Task 4, Vietnamese Machine Reading Comprehension. Inspired by machine reading comprehension benchmarking rajpurkar-etal-2018-know, we design this shared task of Vietnamese reading comprehension, in which computers are given a document D as well as a human question Q to comprehend. In this work, we construct UIT-ViQuAD 2.0, a new dataset that combines answerable questions from the previous version of UIT-ViQuAD (UIT-ViQuAD 1.0 nguyen-etal-2020-vietnamese) with over 12K new, unanswerable questions about the same passages. Table 1 illustrates two such examples.
Passage: Mã máy nhị phân (khác với mã hợp ngữ) có thể được xem như là phương thức biểu diễn thấp nhất của một chương trình đã biên dịch hay hợp dịch, hay là ngôn ngữ lập trình nguyên thủy phụ thuộc vào phần cứng (ngôn ngữ lập trình thế hệ đầu tiên). Mặc dù chúng ta hoàn toàn có thể viết chương trình trực tiếp bằng mã nhị phân, việc này rất khó khăn và dễ gây ra những lỗi nghiêm trọng vì ta cần phải quản lý từng bit đơn lẻ và tính toán các địa chỉ và hằng số học một cách thủ công. Do đó, ngoại trừ những thao tác cần tối ưu và gỡ lỗi chuyên biệt, chúng ta rất hiếm khi làm điều này. (English: Binary machine code (as opposed to assembly code) can be thought of as the most basic representation of a compiler or assembled program or as a hardware-dependent primitive programming language (the first generation programming). Although it is capable of building programs directly in binary, doing so would be complex and prone to major errors because we must handle every bit as well as compute addresses and constants. As a result, except for procedures requiring optimization and specialized debugging, we very rarely do this.) | |
Question 1 | Dù có thể sử dụng mã máy nhị phân để lập trình, nhưng tại sao các lập trình viên lại không sử dụng nó? (Why don’t programmers utilize binary machine code, even though it is possible?) |
Anwer | những thao tác cần tối ưu và gỡ lỗi chuyên biệt (except for procedures requiring optimization and except for procedures requiring optimization and specialized debugging) |
Answer start | 493 |
Question 2 | Ngôn ngữ lập trình thế hệ đầu tiên là ngôn ngữ gì? (What is a first-generation programming language?) |
Answer | Mã máy nhị phân (Binary machine code) |
Answer start | 0 |
Question 3 | Ngôn ngữ lập trình hợp ngữ đầu tiên là ngôn ngữ gì? (What is the first assembly language?) |
Answer | - |
Answer start | - |
Plausible answer | Mã máy nhị phân (Binary machine code) |
Plausible answer start | 0 |
The participating teams also produced 590 total submissions within the official VLSP-2021 evaluation period. We introduce the shared task and present a summary for the evaluation in this work.
In this paper, we have three main contributions described as follows.
-
Firstly, we propose a Vietnamese span-extraction reading comprehension dataset which contains nearly 36,000 human-annotated questions which are unanswerable and answerable to add linguistic diversity in machine reading comprehension and question answering.
-
Secondly, we organize the VLSP2021-MRC Shared Task for evaluating MRC and question answering models in Vietnamese at the VLSP 2021. Our baseline approach could only obtain under 65% F1-score on the public and private test sets, and there is no model of participating teams that pass 78% (in F1) on the private test set, which indicates our dataset is challenging and can encourage the development of MRC models in Vietnamese.
-
Finally, UIT-ViQuAD 2.0 could also be a good resource for multilingual and cross-lingual research purposes when studied along with other MRC and QA datasets.
The following is how the rest of the article is organized. In Section 2, we provide a brief overview of the background and relevant studies. We introduce the VLSP2021-MRC Shared Task in Section 3. Our new dataset (UIT-ViQuAD 2.0) is presented in detail through Section 4. Section 5 presents systems and results proposed by participating teams. In Section 6, we provide further analysis of the shared task results. Finally, Section 7 summarizes the findings of the VLSP2021-MRC shared task and suggests several research directions.
2 Background and Related Work
Dataset | Language | Size | Answerable | Unaswerable |
SQuAD1.1 rajpurkar-etal-2016-squad | English | 100k+ | ✓ | |
SQuAD2.0 rajpurkar-etal-2018-know | English | 150k+ | ✓ | ✓ |
KorQuAD lim2019korquad1 | Korean | 70k+ | ✓ | |
SberQuAD braslavski2020sberquad | Russian | 50k+ | ✓ | |
CMRC-2018 cui2018span | Chinese | 20k+ | ✓ | |
FQuAD1.1 d2020fquad | French | 60k+ | ✓ | |
FQuAD2.0 heinrich2021fquad2 | French | 60k+ | ✓ | ✓ |
UIT-ViNewsQA van2020new | Vietnamese | 23k+ | ✓ | |
UIT-ViQuAD 1.0 nguyen-etal-2020-vietnamese | Vietnamese | 22k+ | ✓ | |
UIT-ViQuAD 2.0 (Ours) | Vietnamese | 35k+ | ✓ | ✓ |
Machine Reading Comprehension (MRC) has attracted many researchers in developing machine learning-based MRC models after the introduction of SQuAD (a large-scale and high-quality dataset)
rajpurkar-etal-2016-squad. The growth in human-annotated datasets and computing capabilities are key factors behind the dramatic progress in the machine reading comprehension models. Particularly, many of datasets are constructed for evaluating the machine reading comprehension task including extractive MRC datasets (SQuAD rajpurkar-etal-2016-squad, SQuAD 2.0 rajpurkar-etal-2018-know, TriviaQA joshi2017triviaqa, and NewsQA trischler-etal-2017-newsqa), abstractive MRC dataset (NarrativeQA kocisky-etal-2018-narrativeqa, RECAM zheng-etal-2021-semeval), multiple-choices datasets (RACE lai-etal-2017-race and MCTest richardson-etal-2013-mctest), and conversational reading comprehension dataset (CoQA reddy-etal-2019-coqa). In addition to the creation of the MRC datasets, various neural network techniques seo2016bidirectional; devlin2018bert; conneau2019unsupervised; van2021deep have been presented and made significant progress in this field. Table 2 shows comparison of different MRC datasets.Various efforts to create Vietnamese MRC datasets have been conducted. UIT-ViQuAD nguyen-etal-2020-vietnamese, UIT-ViNewsQA van2020new are two corpora for extractive machine reading comprehension task in Vietnamese language. Besides, there are two Vietnamese QA systems xlmrserini; bertqas developed based on automatic reading comprehension techniques. In addition, ViMMRC 9247161 and ViCoQA 10.1007/978-3-030-88113-9_44 are two Vietnamese corpora for multiple-choices reading comprehension and conversational reading comprehension, respectively. Besides, a few MRC and QA methods have been studied on Vietnamese MRC datasets, such as BERT 9352127, ViReader van2021vireader, XLMRserini xlmrserini, and ViQAS bertqas.
Ultimately, SQuAD 2.0 rajpurkar-etal-2018-know, and NewsQA trischler-etal-2017-newsqa are two corpora claiming the challenge of unanswerable questions in machine reading comprehension tasks, which are similar to our shared task. In general, extractive MRC requires computer understanding and retrieving the correct answer from the reading texts, which can evaluate the comprehension of the computer’s natural language texts. However, the computer not only answers given questions as usual but also knows which questions are unanswerable. Our purpose in the shared task is to construct a dataset to evaluate the ability of the computer on both answerable and unanswerable questions for the extractive machine reading comprehension task.
3 The VLSP2021-MRC Shared Task
3.1 Task Definition
This task aims to enable the ability of computers to understanding natural language texts and answer relevant questions from users. The task is defined as below:
-
Input: Given a text T = {t, …, t}and a question Q = {q, …, q} which can be answerable or unanswerable.
-
Output: An answer A = [a, a], where can be a span extracted directly from T or empty if no answer is found.
The answers returned by the system are represented as answer spans by character level and are extracted from the reading text. The spans begin with an index indicating the location of the answer in the reading text. The end of the spans is an index determined by the sum of the start index and the length of the answers text. Nevertheless, the question in this task consists of answerable and unanswerable questions (as described in Table 1), which is more difficult than the ViQuAD dataset nguyen-etal-2020-vietnamese.
According to Table 1, the first and the second questions are answerable questions. The answers are directly extracted from the reading passage (highlighted by colors in the reading passage. The blue one is the answer for the first question, and the red one is the answer for the second question). The third question is unanswerable, however, according to rajpurkar-etal-2018-know, the plausible answers are added to the dataset to make it more diverse and create the challenge for current machine reading comprehension to enhance the ability of computers for understanding natural languages.
3.2 Evaluation Metrics
Following the evaluation metrics on SQuAD2.0
rajpurkar-etal-2018-know, we use EM and F1-score as evaluation metrics for Vietnamese machine reading comprehension. These evaluation metrics are described as below:-
Exact Match (EM): If the characters of the MRC system’s predicted answer exactly match the characters of (one of) the gold standard answer(s), EM = 1 for each question-answer pair; otherwise, EM = 0. The EM metric is a strict all-or-nothing measurement, with a score of 0 for a single character error. If the method predicts any textual span as an answer when evaluating against a unanswerable question, the question receives a zero score.
-
F1-score: F1-score is a popular metric for natural language processing and is also used in machine reading comprehension. F1-score estimated over the individual tokens in the predicted answer against those in the gold standard answers. The F1-score is based on the number of matched tokens between the predicted and gold standard answers.
The final ranking is evaluated on the test set, according to the F1-score (EM as a secondary metric when there is a tie).
3.3 Schedule and Overview Summary
Table 3 shows important dates of the VLSP2021-MRC shared task. It lasted for two months, during which the participating teams spent 27 days developing the models.
Time | Phase |
October 1st | Trial Data |
October 5th | Public test |
October 25th | Private test |
October 27th | Competition end |
November 15th | Submission deadline |
December 15th | Notification of acceptance |
December 28th | Camera-ready due |
Besides, Table 4 describes an overview of participants who joined the competition. To get access to the system, each team must nominate a delegate, and register with the organizers. Only delegates of teams can submit the result to the system (as shown on the leaderboard).
Metric | Value |
#Registration Teams | 77 |
#Joined Teams | 42 |
#Signed Data Agreements | 42 |
#Paper Submissions | 6 |
Public Test | Private Test | Overall | |
Total Entries | 551 | 39 | 590 |
Highest F1 | 84.24 | 77.24 | 84.24 |
Highest EM | 77.99 | 67.43 | 77.99 |
Mean F1 | 70.70 | 60.96 | 66.37 |
Mean EM | 61.13 | 50.47 | 56.39 |
Std. F1 | 12.34 | 23.38 | 18.52 |
Std. EM | 12.57 | 20.82 | 17.38 |
Finally, Table 5
shows the statistical information about the results of participants by F1 and EM scores. Overall, the highest EM score is not higher than 80 percent, while the highest F1 score is nearly 84 percent. Both the highest F1 and EM scores come from the public test. However, the results on the private test set are lower. Notably, the standard deviation of results by F1 and EM score on the private test set is significantly higher than the public test set, which means the results between participating teams are different.
4 Dataset Construction
We proposed a new dataset named UIT-ViQuAD 2.0 for this task, the latest version of the Vietnamese Question Answering Dataset. This dataset includes questions from the first version of UIT-ViQuAD nguyen-etal-2020-vietnamese and nearly 13,000 newly human-generated questions which are unanswerable (see Section 4.1) and answerable (see Section 4.2).
Instead of generating unanswerable questions from scratch like SQuAD 2.0 rajpurkar-etal-2018-know, we transform answerable questions into unanswerable questions. We randomly sample one-half of answerable questions in the original dataset and ask our annotators to transform these questions into unanswerable ones, which are impossible to answer given the information of the passage. The answers for answerable questions are then used as the plausible answers for unanswerable questions. This ensures that the unanswerable questions are similar to answerable ones, and the quality of plausible answers for unanswerable questions is high enough for further research into the behavior of Question Answering models.
4.1 Generating Unanswerable Questions
To generate unanswerable questions, we do a strict process of two phases: (1) unanswerable question creation and (2) unanswerable question validation.
4.1.1 Unanswerable Question Creation
We hire 13 high-quality annotators for the process of generating unanswerable questions, most of whom have experience in annotating different datasets in Vietnamese Natural Language Processing. Our hired annotators are carefully trained in 6 phases in 10 days with 30 questions each phase. In the first 2 phases, we mainly focus on getting our annotators familiar with the task. In the next 4 phases, annotators are asked to create questions with a diverse range of unanswerable categories. We did this by having our 13 annotators transform the same set of questions. Then, when more than two annotators have the same way of transforming an answerable question into an unanswerable one, these annotators will be asked to transform that question again. The result of this process is that there are many categories of unanswerable questions in our dataset, such as Antonym, Overstatement, Understatement, Entity Swap, Normal Word Swap, Adverbial Clause Swap, Modifiers Swap. This proposes new challenges to Vietnamese Machine Reading Comprehension researchers. Table 6 presents categories of unanswerable questions in UIT-ViQuAD 2.0.
We include all answerable questions, besides newly generated unanswerable ones, from the previous version of our dataset. This gives us a dataset with the proportion of roughly one unanswerable question per 2 answerable questions. Table 7 summarizes the dataset’s overall statistics.
Reasoning | Description | Example | ||||
Antonym | Antonym used |
|
||||
Overstatement |
|
|
||||
Understatement |
|
|
||||
Entity Swap | Entity replaced by other entity |
|
||||
Normal Word Swap |
|
|
||||
Adverbial Clause Swap |
|
|
||||
Modifiers Swap |
|
|
||||
4.1.2 Unanswerable Question Validation
Before publishing the dataset for the evaluation campaign, we have carefully validated newly unanswerable questions following the procedure inspired by nguyen-etal-2020-vietnamese. To help annotators gradually be better at generating new unanswerable questions, after generating every 3,000 unanswerable questions, we asked our annotators to self-validate the questions that they have generated before and write short documents to reflect on their errors. This effort minimizes the possibility that our annotators repeat their errors too many times.
To further reduce the error rate in our unanswerable questions, we have a separate phase of cross-validating after finishing creating 12,000 unanswerable questions. We hired ten annotators who had generated over 1,000 unanswerable questions during the phase of generating new samples for this phase. This effort helped filter out the annotators who have little experience in annotating unanswerable questions to reduce the noise during the validation phase. Our team then investigated and confirmed every error detected by annotators. To maximize the probability of detecting errors in newly generated unanswerable questions, we provide our annotators with incentives to carefully check for the errors in the dataset as we additionally reward them on each error they correctly detect.
4.2 Additional Difficult Answerable Questions
In addition to answerable questions from UIT-ViQuAD 1.0, we also hire five annotators, who have experiences in doing researches with Vietnamese natural language processing and clearly understand different reasoning skills sugawara-etal-2017-evaluation that is important to evaluate the comprehension ability of models to annotate more challenging answerable questions, which requires models more reasoning ability to correctly answer. The selected annotators are then encouraged to spend at least 3 minutes per question. When generating this set of questions, our purpose is to propose more challenges to researchers in the VLSP 2021 Evaluation Campaign and encourage further analysis on the effects of unanswerable questions in future works.
4.3 Overview Statistics of UIT-ViQuAD 2.0
Train | Public Test | Private Test | All | |
Number of articles | 138 | 19 | 19 | 176 |
Number of passages | 4,101 | 557 | 515 | 5,173 |
Number of total questions | 28,457 | 3,821 | 3,712 | 35,990 |
Number of unanswerable questions | 9,217 | 1,168 | 1,116 | 11,501 |
Average passage length | 179.0 | 167.6 | 177.3 | 177.6 |
Average answerable question length | 14.6 | 14.3 | 14.7 | 14.6 |
Average unanswerable question length | 14.7 | 14.0 | 14.5 | 14.6 |
The general statistics of the datasets are given in Table 7. UIT-ViQuAD 2.0 comprises 35,990 question-answer-passage triples (including 9,217 unanswerable . The organizers provide training, public test, and private test sets for the participarting teams, respectively. For public and private test sets, we only provide passages and their questions without answers to the teams.
5 Systems and Results
5.1 Baseline System
Following devlin2019bert
, we adopt transfer learning based on BERT (Bidirectional Encoder Representations from Transformers) for our baseline system. To adapt our dataset, we slightly modify the run squad.py script
111https://github.com/google-research/bert/blob/master/run_squad.py while keeping the majority of the original code. mBERT is trained in 104 languages, including Vietnamese. In addition, we use the transformers library by Hugging Face 222https://huggingface.co/ to fine-tune mBERT for our question-answering dataset. We fine-tuned the parameters to suit our dataset in the training process as well as the model evaluation process.For the baseline system, we used an initial learning_rate of 3e-5 with a batch_size of 32 and trained for two epochs. The max_seq_length and doc_stride are set to 384 and 128.
5.2 Shared Task Submissions
The AIHUB platform 333https://aihub.vn) was used to manage all submissions. We received entries from 24 teams for the public test, while for the private test, we gained submissions from 19 teams. The systems using the pre-trained language model XLM-R achieve SOTA results. Six of these teams had their system description papers submitted. Each of them is briefly described below.
5.2.1 The vc-tus team
To address unanswerable questions, vlspmrc1 presents a novel Vietnamese MRC technique based on Retrospective Reader zhang2021retrospective. Furthermore, they concentrate on increasing the ability of answer extraction by effectively using attention mechanisms and boosting representation ability through semantic information. They also provide an ensemble method for obtaining considerable improvements in single model results. On the Vietnamese MRC shared task, their strategy help us achieve the first rank.
5.2.2 The ebisu_uit team
vlspmrc4 suggest a novel method for teaching Vietnamese reading comprehension. To tackle the Machine reading comprehension (MRC) test in Vietnamese, they apply BLANC (BLock AttentioN for Context prediction) seonwoo2020context
on pre-trained language models. With this strategy, this model produced good results. This approach achieved 77.222 percent of F1-score on the private exam with the MRC task at the VLSP Shared-task 2021, placing the second rank overall.
5.2.3 The F-NLP team
To learn the correlation between a start position and an end position in pure-MRC output prediction, vlspmrc3 present two types of joint models for answerability prediction and pure-MRC prediction with/ without a dependence mechanism. In addition, we employ ensemble models and a verification technique that includes selecting the best answer from among the top K answers provided by various models.
5.2.4 The UIT-MegaPikachu team
vlspmrc5 propose a new system employs simple yet highly effective methods. The system uses a pretrained language model (PrLM) called XLM-RoBERTa (XLM-R) conneau2019unsupervised, combined with filtering results from multiple outputs to produce the final result. This system created about 5-7 output files and select the answers with the most repetitions as the final prediction answer.
5.2.5 The UITSunWind team
vlspmrc2 present the description of a new approach to solve this task at the VLSP shared task 2021: Vietnamese Machine Reading Comprehension. We propose a model to solve that task, called MRC4MRC. The model is a combination of two MRC components. The MRC4MRC based on the XLM-RoBERTa pre-trained language model is 79.13% of F1-score (F1) and 69.72% of EM (Exact Match) on the public-test set. Although this model ranks in the top 5, the EM performance on answerable questions is the highest at the private test. Our experiments also show that the XLMR language model is better than the powerful PhoBERT language model.
5.2.6 The HN-BERT team
vlspmrc6 present an unsupervised context selector that reduces the length of a given context while keeping the replies in related contexts. They also used numerous training strategies in the VLSP2021-MRC shared task dataset, including unanswerable question sample selection and several adversarial training approaches, which improved performance by 2.5% in the EM score and 1% in the F1 score.
Public Test | Private Test | ||||
F1 | EM | F1 | EM | ||
Human | 87.335 | 81.818 | Human | 82.849 | 75.500 |
NLP_HUST | 84.236 | 77.728 | vc-tus | 77.241 | 66.137 |
NTQ | 84.089 | 77.990 | ebisu_uit | 77.222 | 67.430 |
ebisu_uit | 82.622 | 73.698 | F-NLP | 76.456 | 64.655 |
vc-tus | 81.013 | 71.316 | UIT-MegaPikachu | 76.386 | 65.329 |
F-NLP | 80.578 | 70.662 | SDSOM | 75.981 | 63.012 |
SDSOM | 79.594 | 69.092 | UITSunWind | 75.587 | 64.871 |
UITSunWind | 79.130 | 69.720 | Big Heroes | 74.241 | 61.126 |
UIT-MegaPikachu | 78.637 | 68.804 | 914-clover | 73.027 | 61.853 |
914-Clover | 78.515 | 69.013 | NTQ | 72.863 | 60.938 |
Big Heroes | 78.491 | 68.150 | Hey VinMart | 70.352 | 57.786 |
PhoKho-UIT | 75.894 | 65.533 | PhoKho-UIT | 70.198 | 58.378 |
HN-BERT | 75.842 | 63.544 | HN-BERT | 70.100 | 56.466 |
Hey VinMart | 75.759 | 64.590 | Deep-NLP | 69.220 | 59.429 |
Deep-NLP | 74.767 | 66.789 | ABC | 63.625 | 55.280 |
ABC | 69.287 | 57.864 | BASELINE | 60.338 | 49.353 |
ct-nlp | 68.971 | 58.859 | |||
tpp | 68.484 | 57.786 | |||
S-NLP | 67.589 | 65.140 | |||
BASELINE | 63.031 | 53.546 |
5.3 Human Performance
To estimate the human performance of this task, we employ a team to answer a set of 100 samples from the public test set and 100 samples from the private test set. There are four annotators, and two of them work on each set doing the same thing.
In each instance, we have a passage with a question. The annotator must answer the question using the information in the passage. If there is no answer, it means the question is unanswerable, and then mark ”true” in the field ”is unanswerable”. Following the answering phase, we compute the human accuracy by F1-score and exact match scores for both public and private tests.
To calculate human performance, we use the method given in SQuAD2.0 rajpurkar-etal-2018-know. We have four responses per question in the ground truth. Thus we choose the final ground truth by majority voting and prefer the shortest answer to be the last ground truth, as explained in SQuAD2.0. After obtaining the gold response, we compute the F1 and EM scores in pairs of human-answering and gold answers with the two annotators who previously answered on the public test set. Then, by averaging the results of the two annotators, we compute the final F1 and EM scores of human performance on the public test. The computation is carried out on the private test in the same manner. As a result, the final F1 and EM scores of human performance are 87.34% and 82.85% on the public test set, respectively, and 81.82% and 75.50% on the private test set.
5.4 Experimental Results
According to our statistics, a total of 25 teams registered to participate. These teams from prestigious universities, companies, and organizations participate in the Vietnamese Machine Reading Comprehension task of the VLSP 2021 shared task. And then, out of the 25 teams participating in the development phase of their system on the public test we selected 18 teams that excelled against the baseline to further evaluate their system in the private test.
The results of the teams in the two rounds are aggregated and shown in Table 8. The ranking results of the team are based on F1 points for both rounds. In the public test round, our mBERT baseline model achieved 63.03% on F1 and 53.55% on EM. There were 18 teams with results that outperformed the results of the baseline according to F1. Overall, 14 teams with F1 scores above 70% and 5 teams with over 80%. Specifically, we found the top three teams in the public test, NLP_HUST, NTQ, and ebisu_uit, with F1 results of 84.24%, 84.09%, and 82.63%, respectively. It can be seen that the results of the two top teams in the rankings have very close results. The difference between these two teams is not more than 0.2%. Additionally, the NTQ team’s model scored slightly lower in F1 than NLP_HUST, but their model achieved the highest EM performance of 77.99%.
Regarding the private test round, the baseline model’s results achieved an F1 score of 60.34% and 49.35% on the EM score. Out of the 18 teams that passed the public test, 14 continued to participate in evaluating their system on the private test set. There have been many unexpected changes in the results of the teams’ submissions, especially the way the top three teams appeared. While only placing in 5th with 81.01% of F1 on the public test set, team vc-tus took 1st position in the private test round with F1 score of 77.24%. Besides, the ebisu_uit team maintains a stable level on the model they trained from the public test round to the private test round. They have kept 2nd place in the rankings with their F1 score of 77.22%. Once again, we can see that the results are not much different between the 1st and 2nd place teams. Furthermore, ebisu_uit is also the team with the highest results on the EM measure with 67.43%. If we take a look at the F-NLP team, it shows a similar trend with vc-tus. Remaining 5th in the public test round, their system helped them finish this task at 3rd with 76.46% of F1 score. Generally, all the teams in this round were having trouble with the private test set since its difficulty had increased significantly. As a result, the submission results of the teams are reduced considerably compared to the public test round.
6 Result Analysis
To gain a deeper insight into machine reading comprehension and question answering in Vietnamese, we conduct analysis of the results based on the 5 most powerful models at the VLSP2021-MRC shared task.




Answerable | Unanswerable | Overall | |||||
Teams | Models | EM | F1 | EM | F1 | EM | F1 |
vc-tus | Retrospective Reader + XLM-R (Ensemble) | 57.67 | 73.54 | 85.84 | 85.84 | 66.14 | 77.24 |
ebisu_uit | BLANC + XLM-R/SemBERT (Ensemble) | 56.59 | 70.59 | 92.65 | 92.65 | 67.43 | 77.22 |
F-NLP | XLM-R (Ensemble) | 58.78 | 75.66 | 78.32 | 78.32 | 64.66 | 76.46 |
UIT-MegaPikachu | XLM-R (Single) | 58.82 | 74.63 | 80.47 | 80.47 | 65.33 | 76.39 |
UITSunWind | XLM-R + BiLSTM (Ensemble) | 58.94 | 74.26 | 78.67 | 78.67 | 64.87 | 75.59 |
HN-BERT | PhoBERT_Large+R3F+CS (Single) | 47.50 | 66.99 | 77.33 | 77.33 | 56.47 | 70.10 |
Baseline | mBERT (Single) | 41.72 | 57.43 | 67.11 | 67.11 | 49.35 | 60.34 |
6.1 Competition Progress Analysis
Figure 1 illustrates the submission progress of the top 5 teams on the public test from October 5, 2021, to October 24, 2021. In this phase, we allow 10 submissions per day. However, according to Figure 1, the submission results on both F1 and EM scores are not stable, which oscillates within the submission time. Besides, the results by EM score are no higher than 80%, indicating the challenge in the dataset for the participants.
In addition, Figure 2 illustrates the final submission results of participant teams. The private test started on October 25, 2021, and ended on October 27, 2021. Within 3 days of submission, the results on the F1 score do not change too much. Both F1 and EM scores achieved by participants are not higher than 80% on this phase. Especially, for the final results, the team name ebisu_uit has a lower result than the vc_tus team but achieved a higher result on the EM score. It can be seen from the chart that the team vc_tus achieved the best results by the F1-score, and the team ebisu_uit achieved the best result by the EM score, which placed the and in the competition.
6.2 Answerable vs. Unanswerable Analysis
To better understand the ability of the MRC systems to answer questions, we analyze human performance and the experimental results of the baseline model and the participating teams vlspmrc1; vlspmrc2; vlspmrc3; vlspmrc4; vlspmrc5; vlspmrc6. Table 9 shows final results on answerable and unanswerable questions of the private test set, evaluated on EM and F1 scores. As seen from the table, performances on unanswerable questions are always higher than on answerable questions. The ebisu_uit team achieved the best performance on unanswerable questions with over 92% of F1. However, the F-NLP and UIT-SunWind teams achieved the highest scores on the answers with 75.66% of F1 and 58.82% of EM, respectively. Interestingly, the vc-test team did not obtain the best performance on unanswerable and answerable questions, but this team achieved the best performance on the overall F1-score because they balanced the performances between the two types of questions better than the other teams.
6.3 Challenging Question Examples
We select several typical examples on answerable and unanswerable questions that make it difficult for the models proposed by the participating teams vlspmrc1; vlspmrc2; vlspmrc3; vlspmrc4; vlspmrc5; vlspmrc6. Table 10 presents several examples and explanations that the models failed to predict correct answers. We will explore more complex questions inspired by the works sugawara2018makes; sugawara2020assessing.
Example | Explanation | |||||||
|
|
|||||||
|
|
7 Conclusion and Future Work
The VLSP2021-MRC Shared Task on Machine Reading Comprehension for Vietnamese has been organized at the VLSP 2021. Despite the fact that 77 teams had signed up to get the training datasets, only 24 teams were able to submit their results. Because several teams enrolled for many shared tasks at the VLSP 2021, the other teams may not have enough time to explore MRC models. This shared task provides valuable resources for developing Vietnamese machine reading comprehension, question answering, and other AI applications using MRC and QA models.
To increase performance in machine reading comprehension systems, in the future, we intend to increase the amount and quality of annotated questions. In addition, we also make difficult questions based on findings proposed by the research works sugawara2017evaluation; sugawara2018makes; sugawara2021benchmarking. UIT-ViQuAD 2.0 can also be used to evaluate various other NLP tasks: question answering that use a retriever-reader techniques chen2017reading; bertqas, question generation du2017learning, and information retrieval karpukhin2020dense. We will explore more complex questions inspired by the works sugawara2018makes; sugawara2020assessing. Finally, UIT-ViQuAD 2.0 will be provided to evaluate MRC and QA models, including the training set, the development set (public test set) and the test set (private test set).
Acknowledgments
The authors would like to thank the team of aihub.vn444https://aihub.vn/, and the annotators for their hard work to support the shared task. The VLSP Workshop was supported by organizations: VINIF, Aimsoft, Zalo, Bee, and INT2, and universities: VNU-HCM University of Information Technology, VNU University of Science, and VNU University of Engineering and Technology. Kiet Van Nguyen was funded by Vingroup JSC and supported by the Master, PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Institute of Big Data, code VINIF.2021.TS.026.