FAT ALBERT: Finding Answers in Large Texts using Semantic Similarity Attention Layer based on BERT

08/22/2020 ∙ by Omar Mossad, et al. ∙ 0

Machine based text comprehension has always been a significant research field in natural language processing. Once a full understanding of the text context and semantics is achieved, a deep learning model can be trained to solve a large subset of tasks, e.g. text summarization, classification and question answering. In this paper we focus on the question answering problem, specifically the multiple choice type of questions. We develop a model based on BERT, a state-of-the-art transformer network. Moreover, we alleviate the ability of BERT to support large text corpus by extracting the highest influence sentences through a semantic similarity model. Evaluations of our proposed model demonstrate that it outperforms the leading models in the MovieQA challenge and we are currently ranked first in the leader board with test accuracy of 87.79 possible improvements to overcome these limitations.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the main challenges in Natural Language Processing (NLP) is the ability of a machine to read and understand an unstructured text and then reason about answering some related questions. Such question answering (QA) models can be applied to a wide range of applications such as financial reports, customer service, and health care.

A significant number of datasets namely MovieQA Tapaswi et al. (2016), RACE Lai et al. (2017) and SWAG Zellers et al. (2018) were created to provide a ground truth for training and evaluating a specific type of QA models that involve Multiple Choice Questions (MCQs). The MovieQA dataset challenge has attracted a large number of promising solutions, such as Blohm et al. Blohm et al. (2018)

, where they implement two attention models based on Long-Short-Term-Memory (LSTM) and Convolutional Neural Networks (CNNs). Moreover, they combine both model accuracies using an ensemble aggregation. However, these models are prone to various systematic adversarial attacks like linguistic level attacks (word vs. sentence level) and the knowledge of the adversaries (black-box vs. white-box). These models only learn to match patterns to select the right answer rather than performing plausible inferences as humans do.

Recently, Devlin et al. (2018) implemented Bidirectional Encoder Representations from Transformers BERT, which has since been used as a pre-trained model to tackle a large subset of NLP tasks. BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. A plethora of deep learning models have ever since incorporated BERT into several Tasks, ever-improving the state-of-the-art performances.

A notable limitation of BERT is that it is not able to support large texts that include more than the pre-trained model’s maximum number of words (tokens). Therefore, when dealing with large texts, the performance of BERT is severely affected. In our work, we aim to overcome the limitations of BERT by analyzing and improving how accurate we predict the answer extracted from a large text in the MovieQA MCQ dataset. Our approach relies on the concept of sentence attention to extract the most significant sentences from a large corpus. We were able to accomplish this task using a pre-trained semantic similarity model.

The remainder of this paper is organized as follows: Section 2 describes the datasets focusing on the semantics and the nature of the questions. Next, we highlight the proposed model and the intuition behind all the components we used in section 3. We evaluate the performance of our model in section 4. Finally, we conclude our report and suggest future modifications for the model in section 5.

2 MovieQA Dataset

In this section, we give an overview of the dataset used to evaluate our model. The MovieQA Tapaswi et al. (2016) dataset was created by generating a number of MCQs that can be solved from specific context extracted from the plots of real movies. An existing challenge111MovieQA Challenge: http://movieqa.cs.toronto.edu/ is still on going to find the highest accuracy in solving these MCQs on movie topics. The models can be trained to use either video scenes, subtitles, scripts or movie plots (extracted from Wikipedia). The leader board for this challenge is divided according to the source of input selected for the model. The dataset consists of almost 15,000 MCQs obtained from over 400 movies and features high semantic diversity. Each question comes with a set of 5 highly plausible answers; only one of which is correct. The dataset structure and semantics for the movie plots are described in Table 1. On average, each movie plot has 35.2 sentences and there are 20.3 words per sentence on average. All training and validation sets are labelled with the correct answer. However, the test dataset is not labelled and can only be evaluated using the challenge’s submission server. Due to the large nature of the plot texts, we have selected this dataset to demonstrate how we can incorporate BERT in relatively large texts.

Train Val. Test Total
# of Movies 269 56 83 408
# of Questions 9848 1958 3138 14944
Avg. Q. # of words 9.3 9.3 9.5 9.3 3.5
Avg. CA. # of words 5.7 5.4 5.4 5.6 4.1
Avg. WA. # of words 5.2 5.0 5.1 5.1 3.9
Table 1: MovieQA dataset description

3 FAT ALBERT Model Description

Existing BERT MCQ codes currently lack the support for large text documents since they are restricted to sequence length of at most 512 tokens. Furthermore, due to our limited compute capabilities we are restricted to 130 tokens. According to Devlin et al. (2018), only TPUs with 64GB memory are able to train models with 512 number of tokens, compared to 16GB GPUs.

Therefore, we have used another model on top of BERT MCQ to select the top 5 sentences from the text that are highly similar to the question context. Subsequently, instead of processing the entire corpus, BERT MCQ uses the top 5 sentences only.

Figure 1: FAT ALBERT model overview

The maximum span of a certain question (i.e. the part from the text needed to answer this question) is 5 sentences, and all plot alignments are consecutive (i.e. all questions span a specific passage from the text, not the entire text).

Our model consists of the following:

  • Semantic Similarity Classifier pre-trained on STS and Clinical datasets

  • BERT for MCQ trained on MovieQA dataset

A description of the entire model is depicted in Fig.1. The large movie plot text along with the concatenated question and answer sequences are fed to the similarity model to produce the top 5 similar sentences for each question. We feed the questions, answers and the attention made plot text (top 5 similar sentences) to the BERT

for MCQ model. The output of this model is an array of probabilities having the size of the number of possible choices. Finally, we select the choice having the highest probability.

3.1 Semantic Similarity Network

Figure 2: Semantic Similarity Network model

We use two pre-trained models based on STS and Clinical datasets to find the similarity measure between 2 sentences. A number of variation for semantic similarity exists between our dataset and the aforementioned ones, but the selected model has proven to be effective in our application. A detailed comparison between the performance of these models on the MovieQA dataset is provided in the evaluation section. The model uses BERT to extract the embeddings from both sentences. These embeddings undergo a number of similarity functions, namely cosine similarity, qgram distance, levenshtein similarity,

etc., and the outputs are sent to a fully connected network followed by a softmax layer to provide the similarity index. We have selected to use a combination of STS and Clinical

BERT models after normalizing each model probabilities. Eq. 1

describes how the cosine similarity index between two sentences

and is calculated. Fig. 2 highlights the contextual structure of the BERT similarity model.


3.2 BERT for MCQ

The BERT MCQ model uses a pre-trained BERT transformer network modified and fine tuned on the MovieQA dataset. The embedding outputs of BERT are passed to a fully connected layer to produce the predicted probabilities of each possible choice. We used the bert_large_uncased pre-trained model which uses 24-layer, 1024-hidden, 16-heads and 340M parameters. The fine tuning is performed on the MovieQA dataset after modifying BERT outputs to support the variable number of choices. When running BERT for MCQ on the perfectly aligned plot sentences, the model was able to achieve a validation accuracy of 92.34%. Although, the model was initially developed to support sentence completion type of questions, we modified the model to handle MCQs by changing the network structure, to output probabilities for each choice instead of complete text tokens.

4 Evaluation Results

We evaluate the performance of our model on the MovieQA dataset. In this section, we indicate whether the results were obtained from our own implementation or as mentioned in the reference paper. Some differences appear between our evaluation and the published results, probably due to changes in parameters by the authors which were not mirrored in their source codes. We also provide a brief case study to highlight two cases from the same movie plot where the model succeeds and fails, respectively.

4.1 Evaluation on MovieQA Dataset

A properly trained BERT for MCQ can reach an accuracy in the range of 90% on the MovieQA dataset once it is aligned with exact sentences having the answer. However, when performed without any sentence selection, the accuracy drops to 20% (a random choice) due to the fact that BERT truncates the sentences with word count higher than the maximum sequence length (130 words in our case). Therefore using the semantic similarity model we captured the top 5 similar sentences to the question at hand. We selected 5 questions since the average word count of 5 sentences over the entire dataset is in the range of 110 which seems to be acceptable to truncate a few words from the least similar sentences. The similarity model performance is depicted in table 2. The combined model aggregates the output probabilities of Web and Clinical models after normalization. A possible improvement for the model accuracy can be done by further training it on the MovieQA dataset.

Next, we compare the accuracy of our model against the current MovieQA challenge leader board models. We have included the validation accuracy and test evaluations we received from the MovieQA authors after submission. Another contribution, is that we have created an ensemble model that aggregates the results from the top 4 approaches in the challenge and performs a majority ruling to select the label with the highest probability. This model uses the CNN and LSTM models previously described along with the BERT model. The main incentive behind this ensemble is to allow different models to correct one another and collaboratively avoid making mistakes. Table 3 demonstrates the accuracy of our models compared to the leader board models.

Train. Acc.% Val Acc.%
Web BERT 88.87 89.03
Clinical BERT 88.12 89.28
Combined 90.27 91.07
Table 2: MovieQA semantic similarity evaluations
Val. Acc.% Test Acc.%
FAT ALBERT (Ensemble) 87.48
FAT ALBERT 87.28 87.79
LSTM Blohm et al. (2018) 80.90 / 83.14* 85.12
CNN Blohm et al. (2018) 79.47 / 79.62* 82.73
Word-Level CNN Blohm et al. (2018) 73.39 / 76.50*

* as included in the paper/leader board (not obtained from our evaluations)

Table 3: MovieQA dataset evaluations

To highlight the effect of the number of tokens in BERT, we showcase in Fig. 3 the training loss for models with different number of tokens. The main observation is that in order to support larger number of tokens, we had to reduce the batch size used during training. Although, the results indicate that larger tokens have lower losses in general, it is clear that reducing the batch size has notably affected the model accuracy. For instance, in Fig. 2(a), when using a number of tokens equivalent to 80, the model accuracy increases significantly compared to a maximum number of 50 tokens where the model truncates any input having tokens. Due to the compute capability we haven’t been able to run the model with tokens as the GPUs were not able to allocate enough memory. Therefore, we evaluated the loss for larger number tokens using a smaller batch size in Fig. 2(b)

. The loss is gradually decreasing as the number of tokens becomes higher until it stabilizes when the inputs generally become smaller than the maximum number of tokens. Hence, the input sentences are padded with zeros to reach the required size of tokens.

(a) Batch size = 8

(b) Batch size = 4
Figure 3: BERT MCQ training loss on MovieQA dataset

4.2 Case Study

Figure 4: Forrest Gump (1994): Plot

We demonstrate two cases extracted from the movie: Forrest Gump (1994). The movie plot is 59 sentences and a brief extract is shown in Fig. 4. Two sample questions are displayed: In Fig. 5 the model selects the top 5 similar sentences to the question and in that case, the answer can be fully interpreted from one of theses sentences (highlighted in bold). Hence, after passing these sentences instead of the entire plot to BERT MCQ, it successfully selects the correct choice. On the other hand, the similarity model wasn’t able to select the best sentences in the second example as shown in Fig. 6. Despite finding one of the accurate sentences for this specific question, the model missed the most informative one. Therefore, the QA model subsequently failed to select the correct answer.

Figure 5: Correct model prediction example
Figure 6: Wrong model prediction example

5 Conclusions and Future Work

In this paper, we have created an attention model based on semantic similarity to overcome BERT limitations. In order to solve an MCQ, we begin by extracting the most relevant sentences from a large text, thereby reducing the complexity of the problem of answering MCQ question. At the time of writing this report, our latest submission is ranked first in the MovieQA Challenge.

As a future work, we plan to extend our model to process other input signals provided by the MovieQA dataset like subtitles. We could build more powerful models by incorporating other human like processing mechanisms such as referential relations, entailment, and answer by elimination. Finally, migrating the code to work on TPUs with higher computational power instead of GPUs may allow us to handle larger texts avoiding sentence truncation.


  • M. Blohm, G. Jagfeld, E. Sood, X. Yu, and N. T. Vu (2018)

    Comparing attention-based convolutional and recurrent neural networks: success and limitations in machine reading comprehension

    In Proceedings of the 22nd Conference on Computational Natural Language Learning, External Links: Document, Link Cited by: §1, Table 3.
  • J. Devlin, M. W. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1, §3.
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017) RACE: large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683. Cited by: §1.
  • M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler (2016) MovieQA: understanding stories in movies through question-answering. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    External Links: Document, Link Cited by: §1, §2.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, External Links: Document, Link Cited by: §1.