An Accurate Model for Predicting the (Graded) Effect of Context in Word Similarity Based on Bert

05/03/2020 ∙ by Wei Bao, et al. ∙ 0

Natural Language Processing (NLP) has been widely used in the semantic analysis in recent years. Our paper mainly discusses a methodology to analyze the effect that context has on human perception of similar words, which is the third task of SemEval 2020. We apply several methods in calculating the distance between two embedding vector generated by Bidirectional Encoder Representation from Transformer (BERT). Our teamwillgowon the 1st place in Finnish language track ofsubtask1, the second place in English track of subtask1.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computing the meaning difference between words in the semantic level is a task that has been widely discussed. In the area of natural language processing (NLP) like information retrieval (IR), there are many specific applications using similarity, such as text memorization [10], text categorization [9], Text Q&A [13], etc.

The task3 of SemEval-2020111

focuses on the influence of context when humans perceive similar words. As we all know, polysemous words have different meanings in a totally different context, which the current translation system can recognize very well. However, many translation systems can’t exactly predict the subtle variance on the meanings of words, which is also caused by a different context.

Task3 has two sub-tasks. In subtask1, we are required to predict the extent of change in scores of similarity between two words in different contexts by human annotators. In subtask2, we only predict the absolute score as is in the subtask1 rather than the difference in scores, and we would only discuss subtask1.

Our team uses different algorithms to calculate the distance between two embedding vectors generated by BERT [4] and defines it as the similarity. So we can get the change in similarity by subtraction between two distances. However, this methodology did not get exciting performance in the task evaluation, so we improve this by blending different BERT models, which we will introduce later in Section 3.2.

2 Related Work

There are many methods and models to estimate the similarity between long paragraphs. Most of them treat it as a binary classification problem, Hatzivassiloglou et al.


compute the linguistic vector of features including primitive features and composite features, then they build criteria by feature vectors to classify paragraphs. As for similarity between short sentences, Foltz et al.

[5] suggest a method that provides a characterization of the extent of semantic similarity between two adjacent short sentences by comparing their high-dimensional semantic vectors, which is also a Latent Semantic Analysis (LSA) model. Both LSA and Hyperspace Analogues to Language (HAL) [2] are all corpus-based model, the latter one uses lexical co-occurrence to generate high-dimensional semantic vectors set, where words in this set can be represented as a vector or high-dimensional point so that their similarities can be measured by computing their distances.

Although computing similarity between words are less difficult than between texts, there still exist some sophisticated problems. Similarity between words is not only in morphology but more significantly in semantic meaning. The first step of reckoning the similarity between words is using Word2Vec[11]

, which is a group of corpus-based models to generate word embedding, and mainly utilizes two architectures: continuous bag-of-words (CBOW) and continuous Skip-gram. In the CBOW model, the distributed representations of context are made as input to the model and predict the center words, while the Skip-gram model uses the center words as its input and predict the context, which predicts one word in many times to produce several context words. Therefore, the Skip-gram model can learn efficiently from context and performs better than the CBOW model, but it takes much more consumption in training time than the CBOW model. But since hierarchical softmax and negative sampling

[12] were proposed to optimize the Skip-gram model when training large-scale data.

Word2Vec cannot be used for computing similarity between polysemous words because it generates only one vector for a word, while Embedding from Language Model (ELMo) [15] inspired by semi-supervised sequence tagging [14] can handle this issue. ELMo is consist of bidirectional LSTM [8], which makes the ELMo have an understanding of both next and previous word, it obtains contextualized word embedding by weight summation over the output of hidden layers. Compared with the LSTM used in ELMo, Bidirectional Encoder Representation from Transformer (BERT) [4] is a stack of Transformer Encoder [17], which can be computed in parallel ways and save much time in training. There are two BERT versions with different size, one is BERT Base, which has 12 encode layers with 768 hidden units and 12 attention heads, and the other is BERT Large, which has 24 encode layers with 1024 hidden units and 16 attention heads, achieved state-of-the-art results according to that paper.

3 System Overview

3.1 Data

The source of our test data is from the CoSimLex dataset [1], which is based on the well known SimLex999 dataset [7] and provides pairs of words.

In task3, the English dataset consists of 340 pairs; the Finnish, Croatian, Slovenian consist of 24, 112, 111 pairs respectively. Here is the quantity count table.

Language Abbr. Count
English En 340
Croatian HR 112
Finnish FI 24
Slovenian SL 111
Table 1: Test Dataset in Task3 from CoSimLex dataset

Each language datafile has eight columns, namely word1, word2, context1, context2, word1context1, word2context1, word1context2, word2context2, and their meanings are first word, second word, first context, second context, the first word in the first context, the second word in the first context, the first word in the second context, the second word in the second context respectively. In addition, word1 and word2 may have a lexical difference between word1context and word2context.

3.2 Methodology

The BERT model architecture is based on a multilayer bidirectional Transformer as Figure 1.Instead of the traditional left-to-right language modeling objective, BERT is trained on two tasks: predicting randomly masked tokens and predicting whether two sentences follow each other. BERT model gets a lot of state-of-the-art performance in many tasks, and we also use the BERT model in our strategy. We approach this task as one of tasks that calculate the similarity between two words. In our model, context data would be firstly added into BERT like the following Figure 2.

Figure 1: Bidirectional Transformer architectures of BERT
Figure 2: BERT schematic diagram

Our model would calculate the distance by several algorithms immediately when it obtained embedding of each token, then we predict the graded effect of context in word similarity as the following steps:

  • Step1: Choose the corresponding two embeddings of word1context1 and word2context1, compute the distance in several algorithms as .

  • Step2: Substitute the words in Step 1 with word1context2 and word2context2, and repeat the last step, then we get the .

  • Step3: By subtraction, we can get the change on similarity

  • Step4: Change the distance computing algorithm and repeat Step1 Step3.

  • Step5: After Step1 Step4, we can obtain a vector of change, , where denotes the number of distance calculating algorithms used in our model, we get the final change


Here we provide a flow chart Figure 3 to show the process from Step1 to Step4.

Figure 3: Part Flow Chart of our model

3.3 Experiment

We trained one standard BERT Large model and one multilingual BERT Base model by MXNet [3]. The dataset we trained BERT Large model is openwebtextbookcorpuswikiencased, which were maintained by GluonNLP222, and we trained Multilingual BERT (M-BERT)[16] Base model by wikimultilingualuncased dataset that also provided by GluonNLP. It takes much time to train the BERT model, so we recommend utilizing the well trained BERT model from bert-embedding333We can simply use the model by pip or conda install bert-embedding.

After configuring the models, we can follow the section 3.2 by giving the input from section 3.1 and get the experiment results which will be introduced in Section 4. Task3 has four language tracks, namely English, Croatian, Finnish, Slovenian. We use the BERT Large model in the English track, and Multilingual BERT Base model in the other three tracks.

In section3.2, we use several algorithms to compute similarity. Here we introduce two main algorithms that used in our experiments.

  • Cosine Similarity that calculates the cosine of angle between two vectors.

  • Euclidean Distance that calculates the square root of square distance in each dimension.

4 Results

In our experiment targeted at subtask1, the English language track uses the Bert Large model, the Euclidean distance is 0.718 and the cosine distance is 0.752, the Blend result is 0.768, and the online LB ranks second; Croatian, Finnish, and Slovenian languages all use the Multi-lingual Bert model. The Croatian language track’ Euclidean distance of 0.590, the cosine distance is 0.587, the Blend result is 0.594, and the online LB ranks sixth. The Finnish language uses the Euclidean distance of 0.750, the cosine distance is 0.671, the Blend result is 0.772, and the online LB ranks 1, The Slovenian language uses a Euclidean distance of 0.576, a cosine distance of 0.603, a Blend result of 0.583, and an online LB ranking seventh. We sort the result out the following Table 2.

Language & Abbr. Model Euclidean Dis. Cosine Blend Rank
English, En BERT Large 0.718 0.752 0.768 2
Croatian, HR M-BERT Base 0.590 0.587 0.594 6
Finnish, FI M-BERT Base 0.750 0.671 0.772 1
Slovenian, SL M-BERT Base 0.576 0.603 0.583 7
Table 2: Experiment Results

5 Conclusion

In our paper, we propose a model that computes the similarity and similarity change by blending cosine similarity and euclidean distance, which calculated by two word embedding vectors. We firstly transform words in dataset that we introduce in section 3.1. into the word embedding vectors by BERT that we discuss in section 3.2, then we calculate the distance between two vectors, finally we blend the two distances computed by different algorithms as the final predict result.In the subtask1 of task3, our team willgo won a champion in Finnish track and the second place in English track.


  • [1] C. S. Armendariz, M. Purver, M. Ulčar, S. Pollak, N. Ljubešić, M. Robnik-Šikonja, M. Granroth-Wilding, and K. Vaik (2019) CoSimLex: a resource for evaluating graded word similarity in context. External Links: 1912.05320 Cited by: §3.1.
  • [2] C. Burgess, K. Livesay, and K. Lund (1998) Explorations in context space: words, sentences, discourse. Discourse Processes 25 (2-3), pp. 211–257. Cited by: §2.
  • [3] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015)

    MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems

    External Links: 1512.01274 Cited by: §3.3.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1, §2.
  • [5] P. W. Foltz, W. Kintsch, and T. K. Landauer (1998) The measurement of textual coherence with latent semantic analysis. Discourse processes 25 (2-3), pp. 285–307. Cited by: §2.
  • [6] V. Hatzivassiloglou, J. L. Klavans, and E. Eskin (1999) Detecting text similarity over short passages: exploring linguistic feature combinations via machine learning. In 1999 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora, Cited by: §2.
  • [7] F. Hill, R. Reichart, and A. Korhonen (2014) SimLex-999: evaluating semantic models with (genuine) similarity estimation. External Links: 1408.3456 Cited by: §3.1.
  • [8] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §2.
  • [9] Y. Ko, J. Park, and J. Seo (2004-01) Improving text categorization using the importance of sentences. Inf. Process. Manage. 40 (1), pp. 65–79. External Links: ISSN 0306-4573, Link, Document Cited by: §1.
  • [10] C. Lin and E. Hovy (2003)

    Automatic evaluation of summaries using n-gram co-occurrence statistics

    In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157. External Links: Link Cited by: §1.
  • [11] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. External Links: 1301.3781 Cited by: §2.
  • [12] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. External Links: 1310.4546 Cited by: §2.
  • [13] M. Mohler and R. Mihalcea (2009-01) Text-to-text semantic similarity for automatic short answer grading.. pp. 567–575. External Links: Document Cited by: §1.
  • [14] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power (2017) Semi-supervised sequence tagging with bidirectional language models. External Links: 1705.00108 Cited by: §2.
  • [15] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. External Links: 1802.05365 Cited by: §2.
  • [16] T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual bert?. External Links: 1906.01502 Cited by: §3.3.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: §2.