Computing the meaning difference between words in the semantic level is a task that has been widely discussed. In the area of natural language processing (NLP) like information retrieval (IR), there are many specific applications using similarity, such as text memorization , text categorization , Text Q&A , etc.
The task3 of SemEval-2020111https://competitions.codalab.org/competitions/20905
focuses on the influence of context when humans perceive similar words. As we all know, polysemous words have different meanings in a totally different context, which the current translation system can recognize very well. However, many translation systems can’t exactly predict the subtle variance on the meanings of words, which is also caused by a different context.
Task3 has two sub-tasks. In subtask1, we are required to predict the extent of change in scores of similarity between two words in different contexts by human annotators. In subtask2, we only predict the absolute score as is in the subtask1 rather than the difference in scores, and we would only discuss subtask1.
Our team uses different algorithms to calculate the distance between two embedding vectors generated by BERT  and defines it as the similarity. So we can get the change in similarity by subtraction between two distances. However, this methodology did not get exciting performance in the task evaluation, so we improve this by blending different BERT models, which we will introduce later in Section 3.2.
2 Related Work
There are many methods and models to estimate the similarity between long paragraphs. Most of them treat it as a binary classification problem, Hatzivassiloglou et al.
compute the linguistic vector of features including primitive features and composite features, then they build criteria by feature vectors to classify paragraphs. As for similarity between short sentences, Foltz et al. suggest a method that provides a characterization of the extent of semantic similarity between two adjacent short sentences by comparing their high-dimensional semantic vectors, which is also a Latent Semantic Analysis (LSA) model. Both LSA and Hyperspace Analogues to Language (HAL)  are all corpus-based model, the latter one uses lexical co-occurrence to generate high-dimensional semantic vectors set, where words in this set can be represented as a vector or high-dimensional point so that their similarities can be measured by computing their distances.
Although computing similarity between words are less difficult than between texts, there still exist some sophisticated problems. Similarity between words is not only in morphology but more significantly in semantic meaning. The first step of reckoning the similarity between words is using Word2Vec
, which is a group of corpus-based models to generate word embedding, and mainly utilizes two architectures: continuous bag-of-words (CBOW) and continuous Skip-gram. In the CBOW model, the distributed representations of context are made as input to the model and predict the center words, while the Skip-gram model uses the center words as its input and predict the context, which predicts one word in many times to produce several context words. Therefore, the Skip-gram model can learn efficiently from context and performs better than the CBOW model, but it takes much more consumption in training time than the CBOW model. But since hierarchical softmax and negative sampling were proposed to optimize the Skip-gram model when training large-scale data.
Word2Vec cannot be used for computing similarity between polysemous words because it generates only one vector for a word, while Embedding from Language Model (ELMo)  inspired by semi-supervised sequence tagging  can handle this issue. ELMo is consist of bidirectional LSTM , which makes the ELMo have an understanding of both next and previous word, it obtains contextualized word embedding by weight summation over the output of hidden layers. Compared with the LSTM used in ELMo, Bidirectional Encoder Representation from Transformer (BERT)  is a stack of Transformer Encoder , which can be computed in parallel ways and save much time in training. There are two BERT versions with different size, one is BERT Base, which has 12 encode layers with 768 hidden units and 12 attention heads, and the other is BERT Large, which has 24 encode layers with 1024 hidden units and 16 attention heads, achieved state-of-the-art results according to that paper.
3 System Overview
In task3, the English dataset consists of 340 pairs; the Finnish, Croatian, Slovenian consist of 24, 112, 111 pairs respectively. Here is the quantity count table.
Each language datafile has eight columns, namely word1, word2, context1, context2, word1context1, word2context1, word1context2, word2context2, and their meanings are first word, second word, first context, second context, the first word in the first context, the second word in the first context, the first word in the second context, the second word in the second context respectively. In addition, word1 and word2 may have a lexical difference between word1context and word2context.
The BERT model architecture is based on a multilayer bidirectional Transformer as Figure 1.Instead of the traditional left-to-right language modeling objective, BERT is trained on two tasks: predicting randomly masked tokens and predicting whether two sentences follow each other. BERT model gets a lot of state-of-the-art performance in many tasks, and we also use the BERT model in our strategy. We approach this task as one of tasks that calculate the similarity between two words. In our model, context data would be firstly added into BERT like the following Figure 2.
Our model would calculate the distance by several algorithms immediately when it obtained embedding of each token, then we predict the graded effect of context in word similarity as the following steps:
Step1: Choose the corresponding two embeddings of word1context1 and word2context1, compute the distance in several algorithms as .
Step2: Substitute the words in Step 1 with word1context2 and word2context2, and repeat the last step, then we get the .
Step3: By subtraction, we can get the change on similarity
Step4: Change the distance computing algorithm and repeat Step1 Step3.
Step5: After Step1 Step4, we can obtain a vector of change, , where denotes the number of distance calculating algorithms used in our model, we get the final change
Here we provide a flow chart Figure 3 to show the process from Step1 to Step4.
We trained one standard BERT Large model and one multilingual BERT Base model by MXNet . The dataset we trained BERT Large model is openwebtextbookcorpuswikiencased, which were maintained by GluonNLP222https://gluon-nlp.mxnet.io/modelzoo/bert/index.html, and we trained Multilingual BERT (M-BERT) Base model by wikimultilingualuncased dataset that also provided by GluonNLP. It takes much time to train the BERT model, so we recommend utilizing the well trained BERT model from bert-embedding333We can simply use the model by pip or conda install bert-embedding.
After configuring the models, we can follow the section 3.2 by giving the input from section 3.1 and get the experiment results which will be introduced in Section 4. Task3 has four language tracks, namely English, Croatian, Finnish, Slovenian. We use the BERT Large model in the English track, and Multilingual BERT Base model in the other three tracks.
In our experiment targeted at subtask1, the English language track uses the Bert Large model, the Euclidean distance is 0.718 and the cosine distance is 0.752, the Blend result is 0.768, and the online LB ranks second; Croatian, Finnish, and Slovenian languages all use the Multi-lingual Bert model. The Croatian language track’ Euclidean distance of 0.590, the cosine distance is 0.587, the Blend result is 0.594, and the online LB ranks sixth. The Finnish language uses the Euclidean distance of 0.750, the cosine distance is 0.671, the Blend result is 0.772, and the online LB ranks 1, The Slovenian language uses a Euclidean distance of 0.576, a cosine distance of 0.603, a Blend result of 0.583, and an online LB ranking seventh. We sort the result out the following Table 2.
|Language & Abbr.||Model||Euclidean Dis.||Cosine||Blend||Rank|
|English, En||BERT Large||0.718||0.752||0.768||2|
|Croatian, HR||M-BERT Base||0.590||0.587||0.594||6|
|Finnish, FI||M-BERT Base||0.750||0.671||0.772||1|
|Slovenian, SL||M-BERT Base||0.576||0.603||0.583||7|
In our paper, we propose a model that computes the similarity and similarity change by blending cosine similarity and euclidean distance, which calculated by two word embedding vectors. We firstly transform words in dataset that we introduce in section 3.1. into the word embedding vectors by BERT that we discuss in section 3.2, then we calculate the distance between two vectors, finally we blend the two distances computed by different algorithms as the final predict result.In the subtask1 of task3, our team willgo won a champion in Finnish track and the second place in English track.
-  (2019) CoSimLex: a resource for evaluating graded word similarity in context. External Links: Cited by: §3.1.
-  (1998) Explorations in context space: words, sentences, discourse. Discourse Processes 25 (2-3), pp. 211–257. Cited by: §2.
MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. External Links: Cited by: §3.3.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §1, §2.
-  (1998) The measurement of textual coherence with latent semantic analysis. Discourse processes 25 (2-3), pp. 285–307. Cited by: §2.
-  (1999) Detecting text similarity over short passages: exploring linguistic feature combinations via machine learning. In 1999 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora, Cited by: §2.
-  (2014) SimLex-999: evaluating semantic models with (genuine) similarity estimation. External Links: Cited by: §3.1.
-  (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: §2.
-  (2004-01) Improving text categorization using the importance of sentences. Inf. Process. Manage. 40 (1), pp. 65–79. External Links: Cited by: §1.
Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 150–157. External Links: Cited by: §1.
-  (2013) Efficient estimation of word representations in vector space. External Links: Cited by: §2.
-  (2013) Distributed representations of words and phrases and their compositionality. External Links: Cited by: §2.
-  (2009-01) Text-to-text semantic similarity for automatic short answer grading.. pp. 567–575. External Links: Cited by: §1.
-  (2017) Semi-supervised sequence tagging with bidirectional language models. External Links: Cited by: §2.
-  (2018) Deep contextualized word representations. External Links: Cited by: §2.
-  (2019) How multilingual is multilingual bert?. External Links: Cited by: §3.3.
-  (2017) Attention is all you need. External Links: Cited by: §2.