Log In Sign Up

Semantic Relatedness for All (Languages): A Comparative Analysis of Multilingual Semantic Relatedness Using Machine Translation

This paper provides a comparative analysis of the performance of four state-of-the-art distributional semantic models (DSMs) over 11 languages, contrasting the native language-specific models with the use of machine translation over English-based DSMs. The experimental results show that there is a significant improvement (average of 16.7 using state-of-the-art machine translation approaches. The results also show that the benefit of using the most informative corpus outweighs the possible errors introduced by the machine translation. For all languages, the combination of machine translation over the Word2Vec English distributional model provided the best results consistently (average Spearman correlation of 0.68).


page 1

page 2

page 3

page 4


Towards an Arabic-English Machine-Translation Based on Semantic Web

Communication tools make the world like a small village and as a consequ...

Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity

This paper presents the first Swedish evaluation benchmark for textual s...

DINFRA: A One Stop Shop for Computing Multilingual Semantic Relatedness

This demonstration presents an infrastructure for computing multilingual...

Kannada Spell Checker with Sandhi Splitter

Spelling errors are introduced in text either during typing, or when the...

Using Machine Translation to Localize Task Oriented NLG Output

One of the challenges in a task oriented natural language application li...

Semantic Web for Machine Translation: Challenges and Directions

A large number of machine translation approaches have recently been deve...

Contribution au Niveau de l'Approche Indirecte à Base de Transfert dans la Traduction Automatique

In this thesis, we address several important issues concerning the morph...

1 Introduction

Distributional Semantic Models (DSM) are consolidating themselves as fundamental components for supporting automatic semantic interpretation in different application scenarios in natural language processing. From

question answering systems, to semantic search and text entailment, distributional semantic models support a scalable approach for representing the meaning of words, which can automatically capture comprehensive associative commonsense information by analysing word-context patterns in large-scale corpora in an unsupervised or semi-supervised fashion[thesisAndre, turney, linse].

However, distributional semantic models are strongly dependent on the size and the quality of the reference corpora, which embeds the commonsense knowledge necessary to build comprehensive models. While high-quality texts containing large-scale commonsense information are present in English, such as Wikipedia, other languages may lack sufficient textual support to build distributional models.

To address this problem, this paper investigates how different distributional semantic models built from corpora in different languages and with different sizes perform in computing semantic relatedness similarity and relatedness tasks. Additionally, we analyse the role of machine translation approaches to support the construction of better distributional vectors and for computing semantic similarity and relatedness measures for other languages. In other words, in the case that there is not enough information to create a DSM for a particular language, this work aims at evaluating whether the benefit of corpora volume for English outperforms the error introduced by machine translation.

Given a pair of words and a human judgement score that represents the semantic relatedness of these two words, the evaluation method aims at indicating how close distributional models score to humans. Three widely used word-pairs datasets are employed in this work: Miller & Charles (MC)[miller1991contextual], Rubenstein & Goodenough (RG)[rubenstein1965contextual] and WordSimilarity 353 (WS-353)[finkelstein2001placing].

In the proposed model the word-pairs datasets are translated into English as a reference language and the distributional vectors are defined over the target end model (Figure 1). Despite the simplicity of the proposed method based on machine translation, there is a high relevance for the distributional semantics user/practitioner due to its simplicity of use and the significant improvement in the results.

Figure 1: Depiction of the experimental setup of the experiment.

This work presents a systematic study involving 11 languages and four distributional semantic models (DSMs), providing a comparative quantitative analysis of the performance of the distributional models and the impact of machine translation approaches for different models.

In summary, this paper answers the following research questions:

  1. Does machine translation to English perform better than the word vectors in the original language (for which languages and for which distributional semantic models)?

  2. Which DSMs and languages benefit more and less from the translation?

  3. What is the quality of state-of-the-art machine translation approaches for word pairs (for each language)?

Moreover, this paper contributes with two resources which can be used by the community to evaluate multi-lingual semantic similarity and relatedness models: (i) a high quality manual translation of the three word-pairs datasets - Miller & Charles (MC)[miller1991contextual], Rubenstein & Goodenough (RG)[rubenstein1965contextual] and WordSimilarity 353 (WS-353)[finkelstein2001placing] - for 10 languages and (ii) the 44 pre-computed distributional models (four distributional models for each one of the 11 languages) which can be accessed as a service111The service is available at, together with the multi-lingual approaches mediated by machine translation.

This paper is organised as follows: Section 2 describes the related work, Section 3 describes the experimental setting; while Section 4 analyses the results and provides the comparative analysis from different models and languages, Finally, Section 5 provides the conclusion.

2 Related Work

Mostof related work has concentrated on leveraging joint multilingual information to improve the performance of the models.

Faruqui & Dyer[faruqui-dyer:2014:EACL] use the distributional invariance across languages and propose a technique based on canonical correlation analysis (CCA) for merging multilingual evidence into vectors generated monolingually. They evaluate the resulting word representations on semantic similarity/relatedness evaluation tasks, showing the improvement of multi-lingual over the monolingual scenario.

Utt & Pado[utt-pado:2014:tacl], develop methods that take advantage of the availability of annotated corpora in English using a translation-based approach to transport the word-link-word co-occurrences to support the creation of syntax-based DSMs.

Navigli & Ponzetto[navigli2012babelrelate] propose an approach to compute semantic relatedness exploiting the joint contribution of different languages mediated by lexical and semantic knowledge bases. The proposed model uses a graph-based approach of joint multi-lingual disambiguated senses which outperforms the monolingual scenario and achieves competitive results for both resource-rich and resource-poor languages.

Zou et al.[zou2013bilingual] describe an unsupervised semantic embedding (bilingual embedding) for words across two languages that represent semantic information of monolingual words, but also semantic relationships across different languages. The motivation of their works was based on the fact that it is hard to identify semantic similarities across languages, specially when co-occurrences words are rare in the training parallel text. Al-Rfou et al.[al2013polyglot] produced multilingual word embeddings for about 100 languages using Wikipedia as the reference corpora.

Comparatively, this work aims at providing a comparative analysis of existing state-of-the-art distributional semantic models for different languages as well as analyzing the impact of a machine translation over an English DSM.

3 Experimental Setup

The experimental setup consists of the instantiation of four distributional semantic models (Explicit Semantic Analysis (ESA)[gabrilovich2007computing], Latent Semantic Analysis (LSA)[landauer1998introduction], Word2Vec (W2V)[mikolov2013efficient] and Global Vectors (GloVe)[pennington2014Glove]) in 11 different languages - English, German, French, Italian, Spanish, Portuguese, Dutch, Russian, Swedish, Arabic and Farsi.

The DSMs were generated from Wikipedia dumps (January 2015), which were preprocessed by lowercasing, stemming and removing stopwords. For LSA and ESA, the models were generated using the SSpace Package[sspace], while W2V and GloVe were generated using the code shared by the respective authors. For the experiment the vector dimensions for LSA, W2V and GloVe were set to 300 while ESA was defined with 1500 dimensions. The difference of size occurs because ESA is composed of sparse vectors. All models used in the generation process the default parameters defined in each implementation.

Each distributional model was evaluated for the task of computing semantic similarity and relatedness measures using three human-annotated gold standard datasets: Miller & Charles (MC)[miller1991contextual], Rubenstein & Goodenough (RG)[rubenstein1965contextual] and WordSimilarity 353 (WS-353)[finkelstein2001placing]. As these word-pairs datasets were originally in English, except for those language available in previous works ([faruqui2014community, camacho2015framework]), the word pairs were translated and reviewed with the help of professional translators, skilled in data localisation tasks. The datasets are available at

Two automatic machine translation approaches were evaluated: the Google Translate Service and the Microsoft Bing Translation Service. As Google Translate Service performed 16% better for overall word-pairs translations, this was set as the main machine translation model.

The DInfra platform [barzegar2015dinfra] provided the DSMs used in the work. To support experimental reproducibility, both experimental data and software are available at

4 Evaluation & Results

4.1 Spearman Correlation and Corpus Size

Table 1 shows the correlation between the average Spearman correlation values for each DSM and two indicators of corpus size: # of tokens and # of unique tokens.

ESA is consistently more robust (on average) than the other models in relation to the corpus size due the fact that ESA has larger context windows in opposition to the other distributional models. While ESA considers the whole document as its context window, the other models are restricted to five (LSA) and ten (Word2Vec and GloVe) words.

Another observation is that the evaluation of the WS-353 dataset is more dependent on the corpus size, which can be explained by the broader number of semantic relations expressed under the semantic relatedness umbrella.

Table 2 shows the size of each corpus in different languages regarding the number of unique tokens and the number of tokens.

Gold standard MC RG WS353
unique tokens tokens unique tokens tokens unique tokens tokens
ESA 0.39 0.48 0.67 0.73 0.33 0.39
LSA 0.74 0.75 0.82 0.68 0.64 0.66
W2V 0.43 0.58 0.71 0.72 0.57 0.79
Glove 0.34 0.51 0.51 0.61 0.59 0.63
Table 1: Correlation between corpus size and different models.

4.2 Word-pair Machine Translation Quality

The second step evaluates the accuracy of state-of-the-art machine translation approaches for word-pairs (Table 3). The accuracy of the translation for the WS-353 word pairs significantly outperforms the other datasets. This shows that the higher semantic distance between word pairs (semantic relatedness) has the benefit of increasing the contextual information during the machine translation process, subsequently improving the mutual disambiguation process.

     lang      unique tokens      tokens
en 4.238 902.044
de 4.233 312.380
fr 1.749 247.492
ru 1.766 202.163
it 1.411 178.378
nl 2.021 105.224
pt 0.873 96.712
sv 1.730 82.376
es 0.829 76.587
ar 1.653 46.481
fa 0.925 32.557
Table 2: The sizes of the corpora in terms of the number of unique tokens and tokens (scale of ).
dataset/lang de fr ru it nl pt sv es ar fa
MC 0.48 0.47 0.58 0.42 0.57 0.60 0.55 0.60 0.53 0.38
RG 0.45 0.65 0.53 0.41 0.59 0.51 0.58 0.59 0.43 0.36
WS353 0.78 0.85 0.76 0.76 0.85 0.81 0.78 0.79 0.57 0.43
Table 3: Translation accuracy.
DS Models en de fr ru it nl pt sv es ar fa Model AVG. DS AVG.
MC ESA 0.69 0.67 0.54 0.66 0.37 0.54 0.67 0.37 0.58 0.37 0.56 0.53 0.56
LSA 0.79 0.70 0.55 0.63 0.58 0.55 0.41 0.58 0.66 0.46 0.45 0.56
W2V 0.84 0.70 0.55 0.64 0.74 0.57 0.37 0.40 0.74 0.38 0.68 0.58
Glove 0.69 0.64 0.64 0.76 0.51 0.55 0.62 0.40 0.65 0.38 0.45 0.56
RG ESA 0.80 0.68 0.45 0.63 0.50 0.58 0.51 0.50 0.59 0.36 0.57 0.54 0.53
LSA 0.72 0.65 0.30 0.51 0.48 0.52 0.30 0.53 0.35 0.35 0.46 0.45
W2V 0.85 0.78 0.57 0.64 0.69 0.63 0.42 0.57 0.64 0.36 0.55 0.58
Glove 0.74 0.69 0.50 0.70 0.59 0.54 0.52 0.49 0.61 0.32 0.59 0.56
WS353 ESA 0.50 0.39 0.32 0.44 0.34 0.53 0.44 0.43 0.37 0.26 0.37 0.39 0.41
LSA 0.54 0.45 0.35 0.40 0.33 0.47 0.39 0.40 0.36 0.28 0.43 0.39
W2V 0.69 0.54 0.50 0.53 0.50 0.58 0.53 0.45 0.53 0.44 0.53 0.51
Glove 0.49 0.41 0.34 0.42 0.30 0.46 0.38 0.33 0.32 0.26 0.36 0.36
Lang AVG. 0.70 0.61 0.47 0.58 0.49 0.54 0.46 0.45 0.53 0.35 0.50 0.50
Table 4: Spearman correlation for the language-specific models.
DS Models de fr ru it nl pt sv es ar fa Model AVG. Diff.
MC ESA-MT 0.55 0.53 0.42 0.38 0.45 0.38 0.48 0.39 0.31 0.58 0.45 -0.08 (-15.1%)
LSA-MT 0.61 0.72 0.65 0.67 0.66 0.70 0.74 0.78 0.69 0.75 0.70 0.14 (25.0%)
W2V-MT 0.68 0.79 0.68 0.77 0.69 0.76 0.81 0.83 0.71 0.74 0.75 0.17 (29.3%)
GloVe-MT 0.45 0.78 0.67 0.64 0.63 0.56 0.61 0.82 0.69 0.79 0.66 0.10 (17.9%)
RG ESA-MT 0.62 0.53 0.52 0.61 0.63 0.57 0.56 0.47 0.38 0.71 0.56 0.02 (3.7%)
LSA-MT 0.63 0.62 0.59 0.74 0.67 0.64 0.67 0.62 0.55 0.70 0.64 0.19 (42.2%)
W2V-MT 0.69 0.79 0.69 0.78 0.74 0.75 0.71 0.73 0.57 0.79 0.72 0.14 (24.1%)
GloVe-MT 0.62 0.77 0.71 0.77 0.78 0.66 0.66 0.72 0.65 0.80 0.71 0.15 (26.8%)
WS353 ESA-MT 0.42 0.45 0.41 0.41 0.44 0.43 0.40 0.35 0.42 0.32 0.40 0.01 (2.6%)
LSA-MT 0.51 0.51 0.47 0.48 0.51 0.39 0.51 0.44 0.37 0.43 0.46 0.07 (17.9%)
W2V-MT 0.62 0.59 0.57 0.57 0.63 0.51 0.59 0.55 0.50 0.52 0.57 0.06 (11.8%)
GloVe-MT 0.45 0.48 0.42 0.43 0.46 0.33 0.42 0.41 0.33 0.37 0.41 0.05 (13.9%)
Lang AVG. 0.57 0.63 0.57 0.60 0.61 0.56 0.60 0.59 0.52 0.63 0.56
Table 5: Spearman correlation for the machine translation models over the English corpora. Diff. represents the difference of machine translation score minus the language specific.
DS M de fr ru it nl pt sv es ar fa M. AVG DS. AVG
MC ESA -0.18 -0.03 -0.36 0.03 -0.16 -0.44 0.31 -0.32 -0.16 0.03 -0.13 0.41
LSA -0.13 0.31 0.04 0.16 0.20 0.70 0.27 0.17 0.50 0.68 0.29
W2V -0.02 0.43 0.07 0.05 0.21 1.04 1.00 0.13 0.88 0.09 0.39
GloVe -0.31 0.22 -0.11 0.25 0.14 -0.10 0.51 0.26 0.85 0.75 0.25
RG ESA -0.09 0.19 -0.18 0.21 0.08 0.11 0.12 -0.19 0.06 0.25 0.06 0.41
LSA -0.03 1.04 0.14 0.52 0.30 1.15 0.26 0.77 0.57 0.52 0.52
W2V -0.11 0.39 0.08 0.14 0.18 0.76 0.23 0.14 0.59 0.44 0.28
GloVe -0.11 0.55 0.01 0.31 0.43 0.28 0.35 0.17 1.04 0.36 0.34
WS353 ESA 0.08 0.40 -0.07 0.18 -0.18 -0.02 -0.07 -0.07 0.60 -0.13 0.07 0.36
LSA 0.12 0.43 0.19 0.45 0.09 -0.01 0.27 0.21 0.34 0.01 0.21
W2V 0.14 0.19 0.09 0.14 0.08 -0.04 0.33 0.04 0.12 0.00 0.11
GloVe 0.10 0.41 0.00 0.41 0.00 -0.14 0.28 0.30 0.28 0.04 0.17
AVG 0.06 0.52 0.13 0.36 0.23 0.29 0.70 0.22 0.59 0.82
Table 6: Difference between the language-specific and the machine translation approach. M. AVG represents the average of the models and DS. AVG represents the average of the datasets.

For WS-353 the set of best-performing translations has an average accuracy of 80% (with maximum 85% and minimum 76%). This value dropped significantly for Arabic and Farsi (average 50%).

For MC and RG, the average translation accuracy for the semantic similarity pairs is 51.5%. This difference may be a result of a deficit of contextual information during the machine translation process. For these word-pairs datasets, the difference between best translation performers and lower performers (across languages) is smaller. Additionally, the final translation accuracy for all languages and all word-pairs datasets is 59%. French, Dutch and Spanish are the languages with best automatic translations.

4.3 Language-Specific DSMs

In the first part of the experiment, the Spearman correlations () between the human assessments and the computation of the semantic similarity and relatedness for all DSMs instantiated for all languages were evaluated (Figure 1 (ii)). Table 4 shows the Spearman correlation for each DSM using language-specific corpora (without machine translation), for the three word-pairs datasets.

The comparative language-specific analysis indicates that English is the best-perfor-ming language (0.70), followed by German (0.61). The lowest Spearman correlation was observed in Arabic (0.35). From the tested DSMs, W2V is consistently the best-performing DSM (0.56). The language-specific DSMs achieved higher correlations for MC and RG (0.56 and 0.53, respectively), in comparison to 0.41 for WS-353.

The results for the language-specific DSMs were contrasted to the machine translation (MT) approach, according to the diagram depicted in Figure 1 (i). The Spearman correlation for the MT-mediated approach are shown in Table 5.

4.4 Machine Translation based Semantic Relatedness

Using the MT models, W2V is consistently the best performing DSM (average 0.68), while ESA is consistently the worst performing model (0.47). We can interpret this result by stating that the benefit of using machine translation for ESA does not introduces significant performance improvements in comparison to the language-specific baselines.

The best performing languages are French and Farsi (

= 0.63). The Spearman correlation variance across languages in the MT models is low, as the impact of the use of the English corpus on the DSM model has a higher positive impact on the results in comparison to the variation of the quality of the machine translation. The results for all languages achieve very similar correlation values.

The impact of the MT model can be better interpreted by examining the difference between the machine translation and the domain-specific models (depicted in Table 6). LSA accounts for the largest average percent improvement (28.4%) using the MT model, while ESA accounts for the lowest value (-2.9%). As previously noticed, this can be explained by the sensitivity of these models to the corpus size due to the dimensional reduction strategy (LSA) or the broader context window (ESA). The remaining models accounted for substantial improvements (W2V = 21.7%, GloVe = 19.5%).

Arabic and French achieved the highest percent gains (47% and 38%, respectively), while German accounts for worst results (-4%).These numbers are consistent with the corpus size. For German, the result shows that the corpus volume of the German Wikipedia crossed a threshold size (34% of the English corpus) above which improvements for computing semantic similarity for the target word-pairs dataset might be marginally relevant, while the translation error accounts negatively in the final result.

The average improvement for the MT over the language specific model for each word-pairs dataset is consistently significant: MC = 20%, RG = 30% and WS353 = 14%.

4.5 Summary

Below, the interpretation of the results are summarised as the core research questions which we aim to answer with this paper:

Question 1: Does machine translation to English perform better than the word vectors in the original language (for which languages and for which distributional semantic models)?

Machine translation to English consistently performs better for all languages, with the exception of German, which presents equivalent results for the language-specific models. The MT approach provides an average improvement of 16.7% over language-specific distributional semantic models.

Question 2: Which DSMs or MT-DSMs work best for the set of analysed languages?

W2V-MT consistently performs as the best model for all word-pairs datasets and languages, except German, in which the difference between MT-W2V and language-speci-fic W2V is not significant.

Question 3: What is the quality of state-of-the-art machine translation approaches for word-pairs?

The average translation accuracy for all languages and all word-pairs datasets is 59%. Translation quality varies according to the nature of the word-pair (better translations are provided for word pairs which are semantically related compared to semantically similar word pairs), reaching a maximum of 85% and a minimum of 36% across different languages.

For the distributional semantics user/practitioner, as a general practice, we recommend using W2V built over an English corpus, supported by machine translation. Additionally, the accuracy of state-of-the-art machine translation approaches work better for translating semantically related word pairs (in contrast to semantically similar word pairs).

5 Conclusion

This work provides a comparative analysis of the performance of four state-of-the-art distributional semantic models over 11 languages, contrasting the native language-specific models with the use of machine translation over English-based DSMs. The experimental results show that there is a significant improvement (average of 16.7% for the Spearman correlation) by using off-the-shelf machine translation approaches and that the benefit of using a more informative (English) corpus outweighs the possible errors introduced by the machine translation approach. The average accuracy of the machine translation approach is 59%. Moreover, for all languages, W2V showed consistently better results, while ESA showed to be more robust concerning lower corpora sizes. For all languages, the combination of machine translation over the W2V English distributional model provided the best results consistently (average Spearman correlation of 0.68).

Future work will focus on the analysis and translation of two other word-pairs datasets: SimLex-999[hill2015simlex999] and MEN-3000[bruni].


This publication has emanated from research supported by the National Council for Scientific and Technological Development, Brazil (CNPq) and by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.