Log In Sign Up

BCWS: Bilingual Contextual Word Similarity

by   Ta-Chung Chi, et al.

This paper introduces the first dataset for evaluating English-Chinese Bilingual Contextual Word Similarity, namely BCWS ( The dataset consists of 2,091 English-Chinese word pairs with the corresponding sentential contexts and their similarity scores annotated by the human. Our annotated dataset has higher consistency compared to other similar datasets. We establish several baselines for the bilingual embedding task to benchmark the experiments. Modeling cross-lingual sense representations as provided in this dataset has the potential of moving artificial intelligence from monolingual understanding towards multilingual understanding.


page 1

page 2

page 3

page 4


COS960: A Chinese Word Similarity Dataset of 960 Word Pairs

Word similarity computation is a widely recognized task in the field of ...

CLUSE: Cross-Lingual Unsupervised Sense Embeddings

This paper proposes a modularized sense induction and representation lea...

Investigating Transfer Learning in Multilingual Pre-trained Language Models through Chinese Natural Language Inference

Multilingual transformers (XLM, mT5) have been shown to have remarkable ...

Generating Bilingual Pragmatic Color References

Contextual influences on language exhibit substantial language-independe...

DuRecDial 2.0: A Bilingual Parallel Corpus for Conversational Recommendation

In this paper, we provide a bilingual parallel human-to-human recommenda...

Separated by an Un-common Language: Towards Judgment Language Informed Vector Space Modeling

A common evaluation practice in the vector space models (VSMs) literatur...

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

Most existing work on adversarial data generation focuses on English. Fo...

Code Repositories


Cross-Lingual Unsupervised Sense Embeddings

view repo


Bilingual Contextual Word Similarity (English-Chinese)

view repo

1 Introduction

Distributed word representations have made a huge impact in the field of NLP by capturing semantics in the low-dimensional vectors, namely, the word embeddings 

Mikolov et al. (2013b). However, a word is usually represented by a single vector, ignoring the polymesy phenomenon in language. To deal with this problem, Reisinger and Mooney (2010) first proposed multi-prototype embeddings of a word and motivated a new research direction for sense embedding learning.

Following the pioneering work, a lot of work proposed to improve the quality of both word and sense embeddings. Several datasets about word-level similarity were collected for intrinsically evaluating the embedding performance, such as WS-353 Finkelstein et al. (2001), MEN Bruni et al. (2012), RW Luong et al. (2013), and MC-30 Faruqui et al. (2016). However, there are few datasets available in terms of sense-level evaluation. The first one is the Stanford contextual word similarity (SCWS) proposed by Huang et al. (2012). Although this dataset alleviated the polysemy issue, it is a pure English dataset, and the inter-annotator consistency of this dataset is only about 0.52 in terms of Spearman’s rank correlation, which upper bounds the performance the models can achieve. Another is the recently proposed Word in Context (WiC) dataset Pilehvar and Camacho-Collados (2018), which frames the sense disambiguation as a binary classification task and has a reasonable inter-rater agreement rate, but it is also a pure English dataset.

Recently, several works attempted to focus on learning cross-lingual embeddings in one space Adams (2017). A set of well-learned cross-lingual word embeddings can directly benefit several downstream tasks, such as unsupervised machine translation Lample et al. (2017); Artetxe et al. (2017). In addition, Camacho-Collados et al. (2017) proposed the cross-lingual semantic similarity dataset in Semeval2017, which measures the semantic similarity of word pairs within and across five languages: English, Farsi, German, Italian and Spanish. Although this dataset has high inter-annotator agreements (consistently in the 0.9 ballpark), it cannot evaluate sense similarity due to the lack of word contexts. Therefore, the semantic similarity evaluation on this dataset may not be precise enough.

Nevertheless, none of the cross-lingual datasets considers multi-sense issues, where a word in one language may have multiple translations in another language according to its different meanings. Because learning word-level embeddings is inadequate, the concept about sense embeddings should also be extended to cross-lingual embeddings. To deal with the above drawbacks of the prior datasets, we introduce a large and high-quality bilingual contextual word similarity (BCWS) dataset, which includes 2,091 English-Chinese word pairs with their sentential contexts and the human-labeled similarity scores for evaluating cross-lingual sense embeddings. This is the first and only bilingual word similarity dataset with sentential contexts for evaluating cross-lingual sense similarity. Note that our collected dataset can also be used as a cross-lingual word similarity data, although it is designed for evaluating multi-sense embeddings.

2 Dataset Construction

Figure 1: Illustration of the workflow.

To establish the bilingual contextual word similarity (BCWS) dataset, we collect the data by a five-step procedure as illustrated in Figure 1.

2.1 Chinese Multi-Sense Word Extraction

First, we to extract the most frequent 10,000 Chinese words from Chinese Wikipedia dump. Considering the common part-of-speech (PoS), we then select the words that are nouns, adjective, and verb based on Chinese Wordnet Huang et al. (2010). In order to test the sense-level representations, we remove words with only a single sense to ensure that the selected words are polysemous. Also, the words with more than 20 senses are deleted, since those senses are too fine-grained and even hard for the human to disambiguate. We denote the list of Chinese words .

English Sentence Chinese Sentence Score
Judges must give both sides an equal 我非常喜歡這個故事,它告訴我們一些重要的啟示。 7.00
opportunity to state their cases. (I like this story a lot, which tells us some important inspiration.)
It was of negligible importance prior 黃斑部病變的預防及早期治療是相當重要的。 6.94
to 1990, with antiquated weapons and (The prevention and early treatment of macular lesions is very
few members. important.)
Due to the San Andreas Fault bisecting 水果攤老闆似乎很意外真有人買這貨,露出「你真內行」 3.70
the hill, one side has cold water, the 的眼神與我聊了幾句。 (The owner of the fruit stall seemed surprised
other has hot. that someone bought this unpopular product, talking me few words
about “you are such a pro”.)
Table 1: Sentence pair examples and average annotated scores in BCWS.

2.2 English Candidate Word Extraction

Second, the goal is to find an English counterpart for each Chinese word in . We utilize BabelNet Navigli and Ponzetto (2010), a free and open-sourced knowledge resource, to serve as our bilingual dictionary. Specifically, we first query the selected Chinese word using the free API call provided by Babelnet to retrieve all WordNet senses222BabelNet contains sense definitions from various resources such as Wordnet, Wikitionary, Wikidata, etc. For example, the Chinese word “制服” has two major meanings:

  • uniform: a type of clothing worn by members of an organization

  • subjugate: force to submit or subdue

Hence, we can obtain two candidate English words, “uniform” and “subjugate”. Each word in retrieves its associated English candidate words, and then a dictionary is formed.

2.3 Enriching Semantic Relationship

Note that is merely a simple translation mapping between Chinese and English words. It is desirable that we have more complicated and interesting relationships between bilingual word pairs. Hence, for each English word in , we find its hyponyms, hypernyms, holonyms and attributes, and add the additional words into . In our example, we may obtain {制服: [uniform, subjugate, livery, clothing, repress, dominate, enslave, dragoon…]}. We sample 2 English words if the number of English candidate words is more than 5, 3 English words if more than 10, and 1 English word otherwise to form the final bilingual pair. For example, a bilingual word pair (制服, enslave) can be formed accordingly. After this step, we obtain 2,091 bilingual word pairs .

2.4 Adding Contextual Information

Given the bilingual word pairs , appropriate contexts should be found in order to form the full sentences for human judgment. For each Chinese word, we randomly sample one example sentence in Chinese WordNet that matches the PoS tag we selected in 2.1. For each English word, we find all sentences containing the target word from the English Wikipedia dump. We then sample one sentence where the target word is tagged as the matched PoS tag333We use the NLTK PoS tagger to obtain the tags..

2.5 Human Labeling

In order to associate a similarity measure with a collected bilingual word pair with their contexts, we recruit 11 human annotators for annotating the semantic scores. To ensure the workers’ proficiency, all recruited annotators are Chinese native speakers whose scores are at least 29 in the TOEFL reading section or 157 in the GRE verbal section. All pairs will be scored by all 11 annotators in a random order. To ensure consistency of labeling, the annotators are highly encouraged to look up a given dictionary, the English Oxford dictionary444https://www.oxforddictionaries, due to its plentiful example sentences. Note that they are asked not to rely solely on dictionary definitions but should consider the contextual information given in questions.

The annotators are asked to determine the sense similarity of these two target words based on their contexts in the sentences. Each question is given a score between 0.0 and 10.0 depending on how semantic related they are.

  • 0.0 indicates that the semantic meanings of the two target words are entirely different.

  • 10.0 indicates that the semantic meanings of two target words are entirely the same.

If a particular question is difficult to answer; for example, for the questions with terribly missing words that prevent them from understanding the meaning, the annotators can mark them with 0.0. To ensure the same grading standard, the annotators are asked to finish all questions within 3 days, and we also retest some previously answered questions to make sure they receive similar scores.

3 Data Analysis

Our collected BCWS dataset includes 2,091 questions, each of which contains exactly one Chinese sentence and one English sentence. Moreover, each sentence contains exactly one target word that is surrounded by and shown in Table 1. After finishing labeling, the inter-annotator consistency is then calculated. Specifically, we leave one annotator out and calculate the Spearman’s rank correlation between the scores from the annotator who is left out and the average of the remaining annotators. The average score can be viewed as the human performance, the upper bound of the embedding models. The average agreement of BCWS is 0.83, while the agreement of previously similar dataset SCWS Huang et al. (2012) is about 0.52. The distribution of the correlation scores for two datasets is shown in Figure 2. It can be found that our BCWS dataset has much higher consistency among annotators compared to SCWS, demonstrating the better quality for evaluating sense embeddings.

Figure 2: The distribution of the annotated Spearman’s rank correlation computed by leave-one-out.

From the prior work on SCWS, the current state-of-the-art score is around 0.7, and most work cannot further improve the performance significantly, because they have already surpassed human-labeled performance on SCWS. This observation is also pointed out by Pilehvar and Camacho-Collados (2018). Moreover, note that a merely 300-dimensional word-level skip-gram model can achieve a score of 0.65 Bartunov et al. (2016) on SCWS. In contrast, our baseline word-level skip-gram model can only obtain a score of 0.49, indicating that our dataset provides a larger room of improvement for the follow-up work.

4 Baseline Experiments

We benchmark the experiments by presenting several baseline models about cross-lingual embeddings. We assume that the sentence-level parallel corpus is available but without word-level alignments. The used parallel data is UM-corpus Tian et al. , which contains 15,764,200 parallel sentences with 381,921,583 English words and 572,277,658 unsegmented Chinese words. We exploit a widely-used tool jieba555

to perform Chinese word segmentation. For those baseline models that train word-level embeddings, word similarity score can be obtained by calculating cosine similarity between two target words’ embeddings. Then the Spearman’s rank correlation between human labeled scores and the cosine similarity scores is calculated to measure how well these two scores are correlated. We briefly introduce three baseline methods below and show all results in Table 


Pretrained Word Vectors

The naïve baseline is to simply pretrain word embeddings of two languages. We use word2vec to train word embeddings for Chinese and English parts of the UM-corpus Mikolov et al. (2013a), where the default hyper-parameters settings are adopted. Obviously, this method has poor performance (1.16 for Spearman’s rank), because it does not consider any interaction and alignment between the two languages. In other words, these two sets of embeddings do not live in the same vector space.

Bilingual Word Embeddings

Luong et al. (2015) proposed a bilingual word representation system which extends the skip-gram architecture to predict not only neighbor words in the same language, but also neighbor words in its bilingual counterpart. It assumes that the system uses either the given ground truth word alignment or naive monotonic order alignment. For a fair comparison, we experiment on the none word alignment version. This method directly trains cross-lingual word embeddings from scratch jointly. We train 300-dimensional word vectors with 25 negative samples and leave other parameters as the default configuration. The achieved performance is 49.20 on Spearman’s correlation, and the reason may be that the learned embeddings contain more noises during training due to the lack of word alignments, showing the difficulty of bridging the signal between two languages.

Multilingual Word Embedding

Conneau et al. (2017) proposed MUSE, an unsupervised method for mapping two sets of monolingual word embeddings into the same space via adversarial training. It learns a transformation matrix W which is nearly orthogonal and utilizes it to align two word embedding spaces. Adversarial training is applied to allow a randomly selected word to feed to the discriminator for determining which vector space the word belongs to.

This method requires two sets of pre-trained embeddings using fasttext Bojanowski et al. (2017), where we select 6,000 words with highest frequencies in each of Chinese and English parts of the UM-corpus and train 300-dimensional word vectors with the default settings. Then we perform adversarial matrix transformation for mapping the vectors into the same space and compute the correlation performance. Although the linguistic structure of English and Chinese are totally different, MUSE can still align two embedding spaces quite well, achieving 54.7 on Spearman’s correlation.

Bilingual Sense Embeddings

Chi and Chen (2018) proposed a first sense-level cross-lingual representation learning model with efficient sense induction, where several monolingual and bilingual modules are jointly optimized. We train this model on the UM-corpus and achieve 58.5 on Spearman’s correlation.

Baseline Model Correlation
Mikolov et al. (2013a) 1.16
Luong et al. (2015) 49.20
Conneau et al. (2017) 54.70
Chi and Chen (2018) 58.80
Human performance 82.58
Table 2: Result of current baselines. The reported numbers indicate Spearman’s rank correlation .

Although the result of sense embeddings is significant improved recently, all current results show the difficulty of learning bilingual sense embeddings. The proposed dataset still has a large room for improvement, offering a research direction for future exploration.

5 Conclusion

We present the first dataset to provide evaluation for bilingual contextual word similarity. Unlike the most word similarity datasets, this dataset measures word similarity given their sentential contexts in different languages. Moreover, this dataset has high inter-annotator consistency, providing a large room for improvement towards human performance. The new dataset has the potential of helping researchers explore a new direction of the cross-lingual word and sense embeddings and moving monolingual understanding towards multilingual understanding.


  • Adams (2017) Oliver Adams. 2017. Automatic understanding of unwritten languages. Ph.D. thesis.
  • Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041.
  • Bartunov et al. (2016) Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, and Dmitry Vetrov. 2016. Breaking sticks and ambiguities with adaptive skip-gram. In Artificial Intelligence and Statistics, pages 130–138.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Bruni et al. (2012) Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 136–145. Association for Computational Linguistics.
  • Camacho-Collados et al. (2017) Jose Camacho-Collados, Mohammad Taher Pilehvar, Nigel Collier, and Roberto Navigli. 2017. Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 15–26.
  • Chi and Chen (2018) Ta-Chung Chi and Yun-Nung Chen. 2018. Cluse: Cross-lingual unsupervised sense embeddings. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

  • Conneau et al. (2017) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087.
  • Faruqui et al. (2016) Manaal Faruqui, Yulia Tsvetkov, Pushpendre Rastogi, and Chris Dyer. 2016. Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.
  • Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM.
  • Huang et al. (2010) Chu-Ren Huang, Shu-Kai Hsieh, Jia-Fei Hong, Yun-Zhu Chen, I-Li Su, Yong-Xiang Chen, and Sheng-Wei Huang. 2010. Chinese wordnet: Design, implementation, and application of an infrastructure for cross-lingual knowledge processing. Journal of Chinese Information Processing, 24(2):14–23.
  • Huang et al. (2012) Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In Annual Meeting of the Association for Computational Linguistics (ACL).
  • Lample et al. (2017) Guillaume Lample, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 151–159.
  • Luong et al. (2013) Thang Luong, Richard Socher, and Christopher Manning. 2013.

    Better word representations with recursive neural networks for morphology.

    In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a.

    Efficient estimation of word representations in vector space.

    Proceedings of Workshop at ICLR.
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Navigli and Ponzetto (2010) Roberto Navigli and Simone Paolo Ponzetto. 2010. Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 216–225. Association for Computational Linguistics.
  • Pilehvar and Camacho-Collados (2018) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2018. WiC: 10,000 example pairs for evaluating context-sensitive representations. arXiv preprint arXiv:1808.09121.
  • Reisinger and Mooney (2010) Joseph Reisinger and Raymond J Mooney. 2010. Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 109–117. Association for Computational Linguistics.
  • (21) Liang Tian, Derek F Wong, Lidia S Chao, Paulo Quaresma, and Francisco Oliveira. Um-corpus: A large english-chinese parallel corpus for statistical machine translation.